We use Python a fair bit at Zendesk for building machine learning (ML) products. One of the common performance issues we encountered with machine learning applications is memory leaks and spikes. The Python code is usually executed within containers via distributed processing frameworks such as Hadoop, Spark and AWS Batch. Each container is allocated a fixed amount of memory. Once the code execution exceeds the specified memory limit, the container will terminate due to out of memory errors.

A quick fix is to increase the memory allocation. However this can result in wastage in resources and affect the stability of the products due to unpredictable memory spikes. The causes of memory leaks can include:

lingering large objects which are not released

reference cycles within the code

underlying libraries/C extensions leaking memory

A useful exercise is to profile the memory usage of the applications to gain better understanding on space efficiency of the code and the underlying packages used. This post covers:

profiling memory usage of the application across time

how to inspect memory usage at specific part of the program

tips for debugging memory issues

Profiling Memory Across Time

You can look at the memory usage varying across time during the execution of the Python code using the memory-profile package.

# install the required packages

pip install memory_profiler

pip install matplotlib # run the profiler to record the memory usage

# sample 0.1s by defaut

mprof run --include-children python fantastic_model_building_code.py # plot the recorded memory usage

mprof plot --output memory-profile.png

A. Memory profile as a function of time

The option include-children will include the memory usage of any child processes spawned via the parent process. Graph A shows an iterative model training process which causes the memory to increase in cycles as batches of training data being processed. The objects are released once garbage collection kicks in.

If the memory usage is constantly growing, there is a potential issue of memory leaks. Here’s a dummy sample script to illustrate this.

B. Memory footprints increasing across time

A debugger breakpoint can be set once memory usage exceeds certain threshold using the option pdb-mmem which is handy for troubleshooting.

Memory Dump at a Point in Time

It is important to understand the expected number of large objects in the program and whether they should be duplicated and/or transformed into different formats.

To further analyse the objects in memory, a heap dump can be created during certain lines of the code in the program with muppy.

# install muppy

pip install pympler # Add to leaky code within python_script_being_profiled.py

from pympler import muppy, summary

all_objects = muppy.get_objects()

sum1 = summary.summarize(all_objects) # Prints out a summary of the large objects

summary.print_(sum1) # Get references to certain types of objects such as dataframe

dataframes = [ao for ao in all_objects if isinstance(ao, pd.DataFrame)] for d in dataframes:

print d.columns.values

print len(d)

Example of summary of memory heap dump

Another useful memory profiling library is objgraph which can generate object graphs to inspect the lineage of objects.

Useful Pointers

Strive for quick feedback loop

A useful approach is creating a small “test case” which runs only the memory leakage code in question. Consider using a subset of the randomly sampled data if the complete input data is lengthy to run.

Run memory intensive tasks in separate process

Python does not necessarily release memory immediately back to the operating system. To ensure memory is released after a piece of code has executed, it needs to run in a separate process. This page provides more details on Python garbage collection.

Debugger can add references to objects

If a breakpoint debugger such as pdb is used, any objects created and referenced manually from the debugger will remain in the memory profile. This can create a false sense of memory leaks where objects are not released in a timely manner.

Watch out for packages that can be leaky

Some Python libraries could potentially have memory leaks. E.g. pandas have quite a few known memory leaks issues.

Happy hunting!

References