Sometimes your data file is so large you can’t load it into memory at all, even with compression. So how do you process it quickly?

By loading and then processing the data in chunks, you can load only part of the file into memory at any given time. And that means you can process files that don’t fit in memory.

Let’s see how you can do this with Pandas.

Reading the full file

We’ll start with a program that just loads a full CSV into memory. In particular, we’re going to write a little program that loads a voter registration database, and measures how many voters live on every street in the city:

import pandas voters_street = pandas . read_csv ( "voters.csv" )[ "Residential Address Street Name " ] print ( voters_street . value_counts ())

If we run it we get:

$ python voter-by-street-1.py MASSACHUSETTS AVE 2441 MEMORIAL DR 1948 HARVARD ST 1581 RINDGE AVE 1551 CAMBRIDGE ST 1248 ... NEAR 111 MOUNT AUBURN ST 1 SEDGEWICK RD 1 MAGAZINE BEACH PARK 1 WASHINGTON CT 1 PEARL ST AND MASS AVE 1 Name: Residential Address Street Name , Length: 743, dtype: int64

Where is memory being spent? As you would expect, the bulk of memory usage is allocated by loading the CSV into memory.

In the following graph of peak memory usage, the width of the bar indicates what percentage of the memory is used:

The section on the left is the CSV read.

The narrower section on the right is memory used importing all the various Python modules, in particular Pandas; unavoidable overhead, basically.

You don’t have to read it all

As an alternative to reading everything into memory, Pandas allows you to read data in chunks. In the case of CSV, we can load only some of the lines into memory at any given time.

In particular, if we use the chunksize argument to pandas.read_csv , we get back an iterator over DataFrame s, rather than one single DataFrame . Each DataFrame is the next 1000 lines of the CSV:

import pandas result = None for chunk in pandas . read_csv ( "voters.csv" , chunksize = 1000 ): voters_street = chunk [ "Residential Address Street Name " ] chunk_result = voters_street . value_counts () if result is None : result = chunk_result else : result = result . add ( chunk_result , fill_value = 0 ) result . sort_values ( ascending = False , inplace = True ) print ( result )

When we run this we get basically the same results:

$ python voter-by-street-2.py MASSACHUSETTS AVE 2441.0 MEMORIAL DR 1948.0 HARVARD ST 1581.0 RINDGE AVE 1551.0 CAMBRIDGE ST 1248.0 ...

If we look at the memory usage, we’ve reduced memory usage so much that the memory usage is now dominated by importing Pandas; the actual code barely uses anything:

The MapReduce idiom

Taking a step back, what we have here is an highly simplified instance of the MapReduce programming model. While typically used in distributed systems, where chunks are processed in parallel and therefore handed out to worker processes or even worker machines, you can still see it at work in this example.

In the simple form we’re using, MapReduce chunk-based processing has just two steps:

For each chunk you load, you map or apply a processing function. Then, as you accumulate results, you “reduce” them by combining partial results into the final result.

We can re-structure our code to make this simplified MapReduce model more explicit:

import pandas from functools import reduce def get_counts ( chunk ): voters_street = chunk [ "Residential Address Street Name " ] return voters_street . value_counts () def add ( previous_result , new_result ): return previous_result . add ( new_result , fill_value = 0 ) # MapReduce structure: chunks = pandas . read_csv ( "voters.csv" , chunksize = 1000 ) processed_chunks = map ( get_counts , chunks ) result = reduce ( add , processed_chunks ) result . sort_values ( ascending = False , inplace = True ) print ( result )

Both reading chunks and map() are lazy, only doing work when they’re iterated over. As a result, chunks are only loaded in to memory on-demand when reduce() starts iterating over processed_chunks .

From full reads to chunked reads

You’ll notice in the code above that get_counts() could just as easily have been used in the original version, which read the whole CSV into memory:

def get_counts ( chunk ): voters_street = chunk [ "Residential Address Street Name " ] return voters_street . value_counts () result = get_counts ( pandas . read_csv ( "voters.csv" ))

That’s because reading everything at once is a simplified version of reading in chunks: you only have one chunk, and therefore don’t need a reducer function.

So here’s how you can go from code that reads everything at once to code that reads in chunks: