Thu 21 September 2017

This post is the first of many to come on Apache Arrow, pandas, pandas2, and the general trajectory of my work in recent times and into the foreseeable future. This is a bit of a read and overall fairly technical, but if interested I encourage you to take the time to work through it.

In this post I hope to explain as concisely as I can some of the key problems with pandas's internals and how I've been steadily planning and building pragmatic, working solutions for them. To the outside eye, the projects I've invested in may seem only tangentially-related: e.g. pandas, Badger, Ibis, Arrow, Feather, Parquet. Quite the contrary, they are all closely-interrelated components of a continuous arc of work I started almost 10 years ago.

Side note: consider making a tax-deductible donation to support pandas development

Some background

I started building pandas in April, 2008. It started out as a skunkworks that I developed mostly on my nights and weekends. I didn't know much about software engineering or even how to use Python's scientific computing stack well back then. My code was ugly and slow. I figured things out as I went and learned as much from others as I could. I didn't start doing serious C development until 2013 and C++ development until 2015. I appreciate C++ a lot more now than I would have 9 years ago.

Python was a comparatively more inhospitable place for what we might now call data science development. The problems that pandas solves for people in 2017 were not problems that people generally solved with Python at all. They generally used R, SAS, SPSS, Stata, or MATLAB, in no particular order of preference.

So maybe it's not a surprise that pandas's internal architecture has some warts. In Summer 2011, I devised a contraption known as the BlockManager , a memory management object that uses NumPy arrays internally, for managing the internal columns of data inside a pandas.DataFrame . You can see me writing about it all the way back in July 2011.

While the BlockManager and pandas's overall tight internal coupling to NumPy has served the project well historically, these things are some of the root causes of problems that plague pandas users working with larger datasets in modern times.

To put it simply, we weren't thinking about analyzing 100 GB or 1 TB datasets in 2011. Nowadays, my rule of thumb for pandas is that you should have 5 to 10 times as much RAM as the size of your dataset. So if you have a 10 GB dataset, you should really have about 64, preferably 128 GB of RAM if you want to avoid memory management problems. This comes as a shock to users who expect to be able to analyze datasets that are within a factor of 2 or 3 the size of their computer's RAM.

pandas rule of thumb: have 5 to 10 times as much RAM as the size of your dataset

There are additional, hidden memory killers in the project, like the way that we use Python objects (like strings) for many internal details, so it's not unusual to see a dataset that is 5GB on disk take up 20GB or more in memory. It's an overall bad situation for large datasets.

DataPad, Badger, and my time at Cloudera

I started DataPad in 2013 with Chang She, my longtime friend and pandas collaborator. We wanted to use the nascent PyData stack to power the visual analytics application we were building, but we ran into some serious performance issues, especially in the cloud. The responsiveness of analytics queries from the DataPad application weren't great with pandas out of the box.

So I pared down the pandas feature set to the bare essentials and created a small new implementation which we called Badger. I found that through using contiguous, immutable columnar data structures optimized for data locality, that I could get 2-20x better performance in a wide variety of operations. The biggest wins were in string processing, but there were huge gains across the board. You can see a demo of DataPad here.

Badger was definitely "startup code". When we were acquired by Cloudera in 2014, I contemplated open sourcing Badger, but felt that it would be a lot of work to clean up the code (mostly written in C, with far too many macros) for human consumption and I wanted to build a more future-proof implementation that would still be useful 10 years down the road. Releasing it as-is would have been distracting for pandas users, and I didn't want to keep developing that codebase. It's not a good idea to release codebases only to abandon them. In light of the fact that basically a rewrite was needed, I left Badger on the shelf.

I gave a talk in November 2013 with the subtitle 10 Things I Hate About Pandas, which has had almost 100,000 slide views 4 years later. It's a summary of the things that I'd learned throughout 2013 and battle scars from the first 5 years of pandas development.

The 10 (really 11) things are (paraphrasing my own words):

Internals too far from "the metal" No support for memory-mapped datasets Poor performance in database and file ingest / export Warty missing data support Lack of transparency into memory use, RAM management Weak support for categorical data Complex groupby operations awkward and slow Appending data to a DataFrame tedious and very costly Limited, non-extensible type metadata Eager evaluation model, no query planning "Slow", limited multicore algorithms for large datasets

I had begun to solve some of these problems in Badger, but the solutions were narrow in scope to the problems we were solving at DataPad. Luckily, I moved to Cloudera where there were a lot of database and big data system developers for me to learn from.

At Cloudera, I started looking at Impala, Kudu, Spark, Parquet, and other such big data storage and analysis systems. Since Python and pandas had never been involved with any of these projects, building integrations with them was difficult. The single biggest problem was data interchange, particularly moving large tabular datasets from one process's memory space to another's. It was extremely expensive, and there was no standard solution for doing it. RPC-oriented serialization protocols like Thrift and Protocol Buffers were too slow and too general purpose.

As I dug through the different points of contact between different systems, I saw a lot of commonality with the problems I'd been working on above in Badger. Zero-copy data access was the biggest thing; you need to be able to memory map complex tables to make accessing 1 terabyte of data on disk as fast and easy as 1 megabyte.

By early 2015, I was yearning for what I was then calling a "columnar data middleware" which provided zero-copy access, with rich enough support for strings, nested types, and all the other hairy JSON-like data found in the wild. Like the prototype Badger runtime, this format needed to be optimized for data locality so that we could evaluate queries at maximum speeds.

I was lucky to bump into a collection of like-minded people across many big data projects, especially folks from Apache Drill, Impala, Kudu, Spark, and others. In late 2015, to create a neutral "safe space" free from software vendor affiliation (which can make industry collaborations more complex), we worked with the Apache Software Foundation to establish Apache Arrow.

On paper, Apache Arrow was everything I had been wanting for years. But, in late 2015, all I had (as far as Python is concerned) were some Markdown specification documents. These specifications weren't even final; we set up the Apache project to create a venue for the broader community to have a dialogue about the specs and the problems that Arrow solves. We had to buckle down and build real software to make the vision real and useful. Now that I've been working on the project for almost 2 years, we've made huge progress in realizing the things that we set out to accomplish.

I strongly feel that Arrow is a key technology for the next generation of data science tools. I laid out my vision for this recently in my JupyterCon keynote.

Also in late 2015, I wrote a long set of design documents to start discussions about building a faster, cleaner core pandas implementation, which we may call pandas2. pandas is a community project that governs itself based on consensus (with me as the BDFL to break impasses). I wanted to see if the rest of the core developers agreed with my assessment of what is wrong with pandas's internals. It's been 2 years since then, and by and large there has been general agreement on the problems, but how to solve them all without disrupting the existing pandas user community is an open question. Over this time I have focused on building computational infrastructure that will largely go unseen by pandas users.

Does Arrow solve the "10 Things"?

Arrow doesn't solve all of the 10 things quite yet, but it's made huge strides toward doing so.

Arrow's C++ implementation provides essential in-memory analytics infrastructure for projects like pandas:

A runtime column-oriented memory format optimized for analytical processing performance

A zero-copy, streaming / chunk-oriented data layer designed for moving and accessing large datasets at maximum speeds

Extensible type metadata for describing a wide variety of flat and nested data types occurring in real-world systems, with support for user-defined types

What's missing from the Arrow C++ project at the moment (but not for too much longer) is:

A comprehensive analytical function "kernel" library

Logical operator graphs for graph dataflow-style execution (think TensorFlow or PyTorch, but for data frames)

A multicore schedular for parallel evaluation of operator graphs

I'll write more about the roadmap for building an analytics engine for Arrow memory (that we can use in projects like pandas) in a follow up post.

In the rest of this post, I'm going to go deeper into the "10 Things" and how they're addressed by the Arrow project.

1. Getting closer to the metal

All memory in Arrow on a per column basis, whether strings, numbers, or nested types, is arranged in contiguous memory buffers optimized for random access (single values) and scan (multiple values next to each other) performance. The idea is that you want to minimize CPU or GPU cache misses when looping over the data in a table column, even with strings or other non-numeric types.

In pandas, an array of strings is an array of PyObject pointers, and the actual string data lives inside PyBytes or PyUnicode structs that live all over the process heap. As developers, we are hamstrung by the bloated, memory-bound nature of processing these objects. In Python, the simple string 'wes' occupies 52 bytes of memory. '' occupies 49 bytes. For a great discussion of issues around this, see Jake Vanderplas's epic exposé on Why Python is Slow.

In Arrow, each string is right next to the previous one in memory, so you can scan all of the data in a column of strings without any cache misses. Processing contiguous bytes right against the metal, guaranteed.

Arrow's C/C++ API means that applications which know nothing about Python can consume or produce pristine Arrow tables and share them either in-process or via shared memory / memory maps. pandas's lack of a C or Cython API for data frames has been another big problem over time.

2. Memory mapping huge datasets

Perhaps the single biggest memory management problem with pandas is the requirement that data must be loaded completely into RAM to be processed. pandas's internal BlockManager is far too complicated to be usable in any practical memory-mapping setting, so you are performing an unavoidable conversion-and-copy anytime you create a pandas.DataFrame .

Arrow serialization design provides a "data header" which describes the exact locations and sizes of all the memory buffers for all the columns in a table. This means you can memory map huge, bigger-than-RAM datasets and evaluate pandas-style algorithms on them in-place without loading them into memory like you have to with pandas now. You could read 1 megabyte from the middle of a 1 terabyte table, and you only pay the cost of performing those random reads totalling 1 megabyte. With modern solid state drives, this is generally a good strategy.

Arrow's memory-mapping capability also allows multiple processes to work with the same large dataset without moving it or copying it in any way. We've seen this applied to great effect in the Plasma Object Store (now part of Arrow) used in the Ray project at UC Berkeley.

3. High speed data ingest and export (databases and file formats)

Arrow's efficient memory layout and rich type metadata make it an ideal container for inbound data from databases and columnar storage formats like Apache Parquet.

One of Arrow's primitive constructs is the concept of a "record batch stream", a sequence of atomic tables together comprising a large dataset. This stream processing data model is an idea for databases which serve streams of records from a database cursor.

We have been developing a high-speed connector with Parquet format. We've also seen the optimized turbodbc project for ODBC-based database connections.

I aspire to build Arrow-native connectors for many other file formats and databases, such as:

SQLite

PostgreSQL

Apache Avro

Apache ORC

CSV (a better version of pandas.read_csv )

) JSON

4. Doing missing data right

All missing data in Arrow is represented as a packed bit array, separate from the rest of the data. This makes missing data handling simple and consistent across all data types. You can also do analytics on the null bits (AND-ing bitmaps, or counting set bits) using fast bit-wise built-in hardware operators and SIMD.

The null count in an array is also explicitly stored in its metadata, so if data does not have nulls, we can choose faster code paths that skip null checking. With pandas, we cannot assume that arrays do not have null sentinel values and so most analytics has extra null checking which hurts performance. If you have no nulls, you don't even need to allocate the bit array.

Because missing data is not natively supported in NumPy, over time we have had to implement our own null-friendly versions of most key performance-critical algorithms. It would be better to have null-handling built into all algorithms and memory management from the ground up.

5. Keeping memory allocations in check

In pandas, all memory is owned either by NumPy or the Python interpreter, and it can be difficult to measure exactly how much memory is used by a given pandas.DataFrame . It's not unusual for a line of code to double or triple the memory footprint of a process due to temporary allocations, sometimes causing a MemoryError.

In Arrow's C++ implementation, all memory allocations are carefully tracked in a central "memory pool", so you know exactly how much Arrow memory is in RAM at any given time. By using "subpools" with parent-child relationships, you can precisely measure the "high water mark" in algorithms to understand the peak memory usage of analytical operations. This technique is common in databases to monitor or limit memory usage in operator evaluation. If you know that you are going to exceed available RAM, you can apply mitigation strategies like spilling to disk (where the ability to memory-map on-disk datasets is of course key).

In Arrow memory is either immutable or copy-on-write. At any given time, you know if another array references a buffer that you can see. This enables us to avoid defensive copying.

6. Supporting categorical data well

When I gave my talk in 2013, pandas did not have the pandas.Categorical type; that was implemented afterwards. But pandas's workarounds for data types not in NumPy has always been a bit warty. If you step outside pandas, you can't work with pandas Categoricals. The way that extension dtypes are implemented works, but is a bit bolted-on due to pandas's tight coupling to NumPy.

In Arrow, categorical data is a first-class citizen, and we have prioritized having an efficient and consistent representation both in-memory and on the wire or in shared memory. We support sharing categories (called dictionaries in Arrow) between multiple arrays.

pandas has other user-defined types: datetime with time zone and periods. We intend to be able to support logical data types (having a particular physical memory representation) in Arrow gracefully so that a particular system can faithfully transport its data using Arrow without having to make changes to the Arrow format documents.

7. Better groupby(...).apply operations

The way that Arrow helps is by enabling easier parallelization of groupby operations; due to other problems listed here, it is difficult or impossible to fully parallelize a df.groupby(...).apply(f) operation.

At some point, we will also want to improve the API for complex apply operations in pandas.

8. Appending to data frames

In pandas, all of the data in a column in a DataFrame must reside in the same NumPy array. This is a restrictive requirement, and frequently results in memory-doubling and additional computation to concatenate Series and DataFrame objects.

Table columns in Arrow C++ can be chunked, so that appending to a table is a zero copy operation, requiring no non-trivial computation or memory allocation. By designing up front for streaming, chunked tables, appending to existing in-memory tabler is computationally inexpensive relative to pandas now. Designing for chunked or streaming data is also essential for implementing out-of-core algorithms, so we are also laying the foundation for processing larger-than-memory datasets.

9. Adding new data types

There are multiple layers of complexity to adding new data types:

Adding new metadata

Creating dynamic dispatch rules to operator implementations in analytics

Preserving metadata through operations

For example, a "currency" type could have a currently type a string, with the data physically represented as a float64 or decimal . So you could treat the currency computationally like its numeric representation, but then carry through the currency metadata in numeric operations.

The rules about preserving metadata may be operator-dependent, so it can get complicated.

In Arrow we have decoupled the metadata representation from the details of computation and metadata nannying. In the C++ implementation, we have been planning ahead for user-defined types, so when we are focusing more on building an analytics engine it is a goal to enable the creation of user-defined operator dispatch and metadata promotion rules.

10/11. Query planning, multicore execution

When you write df[df.c < 0].d.sum() , pandas creates a temporary DataFrame df[df.c < 0] then sums the d column of that temporary object. If df contains a lot of columns, this is ridiculously wasteful. Of course you can write df.d[df.c < 0].sum() , but even that produces a temporary Series, which is then summed!

Clearly, if you know the whole expression you are evaluating you can do better and avoid these temporary allocations altogether. Additionally, many algorithms (including this example) can be parallelized amongst all the processors cores on your computer.

As part of building an analytics engine for Arrow, we also plan to build a lightweight physical "query planner" with a multicore in-process scheduler to enable many kinds of algorithms to be parallelized and evaluated efficiently. There is substantial prior art in the domain of graph data flow execution (particularly in the ML world lately, like TensorFlow and PyTorch), so this amounts to creating a graph data flow engine whose primitive unit of data is an Arrow table.

To plan ahead for this use case, in 2015, I started the Ibis project (still under active development) to create a pandas-friendly deferred expression system for static analysis and compilation these types of operations. Since an efficient multithreaded in-memory engine for pandas was not available when I started Ibis, I instead focused on building compilers for SQL engines (Impala, PostgreSQL, SQLite), similar to the R dplyr package. Phillip Cloud from the pandas core team has been actively working on Ibis with me for quite a long time.

What's next?

In an upcoming blog post, I will go into some more detail about the roadmap for building an Arrow-native multithreaded in-memory execution engine and how that's relevant to the architecture of pandas2.

Addendum: On Dask

Nowadays, a lot of people ask me about Dask (and Spark, and other such projects) and how it is helping with pandas performance and scalability. It is definitely helping in various ways, such as:

Splitting up large datasets into pieces and working with them in separate threads or separate processes

Evicting pandas data from RAM that is no longer needed

Dask makes it easy to read a directory of CSV files by running pandas.read_csv in parallel and then running a groupby operation on the entire dataset. Truly, what Matt Rocklin and team have built is an excellent piece of kit.

One issue with the Dask model is that it's using pandas as a black box. dask.dataframe does not solve pandas's inherent performance and memory use problems, but it spreads them out across multiple processes and helps mitigate them by being careful to not work with too large pieces of data all at once, which can result in an unpleasant MemoryError .

Some problems don't fit the Dask partition-parallel distributed task execution model. Also, pandas's memory management and IO challenges make Dask jobs a lot slower than they could be with a more efficient in-memory runtime.