Many organizations are trying to gather and utilise as much data as possible to improve on how they run their business, increase revenue, or how they impact the world around them. Therefore it is becoming increasingly common for data scientists to face 50GB or even 500GB sized datasets.

Now, these kind of datasets are a bit… uncomfortable to use. They are small enough to fit into the hard-drive of your everyday laptop, but way to big to fit in RAM. Thus, they are already tricky to open and inspect, let alone to explore or analyse.

There are 3 strategies commonly employed when working with such datasets. The first one is to sub-sample the data. The drawback here is obvious: one may miss key insights by not looking at the relevant portions, or even worse, misinterpret the story the data it telling by not looking at all of it. The next strategy is to use distributed computing. While this is a valid approach for some cases, it comes with the significant overhead of managing and maintaining a cluster. Imagine having to set up a cluster for a dataset that is just out of RAM reach, like in the 30–50 GB range. It seems like an overkill to me. Alternatively, one can rent a single strong cloud instance with as much memory as required to work with the data in question. For example, AWS offers instances with Terabytes of RAM. In this case you still have to manage cloud data buckets, wait for data transfer from bucket to instance every time the instance starts, handle compliance issues that come with putting data on the cloud, and deal with all the inconvenience that come with working on a remote machine. Not to mention the costs, which although start low, tend to pile up as time goes on.

In this article I will show you a new approach: a faster, more secure, and just overall more convenient way to do data science using data of almost arbitrary size, as long as it can fit on the hard-drive of your laptop, desktop or server.

Vaex

Vaex is an open-source DataFrame library which enables the visualisation, exploration, analysis and even machine learning on tabular datasets that are as large as your hard-drive. To do this, Vaex employs concepts such as memory mapping, efficient out-of-core algorithms and lazy evaluations. All of this is wrapped in a familiar Pandas-like API, so anyone can get started right away.

The Billion Taxi Rides Analysis

To illustrate this concepts, let us do a simple exploratory data analysis on a dataset that is far to large to fit into RAM of a typical laptop. In this article we will use the New York City (NYC) Taxi dataset, which contains information on over 1 billion taxi trips conducted between 2009 and 2015 by the iconic Yellow Taxis. The data can be downloaded from this website, and comes in CSV format. The complete analysis can be viewed separately in this Jupyter notebook.

Cleaning the streets

The first step is to convert the data into a memory mappable file format, such as Apache Arrow, Apache Parquet, or HDF5. An example of how to do convert CSV data to HDF5 can be found in here. Once the data is in a memory mappable format, opening it with Vaex is instant (0.052 seconds!), despite its size of over 100GB on disk: