So… no pandas 🐼?

There are some issues with pandas that the original author Wes McKinney outlines in his insightful blogpost: “Apache Arrow and the “10 Things I Hate About pandas”. Many of these issues will be tackled in the next version of pandas (pandas2?), building on top of Apache Arrow and other libraries. Vaex starts with a clean slate, while keeping the API similar, and is ready to be used today.

Vaex is lazy

Vaex is not just a pandas replacement. Although it has a pandas-like API for column access when executing an expression such as np.sqrt(ds.x**2 + ds.y**2) , no computations happen. A vaex expression object is created instead, and when printed out it shows some preview values.

Calling numpy functions with vaex expression leads to a new expression, which delays a computation and saves RAM.

With the expression system, vaex performs calculations only when needed. Also, the data does not need to be local: expressions can be sent over a wire, and statistics can be computed remotely, something that the vaex-server package provides.

Virtual columns

We can also add expressions to a DataFrame, which result in virtual columns. A virtual column behaves like a regular column but occupies no memory. Vaex makes no distinction between real and virtual columns, they are treated on equal footing.

Adding a new virtual column to a DataFrame takes no extra memory.

What if an expression is really expensive to compute on the fly? By using Pythran or Numba, we can optimize the computation using manual Just-In-Time (JIT) compilation.

Using Numba or Pythran we can JIT our expression to squeeze out a better performance: > 2x faster in this example.

JIT-ed expressions are even supported for remote DataFrames (the JIT-ing happens at the server).

Got plenty of RAM? Just materialize the column. You can choose to squeeze out extra performance at the cost of RAM.

Materializing a column converts a virtual column into an in-memory array. Great for performance (~8x faster), but you need some extra RAM.

Data cleansing

Filtering of a DataFrame, such as ds_filtered = ds[ds.x >0] merely results in a reference to the existing data plus a boolean mask keeping track which rows are selected and which are not. Almost no memory usage, and no memory copying going on.

df_filtered has a ‘view’ on the original data. Even when you filter a 1TB file, just a fraction of the file is read.

Almost no memory usage, and no memory copying going on.

Apart from filtering a DataFrame, a selection can also define a subset of the data. With selections, you can calculate statistics for multiple subsets in a single pass over the data. This is excellent for DataFrames that don’t fit into memory (Out-of-core).

Passing two selections results in two means in a single pass over the data.

Missing values can be a real pain, and is not always easy to decide how to treat them. With vaex, you can easily fill or drop rows of missing values. But here’s the thing: both dropna and fillna methods are implemented via filtering and expressions. This means that, for example, you can try out several fill values at no extra memory cost, no matter the size of your dataset.

You can try several fill values at virtually no extra memory cost.

Binned Statistics

Vaex is really strong in statistics. Since we are dealing with Big Data, we need an alternative to groupby, something that is computationally much faster. Instead, you can calculate statistics on a regular N-dimensional grid, which is blazing fast. For example, it takes about a second to calculate the mean of a column in regular bins even when the dataset contains a billion rows (yes, 1 billion rows per second!).

Every statistic method accepts a binby argument to compute statistics on regular Nd array.

yes, 1 billion rows per second!

Visualizations

Making meaningful plots and visualizations is the best way to understand your data. But when your DataFrame contains 1 billion rows, making standard scatter plots does not only take a really long time, but results in a meaningless and illegible visualization. You can get much better insights about the structure in your data if you focus on aggregate properties (e.g. counts, sum, mean, median, standard deviation, etc.) of one or more features/columns. When computed in bins, these statistics give a better idea of how the data is distributed. Vaex excels in such computations and the results are easily visualized.

Let’s see some practical examples of these ideas. We can use a histogram to

visualize the contents of a single column.

This can be expanded to two dimensions, producing a heat-map. Instead of simply counting the number of samples that fall into each bin, as done in a typical heat-map, we can calculate the mean, take the logarithm of the sum, or just about any custom statistic.

We can even make 3-dimensional volume renderings using ipyvolume.