For this simple benchmark, disk.frame is faster. But Dask has a more convenient syntax. JuliaDB.jl is not ready for prime time.

Please note I have not tried to record the precise times over many runs, but I aim illustrate the magnitude of speed of the different packages

To download and the data, here are some examples

The data can be obtained from Rapids.ai’s Fannie Mae Data distribution page . I have downloaded the 17 Years data which contains dat on 37 million loans with over 1.89 billions rows in Performance datasets.

Benchmark exercise: converting CSV to desired format and simple aggregation

We find the largest possible single file to give each of the tool a test run.