



Couple of weeks back, I came across this amazing library that scales up the existing pandas code by changing just one line of code and making it at least 2x faster compared to the existing. Seeing such big claims gave me a reason to test it out and see the results myself. This is the project i came across check it out!

I will be importing a 2 different datasets of different sizes to compare the performances for both the methods.

Dataset 1

Size: 445MB

import time import pandas as pd duration = [] for i in range(3): start = time.time() data_df = pd.read_csv( 'data.txt', sep = '\s\|\|\s' ) stop = time.time()-start duration.append(stop) del data_df final_time_pd = sum(duration) / float(len(duration)) print ('Average time for 3 runs is {} sec'.format(final_time_pd)) >>> Average time for 3 runs is 12.120 sec import time import modin.pandas as pd duration = [] for i in range(3): start = time.time() data_df = pd.read_csv( 'data.txt', sep = '\s\|\|\s' ) stop = time.time()-start duration.append(stop) del data_df final_time_pd = sum(duration) / float(len(duration)) print ('Average time for 3 runs is {} sec'.format(final_time_pd)) >>> Average time for 3 runs is 6.515 sec

Clearly, Modin wins this case. Let's try with another dataset.

Dataset 2

Size: 990MB

#!/usr/bin/env python