tl;dr; It's ciso8601 .

I have this Python app that I'm working on. It has a cron job that downloads a listing of every single file listed in an S3 bucket. AWS S3 publishes a manifest of .csv.gz files. You download the manifest and for each hashhashash.csv.gz you download those files. Then my program reads these CSV files and it is able to ignore certain rows based on them being beyond the retention period. It basically parses the ISO formatted datetime as a string, compares it with a cutoff datetime.datetime instance and is able to quickly skip or allow it in for full processing.

At the time of writing, it's roughly 160 .csv.gz files weighing a total of about 2GB. In total it's about 50 millions rows of CSV. That means it's 50 million datetime parsings.

I admit, this cron job doesn't have to be super fast and it's OK if it takes an hour since it's just a cron job running on a server in the cloud somewhere. But I would like to know, is there a way to speed up the date parsing because that feels expensive to do in Python 50 million times per day.

Here's the benchmark:

import csv import datetime import random import statistics import time import ciso8601 def f1 ( datestr ): return datetime . datetime . strptime ( datestr , '%Y-%m- %d T%H:%M:%S. %f Z' ) def f2 ( datestr ): return ciso8601 . parse_datetime ( datestr ) def f3 ( datestr ): return datetime . datetime ( int ( datestr [: 4 ]), int ( datestr [ 5 : 7 ]), int ( datestr [ 8 : 10 ]), int ( datestr [ 11 : 13 ]), int ( datestr [ 14 : 16 ]), int ( datestr [ 17 : 19 ]), ) # Assertions assert f1 ( '2017-09-21T12:54:24.000Z' ) . strftime ( '%Y%m %d %H%M' ) == f2 ( '2017-09-21T12:54:24.000Z' ) . strftime ( '%Y%m %d %H%M' ) == f3 ( '2017-09-21T12:54:24.000Z' ) . strftime ( '%Y%m %d %H%M' ) == '201709211254' functions = f1 , f2 , f3 times = { f . __name__ : [] for f in functions } with open ( '046444ae07279c115edfc23ba1cd8a19.csv' ) as f : reader = csv . reader ( f ) for row in reader : func = random . choice ( functions ) t0 = time . clock () func ( row [ 3 ]) t1 = time . clock () times [ func . __name__ ] . append (( t1 - t0 ) * 1000 ) def ms ( number ): return '{:.5f}ms' . format ( number ) for name , numbers in times . items (): print ( 'FUNCTION:' , name , 'Used' , format ( len ( numbers ), ',' ), 'times' ) print ( ' \t BEST ' , ms ( min ( numbers ))) print ( ' \t MEDIAN' , ms ( statistics . median ( numbers ))) print ( ' \t MEAN ' , ms ( statistics . mean ( numbers ))) print ( ' \t STDEV ' , ms ( statistics . stdev ( numbers )))

Yeah, it's a bit ugly but it works. Here's the output:

FUNCTION: f1 Used 111,475 times BEST 0.01300ms MEDIAN 0.01500ms MEAN 0.01685ms STDEV 0.00706ms FUNCTION: f2 Used 111,764 times BEST 0.00100ms MEDIAN 0.00200ms MEAN 0.00197ms STDEV 0.00167ms FUNCTION: f3 Used 111,362 times BEST 0.00300ms MEDIAN 0.00400ms MEAN 0.00409ms STDEV 0.00225ms

In summary:

f1 : 0.01300 milliseconds

: 0.01300 milliseconds f2 : 0.00100 milliseconds

: 0.00100 milliseconds f3 : 0.00300 milliseconds

Or, if you compare to the slowest ( f1 ):

f1 : baseline

: baseline f2 : 13 times faster

: 13 times faster f3 : 6 times faster

UPDATE

If you know with confidence that you don't want or need timezone aware datetime instances, you can use csiso8601.parse_datetime_unaware instead.

from the README:

"Please note that it takes more time to parse aware datetimes, especially if they're not in UTC. If you don't care about time zone information, use the parse_datetime_unaware method, which will discard any time zone information and is faster."

In my benchmark the strings I use look like this 2017-09-21T12:54:24.000Z . I added another function to the benmark that uses ciso8601.parse_datetime_unaware and it clocked in at the exact same time as f2 .

Related posts