Blog moved to: handyfloss.net

Entry available at: http://handyfloss.net/2008.02/summary-of-my-python-optimization-adventures/

This is a follow up to two previous posts. In the first one I spoke about saving memory by reading line-by-line, instead of all-at-once, and in the second one I recommended using Unix commands.

The script reads a host.gz log file from a given BOINC project (more precisely one I got from MalariaControl.net, because it is a small project, so its logs are also smaller), and extracts how many computers are running the project, and how much credit they are getting. The statistics are separated by operating system (Windows, Linux, MacOS and other).

Version 0

Here I read the whole file to RAM, then process it with Python alone. Running time: 34.1s.

#!/usr/bin/python import os import re import gzip credit = 0 os_list = ['win','lin','dar','oth'] stat = {} for osy in os_list: stat[osy] = [0,0] # Process file: f = gzip.open('host.gz','r') for line in f.readlines(): if re.search('total_credit',line): # The following line lacks a '' behind the "total_credit" thing # because WordPress won't accept them (it keeps mangling the text # if I do include them) credit = float(re.sub('/?total_credit','',line.split()[0]) elif re.search('os_name',line): if re.search('Windows',line): stat['win'][0] += 1 stat['win'][1] += credit elif re.search('Linux',line): stat['lin'][0] += 1 stat['lin'][1] += credit elif re.search('Darwin',line): stat['dar'][0] += 1 stat['dar'][1] += credit else: stat['oth'][0] += 1 stat['oth'][1] += credit f.close() # Return output: nstring = '' cstring = '' for osy in os_list: nstring += "%15.0f " % (stat[osy][0]) try: cstring += "%15.0f " % (stat[osy][1]) except: print osy,stat[osy] print nstring print cstring

Version 1

The only difference is a “ for line in f: “, instead of “ for line in f.readlines(): “. This saves a LOT of memory, but is slower. Running time: 44.3s.

Version 2

In this version, I use precompiled regular expresions, and the time-saving is noticeable. Running time: 26.2s

#!/usr/bin/python import os import re import gzip credit = 0 os_list = ['win','lin','dar','oth'] stat = {} for osy in os_list: stat[osy] = [0,0] pattern = r'total_credit' match_cre = re.compile(pattern).match pattern = r'os_name'; match_os = re.compile(pattern).match pattern = r'Windows'; search_win = re.compile(pattern).search pattern = r'Linux'; search_lin = re.compile(pattern).search pattern = r'Darwin'; search_dar = re.compile(pattern).search # Process file: f = gzip.open('host.gz','r') for line in f: if match_cre(line,5): # The following line lacks a '' behind the "total_credit" thing # because WordPress won't accept them (it keeps mangling the text # if I do include them) credit = float(re.sub('/?total_credit','',line.split()[0]) elif match_os(line,5): if search_win(line): stat['win'][0] += 1 stat['win'][1] += credit elif search_lin(line): stat['lin'][0] += 1 stat['lin'][1] += credit elif search_dar(line): stat['dar'][0] += 1 stat['dar'][1] += credit else: stat['oth'][0] += 1 stat['oth'][1] += credit f.close() # etc.

Version 3

Later I decided to use AWK to perform the heaviest part: parsing the big file, to produce a second, smaller, file that Python will read. Running time: 14.8s.

#!/usr/bin/python import os import re credit = 0 os_list = ['win','lin','dar','oth'] stat = {} for osy in os_list: stat[osy] = [0,0] pattern = r'Windows'; search_win = re.compile(pattern).search pattern = r'Linux'; search_lin = re.compile(pattern).search pattern = r'Darwin'; search_dar = re.compile(pattern).search # Distile file with AWK: tmp = 'bhs.tmp' os.system('zcat host.gz | awk \'/total_credit/{printf $0}/os_name/{print}\' > '+tmp) stat = {} for osy in os_list: stat[osy] = [0,0] # Process tmp file: f = open(tmp) for line in f: line = re.sub('>','<',line) aline = line.split('<') credit = float(aline[2]) os_str = aline[6] if search_win(os_str): stat['win'][0] += 1 stat['win'][1] += credit elif search_lin(os_str): stat['lin'][0] += 1 stat['lin'][1] += credit elif search_dar(os_str): stat['dar'][0] += 1 stat['dar'][1] += credit else: stat['oth'][0] += 1 stat['oth'][1] += credit f.close() # etc

Version 4

Instead of using AWK, I decided to use grep, with the idea that nothing can beat this tool, when it comes to pattern matching. I was not disappointed. Running time: 5.4s.

#!/usr/bin/python import os import re credit = 0 os_list = ['win','lin','dar','oth'] stat = {} for osy in os_list: stat[osy] = [0,0] pattern = r'total_credit' search_cre = re.compile(pattern).search pattern = r'Windows'; search_win = re.compile(pattern).search pattern = r'Linux'; search_lin = re.compile(pattern).search pattern = r'Darwin'; search_dar = re.compile(pattern).search # Distile file with grep: tmp = 'bhs.tmp' os.system('zcat host.gz | grep -e total_credit -e os_name > '+tmp) # Process tmp file: f = open(tmp) for line in f: if search_cre(line): line = re.sub('>','<',line) aline = line.split('<') credit = float(aline[2]) else: if search_win(line): stat['win'][0] += 1 stat['win'][1] += credit elif search_lin(line): stat['lin'][0] += 1 stat['lin'][1] += credit elif search_dar(line): stat['dar'][0] += 1 stat['dar'][1] += credit else: stat['oth'][0] += 1 stat['oth'][1] += credit f.close() # etc

Version 5

I was not completely happy yet. I discovered the -F flag for grep (in the man page), and decided to use it. This flag tells grep that the pattern we are using is a literal, so no expansion of it has to be made. Using the -F flag I further reduced the running time to: 1.5s.

Running time vs. script version (Click to enlarge)