Recently we ran into crippling scalability issues at my workplace. By crippling I mean our billing feed stopped! I work for a company that provides infrastructure as a service (IaaS). Our platform provides on demand computing including virtualized servers and network services. As part of this offering, one of the things that we provide to our customers is a Firewall.

Now, the way we calculate bandwidth is by using python scripts to make a pexpect call to the firewalls every 5 minutes and getting information about how much data was transferred, and writing out the usage to RRD files. We use cron to schedule this process. Here are the steps that we used to take for measuring bandwidth usage (as you can see this is sequential)

Get all firewall rows from the db

Loop over the rows and for each row

Log on to the firewall (pexpect)

Get the usage information (this is a string)

Parse this information and write it to the corresponding RRD file

This worked well for about 2 years but we have been steadily growing and as a consequence the number of devices that we have has been increasing.

A couple months ago it was discovered that the run time of this process increased to about 7 mins and 30 seconds. Since we execute this process every 5 mins we were having 2 different process (potentially) working on the same data at a given time; in other words potential for race condition. One of my colleagues (in order to mitigate this, quickly) introduced a lock file in the script. Basically, she made sure that if one process was running another would not be able to start. She was able to get the data flowing back again. However, this increased the time between two subsequent readings from 5 mins to 10 mins.

Many of you must have noticed that this is an embarrassingly parallel problem (because measuring, processing and writing data for one firewall is independent of the other firewall). Here is where we drew some inspiration from (C++’s) Open MP and especially the construct pragma OMP parallel for. For those of you that have not worked with this before it is basically a neat way to parallelize loops. The way we refactored this was to divide the list of rows into chunks and execute each chunk on a different process (yes it is a very degenerate case of the parallel for construct). At this time 3 processes are sufficient for us. By doing this we were able to reduce the running time of the process to about 1 min and 22 seconds!