Lately I've been having fun exploring data sets from the Chicago open data portal. I think it's great that the city is providing access to government data and encouraging people to explore it. One of the first data sets I looked at involved ridership on the public buses. It turns out it's very well documented... you can look at how many people rode buses on each route every day for the past several years. I went and downloaded a simple csv file here and began exploring it. If you're curious yourself, here is a link to similar data that might be useful. There is even an API for stop-by-stop data for each bus that live-updates which I haven't tried out yet. Here's an interesting blog post I found about it which describes a project done by someone in the Data Science for Social Good fellowship. It's an interesting read, if you too care about the public bus system in the City of Chicago!

After exploring of the dataset and training a few basic models to try and predict ridership, I began to wonder what other external features might be useful for such a prediction. Maybe I could scrape some information about public gatherings like sporting events to predict overcrowding on routes leading to stadiums? Perhaps some buses will have fewer than average riders on public school holidays? I decided to try and determine if people ride the bus more when it's nice outside. If you're looking for me and it's -20F out (thanks, polar vortex), don't bother checking buses because I'll absolutely be at home.

In this blog post, I will demonstrate how to clean and merge together two data sets from different sources. Here I will be combining historical bus riderships per day with weather data per day. I'll use timestamps to join the tables together and I'll try to search for correlations within the two data sets.

The Question¶

Does ridership on Chicago public buses correlate with air temperature?

The Data¶

The ridership data is publicly accessible through the Chicago open data portal, and (obviously) there's a plethora of historical weather data out there to be had. I chose to get my data from the National Oceanic and Atmospheric Administration (NOAA) website. You can request which features interest you (air temperature, precipitation type, precipitation amount, etc), select the date ranges and location of interest, and download the data for free. I downloaded my own csv for the Chicago area to include in this analysis.

In this quick study, I'll be using numpy, matplotlib, and pandas. With the two datasets stored in pandas dataframes, there are some nice SQL-esque join functions that pandas offers.

The Analysis¶

The analysis in this post is quite simple. I want to compare public bus ridership on different routes with air temperature, so I'll develop a metric ($R^2$ from a linear fit for example) to determine amount of correlation. This blog post is mostly to demonstrate how to merge two data sets from different sources (ridership, and weather). That said, the results are pretty interesting, too!