Macdougal St Station

Dia&Co recently outgrew its old office and moved across town to a larger space. While this sounds like the best problem for a startup to experience (we keep growing!), I was anxious.

Was I worried that with 4x the space we might lose the tight culture that comes with close-quarters? No, we have been persistent in our commitment to defining and acting on our core values, and these will stand the tests of a larger office.

Was I worried that 6x the conference rooms would be insufficient for our rapidly growing teams? Eh, I think there must be a physical law that says that startup conference rooms are scale invariant — they will always be booked regardless of the number of employees.

Was I worried that the increased sunlight would make it harder to concentrate? Come to NYC in the winter — I will happily welcome that problem.

No, I was worried about my new commute. As a daily user of New York City’s bike sharing program, Citi Bike, I have first-hand experience with the laws of supply and demand in a hot economy. Any rider of the Citi Bike system knows that stations are liable to be full when you need to drop a bike off, and conversely stations are often empty when you really need to check out a bike. In the evenings, I would often have to pick up my bike at a station further from the office due to the closer station being empty. “What would the situation be like at our new office?”, I wondered.

Thankfully, we here at Dia&Co are well-versed in predicting demand in a hot economy. We run a reasonably lean warehouse, so forecasting inventory levels becomes especially important for satisfying a rapidly growing customer base. Let’s put our data science hats on and get to work.

Data: When in Doubt, Get it Yourself

Anybody who reads data science blog posts online is probably sick of hearing about analysis of Citi Bike data. Every couple months, the Citi Bike service releases public datasets of every ride taken on the system over some period of time. Unfortunately, this data is not sufficient to back out how the bike occupancy levels varied as a function of time at different stations across the city.

Thankfully, I have been collecting my own dataset for a couple months, now. Citi Bike has a public API which allows one to query the current number of available bikes and docks at each station in the city. I have a small EC2 box which has been hitting the API every 2 minutes since September 18th, 2016 and storing the data in a Postgres database. As of February 14th, there are over 70 million rows in this database. The code for generating this dataset is available here.

Dasking the DataFrame to Save Some RAM

I had previously dumped the whole database to a CSV and transported it to my local computer. Uncompressed, the dataset is 8.6 GB on disk. While my 16 GB MacBook Pro could hold everything in memory, between PyCharm, Slack, and any data manipulations, I was worried about running out of memory.

This is where dask comes in. Dask does a lot of things, from parallelized big data analytics to distributed task scheduling, and it all happens in Python. Likely the most familiar use-case for data scientists is to run pandas-esque operations on data which does not fit in memory. Dask integrates quite nicely with pandas, so one can reuse much of the same code.

Disclaimer: I should have just queried the data when it was in the Postgres database, but then I wouldn’t have had a chance to talk about dask!

We’ll start by reading the Citi Bike dataset into a dask DataFrame and pulling out the old and new office stations’ data as regular pandas DataFrames. We’ll call these old and new stations howard and macdougal , respectively, for the streets on which they reside. Dask evaluates computations "lazily" (like Spark), so the real computation does not happen until we actually pull out the relevant stations.

Visualizing Occupancy

With the two stations’ data loaded into memory as pandas DataFrames, we can start to play and get a feel for the data. To start, let’s look at how the number of docks available at the old, Howard bike station varied across the full time of the dataset.

Well that looks fairly incomprehensible.

The data is currently indexed by the execution_time which is the time that I pinged the Citi Bike API. We'll go ahead and resample the time period of the data and also get rid of a bunch of columns that are not relevant to this analysis. I'll then try averaging each days' data in order to get a simple picture of what a typical day looks like for the station. We'll ignore weekends (because I'm not working then!), and I'll also faintly plot each days' data in order to get a feeling for the variance.

We can see that the old office Howard station and new office Macdougal stations are commuter destination bike racks. How do we know this? It fills up during work hours (the number of available bikes increases) and empties during evening hours. Ideally, my new office would represent a reverse-commute. This looks to not be the case!

We should expect the opposite behavior from a commuter origin station. We can easily find one in Alphabet City, a neighborhood with few offices and a lack of subway stations.

Old vs. New

Back to the problem at hand — would I run into more issues picking up a bike at the new office versus the old one? To turn this question into math, I will say that my goal is to determine the probability that I will not be able to find a bike when leaving work on a given day. There are a number of ways that we could answer this with the data. Motivated by working at a fast growing startup, I opted for a method with only enough complexity to answer my question with the necessary accuracy.

I tend to leave work sometime between 6 and 7 PM. Let’s look at what the number of available docks at the old and new office stations looked like around that time range on a single day:

This bodes well for the new office! We can see that during the entire period from 6–7 PM, there were available bikes at the new Macdougal station. Meanwhile, the old Howard station emptied out at 6:45 PM.

Now that we’ve seen what a typical day looks like, we can formulate a plan. First, I will say that my departure time is uniformly distributed between 6 and 7 PM (planning ahead is not my strongest suit). I am also going to make the assumption that the moment the station hits zero bikes available, it will stay like that for the rest of the time period from 6–7 PM. This is because the fluctuations around zero bikes are often quick and small, so my chance of grabbing a spot during this time is likely minimal and below the resolution of our measurement. Additionally, there are often spurious effects in the data when bikes and docks break (which happens often and ought to be the subject of a separate blog post).

With the above assumptions in place, we can easily calculate the probability that I will not be able to find a bike by finding all of the times between 6 and 7 PM when we assume the bike rack to be empty and dividing that by all of the times between 6 and 7 PM.

There you have it — my anxiety was unwarranted! I am 1.8x less likely to not be able to find a bike at the new office versus the old. Score 1 for Macdougal.

What about arriving in the morning? We can ask the same question of whether or not I will be able to park my bike betwween 9 and 10 AM. Unfortunately, it looks like the situation is worse at the new office.

While a solution would be to get to work earlier, I think I’d rather park at a further station.

Do you like diving into nonstandard datasets and extracting insights? Do you have a data project that you have been working on which you are excited about? We’re hiring and would love to hear about it!