It’s 9 AM on a Tuesday, and there’s a horrific sign blocking the entrance to the weight room — “One In, One Out.” When this happens, it means the campus gym is so crowded that staff can only let you in once someone else walks out. How could this happen today? I thought I was safe this early in the morning. The gym isn’t supposed to be packed right now!

The UC Berkeley weight room located in the RSF (Recreational Sports Facility) has a maximum capacity of about 200 students at any given time. This presents a crowding problem for a university of some 27,000 undergrads. In 2014, Ollie O’Donnell and I set out to alleviate this problem by creating an app that would tell you exactly how crowded the gym is before you go.

What can we learn from all the data gathered from the gym over the past year, and what kinds of predictions can we make about how crowded the gym will be in the future? Machine learning is the perfect tool for this task as it can incorporate many different features into the answers it gives, from time of day and temperature to whether or not it’s a holiday.

Full disclaimer before moving on: I am by no means an expert in machine learning. It’s quite new to me and, like these algorithms, I’m learning more every day. Please feel free to correct me on any mistakes I inevitably make so I can adjust my weight vector.

Great machine learning intro here, if you’re new to this.

The Data

The column is_start_of_semester is cut off, but also part of the data.

Over the past year we collected more than 29,000 people counts. Using Pandas, I merged those counts with some other helpful variables like weather and holiday information. I fetched the weather data using a handy API called DarkSky (formerly Forecast.io). This presented a great format for reading historical data, but I wanted to be able to use all this history to predict the future as well.

Machine learning models are great for learning from large amounts of data. The general idea is to train your algorithm on about 70% of the data, test it on the other 30% to judge how accurate it is, and then use your trained model to make predictions. Your model’s score on the test set is a number between 0 and 1, representing what fraction of the predictions were close to the actual people counts in the test set. A score of 1.0 means your model is great at predicting, 0.5 means it only gets it right half the time, and any worse and you might as well just randomly guess.

Attempt 1: Learning Alone

I started with my favorite language, Python, and found a handy ML library called scikit-learn. The library even lays out a map of which model to use given your objectives.