I’ve been a long time user of Dublin Bikes since I came here over 5 years ago, and like many other users I have a mental model for of which stations are busy or quiet at various times of the day. My closest station is Portobello and I’ve found that bikes can be fairly hard to come by during mid-morning, but walking into town my chances of finding a bike would start increasing substantially. I wanted to understand this in more detail and find out if there are a handful of different behavioural types that all stations could be categorised into, and how this might vary around the city. The bikes are used mainly for commuting to and from work so it seems natural that some sort of spatial pattern would occur. As you can see from the map above it turns out that we can classify the stations into three different behavioural types and we get some pretty interesting results!

In this post I’ll explain more about what I did and what these patterns are, as well as some of the technical background. Because I’m aiming this post at general readers as well as more technical folk I will talk about the findings first and save the technical methodology for further down. You can also click here for the full interactive map shown above and here for the github repo where I’ve made my code and some example data openly available.

What are the patterns and how were they found?

I collected data every 2 minutes from a public API from January 2017 to August 2017 to build up a decently sized historical data set from which we can work out typical weekday usage profiles for each of the 108 stations. This will average out the effects of rainy days, cold days, special events etc. From this we can then find distinct behavioural patterns into which we can classify each station profile. To do this I used a method called k-means clustering which works by making a prior assumption that there are a number of ‘k’ different categories into which our data can be segmented. The algorithm then tries to find the ‘k’ best average patterns that describe the segmentation of our data. In our case I chose k=3 because it gives aesthetically pleasing results although methods such as silhouette scores are more rigorous. I hope to put online a talk I gave at PyData Dublin in October 2017 which explains this in more technical detail. Each station is then categorised according to its closest match to the three profiles. We are free to choose a higher value but the results become more complicated to interpret due to the increasing number of possible categories, however we would expect the behavioural profiles to be a better fit to the actual data.

I’ve colour coded each of these distinct behaviours found using this method and plotted them in the below diagram. Recall that these are found using the average weekday usage for each station and so time of day is along the x-axis. I’ve also used the percentage capacity of each station instead of the raw number of bikes to handle for the fact that some stations have a higher capacity than others.

Average weekday usage for the three types of stations

It’s straightforward to see that stations most closely fitting the green line would typically be used for commuting into the city to the blue stations in the morning, so naturally we’d expect that green stations would be found in residential areas and blue stations in the city centre. In the evening the opposite occurs as people head back home from work. Red stations have a steady supply over the day and don’t fit in with either of the other two behaviours.

When we plot the stations on a map and colour code them by category a really interesting picture emerges, as shown in the top image (or click here to open a tab for the interactive map). By characterising the stations by only their usage patterns we can see this also results in a high degree of spatial clustering. While this is what you’d expect to see I was surprised by just how clear the clusters really were — the blue stations are focussed around the Dublin 2 area where a lot of business offices are based, bordered by the in-between reds, and green stations outside the city centre in more residential areas.

Methodology

For data collection I wrote a short Python script which ran on my Raspberry Pi, querying the Dublin Bikes API every 2 minutes and appending to a daily CSV file, then dumping this CSV to an S3 bucket every night. I ran this from January 2017 until late August when I finally remembered about the project idea again.

As mentioned previously I then calculated average weekday profiles for each of the 108 stations, then used k-means clustering to determine the three archetypal behavioural categories with which I would categorise each station. I used Euclidean distance for determining similarity of the different time series, which is a bit simplistic and not generally recommended for time series data. I figured that the problem is fairly simple with only 3 behavioural types and despite this assumption I’m happy with my results. More advanced methods like dynamic time warping could be used if a more robust approach is needed.

The mapping was done using the Python library Folium which acts as a wrapper to the JavaScript library Leaflet. It’s fairly painless to use and it seems to stand out among the other Python mapping libraries as it can output into HTML which can then be embedded in websites or hosted standalone.

In terms of future work there’s plenty more that could be done with the data such as differences between weekdays and the effects of weather. I collected weather data for Dublin simultaneously when querying the bikes API to build the dataset used for this post, so it’d be interesting to look at its effect on the usage patterns and perhaps build some predictive model for each station. Models such as ARIMAX should be able to perform well on such a problem.

Anyway that’s it for now, hopefully this was interesting for fellow Dublin Bikes users and technical people alike. Click here for the project github repo or hit me up on Twitter.