I have monitored for three months a car sharing service operating in my city, collecting roughly 200000 routes made with such service. I started this process just for the sake of data acquisition and data visualization. Collecting such data could have given me the possibility to elaborate one of those fancy visualizations which illustrate people mobility on a map. Indeed I tried to achieve such result, and here you can see an animated example I’ve made with this data and a bunch of Javascript lines of code and Google Maps API.

This sample represent around 3hours of traffic in July 2015

After gathering and analysing this data, I started to ask myself if there were some other informations and possibilities lying within this moving dots.

This could be used to build a predictive model and expose some privacy issue on drivers’ behaviours and habits.

Shifts per day of the week

Of course there were other informations lying there. You can easily grasp the daily usage of the service, the coverage of a certain area, the average distance covered in a shift and so on. With this bar chart, you can see that during the weekend there’s a clear predisposition to use the service. But I wanted to take a deeper look into this, especially to see if some privacy flaw could’ve been found.

The easiest attack that I could think of, was a kind of virtual car chase. That is, you can follow a car remotely. Then, if you see somebody getting on one of these cars, you can just wait to see it reappearing on the map to discover where he was going. This is simple and can be done easily. Let’s see if we can push this further.

I started to wonder if performing some cluster analysis on this shifts, some recurring routes patterns would’ve emerge, as if this could be used to build a predictive model and expose some privacy issue on drivers’ behaviours and habits.

The intuition here was that since theres was repetitive pattern, this could be a person moving from point A to point B, let’s say from home to office. This turned out to be an intuitive, but incorrect assumption.

So I tried to divide into clusters these routes paths, where the paths in each cluster share some similar properties. In this case, I choose to aggregate them using their start and end position. That is, I wanted to see if it was possible to find more routes which were leaving from one point of the city and heading to a particular other location. The intuition here was that since there was a repetitive pattern, this could be a person moving from point A to point B, let’s say from home to office. This turned out to be an intuitive, but incorrect assumption.

So this is what this routes clustering provided.

Here I analysed 2000 routes and tried to divide them into 30 clusters. Every color here represent a different cluster.

There was mainly three problems here I didn’t think of at first:

Deciding the number k of clusters Dynamically generate k colors is not a trivial task. Especially if k is big and you want them to be fairly distinguishable with each other! The incorrect assumption I mention before, which I’m going to illustrate in a bit.

The number of clusters denote in how many different groups of similar objects you think you can divide a larger set of objects. A rule of thumb to determining the number of clusters in a data set is to obtaining it like so:

with n as the number of objects (data points).

This is why I tried to divide in 30 groups a set of 2000 routes.

But, we have to consider we were trying to make emerge a persons movements from this shifts. In a typical day roughly 2000 shifts are made with the car sharing service. Dividing it in 30 groups, where each group would represent a person movements, would imply that this person is making around 66 shifts per day, which is unlikely to happen.

So I’ve started trying to find different number of clusters, making assumption like “A typical user should use this twice a day!”, to decide the next k I would have used.