Color of point depends on the direction of the vehicle. As you can see from the map and the table below, two destinations (red and green) are more popular than two others. You can see them on the map in top left and bottom right corners.

Direction Number of points

23022 91190

24416 79777

2 2339

24404 193

Let’s have a closer look at selected parts of the map (including first two directions) below. With more zoom, you can see that a) it’s really lots of points b) sometimes they are not so accurate (I hope, in another case some of the trams ended their journey in the Odra river).

Give me the route

I’d like to get a polyline best representing route of the line heading to specific direction. There are at least few ways I can do it:

a) Go to the website of Municipal Transport Company and look for it

Sadly all I got is a pdf with the schema of public transport in Wrocław and it’s not accurate.

b) Go to the website providing public data of Wrocław and get GTFS files

These files describe public transport in Wrocław quite well, but not all GTFS files are provided, for example shapes.txt, file which contains rules for drawing vehicle routes on the map.

To be fair I could take list of all stops of a given line and just connect them, but the accuracy of that polyline wouldn’t be acceptable.

c) Track one course of each line and direction, connect the points et voilà, I’d have all the routes I wanted

This approach is not bad, but I wouldn’t really trust it:

- GPS accuracy is quite bad sometimes, so taking just one signal every x seconds/minutes seems dangerous

- trams and buses sometimes take different route due to accidents that happen on their original route

d) Take the average path between all the points we have (our choice)

This approach will let us ignore the outliers, in our case points that are outside of wanted route due to a) GPS lack of accuracy b) detour. Also, finally we can have some use of our gigabytes of data, so let’s do it!

But how?

After digging some information about approaches to this kind of problem, I’ve found few possibilities:

a) Principal Curve Analysis

It’d give me a curve with minimum distance to each location. I need a polyline, but it wouldn’t be a problem to sample it afterward. This approach felt a bit too complicated for our problem, maybe next time.

b) Map matching

It’d let me match my locations to the map of roads in Wrocław. It seems great, has some open source implementations and I will probably use it someday, but this time I’d like to focus on averaging thousands of points, while this approach works well even with one sequence of positions.

c) Cluster analysis (our choice)

Let’s group similar objects (in our case similarity is based on the distance between locations), then find centers of found groups and make a polyline between these points.

There are dozens of clustering algorithms. Everyone probably heard about k-means, which is one of the simplest yet often used clustering. It has few drawbacks, like the fact that you have to specify the number of clusters or that by default it works with euclidean distance and our planet is, well, not flat (or is it?). Let’s use a different algorithm: Density-based spatial clustering of applications with noise, or as everyone calls it…

DBSCAN

It’s perfect for us, because:

- it can work with any distance function (we’ll use Haversine)

- it doesn’t need the number of clusters as input, we describe clusters using other parameters (more info later)

- it can ignore noise/outliers

Clustering

I will use python for working with data and visualizing our results. Full code is available on GitHub here. Python is not my primary weapon, so if you have any remarks or hints about my .py code quality, let me know.

Our data is stored in CSV file (which you can get here). It contains 5000 locations of tram number 1 (with one selected direction):

Latitude, Longitude

51.124065, 17.041012

51.124054, 17.040968

51.124054, 17.041037

51.12404, 17.04099

51.124054, 17.040976

(...)

Let’s load it into pandas dataframe and plot all the points on a map:

df = pandas.read_csv('locations.csv', index_col = False, header=0) superMap = folium.Map(location=[51.107885, 17.038538], zoom_start=14, tiles='OpenStreetMap') drawPointsOnMap(superMap, df.values, '#3186cc', 20)

We clearly see a line which we would like to get. Our hope is that it’s going to be represented by (let’s say) less than 100 points, not 5000. Also, we see a detour on our map:

Compared to the main route, there’re really just a few points on the detour route. We hope that DBSCAN is going to ignore these points.

Now, let’s use dbscan algorithm to create clusters. We will use some snippets from article Clustering to Reduce Spatial Data Set Size written by Geoff Boeing (thanks, Geoff!). We will show centers of clusters as red points:

clustersCenters = getDbScanClustersCenters(df, 0.03, 2)

drawPointsOnMap(superMap, clustersCenters, '#ff0000', 30)

Ooops, it doesn’t look so well. Detour points are part of about 15 clusters and we expected to get a bit more points in other parts of the route. Let’s take a look at that line:

clustersCenters = getDbScanClustersCenters(df, 0.03, 2)

Numerical values here are parameters of DBSCAN algorithm:

0.03 is the maximum distance (in km, so here it’s 30 meters) that objects can be away from each other while being in the same cluster

2 is the minimum number of objects that cluster has to consist. If it has less, it’s considered to be noise

Let’s increase 2 to 10. If cluster will contain less than 10 objects, it’s going to be ignored: