Beyond simply including more data, a total rewrite of the heatmap code permitted major improvements in rendering quality. Highlights include twice the resolution, rasterizing activity data as paths instead of as points, and an improved normalization technique that ensures a richer and more beautiful visualization.

The heatmap is now available on Strava and the Strava Route Builder. The rest of this post is a technical deep dive on the details of this update.

Background

From 2015 to 2017, there were no updates to the Global Heatmap due to two engineering challenges:

Our previous heatmap code was written in low-level C code and was only designed to be run on a single machine. It would have taken months for us to to update the heatmap with this restriction.

Accessing stream data required one S3 get request per activity, so reading the input data would have cost thousands of dollars and been challenging to orchestrate.

The heatmap generation code has been fully rewritten using Apache Spark and Scala. The new code leverages new infrastructure enabling bulk activity stream access and is parallelized at every step from input to output. With these changes, we have fully conquered all scaling challenges. The full global heatmap was built across several hundred machines in just a few hours, with a total compute cost of only a few hundred dollars. Going forward, these improvements will enable updating the heatmap on a regular basis.

The remaining sections in this post describe in some detail how each step of Spark job for building the heatmap works, and provides details on the specific rendering improvements that we have made.

Input Data and Filtering

The raw input activity streams data comes from a Spark/S3/Parquet data warehouse. Several algorithms clean up and filter this data.

Most importantly, the heatmap only contains public activities and respects all privacy settings. Go here to learn more.

Additional filters remove erroneous data. Activities at higher than reasonable running speeds are excluded from the running heat layer because they are most likely mislabeled. There is also a higher speed threshold for bike rides to filter data from cars and airplanes.

The intent of the heatmap is to only show data from movement. A new algorithm does a much better job of classifying stopped points within activities. If the magnitude of the time averaged velocity of an activity stream gets too low at any point, subsequent points from that activity are filtered until the activity breaches a specific radius in distance from the initial stopped point.