I recently made a visualization that I thought was interesting and posted it on two great subreddits: /r/soccer and /r/dataisbeautiful.

Since lots of people requested more information about the data and the process used to generate the charts, I decided to create a github repository to share this (and any future) findings.

Regarding the dataset¶

The data was collected by me from UEFA's website, specifically from the Post-match timeline (an example). The process is far from refined, so some degree of error is to be expected. In fact, a bored engineer on reddit proved that the goal marked as 45 meters was actually scored at 42 meters, by analyzing the video and the pitch patterns. It is unclear if the error comes from the source or from my calculations, but assume the latter rather than the former.

The dataset consists of all shots marked on UEFA's website as an event. This introduces a major classification bias in the analysis, but, unfortunately, I don't believe a more complete dataset is available to the public.

tl;dr: don't treat this as a scientific study.

This analysis was made as a result of a fun exercise in web scraping, statistical modelling, and visualization tools.

The tools used (because python is awesome) were:

beautifulsoup (for web scraping)

pandas (for data clean up and analysis)

statsmodels (for modelling)

matplotlib (for visualizations)

All of it was done (and shared) using iPython Notebooks. If you don't know about this wonderful tool, check out the website and this gallery of interesting Notebooks.

It was pointed out on /r/soccer that a particular goal was missing: this stunner from Chelsea's Óscar. When investigating the data to understand why, I noticed I had only scrapped half of the competitions' games (only the first match day for each week).

This notebook and the dataset were updated to reflect the new data, but the above image (the original post on reddit) was kept. The conclusions remained the same.

Enough talk already, let's get down to business: