By Eliot McKinley (@etmckinley)

Recently, this tweet created a small firestorm in the soccer analytics community. While it is unclear the source of the error, it was pretty clear that there weren’t 1,300 passes and 50 shots in an English League 2 match. This led to responses from prominent analysts such as StatsBomb’s Ted Knutson (including on his podcast [starts at 10:45]), Opta’s (and ASA alum) Tom Worville and Ryan Bahia, and Chris Anderson, author of The Numbers Game. All of them were saying pretty much the same thing: question the data you are using. If the data you are using to analyze a problem is not valid, then your solutions won’t be either.

So what do we know about the data that is used for soccer analysis? Previous studies have shown that people are pretty good at agreeing about what type of event occured in a soccer game (e.g. shots, tackles). But as far as I can tell, the accuracy and precision of locations of game events among the various data providers has not been studied. As Joe Mulberry pointed out when looking at the troubling inconsistencies between spatial tracking data and event data, small differences in locations can have big effects on downstream analysis including expected goals (xG) models. In other words, small inconsistencies in how data is tracked can have big consequences for the models built off that data. So what are the differences between how soccer data providers collect and report their data?

To partially answer this, I took to Twitter. I created a Google survey that asked a user to watch a video of a goal and then code the location of a shot using Peter McKeever’s fabulous online tool. While the specifics of how companies code the data are still a bit shrouded, this method is probably a crude version of what data companies do. But instead of (presumably) well paid and well trained professionals doing the work, it is random, totally trustworthy, people on the internet doing it for free.

I asked people to look at three different shots. The first was a headed goal from Poland in the 2018 World Cup, this one was a bit tricky because the broadcast angle and the player’s jump makes determining the exact location difficult.