The 2018 World Cup is finally here. In the opening match, Saudi Arabia will take the pitch against the host country Russia, who is favored by the majority of analysts. In our opening post we explored the predictive qualities of player data versus team level data and the challenges of World Cup predictions. For this post, we dig into data sources and share some initial modeling output.

Similar to our work in 2014, we are focused on predicting match victors for this World Cup. More specifically we will be modeling the concept of expected goals (xG) for each team, diving into the likelihood that a shot will result in a goal. There are numerous approaches to xG modeling, as the World Cup plays out we will share the inner workings of our model. To accomplish our goals (no pun) we are using match and touch event data from Opta Sports, more specifically we are using their F8 (match stats) and F24 (event) feeds to harvest some predictive signal for these maddening matches.

Digging Into the Data

The F8 data includes timestamps for only a few important events (cards and goals) and is mostly familiar stats aggregated at the player and team level for each game — things like shot totals broken down by body part, whether they were inside or outside the box and the end result of the shot. You can find this data on most club and league sites; however, gathering at scale and with consistent quality would be a challenge. Opta provides quality structured feeds for a massive amount of leagues.

The F24 is a much more fine grained view, capable of telling a lot more of the game story. Imagine the field as an X,Y coordinate plane. More than 75 different types of events are included with location, player information and up to 20 pieces of supplemental information about the event telling you things like the body part used and the play type the event belongs to. One could use the F24 data to replay the majority of match activity. Gathering this information at scale with high quality is challenging at best as clubs and leagues do not publish this data. Fortunately, Opta provides wide and deep coverage for the leagues at the F24 level.

We built an ingestion pipeline with Cloud Dataflow and Apache Beam to process thousands and thousands of roster, schedule, F8, and F24 files to create a lovely BigQuery dataset. This way we could easily re-run our ingest processes as we needed to tweak any logic or mapping — ingesting 5 years of 20+ leagues with one single serverless job! We will have a future post on data pipelining.