In a previous post, I implemented a model of soccer developed by Gianluca Baio and Marta A. Blangiardo. They model the number of goals each team scores in a game as Poisson distributed with latent scoring intensities a function of each team's attacking strength, defending strength, and a home team advantage. One output of the model is estimates of every team's attacking and defending strengths, estimates that can be used to predict the outcomes of future games.

This model had (at least) two weaknesses:

By modeling goals scored, it didn't give any credit for almost-goals-that-weren't. A team got no credit for generating a solid opportunity that would have been a goal if it were not for bad luck. By modeling only the end-of-game goal tallies, it did not take into account the score differential, let alone the evolution of this differential over the course of the game. A team with 3 goals going into the final quarter of a game will put less effort into scoring if their opponents have 0 goals than if their opponents have 2, 3, or 4.

I considered these problems for a while and turned to American football, which is a much more structured and discretized game than soccer. A game can be broken down into drives or possessions, and data is available on the outcome of each drive. I built a model of the NFL at the drive level, which addressed both of the weaknesses of the soccer model. It addressed the first weakness by giving credit for drive progression, even if the drive didn't end in a touchdown. It addressed the second weakness by using drive-level covariates, such as score differential and time on the clock.

I then turned back to soccer. In the meantime, problem #1 had essentially been solved by 'expected goals' or xG. It was less obvious how to solve the problem #2 with soccer. Because soccer is a fluid game, the modeling approach doesn't jump out at you the way it might for a discrete possession-based game like football.

Before diving into the model, I'd like to motivate issue #2 above a bit more. If we don't control for game situation when evaluating team strength using an outcome measure like points, goals or expected goals, we risk biasing our estimates of team strength upward for the weakest teams and downward for the strongest. This is true across sports. Imagine a strong team blowing out a weak team in the fourth quarter of an NFL game. The strong team puts in their B-team, allowing the weak team to score a couple touchdowns. This result will make the two teams look closer in strength than they really are. Or in soccer, as described in the 'Criticisms' section of the xG article:

A team may score one or two difficult chances early in a game and sit back for the remaining of the 90 minutes, allowing their opponents to take many shots from different positions, thus increasing the opponents xG. One could then claim that the losing team achieved a higher xG therefore deserves the win. This is why xG should always be taken with additional context of the game before creating a verdict.

For this reason, if we want to use xG to evaluate team strength - and to use our strength estimates to predict the outcomes of future games - we need to take game situation into account.

The Model¶

I decided to approach the problem simply - by breaking up a soccer game into segments. Most segments are 10 minutes long, but the length of the last segment of each half varies depending on injury time.