The Premier League projections are back. The system has been fully renoobulated and is ready to go for the new season. Nerds are welcome to skip down to the extensive methodology section to see what went into these numbers.

My analysis suggests we may be looking at another three-team title race. No system worth your time will rate Manchester City and Chelsea outside the top three, or perhaps outside the top two. But based on expected goals, Liverpool deserve a seat at the title contenders table too. Arsenal have an outside shot, and Tottenham Hotspur and Manchester United are also teams in the Premier League.

My numbers last year consistently rated Liverpool among the top teams, and Liverpool's title contending run last year probably counts as the biggest "get" for expected goals as a projection system. So far this year, the Reds have totaled six points against the most difficult schedule in the league. They were home to Southampton, who look like a solid top-8 side, and then away to defending champions Manchester City and of course Spurs. While the loss of Luis Suarez will surely hurt, so far Liverpool's performances do not raise any statistical warnings. Even that loss at the Etihad was mostly driven by the quality of City's finishing rather than a clear difference in the kinds of chances created.

The system is based on expected goals. This is a method which estimates the quality of a team's overall chances created and conceded. For every shot attempted, I calculate an estimated expected goals value based on the following factors:

I'll explain in detail how each of these factors is incorporated down in that methodology section. Close and central locations are good, shots taken with feet are good, direct free kicks are good, dribbles and through-balls are very good, crosses are not so good. Fast attacks are good, long balls are less good, set plays are a little bit bad.

I create team ratings by summing their expected goals numbers for and against from the past three years. I also include team wage bill as a factor, as wages have shown strong utility in estimating team quality. The current system is about two-thirds 2013-2014 expected goals, and the other third a mix of 2014-2015 expected goals, 2012-2013 expected goals, and wage bill. Every week, I will increase the weight on the 2014-2015 data as our sample size expands.

Queens Park Rangers project as a little less likely to be relegated than their promoted brethren Burnley and Leicester City. This is because of QPR's wage bill, estimated around double those of the other two sides. This method could easily fail in this particular case, as QPR have horrifically underperformed their wage bill in the last three years. But overall, it looks to me like including wages helps with the projections.

I project matches using the same method I used last year, simulating the season 500,000 times, match by match, and collecting the results.

Because of rounding, the numbers may not add up quite right.

It's too early to say much about the bottom of the table. Burnley is looking like the immediate-relegation side their budget pegs them as, but even so no one is much more than a 50-50 shot at being sent back down to the Championship. Villa and Swansea have extricated themselves from the bottom with hot starts but could easily drop back down if a few results do not go their way. Projecting the rest of the bottom of the table feels like a mug's game at this point. I would be surprised to see Newcastle or West Brom still in the bottom at the end of the year, but beyond that it is hard to say.

Swansea are kind of funny. I was all ready to talk about how I'd overrated Swansea last year because I didn't include an adjustment for speed of attack, and no one attacks slower than Swansea. I had them rated at the top of the mid-table all year last year, but in my new numbers Swansea's 2013-2014 is down in the muck with Sunderland and Villa. Of course, then they went and instead took nine points from three matches. Swansea have been carried by some excellent finishing--did you see that Routledge volley?--but nine early points may be enough to keep them safe.

Southampton, like Liverpool, could reflect a failure of projection systems so early in the year. Having remade their roster and lost their manager, should we use Southampton's strong 2013-2014 season as the largest part of their projection? Maybe not. But so far the Saints have hung with Liverpool at Anfield, played to a stalemate against a good West Brom side and dominated West Ham. I don't know that I'd rate Southampton as 1-in-10 shots to finish fourth, but I think this is a top half side.

Not too much say about Spurs yet. The early season numbers are blah, as the stompification by Liverpool at the Lane outweighs the good of the 4-0 against QPR. Last season's numbers place Spurs in a scrum with United, Everton and Southampton for the 5th-8th positions in the EPL, and they project just slightly at the top of that group now. We'll learn a lot in the next month, but it seems likely that another Europa League qualification season is in the cards.

The system isn't selling on Manchester United yet. There is no defense for how they've played over the first three matches, given the EPL's easiest schedule so far. But that's only three matches. United had a solid 0.550ish expected goals ratio last year and sport a wage bill rising into the stratosphere. While Louis van Gaal's side are underdogs for a top four finish, they're not out of it yet. My numbers cannot account for the effect of Angel Di Maria and Radamel Falcao , but by incorporating a wage bill adjustment, I give United a boost based on their ability to sign players of that caliber.

Ok. Those are the numbers. If you want to see how the sausage done got made, click on "Methodology" to reveal the nerdery hidden below.

So what did you do with your summer vacation? I made a new expected goals model. Here's how it works.

My touchstone throughout the process was "does this make football sense?" I am by training something of a skeptic of regression methods. While obviously I had to do lots of regression to create this system, I tried to make sure I was only running regressions when I understand why and how the factors involved related to the creation of better and worse chances in a football match. This has probably let to some infelicities in the math, but I hope it also means that the logic of the system can be communicated reasonably clearly.

Nerdery I: Shot Location

The basis of my model is an "exponential decay" formula. Obviously as you move further from goal, your chance of scoring decreases. But at what rate? How much better is it to be three yards from goal, compared to six, twelve or eighteen yards away? Exponential decay models suggest that your chance of scoring decreases non-linearly as you move away from goal, with larger drops occurring in moves from three to six yards, and relatively smaller drops coming at twelve to sixteen yards. This is what an exponential decay curve looks like. The data points are non-headed non-cross assisted shots bucketed by adjusted distance from goal (0-6 yards, 6-9, 9-12, 12-15 and so on).

The R-Squared on this is a comically high 0.997. That's just the luck of the bucketing to some degree, but I think it shows that exponential decay is the right model.

Getting that "adjusted distance from goal" turned out to be a much harder task that I expected. There's no simple math for taking X,Y location on the pitch and calculating how good a shooting location it is, on average. I can take distance from the end line, but of course shots from an angle are more difficult to score. I can take distance from the center of goal, but this will treat shots from the end line as far too high-quality. I can take the angle formed by the X,Y location and the goal posts, but this ends up underrating shots from wide areas in the six- or eighteen-yard-box while massively overrating shots from long distance.

I used my old "bucket" system to test the system, making sure that my shot location method projected the correct odds of shots being scored from different region of the pitch.

This is the logic I settled on. The quality of location is a function of two factors. Distance from goal and angle from goal. The best angle from goal is found in the center of the pitch. If you are in the roughly six-yard wide central channel, you have as good an angle on goal as exists. The further outside that six-yard central channel you move, the worse your shooting locations becomes. So the calculation is (Distance from goal / relative angle to goal) where relative angle to goal is 1 when you are in the central channel and drops lower the further from that central channel you move.

More specifically, relative angle to goal is the ratio of the angle from your location to the center of goal, divided by the angle from the edge of the central channel to the center of goal, at an equally far distance from the end line. In the image below, image a shot taken from location X. The relative angle is the angle to the blue "X" line divided by the angle to the red "CC" line.





It turned out this still took some regression. Simply dividing distance from goal by relative angle overrated shots from wide areas. So I tested a few options and came up with the following formula.

Adjusted Distance to Goal = (Distance from end line) / (relative angle ^ 1.32)

So I raise relative angle to a power of about 1.3, and the shot conversion numbers work in each bucket. For all of the following sections, I will be basing my expected goal calculations on this "adjusted distance" number.

Nerdery II: Shot Type and Assist Type

I had hoped to build one model, one long equation with adjustments for everything I could make meaning out of. But it became clear quickly that different types of shots had quite different exponential decay models. Imagine a header. This is a header not assisted by a cross, maybe it was assisted by a flick, maybe the ball is just pinging around the box after a failed clearance. If you are right on the goal mouth when you attempt this header, you're extremely likely to score. Probably just as likely, if not maybe slightly more, than if you're kicking the ball from that location. But back out even six yards, and your chance of scoring with a headed shot is far lower than it would be if you could put your foot through the ball. So the implied curve for headers is quite different than the implied curve from regular shots.

And it gets more complicated if you compare that header to attempting a header off a cross. Right at the goal mouth, using your head to re-direct a ball fizzed across the box is certainly harder than knocking home a flick or a loose ball. But move back ten yards, and it might even be easier to score off a cross. Putting sufficient power behind a header from ten yards is extremely hard, but redirecting a cross at pace just takes getting your angles right.

Fundamentally, there are a bunch of different curves for different types of shots. I ended up working with five primary curves. These are (1) headers off crosses, (2) headers not off crosses, (3) non-headed shots off crosses, (4) regular shots and (5) shots following successful dribbles. You can compare the five general curves. I have included the underlying formula for each curve, though these will be updated as more factors are included.

I have written about shots following successful dribbles elsewhere. A shot from inside 10-15 yards, following a successful dribble, is often a one-on-one with the keeper. Deeper, it's almost always a shot from a reasonably open shooting position, though it is rarely a clear one-on-one. So dribbling creates better shots, but especially so within the danger zone.

You might notice two types of shots which I did not create distinct curves for. These are shots assisted by through-balls and shots from direct free kicks. I found that the curves for shots off through-balls and free kicks were very similar to the curve for regular shots. I can use a simple additive adjustment to account for the value of a free kick or a through-ball assist.

There are, I think, football reasons for this. Both free kicks and shots off through-balls only get taken from a limited area on the pitch. Direct free kicks in modern football are taken from outside the eighteen-yard box. They are scored at higher rates than other shots from these areas, by a few percentage points more. If free kick shots were taken commonly from inside the box, the relationship would become very complicated, based on where the wall could stand, the angle to goal, and so on. It might require a whole new curve. But since those shots don't happen, I don't have to worry about them.

Something similar happens with through-balls. In a full league season, there are rarely more than five shots assisted by through-balls attempted from within the six-yard box. If you attempt a through-ball into the six-yard box, the keeper will claim it. Likewise, there are basically no shots assisted by through-balls taken from over 20-25 yards. If a player receives a pass behind the defensive line and 30 yards out, he almost always has space to run with the ball into a better shooting position, closer to the goal. Because shots off through-balls are almost all taken in the region between 6-20 yards from goal, they do not form a peculiar curve with weird bits at the ends. A through-ball assist tends to add about 15 percentage points to a shot's expected goals value.

Nerdery III: Speed of Attack

The most important general factor in whether a shot is likely to be scored is its location. The next most important factor is the speed of attack. Did the attempt follow a plodding passing move during which the defense could set themselves into position? Or was it a fast break 3-on-3 situation where the defense could not easily close down the space?

As my examples show, it's not that attacking at speed is valuable in itself. The value is that fast attacks are usually attacks against an unprepared defense. Because we don't have data on the actual positions of defenders, it is necessary to create proxies for defensive pressure. Through-ball assists are the one I've used in the past. I have added shots following successful dribbles to the pile, and now we come to attacks at speed.

The relationship between speed of attack and chance of scoring is roughly linear, but it breaks down at the extremes.





Attacks at a speed faster than 8 yd/s are pretty rare. You can't run with a ball much faster than Mame Biram Diouf did last week. From the clearance to the goal takes about 10 seconds, and the ball travels about 80 yards before Diouf slots past Joe Hart. So how is it even possible to travel faster than that?

My theory is that we're seeing fast attacks that actually aren't as "usefully" fast as Diouf's. If you lump the ball forward with a goal kick some 60 yards, your big-man striker knocks it down to your little-man second striker and he takes a shot, you could cover 80 yards in only a handful of seconds. But while going 8.0 yd/s got Diouf a free shot on goal against an unprepared defense, the "route one" shot described above would be taken into the teeth of a defense that was already playing deep against the long kick.

To account for this "route one" effect, I went through every attacking move to find those which included a very long pass, one of greater than 35 yards in "vertical" length. These attacks are downgraded in relation to other attacks.

So now I have to incorporate a roughly linear speed of attack effect into the exponential model. I did this by adjusting the adjusted defense. I treat a fast-attack shot from 15 yards as if it were taken from 10 or 12, effectively. I found that shots attempted following set pieces, either corner kicks or free kicks, are slightly less likely to be scored than shots from open play. So to account for this effect, I treat shots off set plays as shots taken at a slightly below average speed.

Average attacking speed is usually about 2.5 yd/s. So I give shots following set plays a "speed" of about 2 yd/s. For every shot, I find its speed above or below average simply by subtracting 2.5 from the speed. Call this "Speed-Av." Then I adjust distance by the following formula. "LB" is a dummy variable marking whether the attacking move has included a very long pass.

Speed-Adjusted Distance = AdjDist * (1 - (0.1 * Speed-Av)) * (1 + 0.08 * LB))

So a shot following a very fast attack like Diouf's, which is about five yd/s above average, is treated as if it were taken from a distance half the distance it actually was. For this reason, I have Diouf's xG at a little bit under 0.3, even though he is well wide of goal at the moment he takes the shot. An attack at average speed from that location would probably have an xG value around 0.1 to 0.15. I'm pretty happy with how that comes out.

Nerdery IV: The Formula

I have a few more notes to add, but that's basically the method as I developed it. In the following formulas, I'm using the "Speed-Adjusted Distance" number explained above, which in turn is built from the "adjusted distance" number described above that. I'm calling this "SpDist."

1) Headers off crosses

xG = 0.64 * exp(-0.21 * SpDist)

2) Headers, not off crosses

xG = 1.13 * exp(-0.29 * SpDist)

3) Non-headed shots off crosses

xG = 0.96 * exp(-0.19 * SpDist)

4) Regular shots

xG = 0.86 * exp(-0.14 * SpDist) + 0.15 * TB + 0.45 * DFK + 0.012

"TB" is whether the shot is assisted by a through-ball, "DFK" is whether the shot is a direct free kick. If a shot is kicked, no matter from where, there's some chance of a perfect strike. So I include a small constant in kicked shots to account for this chance.

5) Shots following a successful dribble

xG = 1.02 * exp(-0.12 * SpDist) + 0.21 * TB + 0.017

I kept getting different values for through-balls and the constant with shots following successful dribbles. As I thought about it, I realized it made sense. If you receive a through-ball pass and then dribble someone, odds are really good that you're now one-on-one with the keeper. There should be a larger TB bonus for these shots.

6) Shots following a dribble of the keeper

xG = 1.15 * exp(-0.04 * SpDist)

These are a tiny subset of shots, but of 33 in the last two seasons, 20 were scored. So they require their own bucket. Spurs benefit from this number this season, as Eric Dier's goal against West Ham has an xG value around 0.65 because he went around Adrian before slotting home.

7) Shots off a rebound

xG = 0.94 * exp(-0.09 * SpDist)

These are shots immediately following either a save or a shot off the woodwork. They are much more likely to be scored than other shots, given that the keeper is rarely in position to save the shot.

And that's my xG formula. Share and enjoy.