Posted 24 July 2017 - 20:56

Hello forum users, first post! Bit of a long one, but there's a fun project I've been working on for a while and I thought I'd share it with an audience who might be interested. The project involves taking the lap time data from an F1 race and then, for each driver, obtaining an 'outright pace' estimate.

What's the motivation? Well, I often find that after watching each race, I'm left looking at the results without being entirely sure what conclusions I can draw. As far as front running drivers are concerned, if they have all had clean races and followed similar strategies, then we usually have a reasonable idea how their pace compared. However if they followed different strategies, or one of them lost places at the start or around the pit stops, it becomes rather difficult. In any case, by the end of the season, the details of how each race unfolded tend to get forgotten and I'm left looking at the points totals and qualifying battles, along with their obvious limitations. Meanwhile, for drivers in the midfield, their final finishing positions are hugely influenced by whether they got the rub of the green in terms of traffic, strategy, safety cars etc. If e.g. one Toro Rosso finishes 8th and the other finishes 11th, I've no idea whether that means the one who finished 8th was quicker on the day. By the end of the season, once again I'm stuck with qualifying statistics and points totals.

So that brings us to the need for race pace estimates. It's not the only criterion by which you can compare drivers, there's also factors such as overtaking ability, wet weather skills, how good you are on lap 1 etc but it's clearly an important and interesting one. It's not readily available however, but using a common statistical method known as linear modelling, we can have a pretty good go at getting it, as I shall now explain.

We'll start off by displaying some lap times from the recent British Grand Prix. Here are all of Felipe Massa's lap times:

The first thing we need to do is reduce the lap time data to laps that are genuinely useful. So that means we remove laps done behind the safety car, inlaps, outlaps, lap 1, safety car restart laps and obvious outliers. We also highlight laps where a driver was not able to use the pace of their car, i.e. where they were blocked behind cars, which I define as being within 1.5 seconds of the car ahead at the start of the lap, or were overtaking cars, getting overtaken by another car, getting lapped etc. So the 'clear laps' are the ones that we would like to use to estimate the drivers' pace. However, we can't just take a raw average of the lap times to get the drivers' pace, because the lap times are of course affected by other factors. Lap times at the start of the race are set with a high fuel load hence are slow, lap times set near the end have a low fuel load but are frequently on very worn tyres. Linear modelling is quite well suited to adjusting for these effects. So starting with the fuel load, here are Massa's lap times again but with a trend line fitted to estimate what effect the fuel load is having on the lap time: While this line doesn't run that closely through every lap time, it has picked up the obvious trend, namely that as the car gets lighter, lap times get faster. However, the fuel load is of course not the only factor that has a big effect on lap times, what is also important is the age of the tyre and the type of tyre used. We notice that Massa's lap times get much faster after lap 25, which is when he stopped for tyres. With linear modelling, it is straightforward to have multiple trend lines to reflect multiple effects on the the lap time. The next plot displays the combined effect of both fuel load and tyre type and tyre age:

These trend lines, which are fitted using lap time data for all drivers in the race, suggest that tyre degradation was fairly high in the British Grand Prix. The lap times throughout the stint on average remained almost constant, meaning that the benefit of the fuel load going down were being counteracted by the tyres losing performance.

The trend lines in this graph represent an equation that the linear model has produced. It can best be represented separately for each tyre. For the soft tyre, the equation is:

Fitted lap time = 92.516 + 0.055*(number of laps of fuel in car) + 0.044*(number of laps done by tyre)

For the supersoft it is:

Fitted lap time = 92.516 + 0.055*(number of laps of fuel in car) - 0.075 + 0.051*(number of laps done by tyre)

(NB tedious clarification: you might wonder why I've not simplified the equation for the supersoft tyres by subtracting the 0.075 from 92.516. The 0.075 represents the advantage of a brand new supersoft tyre over a brand new soft tyre, that is more transparent when the equations are displayed as they are)

We can also display the same type of plot for Kimi Raikkonen in the Grand Prix (NB ignore the 'chronic problem' laps for now, they'll be explained further down).

In this case, the linear model equations are slightly different: Softs: Fitted lap time = 90.708 + 0.055*(number of laps of fuel in car) + 0.044*(number of laps done by tyre) Supersofts: Fitted lap time = 90.708 + 0.055*(number of laps of fuel in car) - 0.075 + 0.051*(number of laps done by tyre) So notice that the numbers for the effect of the fuel load and tyres are the same for Massa and Raikkonen - I've set up the linear model so that these effects are the same for all drivers - that will be discussed shortly. The difference between the two equations is the first number: 90.708 for Raikkonen and 92.516 for Massa. This difference (of 1.808) represents the overall advantage that Raikkonen had in seconds/lap over Massa. In fact we can calculate the pace estimates for all drivers in the race using this method and they are displayed here: The numbers displayed are: in brackets, how far off the pace in seconds the driver was from the fastest driver

in the box, how many laps out of the total race number of laps the model was able to use (labelled as 'clear track' laps in the graphs above) So Bottas is estimated to have been the fastest driver in the race, with Hamilton very close behind, with both Ferrari and Red Bull team mates both closely matched. This is an attempt to fit a neat structure to what is quite complicated data, so naturally there are drawbacks to the method of course. One I will cover briefly for now is using the same fuel and tyre values for all drivers. If we look at Massa's fitted lap times above, we notice that it is on average a bit fast for his first stint on Softs and it's a bit slow for his stint on supersofts. This suggests that, compared to the average driver, Massa had a bigger preference for supersofts compared to softs. Is this a problem though? I wouldn't say so - some of his laps are faster than the fitted line, some are slower, but they largely balance out so that the overall value (92.516) is a fair representation of his pace. In some cases this justification seems a bit of a stretch though, here are Bottas's lap times and fitted lap times for the race: It seems that Bottas was saving tyres up til about lap 20, before making a break once in clear air until his pit stop at the end of lap 32. The issue here is that all of his clear, fast lap times have been included in the model, but some of his slower ones, for example where he was blocked between laps 14-18, are excluded. Given his pace in the laps before that, he probably was saving tyres here. Had they been clear, and therefore included in the model, they would likely have been slower than average laps and therefore brought his estimate down, likewise for his excluded slow laps 43-45. So while the effect on his overall estimate is quite small, it's possible Bottas has been slightly flattered by his estimate in this race. This feature, of why we don't model each driver's tyre usage separately, is a complicated one - this post has quite a lot of information in it already and I don't want to derail the discussion completely so I won't go into more details yet but might well do so in a separate post in future. There are some interesting points to consider with some other drivers too. Here is the graph for Hamilton: This was a pretty comfortable victory for Hamilton, and in the second stint in particular, he seemed to be capable of putting in fast lap times even at the end of the race, suggesting he'd been looking after his tyres throughout the stint before that. As long as there's no threat of rain, this is a sensible tactic in case a safety car comes out. As a result, when a driver wins a race comfortably, their race pace is generally an underestimate and I would suggest this is the case here. Here is Ricciardo's graph:

That slow last lap is clearly an outlier but hasn't been detected as one. The effect is small (excluding the lap improves Ricciardo's estimate by about 0.05) but this is annoying. I've got a straightforward cutoff for whether something is deemed to be an outlier, and unfortunately this lap time has just avoided being classified as an outlier. A more sophisticated detection method which puts less weight on laps depending on how likely it is that a lap is an outlier would be better.

One adjustment that I've included is highlighted by Stroll's graph (and also briefly appears in Raikkonen's graph above):