Formula 1 is often described as the pinnacle of the motor-sport and only the best drivers can compete in this exclusive competition. The best 20 drivers divided into 10 teams form the entire grid in Formula 1. To get the best performance from the single-seater cars a lot of sensors and communication materials are fitted on the car. The data generated from these sensors can range from several measurements within the engine to a sensor which measures the tyre temperature of the car.

All of this data is communicated between the driver and his race engineer during the race to get the best performance from the car. The communication between the driver and his engineer is called team radio. Other things that can be discussed on this team radio are providing mechanical support, race strategy information, relaying complaints about other drivers and warnings about the track itself. Team radio is also used to issue warnings, complaints and penalties from and to the stewards (referees in motorsport).

The research into this topic was mostly focused on exploratory search and is meant to show what can be done with the data and to its results. Hence many different methods were applied and tested.

Gathering data

The collected team radio transcripts from 2017 can be found on racefans.net [1]. For the sake of easy use, we transported the transcripts to a CSV file. Also, note that we only used the data of the races and not the data from qualifying and practice sessions. The reason for this is that the qualifying and practise sessions do not contain a lot of data and the format thereof is completely different from the race data.

The text data as can be found on racefans.net

The challenge of processing this data is that it contains a lot of formula 1 related jargon. These are words usually not included in a general dictionary or their meaning differs from the dictionaries’ definition. Sentence length also varies greatly from simple one or two-word confirmations to run-on sentences with multiple sub-clauses.

Preprocessing the data

Preprocessing is a necessary application in the world of text mining. This is not only to reduce the size of the data and make the processing phase quicker, but it is also a way to normalize the data. During the preprocessing stage of this paper, we used 3 types of preprocessing.

Tokenization

Tokenization is the process of breaking up a sentence in words, symbols and characters. The list of “tokens” becomes further input of the text mining operations. Tokenization solves the issue of handling special characters and punctuation. The main goal of tokenization is identifying meaningful keywords in the text.

Stop word removal

The process of stop word removal is to remove any very common and not useful words in the text. These stop-words occur in very high frequency in the text and are not adding any useful information. Some common stop-words are ‘and’, ‘are’, ‘this’ and ’the’.

Stemming

Stemming is the process of conforming multiple variances of one word to one single word. This again is to make a standardized version of the text. An example of stemming is the words “presentation”, “presented”, “presenting” could all be reduced to the core word “present”.

Word clouds

A word cloud is a visual representation of textual data, often used to get a first insight into the data. Word clouds can be constructed in 3 different ways.

Frequency-based

Significance based

Category based

In this example, the Frequency-based Word Clouds are used with some light filtering. This is because we want to see what words are most used in our data. After the initial wordclouds were generated, some of the high-frequency words that did not add a lot of information (words a lot like stop-words but not recognized as such), that were common throughout all the clouds were filtered out and the wordclouds were generated again. In the images below you can see some of the resulting wordclouds per track. The wordcloud outline is that of the track that is being discussed, and the colours are the colours of the corresponding country’s flag.

Wordclouds per track

What can be observed here is some useful information about what happened on each track. The Russian GP (a) was a race with very little overtaking and hence, a lot of strategies. When it is hard to overtake, teams often look to good pit-stops and tire-strategies to gain an advantage over their opponents. Therefore we can see in Russia that there are words such as “Pit”, “Tyres” and “Box” indicating strategic calls. If we look at the Mexican GP (b) we can see a very big “Lewis”. This is because Lewis Hamilton secured his 4th world title on this track.

When looking at the Singapore GP (c) we can see words like “Inters” and “Damage”. This was the only rainy race of that season. The intermediate-tire, inters for short, is a tire-compound designed for wet races. Rain is also a more challenging condition to drive in and as such it caused a lot of accidents to happen during the race. This included a large incident at the start. Because of this, words like “damage” and “out” are a lot more prevalent. This same approach was also used for the data per Driver.

Wordclouds per driver

Lewis Hamilton is one of the more calculating drivers that is currently driving. He is often asking his engineers on updates about his tires, strategy and racing pace. This also reflects in his driver wordcloud (figure 3a) where these terms show up prominently. We can also see in Sebastian Vettel’s wordcloud (figure 3b) that some Italian words show up. While Sebastian Vettel himself is German, he does drive for the Italian team Ferrari.For example, he is known to frequently thank his engineers and the team using some of the Italian words that show up in his wordcloud.

Driver similarity

Another interesting item to investigate is driver dictionaries. A driver dictionary contains every word used by the driver together with how many times the driver used it. This way a vector of used words is constructed per driver. With these vectors we can calculate the “Cosine similarity” between all the drivers. This is calculated by the equation below. In the equation A and B are the word-vectors of the drivers that are being compared:

Cosine similarity

This similarity indicates how close two drivers’ dictionaries are. This value is then scaled to be between 0 and 1 rather than -1 and 1. Here 0 means they are completely different and 1 means they are exactly the same. A score of 1 only happens when you compare a driver to himself. Comparing all the different drivers with each-other results in the distance matrix seen in the figure below. From this matrix we can conclude that the drivers’ dictionaries are not close to each other at all with a maximum distance value of 0.2 (Except when comparing a driver with himself then the distance is 0).

Driver distance matrix

Sentiment analysis

Sentiment analysis is the practice of labelling sentences by how positive or negative they are by looking at the vocabulary that is used. On our data, we used it to label each message as either positive, negative or neutral. Additionally, a compound score is given to each message which has a value between -1 and 1 that indicates how positive a message is. -1 is completely negative, 1 is completely positive.

After this labelling we plotted the scores of the messages on a lap-by-lap basis, omitting laps where no message was relayed. The resulting graph shows that a driver’s happiness can differ throughout a race and show how events and occurrences have an impact on their happiness.

Lap by lap sentiment analysis

In the figure above, we can see the average scores of all the drivers combined on each lap of the 2017 Brazilian Grand-Prix. This race was overall quite positive for the drivers. There was a low number of retirements and not a lot of damage. This shows in the average sentiment as well as it only dips below 0 very occasionally. In figure b we have the 2017 Singapore Grand-Prix. This race had three notable incidents, corresponding to the three main dips below 0 in the graph.

Event extraction

Using the Singapore Grand-Prix as an example we can show how the sentiment analyses can be used to find interesting events within a race. The figure below will be used as an example.

Event 1 in this example was a 4-way crash instigated by both of the Ferrari drivers: Sebastian Vettel and Kimi Räikkönen, as well as Red Bull driver Max Verstappen. This crash eventually also involved Fernando Alonso as he was collected by the other cars which he could not avoid. All of the above drivers ended up telling their respective engineers about the crash and the results thereof. This resulted in a lot of negative sentiment.

Event 2 was caused by Torro Rosso driver Daniil Kvyat where a driver error caused him to crash his car into one of the guard rails.

Event 3 is a combination of two related incidents, initially, the dip is caused by Sauber driver Marcus Ericsson. He too made a driver error and hit the barrier. As a result, he spun out onto the track. Since his car was now on the track the marshals decided to bring out a safety car. This is a slower car that all of the drivers will have to follow and cannot pass.

While the safety car is out on the track overtaking is not allowed. A lot of the grip of a Formula 1 car comes from the heat of the tires. This heat makes it stick to the road surface better. While the safety car is out on the track the cars will all start to lose grip as their tires get colder from the slower pace. This caused a number of drivers, most notably Mercedes driver Lewis Hamilton, to complain about the speed at which the safety car was going. As a result, the negative sentiment to stuck around for more consecutive laps.

Singapore Grand-Prix event analysis

Conclusion

In this blog we researched if we could use several text mining techniques on Formula 1 team radio data. These techniques included data preprocessing, word cloud visualization, similarity matrices and sentiment analysis. Furthermore, a goal was to see if we could detect events from the sentiment analysis. This last part has successfully been done and gave the results found in the previous section. The other conclusion we can get from this research is that the data that was used was not ideal for this purpose. Some sentences were very short and some sentences extreme long. Additionally, a lot of terminologies is used in this text data and that makes it difficult for a machine to learn from this.

This post has been made in collaboration with L.A.R. Linders