In general, based on the events data set, a soccer match consists of an average of 1,682 ± 101 events (Fig. 2b), with an inter-time between two consecutive events of 3.59 ± 7.42 seconds. There are on average 59 ± 29 events observed for a player in a match, one every 78.78 ± 105.64 seconds, confirming that soccer players are typically in ball possession for less than two minutes22. Passes are the most frequent events, accounting for around 50% of the total events (Fig. 2a). Duels (e.g., tackles and dribbles) are the second most frequent events (≈28%), while shots account for about 1.5% of the total events. The goals scored, the most important events in soccer since they determine a match outcome, are the rarest ones accounting for less than 1% of the total number of events. We provide an example of all the events (1,620) observed for the match “Lazio - Internazionale” of the Italian first division (May 20, 2018), plotted on the position of the field where they have occurred (Fig. 2c).

Fig. 2 Statistics of the events data set. (a) Frequency of events per type. (b) Distribution of the number of events in soccer matches. (c) Events produced by the two teams in the match Lazio (cyan points) vs. Internazionale (black squares). The events are plotted on the position of the field where they occurred. Full size image

Spatial dimension

By looking at the position of the field where the events occur, we can investigate interesting aspects of a soccer match, such as the spatial distribution of players and events. For example the kernel density plot in Fig. 3a shows that passes are distributed mostly in the center of the field, where actually most of the match takes place. As one could expect, we observe differences in the spatial distribution of events when we select the players by their role: while the events of forwards are observed mainly in the opponent’s half of the field (Fig. 3h), the events of defenders are observed mostly in the own half and on the sides of the field (Fig. 3g). Similarly, as expected the spatial distribution of events change with their type: attacking events (e.g., shots) are mostly observed close to the opponent’s goal (Fig. 3b), while defensive events (e.g., clearances) are mostly observed close to the team’s own goal (Fig. 3f). The spatial dimension of match events can provide us with information about a player’s behavior during a match, giving for example the possibility to determine a player’s profile from his average position during a match13.

Fig. 3 Distribution of positions per event type. (a–f) Kernel density plots showing the distribution of the events’ positions during match. The darker is the green, the higher is the number of events in a specific field zone. (g–i) Distribution of the passes’ position during a match for each player’s role. The darker is the color, the higher is the number of passes in a specific field zone. Full size image

Temporal dimension

By looking at when the events occur during a game, we can investigate interesting dynamics of teams and players. For example, Fig. 4 shows that goals are scored more frequently in the second half of the match23,24, mirroring several of the possible factors that could affect scoring, such as a decrease of attention by the defenders towards the end of the match due to a loss of stamina, or a more offensive attitude of the opponents who try to win or equalize the match. Similarly, we observe that the frequency of other rare events like yellow and red cards is the highest in the recovery time. This aspect could highlight the presence of a bias by the referees who are less prone to award a card in the beginning of a match (as suggested in25), a reduction of stamina or an increment of aggression of players at the end of the match.

Fig. 4 In-match evolution of the number of events. Number of events (i.e., goals on the top plot, yellow cards in the middle plot and the red cards in the bottom plot) that occur in all the matches in the data set, with time windows of 5 minutes. Full size image

Another aspect that can be investigated by combining the spatial and the temporal dimensions of soccer-logs are the so-called invasion index, a measure of how close to the opponent’s goal a team plays during a match (i.e., its dangerousness), and acceleration index, a measure of how fast a team reaches the closest position to the opponent’s goal26. By exploiting the spatial and temporal dimension of soccer-logs, the invasion index can be computed on each possession phase, which is defined as a sequence of events on the ball made by a team before the opponents gain the possession. To compute the invasion index of a possession phase we compute: (i) for each event in the possession phase, the probability of scoring from the position where the event occurs (defined as the fraction of goals that have been scored from that position); (ii) we take the highest of these probabilities. A team’s overall invasion index during a match is simply the average invasion index across its possession phases. Figure 5 shows the invasion and acceleration index of the teams throughout the match Roma - Fiorentina (0–2), played on April 7, 2018. We observe that Fiorentina has on average a higher invasion index than Roma (0.27 ± 0.33 and 0.23 ± 0.31, respectively).

Fig. 5 Invasion index and acceleration index for a game in the match data set. Bold lines represent the rolling mean of, respectively, invasion index (a) and acceleration index (b), while thin lines represent the individual values computed for each possession phase of each team. Purple vertical lines refer to the two goal scored by Fiorentina during the match, while the red vertical line indicates the half time of the match. Full size image

A team’s average acceleration index is another measure of its playing efficacy during a match. The acceleration index of a team’s possession phase is computed as the ratio between its invasion index and the square of the time between the first event and most dangerous event of the possession phase. A team’s average acceleration index during a match is the average acceleration index across its possession phases. Similarly to the invasion index, Fiorentina has a higher average acceleration than Roma (Roma: 0.06 ± 0.16, Fiorentina: 0.07 ± 0.15).

Both the invasion and the acceleration indices show that Fiorentina (the winner of the match) was more dangerous during the match, staying closer to the opponent’s goal and reaching dangerous zones faster than Roma.

Team analysis

Soccer-logs enable the analysis of the interactions between players through the reconstruction of a team’s passing network7,14, a representation of the movements of the ball between teammates during a match. A passing network allows identifying the key players in the team, i.e., the ones having more connections to the teammates or a high passing activity27,28. Figure 6 shows two examples of a team passing network for the match Napoli - Juventus (Italian first division). Although Napoli engaged in more passes than Juventus (666 vs. 332), the two passing networks show similar average weighted out-degrees (1.01 ± 0.93% and 1.10 ± 0.84%, respectively). However, Juventus’ playing style resulted in a higher connectivity29, defined as the network’s second smallest eigenvalue (i.e., a root of the characteristic equation of a matrix). This value indicates the robustness of a team, i.e., the strength of the links between its players. As a matter of fact, large values of connectivity between teammates are associated with a better overall team performance.

Fig. 6 Representation of the player passing networks of the match Napoli-Juventus. Nodes represent players, edges represent passes between players. The size of the nodes reflects the number of ingoing and outgoing passes (i.e. node’s degree), while the size of the edges is proportional to the number of passes between the players. Full size image

The reconstruction of passing networks from soccer-logs enables several performance analyses7. For example, by using the passing network and the players’ position during a pass it is possible to identify the most efficient tactical patterns across teams30,31.

Player analysis

Soccer-logs can be used to compare the performance of players and track their evolution in time. As an example, we compare three forwards with different characteristics – L. Messi (FC Barcelona), C. Ronaldo (Juventus FC) and M. Salah (Liverpool). We observe that L. Messi has the highest passing activity: while he produces 49 ± 19 passes per match on average, C. Ronaldo and M. Salah produce 26 ± 6 and 25 ± 9 passes per matches, respectively. Additionally, we observe that L. Messi engages in more duels per match (25 ± 8) than C. Ronaldo and M. Salah (15 ± 5 and 21 ± 7 duels per match). The data we release to the public also enable the computation of several performance metrics, such as Flow Centrality14 and PlayeRank13. A player’s flow centrality in a match is defined as his betweenness centrality in the passing network14. Figure 7a shows the distribution of flow centrality of L. Messi, C. Ronaldo and M. Salah for the matches in season 2017/2018. L. Messi results in a higher flow centrality (0.10 ± 0.01) than C. Ronaldo and M. Salah (0.09 ± 0.01 and 0.09 ± 0.01, respectively).

Fig. 7 Distribution of flow centrality and PlayeRank score for three top players. (a) Distribution of the flow centrality of L. Messi (red line), C. Ronaldo (blue line) e M. Salah (black line) during the soccer season 2017/2018. (b) Performance quality calculated as the PlayeRank score of L. Messi (red line), C. Ronaldo (blue line), and M. Salah (black line). Full size image

The performance quality of the players during the season can be assessed using PlayeRank, a data-driven framework that offers a principled multi-dimensional and role-aware evaluation of the soccer players’ performance quality in a match or in a series of matches13. Figure 7b shows that the three aforementioned players have different performance trends during the season. M. Salah obtained his best performance in the first part of the season, then decreasing during the course of the season. In contrast, L. Messi significantly increases his performance quality throughout the season while C. Ronaldo, who was not playing the first part of the season due to an injury, has on average a performance quality slightly higher than Salah but lower than Messi. We can conclude that, according to two measures computed on soccer-logs, Messi performs the best both in terms of passing centrality and performance quality.