Data Science as support of esports performance and strategies (I)

First step : A general case study about the League of Legends World Championship

Before getting into the topic, here are few disclaimers and information about the motivation that drives this project.

This article is destined to any sports/esports or tech enthusiast, that would like to get involved into data analysis. My primary goal here is to explore various data science tools, from statistical analysis to machine learning techniques, in order to get new keys of understanding performance or strategies axis in esports. Since it will be focused on League of Legends, it is better for the reader to have at least basic ideas of the rules of the game.

This article is the first episode of a series. As a first glimpse into the topic, the two main ideas discussed here are :

how to get the data.

discovering axis of future analysis.

The series will be a playground and a place of experiments. Since I am not yet working for any operational team, the performance will not be the only focus. My motivation is mostly to apply various machine learning techniques into a topic I know well ; I am convinced new knowledge or leads could be discovered this way (maybe not).

Which data ?

The ninth League of Legends World Championship, hosted in Europe, ended with the sacre of FunPlus Phoenix, a chinese team for the second time back-to-back. Each year, the World Championship is the conclusion of a year of competition. 24 teams qualified out of 13 regions across the world, 127 players and more than a hundred of games.

This championship, also referred as the Worlds, is a really good analysis topic for many reasons :

the reasonable volume of data.

of data. the diversity of players and strategies from all the different region.

of players and strategies from all the different region. this is by far the most watched tournament in the scene, so we have multiple analysis, resources and materials that are already done.

We can also add the data of regular season of the major leagues, if we want to discover more general things about region play-style, player profiles, drafting priorities …

How ?

Knowing what to analyse is a thing, getting data is another.

Happily, the company that develops the games, Riot Games, offer some really detailed data about every public game. Through its public API, we can get a lot of different values and indicators about matches and players.

However, tournament games are hosted on different servers than ones regular players uses. So we have to use workarounds to get the data not from public API, but from another server that offer almost the same kind of data. The only thing we need is to get the identifier of every game, which can be hard to find. Thanks to Tim Sevenhuysen and his project Oracle’s Elixir, such data is available freely and simply by downloading dumps of whole seasonal splits and selecting the columns we want. So really, big gratitude to him.

Actors involved, querying architecture and steps.

Since Oracle’s Elixir is already about aggregating major leagues data, what could be the interests of such work ? The main goal is to get a higher granularity and more complete data, whether we need to focus on specific aspects of the game, such as players, champions, teams, region, jungle path, player itemization … It should not be limited by the architecture. Also, with the same code, it will be possible with future extensions to get regular ranked games data. This would be really interesting if we want a huge volume, more general tendencies or scouting tools.

Technologies involved

Few technologies will be used in such a project. The goal is to have an effective stack of already proven techs more than experimenting new architectures. Here is a quick overview of the tools :

Python : if you want to learn about contemporary data analysis r machine learning, then python should be you main focus. From short and powerful syntaxes, advantages of an interpreted language and tons of compiled libraries, python offers some high quality tools in data handling , learning or visualization through tools like Pandas, Tensorflow, Scikit or Seaborn.

, or through tools like Pandas, Tensorflow, Scikit or Seaborn. Jupyter : Since the work done for such a series of articles consist more of an exploratory work than building a robust framework for everyone to use, the notebook environment fits perfectly. That way, coding is more flexible and let testing and experimenting phases more space.

than building a robust framework for everyone to use, the notebook environment fits perfectly. That way, coding is more flexible and let testing and experimenting phases more space. MySQL : The data we need to store is clearly predefined and probably already stored in a relational model into the original database. Moreover, we need a tool where we can query big volumes and aggregate the results. SQL and the MySQL implementation are free to use and robust.

into the original database. Moreover, we need a tool where we can query big volumes and aggregate the results. SQL and the MySQL implementation are free to use and robust. Tableau : In future analysis, in addition to visualization produced in python, I could also use this software as complement.

Database relational model.

This database model is original, but follows quite rigorously the structure of the JSON (tree structure) returned by the Riot API. In addition, some competitive information from Oracle’s Elixir are stored into GameMetadata. Timeline information are not yet integrated, because such data need a more comprehensive focus. They will be integrated for specific analysis.

Please notice also that Player and Team does not reference unique entities in the competitive environment, such as G2, Invictus Gaming or Doublelift. It stores the actual state of a player during one single game, and a team is only the group of 5 players during that specific game. It would be good to have entities that could track structures and movement of players, specifically in these times of mercato. But such a work would need a lot of manual work, is another task, and could totally be added as a new layer in future extensions.

However, for competitive games, we must recognize the quality of data. By only simple split on In-Game usernames, we can retrieve the tag of the team and the username in an uniform way, which simplify a lot the future analysis. Also, players seem to be sorted by role, which is also a really useful value.

Finally, some analysis.

As a first look into an analytical work, we must choose an approach. The dataset offers many possibilities, whether we want to focus on teams, pick&bans, region playstyle … The article is already quite long and the goal was mainly to present the data preparation process. I think focusing firstly on players and the differences of stats between roles could offer a good overview on the dataset.

Feature oriented analysis

In the following analysis, the data is aggregated (mean values) per player on World Championship matches. We only select players that have played at least 3 matches to avoid biased mean values. Almost every value (=feature) is normalized according to the duration of the match : a player that played in longer games tends to have larger total damage. Then, I will use some suffixes to reference the normalizing method :

AVG / Average : The simplest aggregation. It uses the mean value across the available games.

: The simplest aggregation. It uses the mean value across the available games. PM / Per Minute : The total value is divided by the total duration. Such a method uses the assumption that all values are linear over time. This is not always true, for example, champions deals more damage in late game than in early game. But in my opinion, this is a good trade-off compared to the raw value.

: The total value is divided by the total duration. Such a method uses the assumption that all values are linear over time. This is not always true, for example, champions deals more damage in late game than in early game. But in my opinion, this is a good trade-off compared to the raw value. PART / Part or percentage : The value is represented as a part of another value that includes it. For example, magic, physical and true damage are 3 complementary sub-parts of total damages.

One very practical tool to get a clear overview of all the variables is the correlation matrix. Each column is compared to all the others. Without going too deep into definition, correlation studies if two random variables are positively linked (blue), negatively linked (red) or independent (white).

Warning : Correlation is not causality

A correlation matrix can be hard to read when columns are not ordered. As an addition, we can use agglomerative clustering, if we consider correlation as similarity. This way, we have a tree where correlated variables will be on closer leaves.

Correlation matrix, with scale, agglomerative tree and clusters.

Explaining every conclusion of the visual matrix and tree would be counterproductive. However, it would be interesting to have at least few leads for future analysis :

In performance oriented work, the main goal would be to maximize win-rate. We can observe that most of quantitative indicators are correlated to it, with kill participation and assists having the higher values. As first interpretation : teams where players took kills in organised/grouped fights are better rewarded.

to it, with kill participation and assists having the higher values. As first interpretation : teams where players took kills in organised/grouped fights are better rewarded. We can observe also some values anti-correlated to win-rate. Obviously, we retrieve here deaths and damage taken. But some values are not really that apparent : duration of game, physical damages and crowd control inflicted are anti-correlated too. These indicators could give us hints about the metagame and could be studied more in-depth.

and could be studied more in-depth. Vision, wards and assists which are associated to the support role are very distinguished from carry values such as farming, gold earned or damage output.

Damage on turrets is highly correlated to the jungle farm which highlight this role into the lane pressure and objective recovery.

First blood presence is only slightly correlated to win-rate and seem not significant. It should not be considered as main feature for futures analysis.

Role oriented analysis

We saw that some features can be clustered and associated to specific in-game roles. We could look at distributions and projections of these variables according to roles to see if, indeed, we could have deeper focus on some particular tasks (Jungle, AD Carry, Solo mid or top and Support).