In this article, I demonstrate some basic data analysis techniques using the Pandas data analytics library for Python3 on the League of Legends 2018 Summer Split dataset from Oracle’s Elixir.

Viewing Stats about a Single Team

First, I’d like to be able to view the past games for a team, and some interesting data about them. The way the dataset is set up is each game appears twice, once from the viewpoint of each team.

For example, say we are reviewing Echo Fox vs Clutch Gaming, with a gameid of 1002620109 . There will be two rows; from Echo Fox's point of view, they got 10 kills and suffered 5 deaths. To Clutch Gaming, that is 5 kills and 10 deaths. Data for each player is shown in a single row, which means 10 rows for each game, plus an extra two for the "team", which shows things like teamkills and teamdeaths . In this case the player field is simply Team .

This means to get both partipants in a single row, some work is required. Firstly, some basic cleaning of the data is in order:

Some of the 0 and 1 values, for example first blood, are saved as strings instead of numbers. I convert them all to floats, and also set blue and red to be integer values.

Next, geta all the games for a single team, with a team_games method.

Running this gives a table like this:

With a ton more data. Now we want the opposite of team_games , opponent_games :

Putting it all together:

Gives as a nice summary of the results up to date:

Now we have gotten a feel for the data, let’s do some actual analysis.

Stats about First Blood/Turret/Dragon…

Often a strong early game dictates the result — or does it? Let’s investigate. We can grab the percentage of first bloods/turrets easily, using our team_games function.

Which shows us:

There are a number of ways to interpret this: TSM doesn’t prioritize dragon? Perhaps they have a weak lanes (thus not often getting first turret)? How about we check out the rest of the league? First, make games_by_league and teams_by_league functions:

Now we can loop each team for the stats. Each stat is assigned a key in a dictionary, and the value is an array of the percentage for each team in the category.

…or not. We get an error. It turns out one game was cancelled for technical reasons, and since there was no first blood, the column is blank. We can fix this easily, by first changing all whitespace into np.nan , then using dropna to get rid of those rows.

We get this nice table:

To get some more context, let’s add in the total wins for each team and sort by that:

Now we get:

First baron is pretty consistent in the top 7. All of them are within three games of each other, but I did expect a bit or a larger gap. First dragon, turret and blood appears completely non correlated though (we will see this is indeed not highly correlated later on, using pandas corr method).

Golden Guardians is a bit of an outliner — a lot of first dragons, second only to Team Liquid. This could be for a number of reasons, such as the type of dragon, or some teams favoring early game junglers, for example.

Using corr to find correlation

Let’s see if there is a relationship between first dragon, turret, etc, and actually winning the game. Create a do_correlation function:

This gives us:

Correlation goes from -1 to +1. -1 indicates an opposite relationship. For example if first blood has a correlation of 1 with result, that would mean a team wins every game they get first blood.

As expected, getting the first baron has a high correlation with winning the game, 0.76. First blood also has a high correlation with first dragon — maybe the team gets first blood botlane a lot, then transitions into dragon?

We defined blue side as 0, and red side as 1. Notice everything has a negative correlation with side. That means as side goes up, everything else goes down. In other words, red size is much worse, at least for Team Liquid. Let’s see if this tendency extends to other teams.

Gives us:

First baron is still a solid indicator of who is going to win. First dragon and first turret, however, are largely unrelated — 0.11 is not very significant. However there is a strong relationship between first blood and first turret — perhaps NA teams have a tendency to play towards bot, often getting first blood (which leads to firrst turret)? How about compared to Korea’s LCK, the strongest region?

First turret and first baron both have higher correlations. Perhaps the LCK is better at pushing an advantage, and converting objectives into victories? First blood, again, has no obvious relationship to the result. Blue side is slightly favoured.

Conclusion and Improvements

Although machine learning libraries are the latest and greatest tools sweeping the data science community, you can draw some solid conclusions using a more simple library like pandas. I plan to do a follow up article using scikit to train some models to predict things like first blood, who wins a game, and so forth later.

Even if I intend to build a predictive model using a machine learning library, I usually pull in pandas and explore the data first, to get a good intuition for what I’m working with and what kind of model I want to train.

Some areas to explore further is generating graphs, which pandas supports (using matplotlib under the hood) and doing some analysis regarding single players across multiple splits or even seasons. Perhaps TL’s dominant bot lane followed Doublelift when he transferred from TSM? Pandas is the perfect tool for this kind of high level analysis.