I've been playing the video game Heroes of the Storm by Blizzard for about 3 years now. I even attended the game's professional scene's championship last year! One part of the game that has attracted me is the constantly accumulating character roster (I'll provide a short summary of the game later in this post). This means that new characters are being added, and existing characters are being tweaked if they are too powerful or not powerful enough. It's this latter feature that I am most interested in, and the purpose of this blog post: how do we know when a character is over or under powered? There are two methods: a wrong method, and a less-wrong method. This blog article's goal is to re-describe the the old, wrong method, and propose a new ideas for character balance. Keep in mind: the estimated values count for less than the description of the idea.

But before we get into the methods, a quick summary of Heroes of the Storm gameplay (you can skip if you are already familiar).

Gameplay

Heroes of the Storm is a 5-player vs 5-player team game. Before the game begins, each player choose a unique character (called heroes in the game) and is combined with four other players to battle against the opposing team. The objective is to destroy the opponents "core", using either the characters' unique abilities or powerful bonuses scattered around the play area (called a "map"). The game ends when either team's core is destroyed.

At the time of writing, there are over 70 characters, each with unique abilities and advantages. A players decision to choose a characters is based on i) their affinity for a character, ii) which characters are strategically advantageous, given what your opponent is playing (called the "meta"), or iii) which characters are over powered. Focusing on this last reason, a character can be over powered / under powered if their abilities are providing too much damage, healing, etc. Blizzard, the game's creator, will address these different power level of characters with weekly balance patches - however with the game constantly changing, there will never really be perfect equality.

Statistics of the game

Players can upload their games to third-party services, which aggregate all the games players, characters, map and outcomes into valuable statistics. A very influential statistic is the win rate of character i, defined as:

$$\frac{\text{wins}_{i}}{\text{games participated}_i} $$

which is an estimator for:

$$P(\text{win} \;|\; \text{$i$ present in game})$$

Ideally, the win rate of all characters are close to 0.5, as 0.5 is suppose to represent a character that is equally like to win as to lose (and hence "balanced"). Estimates typically range from 0.35 for the most under-powered, to 0.65 for the most over-powered. In fact, the ordering of this estimate over all characters is commonly used by players to find the most over-powered and under-powered characters, and call for "nerfs" or "buffs" respectively to rebalance the character.

But this is wrong. The win rate is only a correlation. It's not taking into account the entire causal picture. Let's explore that next.

The confounder problem

The problem with the above estimate, which I call the naive win rate, is that character selection and the chance of winning have common causes. These common causes are called confounders. For example, professional players may a) select certain characters more often than others and b) win more often. In an extreme case, if the best player in the world selected character T almost exclusively, then character T would have a higher than average win rate, but not necessarily due to the character.

Another example: depending on the other characters on an allied team, a) selection of a character changes, and b) probability of winning changes. If character T is often paired together with S because of some pair synergy, then the win rate of T will be influenced by the win rate of S.

Another example: the higher a team's average MMR (a measure of experience) may a) select a "meta" character more often and b) win more often. More specifically, the more experienced a team, the more likely they will recognise the best character to play in response to their opponents character, and because of such experience, are more likely to win the game.

Another example: when a new hero is released, or an existing hero gets major changes, less experienced players may choose them more. This has an effect on character selection, and on the win rate. Another confounder.

The point is: there exist many causes of both character selection and win probabilities. This means that the naive win rate is not necessarily a good measure of balance, but simply a correlation (not causation) between character selection and winning.

In fact, if you are familiar with causal diagrams, this is perhaps what the causal diagram looks like:

The arrow we are interested in is the arrow between character and win, which I call the causal win rate. The strength of this arrow (difference from 0), and its direction (positive or negative) tell us if a character is over or under powered. This tells us "how much more likely is a win if you choose character T instead of character S?"

Controlling confounders

We have lots of statistical methods to handle the confounders that muddy our naive win rate. The end result is an estimate of the causal win rate. However, causal inference requires you to state assumptions. Here's the big causal assumption, called exchangeability:

Controlling for the map, opponents' average experience, opponents' characters, allied characters, and allied experience, the win rate of those players that did and didn't choose character T would be the same on average if they all chose character T.

We are also going to simplify the problem, which I'm sure is wrong but it's a first approximation. Let's assume that the character in question is chosen last. That is, the decision to choose character T happens with perfect information of all other character selections and map knowledge. We are also assuming no effect modification (which certainly exists). Effect modification is when the effect (causal win rate) can be modified by other factors. For example, we already mentioned above that possible synergies can exists between pairs of characters - that is a form of effect modification. Similarly, the team composition (ex: how many healers the team has, how many damage dealers, etc.) is a form of effect modification. Like I said, we are going to ignore effect modification for now, and will return to it in another blog post.

For our modelling, we will use outcome regression, which is a causal inference method that carries with it lots of modelling assumptions (Note: there are better methods which have less model assumptions). For example: we are assuming a linear relationship between outcome and variables and we are assuming no interactions (effect modifiers).

Our regression will look like:

$$ \begin{align} \text{logit}(\text{win}_i) & = \alpha_1 \text{AlliedVallaPresent}_i \\ & + \alpha_2 \text{AlliedLiLiPresent}_i + ... \\ & + \beta_1\text{AvgAlliedMMR}_i \\ & + \beta_2\text{AvgOppMMR}_i \\ & + \beta_3\text{Map}_i + ... \\ \end{align} $$

where each a coefficient represents a scalar or vector (as in the case of \(\text{Map}_i\)).

Note that (and this is a consequence of our simple model) that all characters in the regression are symmetric. Thus we just need to run a single regression, and the parameters \(\alpha_i\) are all we are interested in.

Interpretation

What does \(\alpha_i\) represent though? Nothing without a reference point. I chose to use the character Kerrigan as the reference character. A positive alpha represents a better causal win rate relative to Kerrigan, a null alpha represents an equal causal win rate relative to Kerrigan, and a negative alpha represents a worse causal win rate relative to Kerrigan. These relative comparisons may seem like a disadvantage, but I chose Kerrigan specifically as she was the median character with respect to causal win rate (so I ran the program once, using some other character, and Kerrigan was the median in that analysis). Truthfully, it doesn't really matter who is the reference character, the ordering between characters is agnostic to the reference character. However, I choose the median character because they create a natural division of who is over-powered and who is under-powered.

Dataset

The dataset we are going to use is from user-submitted games to hotslogs.com. The time period is 4/26/2017 to 5/9/2017 which is a period inbetween two balance patches. I filtered down to Hero League and Team League games, and sampled 50k games for analysis.

Results

After we fit the regression, we have an estimated value of alpha for each character. Importantly, we also have error estimates for each alpha that we use to form confidence intervals for each alpha. If this confidence interval contains 0, we can conclude that the character is either balanced, or unbalanced too little to detect. Here are the results (clickthrough for a larger image):

The characters Li Li, Lucio, and Probius are the strongest. On the other hand, Medivh, Tassadar and E.T.C. are the weakest. For some evidence that our analysis is not absolutely wrong: 5 of the top 6 characters above were nerfed in the May 10th, 2017 balance patch.

How does our causal win rate estimates compare to the naive win rate?(Clickthrough for a larger image)

The two metrics are positively correlated, which offers evidence that the naive win rate isn't biased too much by the confounders. The characters in the upper-left and lower-right quadrants are characters that the community is probably thinking about wrong. For example, Lt. Morales has a low naive win rate, and the community might suggest she get a buff. However, she likely is over performing in this dataset. Another very interesting thing to point out is that all the support characters are "shifted" about 0.2 units to the right. Historically this period was when the double-support meta (two or more support characters per team) was gaining popularity, perhaps because Blizzard kept supports artificially over powered as the above graph suggests. Note that this support character advantage is hidden if we only looked at naive win rates. This support advantage is not a artefact of the analysis either, as when I did the same analysis for a more recent dataset, the support shift wasn't present any more.

More recent data, and the effect of balance patches

I managed to get a more recent data (as of the time of writing) from July 2018, spanning a balance patch. So we can observe what happens to heroes' win rates as they were buffed or nerfed on the July 25th balance patch. First, some of the nerfed characters:

Raynor was severely nerfed in this balance patch, and this is reflected in his pre and post win rates.

Deckard Cain and Yrel were nerfed.

Cho, famously, lost his unstoppable on Surging Dash. This significantly lowered his win rates. This is much more clear when we look at his causal win rate and not his naive win rate.

And buffs: Azmodan and Kael'Thas saw straight buffs, however Hanzo saw some buffs and some nerfs, and his causal win rate stayed about the same.

Conclusion

There are a lot of critiques one could do about this analysis:

I wasn't able to model a player's MMR, and instead used a team's average MMR, which will be less accurate. Consider this a form of measurement error.

Excluding effect modification is a serious flaw. Modelling effect modification would not only give us better estimates, but also aid in finding synergies and anti-synergies in team compositions & maps.

My model assumed a linear relationship between all variables and the outcome. Though I did include some quadratic terms for the average MMR of teams, there is likely still lots of model misspecification bias occurring.

What I hope from this article is that Heroes of the Storm players, and players of similar games, rethink how the naive win rate can be biased. However, one thing that I'd like to think about more is how biased are we? We saw that there was a high correlation between the naive and the causal win rate - does this imply we can use the naive win rate in most circumstances? I'm not sure, and it's something I want to explore. An approach is to use a causal inference model with less assumptions, and including effect modification, to see if we still see a high correlation between the two rates.

Sketch code is available on my Github here.