It’s been a while since I last wrote something here. My last piece, titled “Lies, Damn lies, and #GamerGate statistics” was published last October, but at that point it became too time consuming to both keep up with current game industry events and publish analytical pieces about them.

Every once in a while I still felt the itch in my fingertips, sometimes even starting a rough draft before giving up mid way. But 11 month’s into #Gamergate, I finally found something worth tearing apart: that Sexism in Halo 3 research everyone seems to be talking about recently.

Now, people before me have already analyzed the bad journalism surrounding the recent study. I would therefore like to focus on the bad science employed in the research itself - deconstructing it on a point by point basis.

1. Do your homework:

Right there in the data collection section, my eyes caught the following paragraph:

The player skill level is an objective indicator (determined by an undisclosed algorithm by the developer) of how good or bad that particular player is in the specific playlist selected.

But this is quite incorrect: as the Halo 3 TrueSkill system has been described by Microsoft Research in detail and is actually patented (meaning it’s exposed for everyone to see). There’s even a FAQ which specifies how it affects actual matchmaking and some of its weaknesses, as well as an interactive rank calculator, which translates the TrueSkill estimation to a 1–50 rank displayed in Halo 3 multiplayer matchmaking.

However, one of these weaknesses in the system is that it can’t distinguish individual players who keep playing in the same team - effectively making it possible for one good player to pull along a bad player if they remain in the same party.

Why is this an important bit? apart from demonstrating that the researchers are not all that familiar with the source material, it also indicates that any calculation based on the rank has to rule out the possibility of artificial rank inflation. But the research does not consider that possibility.

2. In the zone:

Another relevant bit which is not taken into consideration in the research is the GamerZone employed in Xbox live. The researchers do mention opening 3 separate accounts, but they do not provide any empirical data about the GamerZone chosen for these accounts, and more importantly the GamerZones of the sampled players.

Why is this an important bit? although there’s no confirmation on exactly how GamerZone affects actual matchmaking, it’s important to note that players can freely chose their gamerzone out of 4 categories: recreation, pro, family, underground.

This effectively means that if the male player was paired with more recreational/family players it’s more likely that he will experience less negativity than a female account that was paired with more underground players (or vice versa). Please consider what the Xbox GamerZone description for underground players is:

The underground is where anything goes. You may be a underground gamer if you are the “human beat box” or “trash-talking chucklehead” (and proud of it)

Players registering as “underground” are exactly the kind of players I will expect more negative and hostile chatter from by this definition alone. Not accounting for such an important metric (which is available for the research team as well as anyone else playing) and verifying that there is a similar GamerZone distribution between male and female players puts the validity of the entire research into question.

But how is skill determined in Halo? The TrueSkill rank itself is using a single metric: winning against the other players/teams. So while individual performance affects the end result, the final outcome of the match is the only thing which matters in the long term. This means that “kill steals”, individual Kill/Death ratio, medals earned etc. do not matter.

This doesn’t stop (some) people from fixating on these metrics though. We can therefore assume there are three occasions where your own teammates (the research only studies voice chatter arriving from the same team, not opponents trying to taunt you) will hit you with negative comments:

You negatively impacted the overall game result (by being bad at the game), resulting in the other team winning and them being angry at you. You “negatively impacted” their own game performance or experience — by playing better than them (stealing their kills, taking the power weapons and utilizing them well etc) and perhaps by not being a team player. That’s regardless of the end result which might still be a win. They are just a bunch of assholes regardless of their skill level, game result or any other parameter.

Why is this an important bit? The research and all the headlines it generated pointed towards “loser” players being the most offensive and negative specifically towards better playing female teammates, and also indicated submissive behavior towards better playing male teammates. But as described above, the only actual metric in the game for “winning/losing” is the end game result: If they themselves lost, the entire team also lost — including the dummy male/female voiced player. You can’t be a bigger/smaller loser than your other teammate because you both lose the exact same game.

But as the research raw data clearly show, the ratio of negative comments towards the female players remains about the same regardless of game end result. The ratio of “sexist” comments towards females is actually higher when the games were won: 2 out of 30 games lost vs 9 out of 52 games won!

Anyone stating based on this data that it’s “the losers” who harass women more is therefore mistaken and/or misleading others: At least in the sense of what a win or lose actually means in Halo 3, it’s not the losers that show inconsiderate behavior..

4. Killing Spree

Iterating on the previous point, one could instead argue that it’s not option #1 of actual win/lose concept that the article was referring to when discussing “losers”, but rather option #2: the individual performance (measured by accumulated kills, deaths or the K/D ratio) or the prior prestige of the player (measured by their TrueSkill rank).

Should this metric be absolute or relative to the player who’s at the receiving end of these comments? It seems that the research tried to answer for both of game performance and prestige status, yet for some reason they decided to compare the number of (only) positive comments towards male and female players against both absolute skill level and skill level relative to the player.

Why is this important?

Here is the thing: throughout most of the game, you don’t have a direct indication of your teammate skill level. Yes, it does show up in the pre- and post- game lobby — but in the game itself it doesn’t just appear on screen.

Unless the research claims that the players memorize your rank before choosing how to respond during gameplay — then the relative skill level metric is a tad problematic compared to the absolute skill level. They should have stuck with absolute skill only.

5. Lies, Damn Lies… you get the idea

But I was jumping ahead here directly to interpret results. Lets take a step back and look at the Statistical Analyses section:

In the analyses segment the researchers specify what statistical tools they chose to analyze the results. I will not go into discussing the ups and downs of various statistical tests (or my gripes with their chosen method of lumping up multiple variables which I’d argue are not entirely independent), but one question which they do not answer is WHY they chose what they chose (Generalized linear model with Poisson distribution) over anything else in the first place. It’s their job to justify the chosen model and they did not do it.

More importantly, the different analytic they did run are not even consistent with the parameters to test against, thus obscuring the actual results even more! They decided for some reason to compare positive comments against maximum skill level and skill level variance, yet negative comments were compared against number of deaths and number of kills. Why the hell did they do that? What’s the rational of such arbitrary separation? Only they have the answer.

6. Can we do one better?

On top of my concerns with the tests they did decide to carry, my actual concern is with the simple one they didn’t: a linear regression model with normal distribution and showing the actual results plotted on the graph. Just as a thought experiment, I decided to run it here.

X-Axis

Since the original hypothesis was that players worse than female teammates will be more hostile towards them, I decided to compare the individual game performance as it contributes to the goal of the game: Comparative kill-death ratio.

This is because when playing Team Slayer matches in Halo, a player with 10 kills but 15 deaths actually contributed more towards the other team winning the match (a -5 spread). Contrary, a player with only 5 kills but no deaths contributed more towards their own team winning the match (a +5 spread). Following this logic, we can easily argue that a proper metric will be the K/D difference between the commenting player and the treated male/female player: a positive value means that the commenting player is “better” while a negative value means that the commenting player is “worse” (at least in this specific game session).

This is very neat because we can now assume a normal distribution here with the average being roughly around zero (i.e: equal chance of having a better/worse player in your team based on the matchmaking system).

Y-Axis

On the other hand, I’d also argue that comparing against just positive or negative comments (which they did in the research) is a bit absurd. Here’s the rational:

If a person made 5 negative comments towards me, does this mean he’s more hostile/negative than a person who made “only” 3 negative comments towards me? On the surface yes, but we take a deeper look we might realize that the same person who made 5 negative comments also made 15 positive comments and 5 neutral comments, while the person who made 3 negative comments made only these comments and nothing else. This now looks like a case where a person who made the more negative comments is simply more talkative inside the game.

This is why I decided instead to measure the actual positivity/negativity level of the comments (simply calculating the difference). Again we can conclude here that a value of zero represents a “neutral” player (regardless of how talkative that person is — positive and negative comments weigh each other out). It eludes my mind why the researchers didn’t bother with plotting this extremely important metric directly.

Plotting the graph

OK, so maybe the reason is not so elusive. Plotting these values against each other show how random this whole thing is: there’s not even a point of adding a trendline for either the female voice or male voice because the R-squared value is almost zero for either — meaning no correlation between relative skill level and hostility whatsoever.

Why is this an important bit? This chart, using the same raw data from the researchers, demonstrates that the entire hypothesis is invalid. Remember, this is they stated:

female-initiated disruption of a male hierarchy incites hostile behavior from poor performing males who stand to lose the most status

But the graph above — comprised from their own data — entirely contradicts their hypothesis. It shows that both the “social constructionist theory” and “evolutionary theory” fail to predict the behavior of the sampled population, even if we take every other (problematic) assumption in this research at face value.

Additionally, the research failure to show even a single plotted chart and relying instead on prediction trendlines with non-perfect fit, makes the casual observers assume that there is a perfect correlation between the sampled data and the regression/estimation trendlines. This is misleading and potentially shady statistics, amplified by the media tendency to take images and graphs out of context to display results.

7. You suck dick

Moving on to our next point, there is a difference between plain negativity and hostility/sexism. This is where the more complex problem lies, as in order to conclude what’s “sexism” we no longer rely on mathematical and statistical equations but rather on human interpretation (and to err is human):

Both coders looked for comments and questions that appeared to be directed toward the experimental player and classified them as positive, negative, or neutral.

But here’s the kicker:

We also explored whether the negative statements in the female manipulation could be considered hostile sexism [32]

So on top of classifying the comments, they’re using a psych evaluation to distinguish negative and sexist comments, but they only apply this to the female voiced player. The bottom line, is that only a miniscule amount of comments were to be sexist towards female players (11 individuals representing 13% of the sample population) but no mention of sexists comments towards male players. Yet the whole “gamer losers are sexist” extrapolated agenda is what encapsulates the entire research and news articles that cover it.

We don’t have the raw audio transcript to check the validity of their methods, but from the samples they did provide we can already see that the female voiced player received comments like “Lasher dude you suck dick” (implicating that the person saying that either thought they’re actually talking to a dude, or they might actually be Hugo from Lost). Male voiced player on the other hand received the comment “You suck dick”.

Yet only the comment towards the female voiced player was considered sexists in the research analysis. Since both females and males are physically capable of the actual operation of sucking dick, exactly when does the action of sucking dick become sexist? Why even imply that only females were on the receiving end of sexists comments - that’s probably the worst kind of sexism employed by the researchers themselves right here.

8. Some additional minor points

Erroneous data?

Something I noticed while examining the games in the raw data .csv file, was that for games F061709K and F061809I the treatment gender row classifies the voice used as male. The F however indicates to me that it’s a female voice (all other games starting with F are female, and games starting with M are male). Possibly an oversight in these two samples?

A/S/L?

One other point which I couldn’t understand, is how the researchers determined that all the individuals recorded were indeed male players. Was it based on voice alone? Or actual confirmation from the player himself? Or perhaps they dived into each player profile and checked their avatar?

I, for one, can’t honestly say that I can distinguish every time if the person I’m talking to is male or female player - especially if you account for some of the adolescents playing the game before reaching puberty. If the research made the assumption that every individual talking is automatically male without confirmation, then we have a very fundamental problem here.

Play to win

Another point which was never brought up in the study, was whether the researchers were playing to win every time, or if they purposely lost in some occasions in order to achieve enough coverage for both wins and losses. This is important because while the researchers mentioned that the pre-recorded messages were not meant to provoke, actions sometimes speak louder then words: if your teammates even suspect that you’re losing on purpose, there is a very good chance that they will be really pissed off at you — and that has a large impact on what they say and how offensive they’ll be: If there’s anything worse than harassers in a game, it’s grievers who ruin the fun for other players by fucking up the game on purpose.

9. Real life implications:

The sexualized environment combined with men being the overwhelming vocal majority suggests an environment not unlike many current work environments where women can represent as little as 10% of the professional work force (e.g., electrical engineering, [31]). This suggests that competitive online video games may represent a common phenomenon that women encounter.

This is probably the worst argument I’ve encountered in the paper. If they were just focusing on video games and makes these claims — well that’s a just a cultural problem. But here they’re making unsubstantiated assumptions on how a virtual competitive game environment where you exist in relative anonymity and meet people that you’ll likely never encounter again — has real life consequences on actual work environments where you interact with your peers in a professional manner and on a daily basis.

This brings up the question of whether this whole research paper doesn’t even care the slightest about videogames and is just being used as another agenda piece to be pushed onto policy makers who don’t even understand these nuances - forcing their hand to heavily regulate industries because someone once said something mean in an online videogame. This has a wider impact then just videogames, and that’s where a line must be drawn.

10. Minority report

So, who’s to blame here? Am I saying that there isn’t any sexism in video games at all? That female players don’t experience harassment and their share of abuse?

Of course I don’t say that. I only show why the current research is not a good indicator that female players are experiencing considerably worse backlash than their male counterparts, and especially that such behavior is coming from the “loser” players.

So let me suggest a different theory then: one that relies on the Greater Internet Fuckwad Theory as coined by Penny Arcade:

What I’m suggesting is that Halo 3 matches are not all that different from the rest of the internet. There are bound to be some total fuckwads playing the game, there is a chance that they will confront you and there’s a change that they’ll try to see what makes you tick, so a female voice will automatically signal to them to start with targeted harassment because they consider them to be easy target. It’s your choice whether you want to stoop down to their level and troll them (repeatedly and politely asking them to check their headset because you can only hear cracking noises is guaranteed to cause frustration), block and report them, or simply mute/ignore and not give them the satisfaction.

And there is a reason to be optimistic, because even this research clearly demonstrated the following point (against the picture that the media and academia tries to paint):

Out of 82 speaking individuals encountered, female voiced player encountered only 11 individuals who showed sexist behavior. That’s like the 10% rock bottom of the population so it’s not all that surprising to see it there. Can we cut that down to 5%? 3%? It’s already a tiny minority of abusers and for obvious reasons it will never hit absolute zero even with heavy regulation put in place.

The data clearly shows that sexist behavior wasn’t dependent on skill level, thus proving one of my possible assumptions mentioned above: they are just a bunch of assholes regardless of their skill level, game result or any other parameter. All we have left to do is treat them as the minority that they are and not give them any satisfaction.

This concludes our analysis here — so thanks for the ones who lasted this long (If you just skimmed to the bottom, make sure to read at least the last two points). And so dear journalists, if any of you here are looking for a good clickbait headline which relies directly on the original research data, feel free to quote this: