On January the 19th we released the comprehensive analysis of the Metacritic website regarding their Games data. As a way to shed some light on the findings reported in that paper, I write this annotated blog. It will save you from having to read the 30 pages.

Introduction

The Metacritic website has a number of important flaws according to the Wikipedia page:

Metacritic:

• Converts each review into a percentage the site decides for itself

• Manually assesses the tone of reviews that have no explicit score and then assigns a quantitative score

• Refuses to reveal what weights are applied to which publications

Now, assuming the above is correct, this would make the data shown at Metacritic highly biased and far from the truth. Statistics dictate that you cannot apply scores at your own leisure based on subjective perception or using unvalidated methods and then call that a sophisticated meta-analysis. Therefore, it is time someone took a closer look at the data at Metacritic, in a descriptive way, to explore the data and see what comes up.

Objectives

The objective of my current analysis was to visualize the data, compare critic scores with user scores, in time and cross-sectional, look for patterns and then try to explain some of those patterns. I had one primary hypothesis, and that states that both critic metascores and user metascores differ significantly, based on the idea that users rate games differently than the “professional reviewers”, especially considering the scoring method of Metacritic is suggestive of statistical flaw. As secondary objectives I would like to look at confounders of scores, such as time, specific critics (or reviewers) and other variables, and comparisons of game release dates with Mobygames data. Finally, some post-hoc exploratory analyses were warranted as the interpretation of the data was moving forward.

Methods

Between 25th of August 2013 and the 10th of September 2013 a custom made tool called Slurp! extracted the following data from the Metacritic website. For each game it collected the Title, Publisher, Developer, Release Date, Platform, Critic Metascore, Number of Critics, User Metascore, Number of Users, Individual Critic Reviews (including Critic Name and links to the original reviews if present), and Individual User Reviews (including User Name and review text of user). This data was then fed to IBM Statistics 20.0 for further analysis.

Results

The total number of games listed at Metacritic was 136.496 at the time of database completion. Game review scores from 14 different platforms were collected (Table 1).

You can see in the table the median Critic and User scores, as well as the 5% – 95% CI values. Also, the number of games per platform listed at Metacritic, versus those with an actual score registered. Strikingly, the website listed 112243 games for iOS, but had Critic scores for only 2088 of those (2%) and User scores for only 413 of those (~0%!). Overall, the website is doing poorly, listed a fair number of games, but adding scores for only few of them, many times less than 50% of the time. The website has however a strong bias towards Xbox and Playstation (1 and 2), getting Critic review scores for as high as 97% of all games listed for the Xbox One, for example, while adding Critic scores for less than 33% of PC games. If I were Steam, I would stop using Metacritic to “aide” my customers in buying a game or not; there is no rational argument to use Metacritic any longer. And not just for this reason, as will be clear later.

Critic scores and User scores differ substantially and significantly. As is apparent from Figure 1 above, there is absolutely no agreement between the “professional” Critics and the (End) Users of games. For all 11 out of 14 platforms there is a statistically significant difference (note the asterixes) of the score given to games between critics and users. In all, User and Critic scores differ significantly from each other (standardized test statistic: 14.623, P < 0.0001).

The median Critic and User scores decline in value per increasing release year of games. Figure 3 above illustrates this nicely. Shown is the release year of each game versus the median Critic and User scores for these games. Games released between 1994 and 1999 were rated at about 87%, but then dropping radically to ~ 70% for games released in the year 2000 (when looking at the median Critic scores). Please check out the paper for more details on this.

Interestingly, Metacritic is – as said before – biased towards collecting critic reviews for console games. Table 2 above shows the number of games listed per platform, along with the median number of Critic and User reviews per game. For example, for the 2367 Xbox360 games listed there is a median number of Critic reviews of 24 per game. Compare this to the 9568 PC games where, in comparison, only a median of 9 Critic reviews were collected per game. Surely, there are many more sources of PC game critics?

Of note, the vast number of iOS crap Metacritic included in their database messes heavily with any general statistical interpretation. Taking a look at Figure 6 above, the number of games listed at the website by year of release is a value that shows what I mean. From 2009 onwards there is a drastic increase in the number of games registered, attributable largely to a tsunami of crap iOS apps, for which Metacritic doesn’t bother to register any Critic scores, and where users don’t bother to log their views about at the website. I would probably advise Metacritic to delete all iOS entries. Nobody cares.

The “popularity” of Metacritic influences the number of User scores per game by release year. From Figure 7 above, it is apparent that with each increase in release year, there are more users taking the time to go the site and log their scores. On the other hand, there is a steep decline in number of Critic reviews collected by Metacritic for the more recent games (from 2006 onwards). Details on how this can be are in the paper. Of note, the median number of Critic reviews for PC games is also declining from 18 for games released in 1999 to only 4 for game released in 2013. Another reason for Steam to quit spamming us with useless Metacritic scores.

If we look at the table listing the top 20 games in terms of getting the highest median Critic score there are already signs of a striking overrating being done by the professionals. Take a look at part of this table (which is complete in the paper) above. The top 20 is dominated by games released on consoles. Grand Theft Auto IV for the PS3 takes the lead, where 64 listed Critics think the game is worth a score of 98% out of 100%, but users (read the end-gamers) disagree and rate it 80%. No surprise, as GTA games are hyped much (marketing), don’t deliver the quality promised for a number of end-gamers that – disgruntled as they are – go to Metacritic to give a more “balanced” score. Yet, those were 1870 users versus 64 Metacritic-chosen Critics. The power – and truth – lies in the numbers, people!

Assuming that the higher quality games attract more people to buy them and lead to more users providing their review at the Metacritic database, I chose the arbitrary number of 1000 user scores to select an elite number of games to do a subanalysis on. Turned out, there were only 123 games listed at Metacritic at the time of analysis for which at least 1000 users left their score for. For each of those 123 games, ordered by the median user score (high to low) you can see the accompanying median critic score as well in Figure 10 above. There are a great number of games listed at Metacritic where the users’ opinion of a game differs utterly from the critics. While the median Critic score for these 123 games camps around the 90% mark, the median User scores check out less than 80% for half of the 123 games, dropping to disastrous scores of below 50% for a fifth of them. There is a strong indication for serious overrating by the professionals on the one hand, and game bashing by disgruntled users on the other hand. Regardless, it renders the Metacritic website a useless source.

Going for a little more quality, the observed huge gaps between Critic scores and End-gamer scores seem to associated with Publishers and Critics. In part of Table 4 above, you can check out those games where the gap between what Critics thought and what the Metacritic users thought is the largest. Company of Heroes 2 (80% in the eyes of 80 Critics, 16%! in the eyes of 4922! users) and Diablo III (88% in the eyes of 86 Critics, 38%! in the eyes of 8362! users) are but some of those games that seem to be hyped on the one hand by “Critics” and then thrashed by the end-gamers on the other hand. Clearly, this is at least suggestive of false scoring on the Critics side, as well as a strong hyped counter-reaction by disgruntled end-gamers on the other. Regardless, it renders Metacritic useless as a source to make you decide which game to buy. There are more details on this in the paper.

Metacritic has a strong focus (er…strong bias) on game journalism from the United States. Indeed, when going over the top 20 Critic sites in terms of reviews and scores listed at Metacritic, it is quite evident that the top game review sites from the US are in there, see Table 5 above. Naturally, this introduces such a huge bias (as the US population is only about 4% of the world population) that we cannot really take Metacritic seriously. Even if we would like to, we would have to do such a large number of assumptions to make the US gaming and game critic population be representative of the world, that it would fail, epic.

So there you have some of the details that are in the paper. I hope it will aide you in your gaming quest. For whatever it is.