Introduction

I have a startling confession to make: sometimes I just want things to be simple. Sometimes I just want things to be easier. All this talk about “Big Data”, predictive modeling, machine learning, and all those associated bits and pieces that go along with data science can be a bit mentally exhausting. I want to take a step back and work with a smaller dataset, something simple, something easy, something everyone can relate to – after all, that’s what this blog started out being about.

A while back, someone posted on Slashdot that the folks over at Finder.com had put together data sets of the install size of every PS4 and Xbox One game released to date. Being a a console owner myself – I’m a PS4 guy, but no fanboy or hardcore gamer by any means – I thought this would be a fun and rather unique data set to play around with, one that would fit well within the category of ‘everyday analytics’. So let’s take a look shall we?

Background

Very little background required here – the dataset comprises the title, release date, release type (major or indie), console (PS4 or Xbox One), and size in GiB of all games released as of September 10th, 2015. For this post we will ignore the time-related dimension and look only at the quantity of interest: install size.

Analysis

Okay, if I gave this data to your average Excel jockey what’s the first thing they’d do? A high level summary of the data broken apart by categorical variables and summarized by quantitative? You got it!

We can see that far more PS4 games have been released than Xbox (462 vs. 336) and the relative proportions are reversed for the former platform versus the latter as release type goes. A small aside here on data visualization: it’s worth noting that the above is a good way to go for making a bar chart from a functional perspective. Since there are data labels and the y-axis metric is in the title, we can ditch the axis and maximize the data-ink ratio (well, data-pixel anyhow). I’ve also avoided using a stacked bar chart as interpretation of absolute values tends to suffer when not read from the same baseline. I’m okay with doing it for relative proportions though – as in the below, which further illustrates the difference in release type proportion between the two consoles:

Finally, how does the install size differ between the consoles and game types? If I’m an average analyst and just trying to get a grip on the data, I’d take an average to start:

We can see (unsurprisingly, if you know anything about console games) that major releases tend to be much larger in size than indie. Also in both cases, Xbox install sizes are larger on average: about 1.7x for indie titles and 1.25x for major. Okay, that’s interesting. But if you’re like me, you’ll be thinking about how 99% of the phenomena in the universe are distributed by a power law or have some kind of non-Gaussian based distribution, and so averages are actually not always such a great way to summarize data. Is this the case for our install size data set?

Yes, it is. We can see here in this combination histogram / cumulative PDF (in the form of a Pareto chart ) that the games follow a power law, with approximately 55 and 65 percent being < 5 GiB, for PS4 and Xbox games respectively But is this entirely due to the indie games having small sizes? Might the major releases be centered around some average or median size?

too affected by outliers. No, we can see that even when broken apart by type of release the power-law like distribution for install sizes persists. I compared the averages to medians found them to be still be decent representations of central tendency and notaffected by outliers. Finally we can look at the distribution of the install sizes by using another type of visualization suited for this task, the boxplot. While it is at least possible to jury-rig up a boxplot in Excel (see this excellent how-to over at Peltier Tech) Google Sheets doesn’t give us as much to work with, but I did my best (the data label is at the maximum point, and the value is the difference between the max and Q3):

The plots show that install sizes are generally greater for Xbox One vs. PS4, and that the difference (and skew) appears to be a bit more pronounced for indie games versus major releases, as we saw in the previous figures. Okay, that’s all very interesting, but what about games that are available for both consoles? Are the install sizes generally the same or do they differ? Difference in Install Size by Console Type

Because we’ve seen that the Xbox install sizes are generally larger than Playstation, here I take the PS4 size to be the baseline for games which appear on both (that is, differences are of the form XBOX Size – PS4 Size). Of the 618 unique titles in the whole data set (798 titles if you double count across platform), 179 (~29%) were available on both – so roughly only a third of games are released for both major consoles. Let’s take a look at the difference in install sizes – do those games which appear for both reflect what we saw earlier?

Yes, for both categories of game the majority are larger on Xbox than PS4 (none were the same size). Overall about 85% of the games were larger on Microsoft’s console (152/179). Okay, but how much larger? Are we talking twice as large? Five times larger? Because the size of the games varies widely (especially between the release types) I opted to go for percentages here:

Unsurprisingly, on average indie games tend to have larger differences proportionally, because they’re generally much smaller in size than major releases. We can see they are nearly twice as large on Xbox vs. PS4 while major releases about 1 and a quarter. When games are larger on PS4, there’s not as big a disparity, and the pattern across release types is the same (though keep in mind the number of games here is a lot smaller than for the former). Finally, just to ground this a bit more I thought I’d look at the top 10 games in each release type where the absolute differences are the largest. As I said before, here the difference is Xbox size minus PS4:

For major releases, the worst offender for being larger on PS4 is Batman: Arkham Night (~6.1 GiB difference) while on the opposite end, The Elder Scrolls Online has a ~19 GiB difference. Wow. For indies, we can see the absolute difference is a lot smaller for those games bigger on PS4, with Octodad having the largest difference of ~1.4 GiB (56% of its PS4 size). Warframe is 19.6 GiB bigger on Xbox than PS4, or 503% larger (!!) Finally, I’ve visualized all the data together for you so you can explore it yourself. Below is a bubble chart of the Xbox install size plotted against PS4, coloured by release type, where the size of each point represents the absolute value of the percentage difference between the platforms (with the PS4 size taken to be the baseline). So points above the diagonal are larger for Xbox than PS4, and points below the diagonal are larger for PS4 than Xbox. Also note that the scale is log-log. You can see that most of the major releases are pretty close to each other in size, as they nearly lie on the y=x line.

Conclusion

It’s been nice to get back into the swing of things and do a little simple data visualization, as well as play with a data set that falls into the ‘everyday analytics’ category.

And, as a result, we’ve learned:

XBox games generally tend to have larger install sizes than PS4 ones, even for the same title

Game install sizes follow a power law, just like most everything else in the universe (or maybe just 80% of it)

What the heck a GiB is

Until next time then, don’t fail to keep looking for the simple beauty in data all around you. References & Resources

Complete List of Xbox One Install Sizes:

Complete List of PlayStation 4 Install Sizes:

Compiled data set (Google Sheets):

Excel Box and Whisker Diagrams (Box Plots) @ Peltier Tech: