Since the release of Dreamcast and the modem adapter, game developers have been able to collect data from players about their behavior in the wild. Game analytics actually goes further back when considering early online PC titles such as EverQuest, which was released in 1999. Game servers were necessary for authenticating users and population game worlds, but also provided the capability to record data about gameplay.

Since 1999, the landscape for collecting and analyzing data has changed significantly. Rather than storing data locally via log files, modern systems can track activity and apply machine learning in near real-time. Here’s the four stages of game analytics systems I’ve noticed during my tenure in the game industry:

Flat Files: Data is saved locally on game servers Databases: Data is staged in flat files and then loaded into a database Data Lakes: Data is stored in Hadoop/S3 and then loaded into a database Serverless: Managed services are used for storage and querying

Each of the steps in this evolution support the collection of larger data sets, and reduce the latency from gathering data to performing analysis. In this post, I’ll introduce example systems from each of these eras, and discuss pros and cons of each approach.

Game analytics really started gaining momentum around 2009. At Bioware, Georg Zoeller built a system for collecting game telemetry during development. He presented the system at GDC 2010:

Shortly after, Electronic Arts started collecting data from games post development, to track player behavior in the wild.

There was also growing academic interest in applying analysis to game telemetry. Researchers in this field, such as Ben Medler, proposed using game analytics to personalize experiences.

While there has been a general evolution of gameplay analytics pipelines over the past two decades, there’s not a fixed timeline for the progression between the different eras. Some game teams are still using systems from the earlier eras, and it may be the best fit for their use cases. There’s also a number of vendor solutions that are available for game analytics, but I won’t cover those in this post. I’m focusing on game teams that want to collect gameplay telemetry and own the data pipeline being used.

Flat File Era

Components in a pre-database Analytics Architecture

I got started in game analytics at Electronic Arts in 2010, before EA had an organization built around data. While many game companies were already collecting massive amounts of data about gameplay, most telemetry was stored in the form of log files or other flat file formats that were stored locally on the game servers. Nothing could be queried directly, and calculating basic metrics such as monthly active users (MAU) took substantial effort.

At Electronic Arts, a replay feature was built into Madden NFL 11 which provided an unexpected source of game telemetry. After every game, a game summary in an XML format was sent to a game server that listed each play called, moves taken during the play, and the result of the down. This resulted in millions of files that could be analyzed to learn more about how players interacted with Madden football in the wild. During my internship at EA in fall 2010, I build a regression model which analyzed which features were most influential in driving player retention.

The impact of win rates on player retention in Madden NFL 11 based on preferred game mode.

About a decade before I started my internship at EA, Sony Online Entertainment was already using game analytics, by collecting gameplay data via log files stored on servers. It wasn’t until a few years later that these data sets were used for analysis and modeling, but it was still one of the first examples of game analytics. Researchers including Dmitri Williams and Nick Yee published papers based on data analyzed from the EverQuest franchise.

Storing data locally is by far the easiest approach to take when collecting gameplay data. For example, I wrote a tutorial on using PHP to store data generated by Infinite Mario. But this approach does have significant drawbacks. Here’s an overview of the tradeoffs with the approach:

Pros

- Simple: save whatever data you want, in whatever format you want

Cons

- No fault tolerance

- Data is not stored in a central location

- Huge latency in data availability

- No standard tooling or ecosystem for analysis

Flat files can work fine if you only have a few servers, but it’s not really a analytics pipeline unless you move the files to a central location. At EA, I wrote a script to pull XML files from dozens of servers to a single server that parsed the files and stored the game events in a Postgres database. This meant that we could perform analysis on gameplay data for Madden, but the dataset was incomplete and had significant latency. It was a precursor to the next era of game analytics.

Another approach that was used during this era was scrapping web sites to collect gameplay data for analysis. During my graduate research, I scrapped websites such as TeamLiquid and GosuGamers to build a collection of professional StarCraft replays. I then build a predictive model for identifying build orders. Other types of analytics projects during this era include scrapping websites such as the WoW Armory, and more recently SteamSpy.

Database Era

Components in an ETL-based Analytics Architecture

The utility of collecting game telemetry in a central location became apparent around 2010, and many game companies started saving game telemetry in databases. A number of different approaches were used to get event data into a database for analysts to use.

While I was at Sony Online Entertainment, we had game servers save event files to a central file server every couple of minutes. The file server then ran an ETL process about once an hour that fast loaded these event files into our analytics database, which was Vertica at the time. This process had a reasonable latency, about one hour from a game client sending an event to the data being queryable in our analytics database. It also scaled to a large volume of data, but required using a fixed schema for event data.

When I was a Twitch, we used a similar process for one of our analytics databases. The main difference from the approach at SOE was that instead of having game servers scp files to a central location, we used Amazon Kinesis to stream events from servers to a staging area on S3. We then used an ETL process to fast load data into Redshift for analysis. Since then, Twitch has shifted to a data lake approach, in order to scale to a larger volume of data and to provide more options for querying the datasets.

The databases used at SOE and Twitch were immensely valuable for both of the companies, but we did run into challenges as we scaled the amount of data stored. As we collected more detailed information about gameplay, we could no longer keep complete event history in our tables and needed to truncate data older than a few months. This is fine if you can set up summary tables that maintain the most important details about these events, but it’s not an ideal situation.

One of the issues with this approach is that the staging server becomes a central point of failure. It’s also possible for bottlenecks to arise where one game sends way too many events, causing events to be dropped across all of the titles. Another issue is query performance as you scale up the number of analysts working with the database. A team of a few analysts working with a few months of gameplay data may work fine, but after collecting years of data and growing the number of analysts, query performance can be a significant problem, causing some queries to take hours to complete.

Pros

- All data is stored in one place and is queryable with SQL

- Good tooling available, such as Tableau and DataGrip

Cons

- It’s expensive to keep all data in a database like Vertica or Redshift

- Events need to have a fixed schema

- Truncating tables may be necessary

Another issue with using a database as the main interface for gameplay data is that machine learning tools such as Spark’s MLlib cannot be used effectively, since the relevant data needs to be unloaded from the database before it can be operated on. One of the ways of overcoming this limitation is to store gameplay data in a format and storage layer that works well with Big Data tools, such as saving events as Parquet files on S3. This type of configuration became more population in the next era, and gets around the limitations of needed to truncate tables and the reduces the cost of keeping all data.

Data Lake Era

Components in a Data Lake Analytics Architecture

The data storage pattern that was most common while I was working at a data scientist in the game industry was a data lake pattern. The general pattern is to store semi-structured data in a distributed database, and run ETL processes to extract the most relevant data to analytics databases. A number of different tools can be used for the distributed database: at Electronic Arts we used Hadoop, at Microsoft Studios we used Cosmos, and at Twitch we used S3.

This approach enables teams to scale to massive volumes of data, and provides additional fault tolerance. The main downside is that it introduces additional complexity, and can result in analysts having access to less data than if a traditional database approach was used, due to lack of tooling or access policies. Most analysts will interact with data in the same way in this model, using an analytics database populated from data lake ETLs.

One of the benefits of this approach is that it supports a variety of different event schemas, and you can change the attributes of an event without impacting the analytics database. Another advantage is that analytics teams can use tools such as Spark SQL to work with the data lake directly. However, most places I worked at restricted access to the data lake, eliminating many of the benefits of this model.

Pros

- Scales to massive amounts of data

- Supports flexible event schemas

- Expensive queries can be migrated to the data lake

Cons

- Significant operational overhead

- ETL processes may introduce significant latency

- Some data lakes lack mature tooling

The main drawback with the data lake approach is that usually a whole team is needed just to keep the system operational. This makes sense for large organizations, but may be overkill for smaller companies. One of the ways of taking advantage of using a data lake without the cost of operational overhead is by using managed services.

Serverless Era

Components in a managed Analytics Architecture (GCP)

In the current era, game analytics platforms incorporate a number of managed services, which enable teams to work with data in near real-time, scale up systems as necessary, and reduce the overhead of maintaining servers. I never experienced this era while I was working in the game industry, but saw signs of this transition happening. Riot Games is using Spark for ETL processes and machine learning, and needed to spin up infrastructure on demand. Some game teams are using elastic computing methods for game services, and it makes sense to utilize this approach for analytics as well.

After GDC 2018, I decided to try out building a sample pipeline. In my current job I’ve been using Google Cloud Platform, and it seems to have good tooling for setting up a managed data lake and query environment. The result was this tutorial, which uses DataFlow to build a scalable pipeline.

Pros

- The same benefits as using a data lake

- Autoscales based on storage and query needs

- Minimal operational overhead

Cons

- Managed services can be expensive

- Many services are platform specific and may not be portable

In my career I had the most success working with the database era approach, since it provided the analytics team with access to all of the relevant data. However, it wasn’t a setup that would continue to scale and most teams that I worked on have since moved to data lake environments. In order for a data lake environment to be successful, analysts teams need access to the underlying data, and mature tooling to support their processes. If I were to build a pipeline today, I would definitely start with a serverless approach.