By Ben Sykes

Ensuring a consistently great Netflix experience while continuously pushing innovative technology updates is no easy feat. How can we be confident that updates are not harming our users? And that we’re actually making measurable improvements when we intend to?

Using real-time logs from playback devices as a source of events, we derive measurements in order to understand and quantify how seamlessly users’ devices are handling browsing and playback.

Log to Metric Data Pipeline

Once we have these measures, we feed them into a database. Every measure is tagged with anonymized details about the kind of device being used, for example, whether the device is a Smart TV, an iPad or an Android Phone. This enables us to classify devices and view the data according to various aspects. This in turn allows us to isolate issues that may only affect a certain group, such as a version of the app, certain types of devices, or particular countries.

This aggregated data is available immediately for querying, either via dashboards or ad-hoc queries. The metrics are also continuously checked for alarm signals, such as if a new version is affecting playback or browsing for some users or devices. These checks are used to alert the responsible team which can address the issue as quickly as possible.

During software updates, we enable the new version for a subset of users and use these real-time metrics to compare how the new version is performing vs the previous version. Any regression in the metrics gives us a signal to abort the update and revert those users getting the new version back to the previous version.

With this data arriving at over 2 million events per second, getting it into a database that can be queried quickly is formidable. We need sufficient dimensionality for the data to be useful in isolating issues and as such we generate over 115 billion rows per day. At Netflix we leverage Apache Druid to help tackle this challenge at our scale.