Data Collection

There are primarily two data streams that are used to determine the trending videos:

Play events: Videos that are played by our member

Impression events: Videos seen by our members in their view port

Netflix embraces Service Oriented Architecture (SOA) composed of many small fine grained services that do one thing and one thing well. In that vein, Viewing History Service captures all the videos that are played by our members. Beacon is another service that captures all impression events and user activities within Netflix. The requirement of computing recommendations in real time, presents us with an exciting challenge to make our data collection/processing pipeline a low latency, highly scalable and resilient system. We chose Kafka, a distributed messaging system, for our data pipeline as it has proven to handle millions of events per second. All the data collected by the Viewing History and Beacon services are sent to Kafka.

Data Processing

We built a custom stream processor that consumes the play and impressions events from Kafka and computes the following aggregated data:

Play popularity: How many times is a video played

Take rate: Fraction of play events over impression events for a given video

The first step in the data processing layer is to join the play and impression streams. We join them by request id, which is a unique identifier used to tie the front end calls to the backend service calls. With this join, all the plays and impressions events are grouped together for a given request id, as illustrated in this figure.

This joined stream is then partitioned by video id, for all the plays and impression events of a given video to be processed at the same consumer instance. This way, each consumer will be able to atomically calculate the total number of plays and impressions data for every video. The aggregated play popularity and take rate data are persisted into Cassandra, as shown in this figure.

Real Time Data Monitoring

Given the importance of the data quality to the recommendation system and the user experience, we continuously do canary analysis for the event streams. This involves simple validations such as the presence of mandatory attributes within an event to more complex validations such as finding the absence of an event within a time window. With appropriate alerting in place, within minutes of every UI push, we are able to catch any data regressions with this real time stream monitoring.

It is imperative that the Kafka consumers are able to keep up with the incoming load into Kafka. Processing an event that was minutes old will neither provide a real trending effect nor help find data regression issues soon.

Bringing it all together

On a live user request, the aggregated play popularity and take rate data along with other explicit signals such as members’ viewing history and past ratings are used to compute a personalized Trending now row. The following figure shows the end to end infrastructure for building Trending Now row.

Netflix has a data-driven culture that is key to our success. With billions of member viewing events and tens of millions of categorical preferences, we have endless opportunities to improve our recommendations even further.

We are in the midst of replacing our custom stream processor with Spark Streaming. Stay tuned for an upcoming tech blog on our resiliency testing on Spark Streaming.

If you would like to join us in tackling these kinds of challenges, we are hiring!

See Also: