by Daniel C. Weeks

In previous posts, we discussed how the Hadoop platform at Netflix leverages AWS’s S3 offering (read more here). In short, Netflix considers S3 the “source of truth” for all data warehousing. There are many attractive features that draw us to this service including: 99.999999999% durability, 99.99% availability, effectively infinite storage, versioning (data recovery), and ubiquitous access. In combination with AWS’s EMR, we can dynamically expand/shrink clusters, provision/decommission clusters based on need or availability of reserved capacity, perform live cluster swapping without interrupting processing, and explore new technologies all utilizing the same data warehouse. In order to provide the capabilities listed above, S3 makes one particular concession which is the focus of this discussion: consistency.

The consistency guarantees for S3 vary by region and operation (details here), but in general, any list or read operation is susceptible to inconsistent information depending on preceding operations. For basic data archival, consistency is not a concern. However, in a data centric computing environment where information flows through a complex workflow of computations and transformations, an eventually consistent model can cause problems ranging from insidious data loss to catastrophic job failure.

Over the past few years, sporadic inaccuracies presented which only after extensive investigation pointed to consistency as the culprit. With the looming concern of data inaccuracy and no way to identify the scope or impact, we invested some time exploring how to diagnose issues resulting from eventual consistency and methods to mitigate the impact. The result of this endeavor is a library that continues to evolve, but is currently in production here at Netflix: s3mper (latin: Always).

Netflix is pleased to announce that s3mper is now released as open source under the Apache License v2.0. We hope that the availability of this library will inspire constructive discussion focusing on how to better manage consistency at scale with the Hadoop stack across the many cloud offerings currently available.

How Inconsistency Impacts Processing

The Netflix ETL Process is predominantly Pig and Hive jobs scheduled through enterprise workflow software that resolves dependencies and manages task execution. To understand how eventual consistency affects processing, we can distill the process down to a simple example of two jobs where the results of one feed into another. If we take a look at Pig-1 from the diagram, it consists of two MapReduce jobs in a pipeline. The initial dataset is loaded from S3 due to the source location referencing an S3 path. All intermediate data is stored in HDFS since that is the default file system. Consistency is not a concern for these intermediate stages. However, the results from Pig-1 are stored directly back to S3 so the information is immediately available for any other job across all clusters to consume.



Pig-2 is activated based on the completion of Pig-1 and immediately lists the output directories of the previous task. If the S3 listing is incomplete when the second job starts, it will proceed with incomplete data. This is particularly problematic, as we stated earlier, because there is no indication that a problem occurred. The integrity of resulting data is entirely at the mercy of how consistent the S3 listing was when the second job started.



A variety of other scenarios may result in consistency issues, but inconsistent listing is our primary concern. If the input data is incomplete, there is no indication anything is wrong with the result. Obviously it is noticeable when the expected results vary significantly from long standing patterns or emit no data at all, but if only a small portion of input is missing the results will appear convincing. Data loss occurring at the beginning of a pipeline will have a cascading effect where the end product is wildly inaccurate. Due to the potential impact, it is essential to understand the risks and methods to mitigate loss of data integrity.

Approaches to Managing Consistency

The Impractical

When faced with eventual consistency, the most obvious (and naive) approach is to simply wait a set amount of time before a job starts with the expectation that data will show up. The problem is knowing how long “eventual” will last. Injecting an artificial delay is detrimental because it defers processing even if requisite data is available and still misses data if it fails to materialize in time. The result is a net loss for both processing time and confidence in the resulting data.

Staging Data

A more common approach to processing in the cloud is to load all necessary data into HDFS, complete all processing, and store the final results to S3 before terminating the cluster. This approach works well if processing is isolated to a single cluster and performed in batches. As we discussed earlier, having the ability to decouple the data from the computing resources provides flexibility that cannot be achieved within a single cluster. Persistent clusters also make this approach difficult. Data in S3 may far exceed the capacity of the HDFS cluster and tracking what data needs to be staged and when it expires is a particularly complex problem to solve.

Consistency through Convention

Conventions can be used to eliminate some cases of inconsistency. Read and list inconsistency resulting from overwriting the same location can result in data corruption in that a listing may include old versions of data with new therefore producing an amalgam of two incomplete datasets. Eliminating update inconsistency is achievable by imposing a convention where the same location is never overwritten. Here at Netflix, we encourage the use of a batching pattern, where results are written into partitioned batches and the Hive metastore only references the valid batches. This approach removes the possibility of inconsistency due to update or delete. For all AWS regions except US Standard that provide “read-after-write” consistency, this approach may be sufficient, but relies on strict adherence.

Secondary Index

S3 is designed with an eventually consistent index, which is understandable in context of the scale and the guarantees it provides. At smaller scale, it is possible to achieve consistency through use of a consistent, secondary index to catalog file metadata while backing the raw data on S3. This approach becomes more difficult to achieve as the scale increases, but as long as the secondary index can handle the request rate and still provide guaranteed consistency, it will suffice. There are costs to this approach. The probability of data loss and the complexity increases while performance degrades due to relying on two separate systems.

S3mper: A Hybrid Approach

S3mper is an experimental approach to tracking file metadata through use of a secondary index that provides consistent reads and writes. The intent is to identify when an S3 list operation returns inconsistent results and provide options to respond. We implemented s3mper using aspects to advise methods on the Hadoop FileSystem interface and track file metadata with DynamoDB as the secondary index. The reason we chose DynamoDB is that it provides capabilities similar to S3 (e.g. high availability, durability through replication), but also adds consistent operations and high performance.

What makes s3mper a hybrid approach is its use of the S3 listing for comparison and only maintaining a window of consistency. The “source of truth” is still S3, but with an additional layer of checking added. The window of consistency allows for falling back to the S3 listing without concern that the secondary index will fail and lose important information or risk consistency issues that arise from using tools outside the hadoop stack to modify data in S3.

The key features s3mper provides include (see here for more detailed design and options):

Recovery : When an inconsistent listing is identified, s3mper will optionally delay the listing and retry until consistency is achieved. This will delay a job only long enough for data to become available without unnecessarily impacting job performance.

: When an inconsistent listing is identified, s3mper will optionally delay the listing and retry until consistency is achieved. This will delay a job only long enough for data to become available without unnecessarily impacting job performance. Notification : If listing cannot be achieved, a notification is sent immediately and a determination can be made as to whether to kill the job or let it proceed with incomplete data.

: If listing cannot be achieved, a notification is sent immediately and a determination can be made as to whether to kill the job or let it proceed with incomplete data. Reporting : A variety of events are sent to track the number of recoveries, files missed, what jobs were affected, etc.

: A variety of events are sent to track the number of recoveries, files missed, what jobs were affected, etc. Configurability : Options are provided to control how long a job should wait, how frequently to recheck a listing, and whether to fail a job if the listing is inconsistent.

: Options are provided to control how long a job should wait, how frequently to recheck a listing, and whether to fail a job if the listing is inconsistent. Modularity : The implementations for the metastore and notifications can be overridden based on the environment and services at your disposal.

: The implementations for the metastore and notifications can be overridden based on the environment and services at your disposal. Administration: Utilities are provided for inspecting the metastore and resolving conflicts between the secondary index in DynamoDB and the S3 index.

S3mper is not intended to solve every possible case where inconsistency can occur. Deleting data from S3 outside of the hadoop stack will result in divergence of the secondary index and jobs being delayed unnecessarily. Directory support is also limited such that recursive listings are still prone to inconsistency, but since we currently derive all our data locations from a Hive metastore, this does not impact us. While this library is still in its infancy and does not support every case, using it in combination with the conventions discussed earlier will alleviate the concern for our workflow and allow for further investigation and development of new capabilities.

Performance in production

S3mper has been running in production at Netflix for a few months and the result is an interesting dataset with respect to consistency. For context, Netflix operates out of the US Standard region where we run tens of thousands of Pig, Hive, and Hadoop jobs across multiple clusters of varying size and process several hundreds of terabytes of data every day.

The number of listings is hard to estimate because any given job will perform several listings depending on the number of partitions processed, but s3mper is tracking every interaction Hadoop has with S3 across all clusters and datasets. At any given time, DynamoDB contains metadata on millions of files within our configured 24 hour sliding window of consistency. We keep track of metrics on how frequently s3mper recovers a listing (i.e. postpones a job until it receives a complete listing) and when the delay is exceeded resulting in a job executing with data acquired through an inconsistent listing.