This article is based on my Strata NY’18 Conference talk on Circuit Breakers to safeguard for garbage in, garbage out. Checkout the talk video recording for details.

Imagine a business metric showing a sudden spike — is the spike real or is it a data quality problem? Analysts and Data Engineers today will spend hours, days, and even weeks analyzing whether a given metric is correct! In other words, Time-to-Reliable-Insights today is unbounded, and is a widespread pain-point across the industry. At Intuit, we are working on addressing the data quality problem at scale, and recently presented our platform (called QuickData SuperGlue) at the Strata Conference in New York, 2018. This blog summarizes the key details from the talk.

Analogous to using the circuit breakers pattern in micro-services architecture, we are designing circuit-breakers for data pipelines. In the presence of data quality issues, the circuit opens preventing low quality data from propagating to downstream processes. The result is that data will be missing in the reports for time-periods of low quality, but if present, it is guaranteed to be correct. This proactive approach makes Time-to-Reliable-Insights bounded to mins by automating data availability to be directly proportional to data quality. This approach also eliminates the unsustainable fire-fighting required for verifying-&-fixing metrics/reports on a case-by-case basis.The rest of the blog describes details for implementing and deploying circuit breakers and divided into three sections:

Data Pipelines Ground realities

Circuit Breaker Pattern for Data Pipelines

Implementing Circuit Breakers in Production

Data Pipelines Ground realities

A data pipeline is a logical abstraction representing a sequence of data transformations required for converting raw data into insights. In our data platform, we have thousands of data pipelines running daily. Each pipeline ingests data from different sources, and applies a sequence of ETL and analytical queries to generate insights in the form of reports, dashboards, ML models, output tables.These insights are used for both data-driven business operations as well as in-product customer experiences.

We ingest 4 types of data collected across 100s of relational DBs, as well as NoSQL stores (Key-Value, Document):

User Entered Data (UED): Data entered by customers in using the products

Behavioral Analytics Data: Clickstream data capturing usage of the product

Enterprise data: Back-office systems for customer care, billing, etc.

Third party data: Includes social feeds, lender files, bank data, etc.

Data from the sources is ingested into a Lake (HDFS/S3) using batch ingestion frameworks is using Kafka, Sqoop, Oracle GoldenGate, and a dozen home-grown tools. Data in the Lake is then analyzed using multiple query engines (Hive, Spark) as well as moved to MPP warehouses such as Vertica. Results are made available via serving databases such as Cassandra, Druid. To show the complexity of a real-world data pipeline, below is the logical view of a pipeline generated using QuickData SuperGlue:

Within a data pipeline, Data Quality issues are introduced at different stages. We categorize the issues into three buckets: a) Source-related issues; b) Ingestion-related issues; c) Referential integrity. For each bucket, following are the most common issues we have experienced. The root-causes of these issues is a combination of operational errors, incorrect logic, lack of change management, data model inconsistencies.

Circuit Breaker Pattern for Data Pipelines

Electric circuit breakers were invented to proactively manage electric surges that would otherwise overload and cause house fires. By trading off availability of electricity, a circuit breaker prevents potential fires. Circuit Breaker pattern is also popular in micro-services architectures — instead of having the API wait for a slow or saturated micro-service, the circuit breaker proactively skips calling the service. The end result is a predictable API response time with the trade-off that certain services may not available. When the micro-service issues is resolved, the breaker is closed and the service becomes available.

Circuit breakers for data pipelines follow a similar pattern. The quality of the data is proactively analyzed — if it is below a threshold, instead of letting the pipeline jobs continue and mix high and low quality data, the circuit is open preventing downstream processing for the low quality data batch. There is an implicit guarantee that the available insights will always be reliable i.e., if the data was low quality, it will be missing. This is an easy to understand contract across data engineers, analysts, data scientists, and other consumers of insights. It replaces the need to manually verify results reactively on a case-by-case basis. Data ingested in the Lake is persisted in hourly or daily batches in a staging area. Each batch is analyzed for data quality — when an issue is detected, the circuit is open preventing downstream processing of the data batch. When the circuit is open, teams are alerted to diagnose the issue. If the issue can be resolved, the batch is backfilled and made available for downstream processing.

In the Data Pipeline Circuit breaker pattern, there are two states:

Circuit Close: Data is flowing through the pipeline.

Circuit Open: Data is not flowing i.e., an issue is discovered, so downstream data is not available

In both the open and closed state, the quality of data partitions (hourly/ daily) is continuously checked. When the data batch meets the quality threshold, the circuit moves from open to closed state. Conversely, when data quality fails, the circuit moves from closed to open. Data Quality issues are distinguished into Hard and Soft Alerts. In contrast to Hard Alerts, Soft Alerts do not change the state of the circuit, but shows a warning along with the insights.

Implementing Circuit Breaker in the Data Platform

Implementing circuit breakers requires implementing 3 core functions:

Tracking data lineage: Finds all tables and jobs involved in the transformation from source tables to output tables, reports, ML models.

Profile data pipeline: Tracks events, stats, and anomalies associated with the data pipeline. The profiling is divided into operational and data-level as described later.

Control the circuit breaker: Triggers the circuit based on the issues discovered by profiling

Tracking data lineage is accomplished by analyzing the queries associated with the pipeline jobs. Specifically, a pipeline is composed of jobs; each job is composed on one or more scripts; each script consists of one or more SQL statements. A SQL query is analyzed for input and output tables. Lineage of a pipeline is defined as an array of triplets <Job Name, Input Tables, Output Tables>. This is not a one-time activity. It is continuously evolving. Each script consists of one or more queries in different languages: SQL with some of Pig. These are then glued together with the output of one becomes the input for the next job.

Profiling is divided into two buckets:

Operational Profiling: The focus is on Job health and Data Fabric health. Job health involves tracking execution related stats such as completion time, start-time, etc. Data Fabric health focusses on tracking events and stats from system components namely source databases, ingestion tools, scheduling frameworks, analytical engines, serving databases, publishing frameworks (such as Tableau, QlikView, SageMaker, etc.)

Data Profiling: The focus in this bucket is on analyzing the data-related patterns. This is a fairly broad topic and a topic for a future blog. It is divided into three buckets:

Controlling the circuit breaker is based on issues discovered using profiling. Issues are discovered using either absolute thresholds/rules or relative based on anomaly detection/ML models.

As discussed earlier, detected issues are categorized into Hard Alerts and Soft Alerts; while Hard Alerts open the circuit, Soft Alerts keep the circuit closed, but generate a warning.Some of the examples of Hard and Soft Alerts for Operational and Data Profiling are shown below:

To summarize, today there is no well-defined and to manage data quality issues detected in data pipelines. Circuit Breakers for Data Pipelines is a pattern that makes data availability proportional to data quality. By defining how the Hard and Soft Alerts control the circuit, it allows changing the proportionality slope for critical tables that feed hundreds of downstream tables.

There is a rockstar team driving this effort — started the project as an engineering days hackathon! Come join us!

For related articles, checkout my blog series on Changing Data Platform Landscape.