It goes without saying that you should be testing your data quality. I’ve written about this myself before, as have many others. You can’t make informed decisions if you don’t trust your data!

Cloud data warehouses enable us to manipulate data more easily than ever before. This is fantastic for the data analytics team that wants to move quickly, but it can lead to some problems. When we start to create hundreds or even thousands of datasets in our data warehouse, understanding which of them are accurate, up-to-date, and correct becomes challenging.

When we launched the Dataform open-source framework we shipped it with the concept of data assertions, a way for you to write SQL queries that can quickly detect issues with your production data and help you capture problems early. Over the past year we’ve seen more and more use cases for data quality tests and how they are used in practice.

This is an advanced guide to data quality testing with Dataform for people who are currently using BigQuery, Redshift, Snowflake or Azure SQL Data Warehouse and SQL to manage their ELT data transformation processes.

In this article we’ll cover some of those use cases, some of the new features in Dataform that make testing data quality super easy, and lots of handy SQL snippets that should be useful regardless of whether or not you are using Dataform.

Testing basics

Dataform is an open-source data modeling framework. It allows you to manage all your tables, views and dependencies and pipelines with SQL and version control their definitions. Part of that framework includes data tests, and we have recently introduced a simpler way to get started writing these tests.

Here’s a simple example of a SQLX file in a Dataform project that defines a dataset that we’ll use as a base for the examples below, containing information and stats about our customers.

definitions/customer_stats.sqlx :

config {

type: "table"

}



SELECT

customer_id,

email,

attribution_source AS source,

sessions

FROM

customers

LEFT JOIN (

SELECT

customer_id,

SUM(sessions)

FROM

sessions)

USING

(session_id)

Unique keys

Most cloud data warehouses don’t have the concept of primary keys or a built-in way to check that a key is unique.

To add a key uniqueness check to the dataset we created above, we simply add the relevant assertion to the config block of our dataset:

config {

type: "table",

assertions: {

index: ["customer_id"]

}

}

That’s it! When we run our pipeline using dataform, it will generate a query specific to your configured data warehouse type that looks for duplicate rows in the source dataset and warn us if any of them fail.

Nullness checks

Another common data integrity requirement which most warehouses don’t support is column non-nullness checking. We can easily add this by extending our config block:

config {

type: "table",

assertions: {

index: ["customer_id"],

nonNull: ["email"]

}

}

Custom row checks

The above use cases don’t really support conditions that are specific to application business logic. What if we want to write something a bit more custom?

We can use the rowConditions feature of the assertions config block to write SQL based rules that should apply to every row in our data.

For example, let’s say we track where our customers came from using a string value: either “outbound” or “inbound”.

We can write a check for this using a custom row condition:

config {

type: "table",

assertions: {

index: ["customer_id"],

nonNull: ["email"],

rowConditions: [

"source in ('inbound', 'outbound')"

]

}

}

Row conditions capture many practical data quality checks we see people writing that aren’t covered by the above shorthands, and adding new conditions is easy.

Advanced use cases

For these examples, we’re going to use custom built assertions in Dataform. Dataform assertions are SQL queries that look for problems in data — that is, the query should return zero rows if the assertion passes. You can read more about assertions here.

Data freshness

Let’s assume you rely on data being loaded into your warehouse by some external system, e.g. a product like Stitch Data that incrementally loads data from your production database into your warehouse.

It would be great if we knew if that data stopped flowing so that we can go and fix any issues with the connection, or get a warning before people start asking why the dashboards stopped updating.

Here’s an example assertion for BigQuery that makes sure that the latest data to be loaded in was less than 6 hours old:

config {

type: "assertion"

}



WITH

freshness AS (

SELECT

time_diff(CURRENT_TIMESTAMP(),

MAX(load_timestamp),

"hour") AS delay

FROM

production_database_staging )

SELECT

*

FROM

freshness

WHERE

delay > 6

This query will return a single row if the current delay on our table is greater than 6 hours, and cause the assertion to fail.

Data completeness

Sometimes just looking at freshness doesn’t cut it. It’s common to have an ETL system that loads daily dumps of data into our warehouse into partitioned tables. If the load for a given day fails, that could indicate a problem which a freshness assertion won’t necessarily capture.

Instead, we want to check that there is some data present for each day of data in the last 30 complete days. We can use the very handy generate_date_array function from BigQuery for this, assuming our load table uses date partitioning:

config {

type: "assertion",

}



WITH

required_dates AS (

SELECT

*

FROM

UNNEST( GENERATE_DATE_ARRAY( DATE_SUB(CURRENT_DATE(), INTERVAL 30 day), DATE_SUB(CURRENT_DATE(), INTERVAL 1 day), INTERVAL 1 day) ) AS date )

SELECT

rd.date

FROM

required_dates rd

LEFT JOIN (

SELECT

DISTINCT _PARTITIONDATE AS date

FROM

data_load_tables) dlt

USING

(date)

WHERE

dlt.date IS NULL

This will return the dates that are missing, and zero rows if there are no dates missing.

ML model accuracy (BigQuery)

Here’s a cool one — BigQuery allows us to train and use ML models entirely within BigQuery using a SQL like syntax.

What if our ML model accuracy decreases for some reason? We probably want to know about it, and we may want to actually gate deployment of our model on the accuracy being over a certain baseline level.

config {

type: "assertion",

}



SELECT

*

FROM

ML.EVALUATE (

MODEL my_ml_model,

(

SELECT

*

FROM

source_data

WHERE

dataframe = 'evaluation'

)

)

WHERE accuracy < .8

If our accuracy drops below 80% then our assertion will fail. By making this assertion a dependency of our model deployment query, we can also make sure we don’t push new models that don’t meet this bar. Check out our recent blog post on building an end to end ML pipeline using BigQuery that covers this in more detail.

Unit tests

Assertions aren’t hermetic, they are dependent on the production data inside your warehouse that is changing all the time. This means that your assertions can start failing even if your data pipeline or SQL queries are exactly the same.

This is very much intended behaviour, but not always what we want. For testing query logic without a dependency on production data we can use unit tests that make it easy to mock out data sources. We won’t cover this here, but you can read more in our unit testing SQL queries with Dataform blog post.

Conclusion

Using SQL is great for doing data quality testing, and Dataform makes it even easier than before to configure both common and advanced data quality tests with just a few lines of code.

If you have testing use cases that you don’t think are covered by the above, we’d love to hear about them!