I’ve previously written about how I think we should prefer writing processing pipelines in pure SQL. However, a big difference between SQL and more widely-used languages is that those other languages generally have a strong tradition of unit testing.

Usually, when we talk about ‘tests’ in the context of SQL, we don’t actually mean unit tests. Instead, the term generally refers to data tests, which are really assertions that the data itself conforms to some test criteria.

Unit tests are not assertions. Unit tests verify the logic of a SQL query by running that query on some fixed set of inputs. Assertions necessarily depend upon the real datasets which they validate, while unit tests should never depend on any real data.

The benefits of unit tests

Unit testing is a standard practice in software engineering. Unit tests help ensure that difficult pieces of logic or complex interactions between components work as expected - and continue to work as expected as the surrounding code changes.

Unit tests should not have any external dependencies; tests run the code in question on some faked inputs, ensuring that changes outside of that unit of code do not affect the test. This means that the success or failure of the test comes down purely to the code’s logic. Thus, if the test fails, you know exactly where to start debugging.

Why isn’t SQL unit testing widespread?

In standard languages, a unit test typically consists of injecting fake input into the code under test and checking that the output matches some expected result. However, SQL scripts don’t label their input datasets - typically, they’re just defined statically inline in a FROM clause. This makes it difficult to inject fake input test data into your SQL code.

The result of this is that most SQL code goes untested.

The solution

Various SQL frameworks let you define layers of indirection between your SQL and its input(s); i.e. you declare and label the input datasets upon which a query depends. Unit testing frameworks can use this indirection to replace real input data with faked versions.

We can then run the code under test, using some faked input, and compare the output result rows against a set of expected outputs. If the actual output of the code under test matches the expected output, the test passes; if not, it fails.

This technique is simple and gives you real power to verify that a SQL script does what you think it does. You can pass faked inputs to your SQL that your real data may not currently contain, giving you confidence that it can robustly handle a wide range of data.

Test case support in Dataform

When using Dataform’s enriched SQL, you reference input datasets using either the ref() or resolve() function. This functionality gives us an easy way to inject fake input datasets into a script, thus enabling users to write unit tests.

We have defined a new type of Dataform script: test . In a test query, you specify:

The query you’re testing

The faked inputs, each labeled with their referenced name

The expected output of running the query on the faked inputs

Behind the scenes, when you run the test, we dynamically replace the inputs to the dataset query with your faked input data. We then run the dataset query, along with the query that defines your expected output, and check that the resulting rows match. Simple!

An example

Here’s a worked example written using Dataform’s JavaScript API.



// First, define a dataset - we’ll follow this up with the unit test. publish ( " age_groups " ). query ( ctx => ` SELECT FLOOR(age / 5) * 5 AS age_group, COUNT(1) AS user_count FROM ${ ctx . ref ( " ages " )} GROUP BY age_group ` ); // Now, define the unit test. test ( " test_age_groups " ) // Specify the name of the dataset under test. . dataset ( " age_groups " ) // Provide the fake input “ages” dataset. . input ( " ages " , ` SELECT 15 AS age UNION ALL SELECT 21 AS age UNION ALL SELECT 24 AS age UNION ALL SELECT 34 AS age ` ) // Provide the expected output of running “age_groups” on the “ages” dataset. . expect ( ` SELECT 15 AS age_group, 1 AS user_count UNION ALL SELECT 20 AS age_group, 2 AS user_count UNION ALL SELECT 30 AS age_group, 1 AS user_count ` );

Alternatively, if you prefer to use Dataform’s enriched SQL, the unit test would look as follows (note that publishing the dataset is elided for simplicity):



config { type : " test " , dataset : " age_groups " } input " ages " { SELECT 15 AS age UNION ALL SELECT 21 AS age UNION ALL SELECT 24 AS age UNION ALL SELECT 34 AS age } SELECT 15 AS age_group , 1 AS user_count UNION ALL SELECT 20 AS age_group , 2 AS user_count UNION ALL SELECT 30 AS age_group , 1 AS user_count

For more details, see our documentation.

We’ve released this functionality as part of the v1.0.0 release of our @dataform NPM packages. Dataform Web will soon support test cases, too. Let us know what you think!