Diftong is a tool that compares two tabular databases resulting from different versions of a workflow to detect and prevent potential unwanted alterations.

Introduction

Klarna’s Checkout product is integrated with 130,000+ online merchants. Making credit and fraud assessments for every single purchase in real-time requires tonnes of high quality data available at a moment’s notice. This is done by using a group of data transformation workflows to fetch the required data for each assessment.

Even though risk and fraud assessments workflows might have the highest name recognition in Klarna, they are by no means the only workflows executed on a regular basis. There are currently about 300 workflows executed per day, each producing data that is important for Klarna’s day-to-day business operations.

Everything would be ok if those workflows were consistent and did not change, but this is not the case. Workflows often need to be improved and updated to cope up with the evolving business needs and the rapid increase of the data. Some examples of these updates can be:

Improved implementation of more efficient and more optimized transformations.

Inclusion of new data models.

A change in business requirements.

An infrastructure upgrade that can affect the results of a transformation even if no changes to the workflow itself were done.

For any effort to improve/update a workflow, there is a hard requirement to uphold the level of quality of output data. Hence, we need data validation to ensure that this is the case.

Easy enough, right?

Not really and for the following reasons:

Massive amounts of data and long running workflows.

Complex workflows.

Many heterogeneous input data sources with unknown data quality.

Many use cases for output data.

In general, validating big data workflows is a bit like looking for a needle in a haystack.

The big idea

Preserving data quality after developing a new version of a workflow is critical. Manual data validation can be time-consuming, error-prone, and tedious. So it might seem like a good idea to either skip any meaningful validation and hope for the best, or to abort the update. Both alternatives have negative impacts as they will lead to either producing data with poor quality, or restricting any attempts to improve organizations methods.

As a result, much-needed workflow updates and improvements might end up being postponed or canceled altogether after a short-term cost-benefit analysis. Which in turn, over time, produces slower workflows with degraded data quality. All of this seems like a waste of manpower and computing resources, so we decided to do something about it.

We will provide a short description of a big data validation tool, called Diftong, that provides an automated and more agile way of doing large scale data validation by comparing the outputs of a pre-updated workflow and its post-update equivalent.

The validation process

Diftong automatically computes overall statistics of the differences between two tabular databases that have the same schema. The validation process in Diftong can be described in three main steps:

Deduplication, detecting duplication in data. Row validation, resulting in row-based statistics. Column validation, resulting in column-based statistics.

To explain the steps in a clearer way, we will go through a running example that consists of two databases with identical schemas that are to be validated, let’s call them DB1 and DB2. Each database contains a table named “Users” with four columns (id, name, salary, and birthday). A sample of the data is shown below. Again, this is just a simple example to explain how the tool works, while the real use cases are applied to larger databases with millions of records.

Data in “Users” table — DB1 and DB2

1. Deduplication step

As duplicate rows -with matching values in all columns- may be present in big data, we need to account for that. So each row in both databases (DB1 and DB2) is counted in this step and the results are stored in two new databases (DDB1 and DDB2). Each new database contains the same tables and data as the original databases before deduplication, but with one additional column that shows the number of times each row appears in a table to be used in later stages of the validation process.

The deduplication step

Back to our running example, the deduplication step detected that the row with id=2 is duplicated in the second database as shown below.

Data in “Users” table — DDB1 and DDB2

2. Row validation step

This step gives an overview of the total number of changes in each table where the differences between all rows are calculated. Detecting the differences and similarities on a row-based level gives a clear idea of any changes that might have occurred. If this is the case, a deeper level of statistics based on each column in that table will be calculated in the next step.

The generated statistics in this step are:

Number of rows in DB1 and DB2

Absolute difference of DB1 and DB2

Number of rows in DDB1 and DDB2

Absolute difference of DDB1 and DDB2

Total number of differences between DDB1 and DDB2

Percent of differences between DDB1 and DDB2

The row validation step

Looking at the data in the “Users” table in DDB1 and DDB2, we can see that five rows out of seven contain differences in the values.

Row differences between “Users” table in DDB1 and DDB2

Hence, the generated statistics from this step reflected those changes and we can see in the following table that there is 71.4% difference in the data.

Row validation statistics — “Users” table

3. Column validation step

If differences were detected in the previous step, a validation based on the values in each column is needed to get a more detailed view of the changes in data. In addition, this step gives more understanding of the distribution of data by measuring how differences are spread. In other words, starting from the results of the row validation step, column based statistics are calculated for the tables that contain differences.

The generated statistics are:

Total number of differences in data

Maximum and minimum difference in data

First quartile Q1 (25th percentile)

Second quartile Q2 (50th percentile)

Third quartile Q3 (75th percentile)

The column validation step

Looking at the field level differences displayed in our example below, we can see that:

Name : there is one difference in two letters, the first and the last (Sofie/sofia) in the record with id=3.

: there is one difference in two letters, the first and the last (Sofie/sofia) in the record with id=3. Salary : there is one difference of 500 in the record with id=3.

: there is one difference of 500 in the record with id=3. Birthday : there is one difference of 1,800,000 milliseconds in the record with id=3.

: there is one difference of 1,800,000 milliseconds in the record with id=3. Row count: there is one duplicate row in the record with id=2.

Noting that extra rows are not taken into consideration, like the record with id=4.

Column differences between two tables in DDB1 and DDB2

Since there is always one difference in each column, the generated statistics (Max, Min, Q1, Q2, and Q3) ended up having the same values in this example:

Column validation statistics — “Users” table

Results

Using Diftong in real use cases related to Klarna’s core business has shown that the capability of identifying the differences between large scale databases helps in managing data transformation workflows.

In particular, Diftong has helped Klarna validate optimizations of workflows. It has added a great value to the company as the automatic comparison of databases enables a more agile approach to workflow improvements and updates. More specifically, such comparisons:

Reduce the delay in improving workflow performance through the rapid identification of any differences in the result data produced by two versions of a workflow. This, in turn, makes it easier for the workflow owner to trust and verify the correctness of the improvement.

Improve the quality of the data due to easier identification of issues such as corrupt information and inconsistent semantics.

Further readings

This study is published in the Journal of Big Data. The full article with some examples and test cases can be found in this link.

Acknowledgements

This post was co-written by Johan Petrini. Many thanks for the support of Klarna colleagues in the Data Domain, especially Erik Zeitler, and also for the support of Uppsala University in Sweden.