Introduction

Sometimes you need to compare data across two BigQuery tables. You may want to do this generically (match entire-row-by-row), sometimes comparing by key. This is far easier than you may think!

Sample Data

NOTE: You need the transactions dataset for this SQL.

Create a left_table with a million rows. This will contain the base table for comparison.

CREATE OR REPLACE TABLE `transactions.left_table` AS

SELECT

r AS id,

CAST(r AS STRING) AS col1,

CAST(FLOOR(RAND() * 1000) AS STRING) AS col2,

'x' AS col3,

'y' AS col4

FROM

UNNEST(GENERATE_ARRAY(1, 1000000)) r;

Create a right_table with mostly the same data, but some deletes, updates, and inserts. There is a probability used for deletes and updates, while 100 rows are simply inserted.

CREATE OR REPLACE TABLE `transactions.right_table` AS

WITH

DataChanges AS (

SELECT

* EXCEPT (col3, col4),

IF(RAND() < 0.001, 'new-x', col3) AS col3,

col4

FROM

`transactions.left_table`

WHERE

RAND() > 0.001

),

NewData AS (

SELECT

r AS id,

CAST(r AS STRING) AS col1,

CAST(FLOOR(RAND() * 1000) AS STRING) AS col2,

'x' AS col3,

'y' AS col4

FROM

UNNEST(GENERATE_ARRAY(1000000, 1000100)) r

)

SELECT * FROM DataChanges

UNION ALL

SELECT * FROM NewData;

Whole Table Comparison

A whole table comparison will generate a FARM_FINGERPRINT per row. This will allow us to do a FULL OUTER JOIN on this hash and see what values exist on one side but not the other.

Note that without a key we cannot tell if a row has been updated.

WITH

LeftData AS (

SELECT

a AS data,

FARM_FINGERPRINT(FORMAT("%T", a)) AS h

FROM

`transactions.left_table` AS a

),

RightData AS (

SELECT

b AS data,

FARM_FINGERPRINT(FORMAT("%T", b)) AS h

FROM

`transactions.right_table` AS b

)

SELECT

IF(l.h IS NULL,"New on right","New on left") AS Change,

IF(l.h IS NULL,r.data,l.data).*

FROM

LeftData l

FULL OUTER JOIN RightData r

ON l.h = r.h

WHERE

l.h IS NULL OR

r.h IS NULL

The above could be changed so that — rather than listing the rows — it could count the rows with a GROUP BY.

Key by Key Comparison

Rather than just comparing the whole table, we want to compare row-by-row. So we will use the key to determine if it’s an insert, update, or delete.

SELECT

CASE

WHEN a.id IS NULL AND b.id IS NOT NULL THEN "I"

WHEN a.id IS NOT NULL AND b.id IS NULL THEN "D"

ELSE "U"

END AS op,

IF(b.id IS NULL, a, b).*

FROM

`transactions.left_table` a

FULL OUTER JOIN `transactions.right_table` b

ON a.id = b.id

WHERE

a.id IS NULL OR

b.id IS NULL OR

FARM_FINGERPRINT(FORMAT("%T", a)) <>

FARM_FINGERPRINT(FORMAT("%T", b));

This will generate a Change-Data-Capture like log of changes between left_table and right_table. Below is the sample output from running this.

Conclusion

BigQuery can be used for creating test cases, validating your data, generating CDC logs from snapshots fairly easily. This can be done without an explicit column-by-column comparison.

Thanks to Thinh Ha for the comparison queries.