If you are here means, maybe you are using Stitch Data for ETL and replicating the OLTP database into BigQuery, RedShift or any other Data warehouses. Its a cool ETL tool with NoOps for a Data Engineer, and no need to manage an Infra for ETL. De-Duplication is one of the key components in Data analytics. But you can do this during the ETL process. But Stitch works in a different way. If you are using CDC, then it’ll perform Append only operations.

We have a data pipeline to sync MySQL tables to BigQuery. It's near-realtime sync. Meantime the database is getting so many updates. So which sync the data every time there will be a lot of duplicates(Each id will have multiple versions). Stitch has a solution for this. They are adding some additional columns to your target table to identify the most recent row.

_sdc_batched_at TIMESTAMP

Indicating when Stitch loaded the batch the record was a part of into the data warehouse.

Example data: 2019-08-08 14:55:08.82+00

_sdc_sequence INTEGER

A Unix epoch (in milliseconds) that indicates the order in which data points were considered for loading.

Example data: 1565276078922000095

_sdc_primary_key STRING

Applicable only if Stitch doesn’t detect a Primary Key in the source table. Stitch will use this column to de-dupe data.

Example data: 5d8b9a05-33cc-4d5f-8163-4474814b46c6

Based on the above columns, we can perform the de-duplication. They are proving the following statement for this.

# Standard SQL

SELECT

DISTINCT o.*

FROM

`bhuvi-project.sanbox.bhuvi_test` o

INNER JOIN (

SELECT

id,

MAX(_sdc_sequence) AS seq,

MAX(_sdc_batched_at) AS batch

FROM

`bhuvi-project.sanbox.bhuvi_test`

GROUP BY

id) oo

ON

o.id = oo.id

AND o._sdc_sequence = oo.seq

AND o._sdc_batched_at = oo.batch

I have tested the query on my dataset.

Total rows: 23,806,735

Total size: 2.44GB

The above query looks like a statement which we use in any normal transactional database. And it has many aggregations.

Let's scan the complete data with the above query. I used https://bqvisualiser.appspot.com/ to visualize the query plan.

Total Execution time: 56.483 sec

A better way of doing this:

BigQuery is an analytic database. It’ll efficiently handle the Group by and select max but let's try the analytical way with Window functions and get rid of the old school approach.

# Standard SQL

WITH

cte AS (

SELECT

ROW_NUMBER() OVER(PARTITION BY id ORDER BY _sdc_batched_at DESC, _sdc_sequence DESC )AS rownum,

*

FROM

`bhuvi-project.sanbox.bhuvi_test`)

SELECT

*

FROM

cte

WHERE

rownum=1;

Total execution time: 37.248 sec