A customer of my popular SQL training (which you should book!) has recently challenged me to optimise a hierarchical query that merges an archive log’s deltas in order to obtain a snapshot of some record at a given point in time. In this article, I will reproduce their problem statement in a simplified version and show how this can be done with SQL Server, using a few cool SQL features:

JSON (because that’s how they stored their deltas)

Window functions, which are so useful

CROSS APPLY, a cool, alternative way to join dependent data

Aggregating CASE expressions

All of these are topics covered in the training, which were immediately applicable to this problem statement.

The problem statement

This was their archive design. They designed for uncertainty, meaning that for some entities in their system, they did not know what kinds of attributes will be part of the entity in the future. Given their application design, users could even add their own custom attributes to an entity.

This kind of thing is typically solved with the EAV (Entity Attribute Value) model, a “workaround” to denormalise data sets in SQL databases in the event of such schema uncertainty.

EAV can be implemented in several ways:

Through classic SQL tables only

An example implementation is this:

CREATE TABLE eav_classic ( entity_type VARCHAR (100) NOT NULL, entity_id BIGINT NOT NULL, attribute_name VARCHAR (100) NOT NULL, attribute_type VARCHAR (100) NOT NULL, attribute_value VARCHAR (100) NULL, CONSTRAINT eav_classic_pk PRIMARY KEY (entity_type, entity_id, attribute_name) );

The drawbacks of this non-normalised design are immediately obvious. Most specifically, there is no simple way to establish referential integrity. But this may be totally OK, especially for archive logs, and for smaller databases (datomic does something similar)

Through tables containing JSON or XML data

Whenever you have schema-on-read data, JSON or XML data types may be appropriate, so this is a perfectly valid alternative:

CREATE TABLE eav_json ( entity_type VARCHAR (100) NOT NULL, entity_id BIGINT NOT NULL, attributes VARCHAR (10000) NOT NULL CHECK (ISJSON(attributes) = 1), CONSTRAINT eav_json_pk PRIMARY KEY (entity_type, entity_id) );

If your database supports a JSON data type, obviously, you will prefer that over the above emulation

For the rest of this article, I will use the JSON

Versioning the EAV table

Versioning data in an EAV model is quite easier than in a normalised schema. We can just add a version number and/or timestamp to the record. In their case, something like this may make sense:

CREATE TABLE history ( id BIGINT IDENTITY (1, 1) NOT NULL PRIMARY KEY, ts DATETIME NOT NULL, entity_type VARCHAR(100) NOT NULL, entity_id BIGINT NOT NULL, delta VARCHAR(8000) NOT NULL CHECK (ISJSON(delta) = 1) ); INSERT INTO history (entity_type, entity_id, ts, delta) VALUES ('Person', 1, '2000-01-01 00:00:00', '{"first_name": "John", "last_name": "Doe"}'), ('Person', 1, '2000-01-01 01:00:00', '{"age": 37}'), ('Person', 1, '2000-01-01 02:00:00', '{"age": 38}'), ('Person', 1, '2000-01-01 03:00:00', '{"city": "New York"}'), ('Person', 1, '2000-01-01 04:00:00', '{"city": "Zurich", "age": null}') ;

This table now contains a set of deltas applied to the Person entity with ID = 1. It corresponds to the following sequence of SQL statements on an ordinary entity:

INSERT INTO person (id, first_name, last_name) VALUES ('John', 'Doe'); UPDATE person SET age = 37 WHERE id = 1; UPDATE person SET age = 38 WHERE id = 1; UPDATE person SET city = 'New York' WHERE id = 1; UPDATE person SET city = 'Zurich', age = null WHERE id = 1;

You could even see their hand-written log like a transaction log of the database system, kinda like what you can extract using products like Golden Gate or Debezium. If you think of the transaction log as an event stream, the RDBMS’s current data representation is like a snapshot that you can get when applying any number of deltas to your tables.

Sometimes, you don’t want to completely change your architecture and go full “event sourcing”, but just need this kind of log for a specific set of auditable entities. And e.g. for reasons like still supporting very old SQL Server versions, as well as supporting other databases, you may choose also not to use the SQL:2011 temporal table feature, which has also been implemented in SQL Server 2016 and more recent versions.

With that out of our way…

How to access any arbitrary snapshot version?

When we visually process our HISTORY table, we can see that Person ID = 1 had the following values at any given time:

TIME FIRST_NAME LAST_NAME AGE CITY ------------------------------------------------------ 00:00:00 John Doe 01:00:00 John Doe 37 02:00:00 John Doe 38 03:00:00 John Doe 38 New York 04:00:00 John Doe Zurich

Remember, this is always the same record of Person ID = 1, its snapshots represented at different times in the time axis. The goal here is to be able to find the record of John Doe at any given time.

Again, if we had been using the SQL:2011 temporal table feature, we could write

-- SQL Server SELECT * FROM Person FOR SYSTEM_TIME AS OF '2000-01-01 02:00:00.0000000'; -- Oracle (flashback query) SELECT * FROM Person AS OF TIMESTAMP TIMESTAMP '2000-01-01 02:00:00'

Side note: Do note that Oracle’s flashback query needs to be properly configured:

Not all data is “flashbackable”

DDL tends to destroy the archive

Proper grants are needed to access the flashback archive

Similar limitations may apply in SQL Server.

What if the RDBMS can’t help us?

If again for some reason, we cannot use the RDBMS’s temporal table features, we’ll roll our own as we’ve seen. So, our query in SQL Server to access the snapshot at any given time may be this:

SELECT '{' + string_agg( CASE type WHEN 0 THEN NULL ELSE '"' + [key] + '": ' + CASE type WHEN 1 THEN '"' + value + '"' ELSE value END END, ', ') + '}' FROM ( SELECT *, row_number() OVER ( PARTITION BY [key] ORDER BY ts DESC) rn FROM history OUTER APPLY openjson(delta) -- Apply all deltas prior to any given snapshot WHERE ts <= '2000-01-01 02:00:00' ) t WHERE rn = 1;

What does this query do? Consider again our deltas at 04:00:00:

TIME FIRST_NAME LAST_NAME AGE CITY ------------------------------------------------------ 00:00:00 John Doe 01:00:00 John Doe 37 02:00:00 John Doe 38 03:00:00 John Doe 38 New York 04:00:00 John Doe - Zurich

Observe how each value has some color encoding:

Strong, red : The current snapshot’s attribute value, when the last delta was applied to any given attribute

: The current snapshot’s attribute value, when the last delta was applied to any given attribute Strong, black : A previous snapshot’s attribute value, when a previous, superseded delta was applied to any given attribute

: A previous snapshot’s attribute value, when a previous, superseded delta was applied to any given attribute Light grey : A previous snapshot’s attribute value that was inherited from another previous delta

For any given snapshot, we want to find the Strong, red values. E.g. at a previous snapshot time, the color encoding would have been:

At 03:00:00

TIME FIRST_NAME LAST_NAME AGE CITY ------------------------------------------------------ 00:00:00 John Doe 01:00:00 John Doe 37 02:00:00 John Doe 38 03:00:00 John Doe 38 New York 04:00:00 John Doe - Zurich

At 02:00:00

TIME FIRST_NAME LAST_NAME AGE CITY ------------------------------------------------------ 00:00:00 John Doe 01:00:00 John Doe 37 02:00:00 John Doe 38 03:00:00 John Doe 38 New York 04:00:00 John Doe - Zurich

So, our query needs to find the delta that was applied last for any given attribute.

With SQL, we can find that easily. We can assign a row number to each delta per attribute in reverse order, something like this:

TIME FIRST_NAME LAST_NAME AGE CITY ------------------------------------------------------ 00:00:00 John (1) Doe (1) 01:00:00 John Doe 37 (3) 02:00:00 John Doe 38 (2) 03:00:00 John Doe 38 New York (2) 04:00:00 John Doe - (1) Zurich (1)

Once we have that row number, we just filter out only those deltas whose row number is 1. Something like:

SELECT [key], value, row_number() OVER ( PARTITION BY [key] ORDER BY ts DESC) rn FROM history OUTER APPLY openjson(delta) ORDER BY [key], ts;

Notice the OUTER APPLY openjson(delta) syntax. This just expands the JSON structure into key/value/type columns, which we can use more easily in a SQL query. Other database systems may have similar syntax for similar purposes. The result of the above query is:

key |value |rn -----------|---------|--- age |37 |3 age |38 |2 age | |1 city |New York |2 city |Zurich |1 first_name |John |1 last_name |Doe |1

Filtering the ones whose row number is 1:

SELECT [key], value FROM ( SELECT ts, [key], value, row_number() OVER ( PARTITION BY [key] ORDER BY ts DESC) rn FROM history OUTER APPLY openjson(delta) ) t WHERE rn = 1 ORDER BY ts, [key]

This yields:

key |value -----------|------- first_name |John last_name |Doe age | city |Zurich

Exactly the data we wanted, in key/value form. Notice that this filtering step could have been done with DISTINCT ON in PostgreSQL, or with KEEP (DENSE_RANK FIRST ORDER BY ..) in Oracle – an exercise which I shall leave to the reader (feel free to leave the solution in the comments!)

And now, finally, just re-assemble the JSON using SQL Server 2017 STRING_AGG . PostgreSQL would offer us JSON_AGG here, Oracle has JSON_OBJECTAGG. With STRING_AGG , you have to take care of manually escaping all values according to JSON syntax rules, which is bad. In my example, I just replaced ” by \”. Other characters need escaping too, so if there is a built-in feature, use that instead of string processing.

The STRING_AGG function aggregates a CASE expression which translates different JSON data types into different formats, where:

0 is NULL (and nulls are not aggregated)

1 is string

everything else can be taken at its value for simplicity, e.g. numbers or booleans

Every value (except nulls) are prefixed by the JSON object’s attribute name (“key”).

SELECT '{' + string_agg( CASE type WHEN 0 THEN NULL ELSE '"' + replace([key], '"', '\"') + '": ' + CASE type WHEN 1 THEN '"' + replace(value, '"', '\"') + '"' ELSE value END END, ', ') + '}' FROM ( SELECT *, row_number() OVER ( PARTITION BY [key] ORDER BY ts DESC) rn FROM history OUTER APPLY openjson(delta) -- Apply all deltas prior to any given snapshot WHERE ts <= '2000-01-01 04:00:00' ) t WHERE rn = 1;

This produces

{"city": "Zurich", "first_name": "John", "last_name": "Doe"}

A final query, that gets us the entire history of snapshots (watch the performance on this one, could definitely be optimised):

SELECT ts, ( SELECT '{' + string_agg( CASE type WHEN 0 THEN NULL ELSE '"' + replace([key], '"', '\"') + '": ' + CASE type WHEN 1 THEN '"' + replace(value, '"', '\"') + '"' ELSE value END END, ', ') + '}' FROM ( SELECT *, row_number() OVER ( PARTITION BY [key] ORDER BY ts DESC) rn FROM history OUTER APPLY openjson(delta) -- Apply all deltas prior to any given snapshot WHERE ts <= x.ts ) t WHERE rn = 1 ) FROM history x GROUP BY ts;

It yields:

ts | ---------|-------------------------------------------------------------------------- 00:00:00 |{"first_name": "John", "last_name": "Doe"} 01:00:00 |{"age": 37, "first_name": "John", "last_name": "Doe"} 02:00:00 |{"age": 38, "first_name": "John", "last_name": "Doe"} 03:00:00 |{"age": 38, "city": "New York", "first_name": "John", "last_name": "Doe"} 04:00:00 |{"city": "Zurich", "first_name": "John", "last_name": "Doe"}

So, the complete history of all the snapshot versions of the Person with ID = 1.

Very cool, and definitely good enough for their archive / audit query requirements.