Recently, I’ve been working with a database that makes liberal uses of triggers to audit database changes. Each table has a trigger that records the action taken (INSERT, UPDATE, DELETE), the change in value, and the time at which the action occurred. The history recorded effectively paints a picture of database history, thereby requiring much less effort on the part of the application designer to persist data in any cohesive way.

Scanning through some of this historical data with a coworker, we discovered to our surprise that the timestamps were not monotonically increasing. In fact, updates to the same records looked to be confusingly out of order. Abstracting away the actual data, we were looking at a situation like so:

testdb=# SELECT id, old_val, new_val, created_at FROM history; id | old_val | new_val | created_at ---+---------+---------+---------------------------- 1 | a | b | 2015-05-11 12:00:00.000000 2 | b | c | 2015-05-11 12:00:30.000000 3 | c | d | 2015-05-11 12:00:15.000000

Logically, the progression of values still makes sense: a -> b -> c -> d. Verifying the original table, the value is persisted as d, suggesting the most recent update (by id value) is in fact the most accurate representation of the current state of the database. How then, do we generate out of order timestamps?

The Investigation

Transactions in Postgres assure data integrity by locking modified rows until that transaction has been completed. Let’s say I have a users table, and a user named ‘a’. I also have a trigger, which populates my history table, which in turn has a default created_at value of NOW():

testdb=# SELECT * FROM users; id | name ---+------ 1 | a (1 row) testdb=# \d history Column | Type | Modifiers --------------------------------+-----------------------------+------------------------------------------------------ id | integer | not null default nextval('history_id_seq'::regclass) old_val | character varying(255) | new_val | character varying(255) | created_at | timestamp with time zone | not null default now()

Now, in two different windows, I can open simultaneous transactions, a few seconds apart:

testdb=# # SESSION 1 | testdb=# # SESSION 2 testdb=# START TRANSACTION; | testdb=# testdb=# SELECT now(); | testdb=# now | testdb=# ------------------------------- | testdb=# 2015-05-11 20:57:50.255651-04 | testdb=# (1 row) | testdb=# testdb=# | testdb=# START TRANSACTION; testdb=# | testdb=# SELECT now(); testdb=# | now testdb=# | ------------------------------- testdb=# | 2015-05-11 20:58:00.466234-04 testdb=# | (1 row) testdb=# | testdb=# START TRANSACTION;

Importantly, my 2nd session was started roughly 10 seconds after my 1st. If I wait a few seconds, and check now() again, I see the following:

testdb=# # SESSION 1 | testdb=# # SESSION 2 testdb=# SELECT now(); | testdb=# SELECT now(); now | now ------------------------------- | ------------------------------- 2015-05-11 20:57:50.255651-04 | 2015-05-11 20:58:00.466234-04 (1 row) | (1 row)

Once a transaction is open, Postgres fully stops time. Now, in my 2nd session, I’ll update the name of my user ‘a’. Since I’m still in a separate transactions, the name of my user will still appear as ‘a’ in my 1st session.

testdb=# # SESSION 1 | testdb=# # SESSION 2 testdb=# | testdb=# UPDATE users SET name='b' WHERE id=1; testdb=# | UPDATE 1 testdb=# | testdb=# SELECT * FROM users; testdb=# | id | name testdb=# | ---+------ testdb=# | 1 | b testdb=# | (1 row) testdb=# SELECT * FROM users; | testdb=# id | name | testdb=# ---+------ | testdb=# 1 | a | testdb=# (1 row) | testdb=#

If I then attempt to update my user’s name in my 1st session, the update will hang, as the transaction will not be able to secure a lock on the row in question. While the 2nd transaction remains open, the 2nd session will continue to hold a lock on the updated row to prevent any loss or overwrite or changed yet uncommitted data.

testdb=# # SESSION 1 | testdb=# # SESSION 2 testdb=# UPDATE users SET name='c' WHERE id=1; | testdb=# | testdb=# SELECT * FROM users; | id | name | ---+------ | 1 | b | (1 row)

Committing my 2nd session, will release the lock to my 1st, allowing the update to complete. However, until my 1st session is committed, reads outside of the transaction will not show the update from b -> c.

testdb=# # SESSION 1 | testdb=# # SESSION 2 testdb=# UPDATE users SET name='c' WHERE id=1; | testdb=# | testdb=# COMMIT TRANSACTION; UPDATE 1 | testdb=# testdb=# | testdb=# SELECT * FROM users; testdb=# | id | name testdb=# | ---+------ testdb=# | 1 | b testdb=# | (1 row) testdb=# SELECT * FROM users; | testdb=# id | name | testdb=# ---+------ | testdb=# 1 | c | testdb=# (1 row) | testdb=# testdb=# COMMIT TRANSACTION; | testdb=# testdb=# | testdb=# SELECT * FROM users; testdb=# | id | name testdb=# | ---+------ testdb=# | 1 | c testdb=# | (1 row)

As soon as each update is performed, the write to the history table is also prepped by the trigger. However, this write also isn’t executed until the transaction closes. Since the 2nd transaction closes first, it posts first, and carries with it the timestamp from the start of the transaction.

The Conclusion

The timeline of our investigation is as follows:

SESSION 1 start: 2015-05-11 20:57:50.255651-04 close SESSION 2 start: 2015-05-11 20:58:00.466234-04 close

The end result is a history table in which the created_at timestamps are ‘out of order’:

testdb=# SELECT id, old_val, new_val, created_at FROM history; id | old_val | new_val | created_at ---+---------+---------+---------------------------- 1 | a | b | 2015-05-11 20:58:00.466234-04 2 | b | c | 2015-05-11 20:57:50.255651-04

Despite the default on the history table’s created_at column, the value of now() at the time the row is prepped is coming from the timestamp frozen at the start of the transaction. The timestamp() is not a reflection of time at the write time, but at the row prep time. The id, in contrast, is acquired at the time the row posts to the history table, and is taken from the next value available in the id sequence.1

The ordering of the table by id, rather than the created_at value, therefore represents the most accurate history of how the data has changed.

* All above code examples are in psql syntax.

1 Because sequence values must be immediately available to all transactions, incrementing a sequence cannot be undone. This prevents locking of a sequence value itself, as well as preventing a collision between a value updated within an open transaction and the state of the sequence after a closed transaction. See sequence documentation and transaction isolation documentation for further information.