I recently had to produce reports on a table containing events of a user's account balance. The user can deposit and withdraw from their account, and support personnel can set the account's credit, which is the maximum amount the user can overdraw.

The table looked roughly like this:

db=# SELECT * FROM event ; id | account | type | happened_at | data ----+---------+------------+------------------------+----------------------------------- 1 | 1 | created | 2019-08-01 15:14:13+03 | {"credit": 0, "delta_balance": 0} 2 | 1 | deposited | 2019-08-01 15:15:15+03 | {"delta_balance": 100} 3 | 1 | withdraw | 2019-08-02 09:35:33+03 | {"delta_balance": -50} 4 | 1 | credit_set | 2019-08-03 16:14:12+03 | {"credit": 50} 5 | 1 | withdraw | 2019-08-03 14:45:44+03 | {"delta_balance": -30} 6 | 1 | credit_set | 2019-08-03 16:14:12+03 | {"credit": 100} 7 | 1 | withdraw | 2019-08-03 16:15:09+03 | {"delta_balance": -50} (7 rows)

Expand for table and data setup

CREATE TABLE event ( id serial primary key , account int , type text , happened_at timestamp with time zone , data jsonb ); SET dateformat YMD ; INSERT INTO event ( id , account , type , happened_at , data ) VALUES ( 1 , 1 , 'created' , '2019-08-01 15:14:13' , '{"delta_balance": 0, "credit": 0}' ), ( 2 , 1 , 'deposited' , '2019-08-01 15:15:15' , '{"delta_balance": 100}' ), ( 3 , 1 , 'withdraw' , '2019-08-02 09:35:33' , '{"delta_balance": -50}' ), ( 4 , 1 , 'credit_set' , '2019-08-03 16:14:12' , '{"credit": 50}' ), ( 5 , 1 , 'withdraw' , '2019-08-03 14:45:44' , '{"delta_balance": -30}' ), ( 6 , 1 , 'credit_set' , '2019-08-03 16:14:12' , '{"credit": 100}' ), ( 7 , 1 , 'withdraw' , '2019-08-03 16:15:09' , '{"delta_balance": -50}' ) ;

What deposit boxes used to look like. Photo by Tim Evans

To get the current balance of an account, we sum the changes in delta_balance :

db=# SELECT account , SUM (( data -> 'delta_balance' ) :: int ) AS balance FROM event GROUP BY account ; account | balance ---------+--------- 1 | -30

The data field contains information specific to each type of event. To extract the value of delta_balance from the data column we use the arrow operator provided by PostgreSQL.

The result of the query shows that the current balance of account 1 is -30. This means the account is in overdraft. To check if this is within the allowed range, we need to compare it to the credit set for this account. The credit for account 1 was initially set to 0 when the account was created. The credit was then adjusted twice, and is currently set to 100.

To get the current state of an account, we need its aggregated balance and the latest credit that was set for it.

In Oracle there is a function called last we can be use to get the last credit_set event. A query using last might look like this:

-- Oracle SELECT account , MAX ( CASE WHEN type = 'credit_set' THEN data ELSE null END ) KEEP ( DENSE_RANK LAST ORDER BY id ) AS credit FROM event GROUP BY account ;

PostgreSQL also has a LAST_VALUE analytic function. Analytics functions cannot be used in a group by the way aggregate functions do:

db=# SELECT account , LAST_VALUE ( data ) OVER ( PARTITION BY account ORDER BY id RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING ) AS credit FROM event WHERE type = 'credit_set' GROUP BY account ; ERROR: column "event.data" must appear in the GROUP BY clause or be used in an aggregate function LINE 3: LAST_VALUE(data) OVER (

The error tells us that the data field used in the analytic function must be used in the group by. This is not really what we want. To use PostgreSQL's LAST_VALUE function, we need to remove the group by:

db=# SELECT account , LAST_VALUE ( data ) OVER ( PARTITION BY account ORDER BY id RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING ) AS credit FROM event WHERE type = 'credit_set' ; account | credit ---------+----------------- 1 | {"credit": 100} 1 | {"credit": 100}

These are not exactly the results we need. Analytic, or window functions, operate on a set of rows, and not in a group by.

PostgreSQL doesn't have a built-in function to obtain the first or last value in a group using group by.

To get the last value in a group by, we need to get creative!

The plain SQL solution is to divide and conquer. We already have a query to get the current balance of an account. If we write another query to get the credit for each account, we can join the two together and get the complete state of an account.

To get the last event for each account in PostgreSQL we can use DISTINCT ON :

db=# SELECT DISTINCT ON ( account ) account , data -> 'credit' AS credit FROM event WHERE type = 'credit_set' ORDER BY account , id DESC ; account | credit ---------+-------- 1 | 100

Great! Using DISTINCT ON we got the last credit set for each account.

DISTINCT ON I've written about the many faces of DISTINCT in PostgreSQL.

The next step is to join the two queries, and get the complete account state:

db=# SELECT a . account , a . balance , b . credit FROM ( SELECT account , SUM (( data -> 'delta_balance' ) :: int ) AS balance FROM event GROUP BY account ) AS a JOIN ( SELECT DISTINCT ON ( account ) account , data -> 'credit' AS credit FROM event WHERE type = 'credit_set' ORDER BY account , id DESC ) AS b ON a . account = b . account ; account | balance | credit ---------+---------+-------- 1 | -30 | 100

We got the expected result.

Before we move on, let's take a glance at the execution plan:

QUERY PLAN --------------------------------------------------------------------------------------------- Hash Join (cost=44.53..49.07 rows=4 width=44) Hash Cond: (event.account = b.account) -> HashAggregate (cost=25.00..27.00 rows=200 width=12) Group Key: event.account -> Seq Scan on event (cost=0.00..17.50 rows=750 width=36) -> Hash (cost=19.49..19.49 rows=4 width=36) -> Subquery Scan on b (cost=19.43..19.49 rows=4 width=36) -> Unique (cost=19.43..19.45 rows=4 width=40) -> Sort (cost=19.43..19.44 rows=4 width=40) Sort Key: event_1.account, event_1.id DESC -> Seq Scan on event event_1 (cost=0.00..19.39 rows=4 width=40) Filter: (type = 'credit_set'::text)

The event table is being scanned twice, once for each subquery. The DISTINCT ON subquery also requires a sort by account and id . The two subqueries are then joined using a hash-join.

PostgreSQL is unable to combine the two subqueries into a single scan of the table. If the event table is very large, performing two full table scans, and a sort and a hash join, might become slow and consume a lot of memory.

common table expression It's tempting to use common table expression (CTE) to make the query more readable. But, CTE's are currently optimization fences, and using it here will most definitely prevent PostgreSQL from performing any optimization that involves both subqueries.

Using good ol' SQL got us the answer, but it took two passes on the table. We can do better with the following trick:

db#=> SELECT account , SUM (( data -> 'delta_balance' ) :: int ) AS balance , ( MAX ( ARRAY [ id , ( data -> 'credit' ) :: int ]) FILTER ( WHERE type = 'credit_set' ))[ 2 ] AS credit FROM event GROUP BY account ; account | balance | credit ---------+---------+-------- 1 | -30 | 100

This is so much simpler than the previous question, so let's break it down.

To understand what exactly is going on here, we first need to understand how PostgreSQL compares arrays:

Array comparisons compare the array contents element-by-element, using the default B-tree comparison function for the element data type.

When comparing arrays, PostgreSQL will go element by element and compare the values according to their type. To demonstrate:

db=# SELECT greatest ( ARRAY [ 1 , 200 ], ARRAY [ 2 , 100 ]); greatest ---------- {2,100}

The first element of the second array (2) is larger than the first element of the first array (1), so it's the greatest.

db=# SELECT greatest ( ARRAY [ 1 , 200 ], ARRAY [ 1 , 201 ]); greatest ---------- {1,201} (1 row)

The first elements of both arrays are equal (1), so PostgreSQL moves on to the next element. In this case, the second element of the second array (201) is the greatest.

Using this feature of PostgreSQL, we construct an array where the first element is the value to sort by, and the second element is the value we want to keep. In our case, we want to get the credit by the max id :

MAX ( ARRAY [ id , ( data -> 'credit' ):: int ])

Not all events set credit, so we need to restrict the result to credit_set events:

MAX ( ARRAY [ id , ( data -> 'credit' ):: int ]) FILTER ( WHERE type = 'credit_set' )

The result of this expression is an array:

db#=> SELECT account , MAX ( ARRAY [ id , ( data -> 'credit' ) :: int ]) FILTER ( WHERE type = 'credit_set' )) FROM event GROUP BY account ; account | max ---------+--------- 1 | {6,100}

We only want the second element, the value of credit :

( MAX ( ARRAY [ id , ( data -> 'credit' ):: int ]) FILTER ( WHERE type = 'credit_set' ))[ 2 ]

And this is it! This way we can get the last credit set for each account.

Next, let's examine the execution plan:

QUERY PLAN --------------------------------------------------------------- HashAggregate (cost=32.50..34.50 rows=200 width=16) Group Key: account -> Seq Scan on event (cost=0.00..17.50 rows=750 width=72)

Simple plan for a simple query!

The main benefit of this approach is that it only needs one pass of the table and no sort.

This approach is very useful, but it has some restrictions:

caveat All elements must be of the same type.

PostgreSQL does not support arrays with different types of elements. As an example, if we wanted to get the last credit set by date, and not by id:

db=# SELECT account , SUM (( data -> 'delta_balance' ) :: int ) AS balance , ( MAX ( ARRAY [ happened_at , ( data -> 'credit' ) :: int ] ) FILTER ( WHERE type = 'credit_set' ))[ 2 ] AS credit FROM event GROUP BY account ; ERROR: ARRAY types timestamp with time zone and integer cannot be matched LINE 4: (MAX(ARRAY[happened_at, (data-> 'credit' ) :: int ]) FILTER ...

PostgreSQL tells us that an array cannot contain both timestamps and integers.

We can overcome this restriction in some cases by casting one of the elements:

db=# SELECT account , SUM (( data -> 'delta_balance' ) :: int ) AS balance , ( MAX ( ARRAY [ EXTRACT ( 'EPOCH' FROM happened_at ), ( data -> 'credit' ) :: int ] ) FILTER ( WHERE type = 'credit_set' ))[ 2 ] AS credit FROM event GROUP BY account ; account | balance | credit ---------+---------+-------- 1 | -30 | 100

We converted the timestamp to epoch, which is the number of seconds since 1970. Once both elements are of the same type, we can use the array trick.

caveat Query might consume a little more memory.

This one if a bit of a stretch, but as we demonstrated in the past, large group and sort keys consume more memory in joins and sorts. Using the array trick, the group key is an array, which is a bit larger than the plain fields we usually sort by.

Also, the array trick can be used to "piggyback" more than one value:

( MAX ( ARRAY [ id , ( data -> 'credit' ):: int , EXTRACT ( 'EPOCH' FROM happened_at ) ]) FILTER ( WHERE type = 'credit_set' ))[ 2 :]

This query will return both the last credit set, and the date in which it was set. The entire array is used for sorting, so the more values we put in the array, the larger the group key gets.

The array trick is very useful, and can significantly simplify complicated queries and improve performance. We use it to produce reports on time series data, and to generate read models from event tables.

Call out to my readers I'm pretty sure using user-defined aggregate function in PostgreSQL it should be possible to create a function with the signature MAX_BY(key, value) . I haven't had time to dig deep into custom aggregate functions, but if any of the readers do, please share your implementation and i'll be happy to post it here.

UPDATED: Aug 17, 2019

In the few days following the publication of this article, I received several suggestions and comments from readers. This is a summary of the comments I received, and my thoughts on them.

One commenter on Reddit suggested using ARRAY_AGG :

db=# SELECT account , (( ARRAY_AGG ( data ORDER BY happened_at DESC ) FILTER ( WHERE type = 'credit_set' ))[ 1 ] -> 'credit' ) :: int AS credit FROM event GROUP BY account ; account | credit ---------+-------- 1 | 50

This approach obviously works, and it doesn't require the key and the value to be of the same type, which is a big limitation of the array trick.

The downside to this approach is that it requires a sort which might become expensive with very large data sets:

QUERY PLAN ------------------------------------------------------------------ GroupAggregate ( cost = 1.17..1.36 rows = 7 width = 8 ) Group Key : account -> Sort ( cost = 1.17..1.19 rows = 7 width = 76 ) Sort Key : account -> Seq Scan on event ( cost = 0.00..1.07 rows = 7 width = 76 )

Another commenter on Reddit suggested using window functions in combination with DISTINCT ON . This is the original suggestion:

SELECT DISTINCT ON ( account ) account , FIRST_VALUE (( data -> 'credit' ):: int ) OVER w , LAST_VALUE (( data -> 'credit' ):: int ) OVER w , SUM (( data -> 'credit' ):: int ) OVER w FROM event WHERE type = 'credit_set' WINDOW w AS ( PARTITION BY account ORDER BY id ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING )

The query uses both DISTINCT ON and window functions. It works by calculating the aggregates using the window function on the entire set (all rows of the account), and then fetch the first or last row using DISTINCT ON .

To make the window functions behave like a "group by" and calculate the aggregates on the entire set, the bound is defined as BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING , meaning "for the entire partition".

To avoid repeating the window for every aggregate, the query uses a WINDOW clause to define a named window that can be used multiple times in the query.

This query however, is not really working because the where clause is restricted to events with type credit_set . To get the complete status of the account, we also need to aggregate the balance of all events.

To actually make this approach work, we need to make the following adjustments:

db=# SELECT DISTINCT ON ( account ) account , LAST_VALUE (( data -> 'credit' ) :: int ) OVER ( PARTITION BY account ORDER BY ( CASE WHEN type = 'credit_set' THEN happened_at ELSE null END ) ASC NULLS FIRST ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING ) as credit , SUM ( COALESCE (( data -> 'delta_balance' ) :: int , 0 )) OVER ( PARTITION BY account ) AS balance FROM event ; account | credit | balance ---------+--------+--------- 1 | 100 | -30

What changes did we make:

We had to ditch the where clause so all events are processed.

We also had to do some "creative sorting" to get the last_credit set event.

set event. We removed the named window because it was no longer reused.

The plan also gotten more complicated:

Unique ( cost = 1.19..1.55 rows = 7 width = 24 ) -> WindowAgg ( cost = 1.19..1.54 rows = 7 width = 24 ) -> WindowAgg ( cost = 1.19..1.38 rows = 7 width = 48 ) -> Sort ( cost = 1.19..1.20 rows = 7 width = 44 ) Sort Key : account , ( CASE WHEN ( type = 'credit_set' :: text ) THEN happened_at ELSE NULL :: timestamp with time zone END ) NULLS FIRST -> Seq Scan on event ( cost = 0.00..1.09 rows = 7 width = 44 )

Two sorts, and several aggregates. The bottom line, in my opinion, is that this approach is harder to maintain and it yields a significantly more complicated plan. I wouldn't use it in this case. However, like the previous approach, it is not restricted by the type of the key and value.

In response to my tweet, a reader pointed me to an old wiki page with an implementation of two custom aggregate functions FIRST and LAST .

After creating the custom aggregates in the database as instructed in the wiki page, the query can look like this:

SELECT account , SUM ( COALESCE (( data ->> 'delta_balance' ):: int , 0 )) AS balance , LAST (( data ->> 'credit' ):: int ) FILTER ( WHERE type = 'credit_set' ) AS credit FROM event GROUP BY account ;

The main issue I found with this approach is that the order seems to be arbitrary. First and last can only be defined in the context of some order. I couldn't find a way to provide a field to sort by, so I consider this approach flawed for this use case.

As I suspected, this use case is ideal for custom aggregates and extensions, and indeed, another reader on Reddit pointed me the extension "first_last". The API is roughly similar to the custom aggregate above, but is also offers a way to sort the results so the first and last are not arbitrary.

I did not install the extension, but the query should look something like that: