Part of my job is desigining helper applications around a set of data in a database designed by a third-party software vendor. The vendor’s database design is … interesting to say the least; many of the records are versioned, so a common challenge is to find the newest record for a given entity.



This isn’t always as easy as it sounds, because determining “newest” often requires looking at multiple columns. To understand, consider the following (contrived) example:

-- TABLE values collection | book | chapter | line | value ------------+------+---------+------+------- 1234 | 3 | 2 | 34 | fizz 1234 | 1 | 17 | 23 | blah 1235 | 2 | 12 | 7 | quuz 1235 | 7 | 8 | 44 | qaax 1235 | 7 | 9 | 2 | fooz

In this table, we have values that are found within lines, within chapters, within books, within a collection. This might be entries in some kind of deed book or log, for example. Suppose we want to generate a table containing only the newest value – that is, the value found in the last line of the last chapter of the last book in each collection. Correct results would be:

collection | value ------------+------- 1234 | fizz 1235 | fooz

A first approach At first blush, this seems simple, and you might be tempted to try something like this: SELECT collection, value FROM values v WHERE book=(SELECT MAX(book) FROM values WHERE collection = v.collection) AND chapter=(SELECT MAX(chapter) FROM values WHERE collection = v.collection) AND line = (SELECT MAX(line) FROM values WHERE collection = v.collection) Problem is, this doesn’t work; in fact it returns no rows. Why? Consider collection 1234: MAX(book) is 3

is 3 MAX(chapter) is 17

is 17 MAX(line) is 34 …which means we’re looking for a record with book=1 , chapter=17 , and line=34 . No such record exists. One possible approach to the problem is a “fractal” approach, in which each subsequent WHERE subclause needs to include the previous ones: SELECT collection, value FROM values v WHERE book=(SELECT MAX(book) FROM values WHERE collection = v.collection) AND chapter=(SELECT MAX(chapter) FROM values WHERE collection = v.collection AND book=(SELECT MAX(book) FROM values WHERE collection = v.collection)) AND line = (SELECT MAX(line) FROM values WHERE collection = v.collection AND chapter = (SELECT MAX(chapter) FROM values WHERE collection = v.collection AND book=(SELECT MAX(book) FROM values WHERE collection = v.collection)) AND book = (SELECT MAX(book) FROM values WHERE collection= v.collection)) YUCK! What a mess! It does work, but it’s confusing and tedious to write; can you imagine this query if there were four or five fields we need to order? Or if values was some complex subquery rather than a table? I’ve had to write queries like this on other database systems where this dataset has lived, and they’re some of the most hideous I’ve ever written. It’s bad enough with a simple example table, but in a real query situation it’s enough to make your head spin.

A different approach If you only need to find the latest value for a single collection, there’s a much easier trick: SELECT collection, value FROM values WHERE collection=1234 ORDER BY book DESC, chapter DESC, line DESC LIMIT 1; ORDER BY works exactly how we want it to in this case: it orders first by book , then by chapters within books , then by lines within chapters . By ordering in descending order and just snipping off the first result, we end up with just the newest value . Unfortunately, we can’t use this approach to build a table of results for all collections. Or can we? Enter window functions Window functions are a little-used feature of Postgresql that allow you to apply aggregate functions over groups of records based on matching values in a column. In other words, if we wanted to tally a department’s salaries by department, we could SUM the salaries using a windowing function OVER a PARTITION defined by a department column. We can also tell the window function how to order the records within each department section, so we could (for example) tally the salaries progressively by pay grade. Or, to apply this to our values table, we can create a function to aggregate value over records grouped by collection in which rows are ordered by book , chapter , and line . It might look like this: SELECT DISTINCT collection, FIRST(value) OVER (PARTITION BY collection ORDER BY book DESC, chapter DESC, line DESC) FROM values; What this query does is group (PARTITION) our records by collection, order each group as we specified in our OVER clause, and then run the FIRST aggregate function on the value field in each group ( FIRST is an aggregate function that gives us the first record in the series). Unlike a similar GROUP BY operation, this still gives us a row for each row in the original table; but since the aggregated value will be the same for each group, a DISTINCT will reduce us down to one row per collection . There’s only one real problem with this query: FIRST doesn’t exist (sorry copy-pasters). As nice as it would have been, there’s no aggregate function that will return the first record in a series. Is there a way around this? Enter the ARRAY Array types in a database may at first just seem like a dangerous tool for thwarting normalization, but there are situations where they come in quite handy. This is one of them. Postgresql has an array_agg() aggregate function that, well, aggregates values into an array. Thus, the query: SELECT collection, array_agg(value) FROM values GROUP BY collection; Returns: collection | array_agg ------------+------------------ 1234 | {fizz,blah} 1235 | {quuz,qaax,fooz} Now, by using a window function instead of a GROUP BY , we can control the order that the values end up in the array: SELECT collection, array_agg(value) OVER (PARTITION BY collection ORDER BY book DESC, chapter DESC, line DESC) FROM values; This gives us: collection | array_agg ------------+------------------ 1234 | {fizz} 1234 | {fizz,blah} 1235 | {fooz} 1235 | {fooz,qaax} 1235 | {fooz,qaax,quuz} Ah, now you can see that the first value in each array is the correct one. We just need to pop off that first value and add DISTINCT so that we only have one record per collection: SELECT DISTINCT collection, ( array_agg(value) OVER (PARTITION BY collection ORDER BY book DESC, chapter DESC, line DESC) )[1] FROM values; Notice two things here: Unlike every other array/list/tuple/collection syntax known to programmer-kind, Postgresql arrays start at 1, not 0. So the first value in the array is 1.

We need to put the whole window expression in parenthesis and add the subscript to that, because it’s the whole expression (not just array_agg(value) ) that evaluates to the array. This query gives us the correct output, and is considerably more approachable than the fractal approach. Not only that, but adding more ORDER BY columns is trivial.