Blog

The promise of Postgres JSON

JSON support is the most interesting new Postgres feature of the last few years. It relaxes the primary constraint of SQL databases — the rigid schema structure — by letting you store semistructured data in your tables alongside other data. It also decisively counters the NoSQL trend by giving users document store-like semantics in a proven, mature database technology.

Clearly we’re big fans. But while the getters and setters are straightforward, the functions for analysis queries are a bit counterintuitive. In this post, we’ll cover a few common patterns and how to build them out.

Joining on array values

Often, when adding many-to-one data to a table, you’ll start with a simple array. Yet inevitably an unanticipated analysis need will come up, and you’ll need to write joins using the data in the arrays.

For example, let’s say we stored each user’s device UUIDs in a JSON array, but now we want to join it to our newly-created devices table. Our users table might look like this:

We’ll start by breaking the table out into one row per users.id per device_uuid. We’ll do that with the handy json_array_elements function:

select id, json_array_elements(device_uuids) device_uuid from users

json_array_elements is a set-defining function, meaning it returns more than one row for each row it’s called on. In this case we’re going to get one row per user per device_uuid. Here are the results:

Now we have our device_uuid column! We can finally join users to devices. For example, if we wanted number of users per device platform, we could do:

select devices.platform, count(distinct users.id) from devices join ( select user_id, json_array_elements(device_uuids) device_uuid from users ) users_with_device_uuids using (device_uuid) group by 1

All we had to do was subselect our previous query and write a quick group-and-count!

Aggregating over arrays

Similarly, we might quickly store individual values in an array, but then need to aggregate those values later. For example, we might store per-user website load times in an array in the users table like so:

By storing them this way, we’ve made getting per-user average load times more annoying than if we had used a separate pageload_logs table. But, as before, we can create that table on the fly with the json_array_elements function:

select user_id, json_array_elements(load_times) load_time from users

Now we can just cast to float and compute a per-user average:

select id, avg(load_time::text::numeric) from ( select id, json_array_elements(load_times) load_time from users ) t group by 1

Note the bizarre double-typecast: ::text::numeric. JSON functions will return integers or text but never floating-point values. Fortunately you can cast to text and then to numeric.

This query gives us the results we were looking for:

Finally, we can even compute a global average the same way, just without the grouping:

select avg(load_time::text::numeric) from ( select user_id, json_array_elements(load_times) load_time from users ) t

And now we have the single average load time:

Aggregating object values

If you’re logging events from mobile devices and other clients, you may be using a JSON column for event data, with a key for each event type and a value for the count. For example, each timestamp might have an object like this:

{ pageview: 12, shopping_cart_click: 7, purchase: 3 }

Now let’s sum the total number of events we got on each day. We’ll use the set-defining function json_each_text to split out the keys and values. Let’s start by just selecting the raw keys and values:

select * from events, json_each_text(events.event_data)

Take a good look at that join:

from events, json_each_text(events.event_data)

This is a lateral cartesian join. It’s short for:

from events cross join lateral json_each_text(events.event_data)

The second half of the join references the table in the first half of the join, restricting the keys and values to the right event.

This gives us the results we expect:

Given that we have the keys and values right there in the join, we can rewrite our query with a simple group-and-count:

select date(event_time), sum(value::float) from events, json_each_text(events.event_data) group by 1

Note the cast of value to float. We chose json_each_text instead of json_each because it returns text, which can be cast directly to float.

Now we’ve got the total events per day!

Building blocks

By now, you’ve probably identified our favorite trick: split the JSON data out into rows and then aggregate. Our favorite functions for doing this are json_array_elements for arrays and json_each_text for objects. Paired with a little creative join logic, you’ll be analyzing JSON data in no time!