Why nesting is so cool

Daniel Mintz, Chief Data Evangelist

When you're setting up a data warehouse, one of the key questions is how to structure your data. And one of the trickiest types of data to deal with is hierarchically structured data (like orders and the order_items that make them up; or the pageloads that make up sessions ).

Do you completely normalize the data into a snowflake schema? Completely denormalize it into a very wide table with lots of repeated values? Or do something in the middle like a star schema?

Historically, that's a decision you had to make as you were designing your warehouse. And it always involved trade offs. Denormalization (one big wide table with lots of repeated values) takes up more space. But by eliminating joins, it's faster for some queries. Normalization, on the other hand, avoids repeated values and yields more compact tables. So it's more space efficient and makes other types of queries faster.

But some of the newest warehousing technologies, like Google BigQuery, offer a third option: don't decide, nest.

To explain what this means and why it's so powerful, let's imagine a (really) simple dataset describing orders and the items that make up each:

Order 1, 2017-01-25: {2 Apples @ $0.50 each, 3 Bananas @ $0.25 each} Order 2, 2017-01-26: {4 Carrots @ $0.20 each, 1 Date @ $0.60 each, 1 Endive @ $2 each} Order 3, 2017-01-27: {2 Fennel @ $1 each, 2 Bananas @ $0.25 each}

One way to structure this data would be in a snowflake schema:

This makes counting orders (SELECT COUNT(*) FROM orders) really cheap since it doesn't need to worry about any of the dimensions of those orders. But getting revenue by day requires two joins (and joins are usually expensive):

SELECT order_date, SUM(unit_price * quantity) FROM orders JOIN order_items USING(order_id) JOIN items USING(product_id) GROUP BY 1 ORDER BY 1

Alternatively, you could completely denormalize the data into one big, wide table:

That makes the revenue query dead simple and very cheap:

SELECT order_date, SUM(quantity * unit_price) FROM orders_denorm GROUP BY 1 ORDER BY 1

But now to get a count of orders, you have to do:

SELECT COUNT(DISTINCT order_id) FROM orders_denorm

That scans 7 rows instead of 3; plus DISTINCT is expensive. (And yes, when your table is 7 rows it’s not a big deal, but what if it’s 7 billion?)

So neither is ideal. Really, what you'd like is not to choose your schema until the last moment, when you know what kind of query you're running and can choose the optimal schema. And that's what nesting does.

So here's what the same data looks like nested in BigQuery:

Basically, you've nested a whole table into each row of the orders table. Another way to think about it is that your table contains a pre-constructed one-to-many join. The join is there if you need it, at no additional cost, but when you don't need it, you don't have to use it.

Now, when you just want to know how many orders there were, you don't have to scan repeated data or use DISTINCT . You just do:

SELECT COUNT(*) FROM orders

But when you do want to know more about the items in the orders, you simply UNNEST the order_items fields. This is far less computationally expensive than a join, because the data for each row is already colocated. That means the unnesting can be performed in parallel (whereas joining, even when joining on a distribution key, must be performed serially).

The SQL for unnesting can be a bit strange at first, but you quickly get the hang of it. And luckily, Looker can write it for you anyway. Here's how you'd find out the total revenue for the last three days:

SELECT order_date, SUM(order_items.quantity * order_items.unit_price) FROM orders LEFT JOIN UNNEST(orders.order_items) as order_items GROUP BY 1 ORDER BY 1

(You'll note that even though there appears to be a LEFT JOIN in the query, since you're just joining a nested column that's colocated with the row it's joining, computationally this join is almost costless.)

To sum up, when you're dealing with data that is naturally nested, leaving it nested until query time gives you the best of both worlds: great performance no matter what level of the hierarchy you're interested in.