Onwards!

Specifying a data schema that matches your data access patterns is important for many database solutions and BigQuery is no exception. BigQuery is a column store and works best when the data are: 1) denormalised and 2) stored in a single table.

Storing all the data in a single table might seem troublesome, but BigQuery’s column types include arrays (lists) and structs (objects). This enables you to include data you would otherwise store in auxiliary tables. In doing so, you avoid the need for JOIN operations on (potentially) billions of rows.

Another best practice is to cluster your data (I strongly recommend reading Felipe Hoffa’s post [1] on it). Functionally transparent, data clustering will cause BigQuery to store related rows in close proximity. This can have profound beneficial effects on performance, and thus, costs[1]. So let’s use that as well.

The resulting table creation query would look like something like this. The data will be partitioned by date and clustered by channel (shop type) because Bob likes typically likes his statistics by channel and then by user — know your data access patterns.

CREATE TABLE `bobs_knob_shops.sessions` (

channel_id STRING,

visitor_id STRING,

timezone_offset INT64,

start_time TIMESTAMP,

end_time TIMESTAMP,

hit_count INT64,

hits ARRAY<STRUCT<

timestamp TIMESTAMP,

url STRING,

referrer_url STRING

>>)

PARTITION BY

DATE(start_time)

CLUSTER BY

channel_id,

visitor_id

Marvellous. And now we hit our first snag. Below is an example event message and if you take a step back, you might spot the problem. (There is nothing wrong with this data or its format.)

The issue is that we receive single events, whereas we designed our rows to be complete sessions. In practice, this would mean that rather than appending a new row, we would need to update part of an existing one (unless it is the first in the session). And updating rows, while possible, is not something BigQuery was designed to do. Updating rows is a rather expensive operation; it runs through all the bytes in the table. To add insult to injury, you cannot run multiple updates on the same table in parallel.

Getting Back On Track

We could use an additional technology, like Apache Beam, to aggregate events into sessions and trigger a write to BigQuery once a session has concluded. For our use case however, this is not needed. We will use an unaggregated flat table for our events and ship them over into the final sessions table every so often.

This ties in nicely with a closing note in Felipe Hoffa’s post[1]. He mentions that inserting data at multiple times during a day will weaken the clustering. So using a single table would have never been a good idea (and yes, that would have been an issue for the Beam job as well).

Hence we will use an additional table it is and copy it over every night. We will use the UTC time zone, which is the default for Google Cloud, to determine midnight. It is best to be explicit about time zones, but this way the punishment for accidental omission is less severe (or even present). We will skip clustering and partitioning; we do not expect to query this table a lot and it will contain at most a single day of data.

CREATE TABLE `bobs_knob_shops.events` (

channel_id STRING,

visitor_id STRING,

timestamp TIMESTAMP,

timezone_offset INT64,

url STRING,

referrer_url STRING

)

Don’t you love late night online knob shopping sprees? Alice sure does. Charles not so much, but he lives in a wildly different time zone. You might have already guessed it: job that simply copies over all the data in the daily table would leave you with broken sessions.