I’ve thoroughly enjoyed writing short (and sometimes a bit longer) bite-sized tips for my #GTMTips topic. With the advent of Google Analytics: App + Web and particularly the opportunity to access raw data through BigQuery, I thought it was a good time to get started on a new tip topic: #BigQueryTips.

For Universal Analytics, getting access to the BigQuery export with Google Analytics 360 has been one of the major selling points for the expensive platform. The hybrid approach of getting access to raw data that has nevertheless been annotated with Google Analytics’ sessionization schema and any integrations (e.g. Google Ads) the user might have enabled is a powerful thing indeed.

With Google Analytics: App + Web, the platform is moving away from the data model that has existed since the days of Urchin, and is instead converging with Firebase Analytics, to which we have already had access with native Android and iOS applications.

Fortunately, BigQuery export for App + Web properties comes at no additional platform costs - you only pay for storage and queries just as you would if creating a BigQuery project by yourself in Google’s cloud platform.

So, with data flowing into BigQuery, I thought it time to start writing about how to build queries against this columnar data store. One of the reasons is because I want to challenge myself, but the other reason is that there just isn’t that much information available online when it comes to using SQL with Firebase Analytics’ BigQuery exports.

To get things started with this new topic, I’ve enlisted the help of my favorite SQL wizard in #measure, Pawel Kapuscinski. He’s never short of a solution when tricky BigQuery questions pop up in Measure Slack, so I thought it would be great to get him to contribute with some of his favorite tips for how to approach querying BigQuery data with SQL.

Hopefully, #BigQueryTips will expand to more articles in the future. There certainly is a lot of ground to cover!

Getting started

First of all, you’ll naturally need a Google Analytics: App + Web property. Here are some guides for getting started:

Next, you need to enable the BigQuery export for the new property, and for that you should follow this guide.

Once you have the export up-and-running, it’s a good time to take a moment to learn and/or get a refresher on how SQL works. For that, there is no other tutorial online that comes even close to the free, interactive SQL tutorial at Mode Analytics.

And once you’ve learned the difference between LEFT JOIN and CROSS JOIN , you can take a look at some of the sample data sets for iOS Firebase Analytics and Android Firebase Analytics. Play around with them, trying to figure out just how much querying a typical relational database differs from accessing data stored in columnar structure as BigQuery does.

At this point, you should have your BigQuery table collecting daily data dumps from the App + Web tags firing on your site, so let’s work with Pawel and introduce some pretty useful BigQuery SQL queries to get started on that data analysis path!

Tip #1: CASE and GROUP BY

Our first tip covers two extremely useful SQL statements: CASE and GROUP BY . Use these to aggregate and group your data!

CASE

CASE statements are similar to the if...else statements used in other languages. The condition is introduced with the WHEN keyword, and the first WHEN condition that matches will have its THEN value returned as the value for that column.

You can use the ELSE keyword at the end to specify a default value. If ELSE is not defined and no conditions are met, the column gets the value null .

SOURCE TABLE

user age 12345 15 23456 54

SQL

SELECT user , age, CASE WHEN age >= 90 THEN "90+" WHEN age >= 50 THEN "50-89" WHEN age >= 20 THEN "20-49" ELSE "0-19" END AS age_bucket FROM some_table

QUERY RESULT

user age age_bucket 12345 15 0-19 23456 54 50-89

The CASE statement is useful for quick transformations and for aggregating the data based on simple conditions.

GROUP BY

GROUP BY is required every time you want to summarize your data. For example, when you do calculations with COUNT (to return the number of instances) or SUM (to return the sum of instances), you need to indicate a column to group these calculations by (unless you’re only retrieving the calculated column) . GROUP BY is thus most often used with aggregate functions such as COUNT , MAX , ANY_VALUE , SUM , and AVG . It’s also used with some string functions such as STRING_AGG when aggregating multiple rows into a single string.

SOURCE TABLE

user mobile_device_model 12345 iPhone 5 12345 Nokia 3310 23456 iPhone 7

SQL

SELECT user , COUNT (mobile_device_model) AS device_count FROM table GROUP BY 1

QUERY RESULT

user device_count 12345 2 23456 1

In the query above, we assume that a single user can have more than one device associated with them in the table that is being queried. As we do a COUNT of all the devices for a given user, we need to group the results by the user column for the query to work.

USE CASE: Device category distribution across users

Let’s keep the theme of users and devices alive for a moment.

Users and events are the key metrics for App + Web. This is fundamentally different to the session-centric approach of Google Analytics (even though there are reverberations of “sessions” in App + Web, too). It’s much closer to a true hit stream model than before.

However, event counts alone tell as nothing without us drilling into what kind of event happened.

In this first tip, we’ll learn to calculate the number of users per device category. As the concept of “User” is still bound to a unique browser client ID, if the same person visited the website on two different browser instances or devices, they would be counted as two users.

This is what the query looks like:

SELECT CASE WHEN device.category = "desktop" THEN "desktop" WHEN device.category = "tablet" AND app_info.id IS NULL THEN "tablet-web" WHEN device.category = "mobile" AND app_info.id IS NULL THEN "mobile-web" WHEN device.category = "tablet" AND app_info.id IS NOT NULL THEN "tablet-app" WHEN device.category = "mobile" AND app_info.id IS NOT NULL THEN "mobile-app" END AS device, COUNT ( DISTINCT user_pseudo_id) AS users FROM `dataset.analytics_accountId.events_2*` GROUP BY 1

The query itself is simple, but it makes use of the two statements covered in this chapter effectively. CASE is used to segment “mobile” and “tablet” users further into web and app groups (something you’ll find useful once you start collecting data from both websites and mobile apps), and GROUP BY displays the count of unique device IDs per device category.

Tip #2: DISTINCT and HAVING

The next two keywords we’ll cover are HAVING and DISTINCT . The first is great for filtering results based on aggregated values. The latter is used to deduplicate results to avoid calculating the same result multiple times.

DISTINCT

The DISTINCT keyword is used to deduplicate results.

SOURCE TABLE

user device_category session_id 12345 desktop abc123 12345 desktop def234 12345 tablet efg345 23456 mobile fgh456

SQL

SELECT user , COUNT ( DISTINCT device_category) AS device_category_count FROM table GROUP BY 1

QUERY RESULT

user device_category_count 12345 2 23456 1

For example, if the user had three sessions with device categories desktop , desktop , and tablet , then a query for COUNT(DISTINCT device.category) would return 2 , as there are just two instances of distinct device categories.

HAVING

The HAVING clause can be used at the end of the query, after all calculations have been done, to filter the results. All the rows selected for the query are still processed, even if the HAVING statement strips out some of them from the result table.

SOURCE TABLE

user device_category session_id 12345 desktop abc123 12345 desktop def234 12345 tablet efg345 23456 mobile fgh456

SQL

SELECT user , COUNT ( DISTINCT device_category) AS device_category_count FROM table GROUP BY 1 HAVING device_category_count > 1

QUERY RESULT

user device_category_count 12345 2

It’s similar to WHERE (see next chapter), but unlike WHERE which is used to filter the actual records that are processed by the query, HAVING is used to filter on aggregated values. In the query above, device_category_count is an aggregated count of all the distinct device categories found in the data set.

USE CASE: Distinct device categories per user

Since the very name, App + Web, implies cross-device measurement, exploring some ways of grouping and filtering data based on users with more than one device in use seems fruitful.

The query is similar to the one in the previous chapter, but this time instead of grouping by device category, we’re grouping by user and counting the number of unique device categories each user has visited the site with. We’re filtering the data to only include users with more than one device category in the data set.

This is an exploratory query. You can then extend it to actually detect different models instead of just using device category. Device category is a fickle dimension to use, as in the example dataset used for this article, many times a device labelled as “Apple iPhone” was actually counted as a Desktop device.

See this article by Craig Sullivan to understand how messed up device attribution in web analytics actually is.

SELECT user_pseudo_id, COUNT ( DISTINCT device.category) AS used_devices_count, STRING_AGG( DISTINCT device.category) AS distinct_devices, STRING_AGG(device.category) AS devices FROM `dataset.analytics_accountId.events_2*` GROUP BY 1 HAVING used_devices_count > 1

As a bonus, you can see how STRING_AGG can be used to concatenate multiple values into a single column. This is useful for identifying patterns that emerge across multiple rows of data!

Tip #3: WHERE

WHERE

If you want to filter the records against which you’ll run your query, WHERE is your best friend. As it filters the records that are processed, it’s also a great way to reduce the (performance and monetary) cost of your queries.

SOURCE TABLE

user session_id landing_page_path 12345 abc123 /home/ 12345 bcd234 /purchase/ 23456 cde345 /home/ 34567 def456 /contact-us/

SQL

SELECT * FROM table WHERE user = '12345'

QUERY RESULT

user session_id landing_page_path 12345 abc123 /home/ 12345 bcd234 /purchase/

The WHERE clause is used to filter the rows against which the rest of the query is made. It is introduced directly after the FROM statement, and it reads as “return all the rows in the FROM table that match the condition of the WHERE clause”.

Do note that WHERE can’t be used with aggregate values. So if you’ve done any calculations with the rows returned from the table, you need to use HAVING instead.

Due to this, WHERE is less expensive in terms of query processing than HAVING .

USE CASE: Engaged users

As mentioned before, App + Web is event-driven. With Firebase Analytics, Google introduced a number of automatically collected, pre-defined events to help users gather useful data from their apps and websites without having to flex a single muscle.

One such event is user_engagement. This is fired when the user has engaged with the website or app for a specified period of time.

Since this is readily available as a custom event, we can create a simple query that uses the WHERE clause to return only those records where the user was engaged.

The key to WHERE is that it’s executed after the FROM clause (which specifies the table to be queried). If a row doesn’t match the condition in WHERE , it won’t be used to match the rest of the query against.

SELECT event_date, COUNT ( DISTINCT user_pseudo_id) AS engaged_users FROM `dataset.analytics_accountId.events_20190922` WHERE event_name = "user_engagement" GROUP BY 1

See how we’re including data from just one date? We’re also only including DISTINCT users, so if a user was engaged more than once during the day we’re only counting them once.

Tip #4: SUBQUERIES and ANALYTIC functions

A subquery in SQL means a query within a query. They can emerge in multiple different places, such as within SELECT clauses, FROM clauses, and WHERE clauses.

Analytic functions let you do calculations that cover other rows than just the one that is being processed. It’s similar to aggregations such as SUM and AVG except it doesn’t result in rows getting grouped into a single result. That’s why analytic functions are perfectly suited for things like running totals/averages or, in analytics contexts, determining session boundaries by looking at first and last timestamps, for example.

SUBQUERY

The point of a subquery is to run calculations on table data, and return the result of those calculations to the enclosing query. This lets you logically organize your queries, and it makes it possible to do calculations on calculations, which would not be possible in a single query.

SOURCE TABLE

user event_name 12345 session_start 23456 user_engagement 34567 user_engagement 45678 link_click

SQL

SELECT COUNT ( DISTINCT user ) AS all_users, ( SELECT COUNT ( DISTINCT user ) FROM table WHERE event_name = 'user_engagement' ) AS engaged_users FROM table

QUERY RESULT

all_users engaged_users 4 2

In this example, the subquery is within the SELECT statement, meaning the subquery result is bundled into a single column of the main query.

Here, the engaged_users column retrieves the count of all distinct user IDs from the table, where these users had an event named user_engagement collected at any time.

The main query then combines this with a count of all distinct user IDs without any restrictions, and thus you get both counts in the same table.

You couldn’t have achieved with just the main query, since the WHERE statement applies to all SELECT columns in the table. That’s why we needed the subquery - we had to use a WHERE statement that only applied to the engaged_users column.

ANALYTIC FUNCTIONS

Analytic functions can be really difficult to understand, since with SQL you’re used to going over the table row-by-row, and comparing the query against each row one at a time.

With an analytic function, you stretch this logic a bit. The query still goes over the source table row-by-row, but this time you can reference other rows when doing calculations.

SOURCE TABLE

user event_name event_timestamp 12345 click 1001 12345 click 1002 23456 session_start 1012 23456 user_engagement 1009 34567 click 1000

SQL

SELECT event_name, user , event_timestamp FROM ( SELECT user , event_name, event_timestamp, RANK() OVER (PARTITION BY event_name ORDER BY event_timestamp) AS rank FROM table ) WHERE rank = 1

QUERY RESULT

event_name user event_timestamp click 34567 1000 user_engagement 23456 1009 session_start 23456 1012

In this query, we take each event and see which user sent the first such event in the table.

For each row in the table, the RANK() OVER (PARTITION BY event_name ORDER BY event_timestamp) AS rank is run. The partition is basically a “reference” table with all the event names matching the current row’s event name, and their respective timestamps.

These timestamps are then ordered in ascending order (within the partition). The RANK() OVER part of the function returns the current event name’s rank in this table.

To walk through an example, let’s say the query engine encounters the first row of the table. The RANK() OVER (PARTITION BY event_name ORDER BY event_timestamp) creates the reference table that looks like this for the first row:

user event_name event_timestamp rank 12345 click 1000 1 12345 click 1001 2 12345 click 1002 3

The query then checks how the current row matches against this partition, and returns 2 as the first row in the table was rank 2 of its partition.

This partition is ephemeral - it’s only used to calculate the result of the analytic function.

For the purposes of this exercise, this analytic function is furthermore done in a subquery, so that the main query can filter the result using a WHERE for just those items that had rank 1 (first timestamp of any given event).

I recommend checking out the first paragraphs in this document - it explains how the partitioning works.

USE CASE: First interaction per event

Let’s extend the example from above into the App + Web dataset.

We’ll create a list of client IDs together with the event name, the timestamp, and the timestamp converted to a readable (date) format.

SELECT user_pseudo_id, event_name, event_timestamp, DATETIME(TIMESTAMP_MICROS(event_timestamp), "Europe/Helsinki" ) AS date FROM ( SELECT user_pseudo_id, event_name, event_timestamp, RANK() OVER (PARTITION BY event_name ORDER BY event_timestamp) AS rank FROM `dataset.analytics_accountId.events_20190922` ) WHERE rank = 1 ORDER BY event_timestamp

The logic is exactly the same as in the introduction to analytic functions above. The only difference is how we use the DATETIME and TIMESTAMP_MICROS to turn the UNIX timestamp (stored in BigQuery) into a readable date format.

Don’t worry - analytic functions in particular are a tough concept to understand. Play around with the different analytic functions to get an idea of how they work with the source idea.

There are quite a few articles and videos online that explain the concept further, so I recommend checking out the web for more information. We will also return to analytic functions many, many times in future #BigQueryTips posts.

Tip #5: UNNEST and CROSS JOIN

The dataset exported by App + Web does not end up in a relational database where we could use the JOIN key to quickly pull in extra information about pages, sessions, and users.

Instead, BigQuery arranges data in nested and repeated fields - that’s how it can have a single row represent all the hits of a session (as in the Google Analytics dataset).

The problem with this approach is that it’s not too intuitive to access these nested values.

Enter UNNEST , particularly when coupled with CROSS JOIN .

UNNEST means that the nested structure is actually expanded so that each item within can be joined with the rest of the columns in the row. This results in the single row becoming multiple rows, where each row corresponds to one value in the nested structure. This column-to-rows is achieved with a CROSS JOIN , where every item in the unnested structure is joined with each column in the rest of the table.

UNNEST and CROSS JOIN

The two keywords go intrinsically together, so we’ll treat them as such.

SOURCE TABLE

timestamp event params.key params.value 1000 click time

type

device 10

right-click

iPhone

SQL

SELECT timestamp , event, event_params. key AS event_params_key FROM table AS t CROSS JOIN UNNEST (t.params) AS event_params

QUERY RESULT

timestamp event event_params_key 1000 click time 1000 click type 1000 click device

See what happened? The nested structure within params was unnested so that each item was treated as a separate row in its own column. Then, this unnested structure was cross-joined with the table. A CROSS JOIN combines every row from table with every row from the unnested structure.

This is how you end up with a structure that you can then use for calculations in your main query.

It’s a bit complicated since you always need to do the UNNEST - CROSS JOIN exercise, but once you’ve done it a few times it should become second nature.

In fact, there’s a shorthand for writing the CROSS JOIN that might make things easier to read (or not). You can replace the CROSS JOIN statement with a comma. This is what the example query would look like when thus modified.

SELECT timestamp , event, event_params. key AS event_params_key FROM table AS t, UNNEST (t.params) AS event_params

USE CASE: Pageviews and Unique Pageviews per user

Let’s look at a query for getting the count of pageviews and unique pageviews per user.

SELECT event_date, event_params.value.string_value as page_url, COUNT (*) AS pageviews, COUNT ( DISTINCT user_pseudo_id) AS unique_pageviews FROM `dataset.analytics_accountId.events_20190922` AS t, UNNEST (t.event_params) AS event_params WHERE event_params. key = "page_location" AND event_name = "page_view" GROUP BY 1 , 2 ORDER BY 3 DESC

In the FROM statement we define the table as usual, and we give it an alias t (good practice, as it helps keep the queries lean).

The next step is to UNNEST one of those nested fields, and then CROSS JOIN it with the rest of our table. The WHERE clause is used to make sure only rows that have the page_location key and the page_view event type are included in the final table.

Finally, we GROUP BY the event date and page URL, and then we sort everything by the count of pageviews in descending order.

Conceptually, the UNNEST and CROSS JOIN create a new table that has a crazy number of rows, as every row in the original table is multiplied by the number of rows in the event_params nested structure. The WHERE clause is thus your friend in making sure you only analyze the data that answers the question you have.

Tip #6: WITH…AS and LEFT JOIN

The final keywords we’ll go through are WITH...AS and LEFT JOIN . The first lets you create a type of subquery that you can reference in your other queries as a separate table. The second lets you join two datasets together while preserving rows that do not match against the join condition.

WITH…AS

WITH...AS is very similar conceptually to a subquery. It allows you to build a table expression that can then be used by your queries. The main difference to a subquery is that the table gets an alias, and you can then use this alias wherever you want to refer to the table’s data. A subquery would need to be repeated in all places where you want to reference it, so WITH makes custom queries more accessible.

SOURCE TABLE

user event_name event_timestamp 12345 click 1000 12345 user_engagement 1005 12345 click 1006 23456 session_start 1002 34567 session_start 1000 34567 click 1002

SQL

WITH users_who_clicked AS ( SELECT DISTINCT user FROM table WHERE event_name = 'click' ) SELECT * FROM users_who_clicked

QUERY RESULT

user 12345 34567

It’s not the most mind-blowing of examples, but it should illustrate how to use table aliases. After the WITH...AS clause, you can then reference the table in e.g. FROM statements and in joins.

See the next example where we make this more interesting.

LEFT JOIN

A LEFT JOIN is one of the most useful data joins because it allows you to account for unmatched rows as well.

A LEFT JOIN takes all rows in the first table (“left” table), and the rows that match a specific join criterion in the second table (“right” table). For all the rows that did not have a match, the first table is populated with a null value. Rows in the second table that do not have a match in the first table are discarded.

SOURCE TABLE

user event_name event_timestamp 12345 click 1000 12345 user_engagement 1005 12345 click 1006 23456 session_start 1002 34567 session_start 1000 34567 click 1002

SQL

WITH users_who_clicked AS ( SELECT DISTINCT user FROM table WHERE event_name = 'click' ) SELECT t1. user , CASE WHEN t2. user IS NOT NULL THEN "true" ELSE "false" END AS user_did_click FROM table as t1 LEFT JOIN users_who_clicked as t2 ON t1. user = t2. user GROUP BY 1 , 2

QUERY RESULT

user user_did_click 12345 true 23456 false 34567 true

With the LEFT JOIN , we take all the user IDs in the main table, and then we take all the rows in the users_who_clicked table (that we created in the WITH...AS tutorial above). For all the rows where the user ID from the main table has a match in the users_who_clicked table, we populate a new column named user_did_click with the value true.

For all the user IDs that do not have a match in the users_who_clicked table, the value false is used instead.

USE CASE: Simple segmenting

Let’s put a bunch of things we’ve now learned together, and replicate some simple segmenting functionality from Google Analytics in BigQuery.

WITH engaged_users_table AS ( SELECT DISTINCT event_date, user_pseudo_id FROM `dataset.analytics_accountId.events_20190922` WHERE EVENT_NAME = "user_engagement" ), pageviews_table AS ( SELECT DISTINCT event_date, event_params.value.string_value AS page_url, user_pseudo_id FROM `simoahava-com.analytics_206575074.events_20190922` AS t, UNNEST (t.event_params) AS event_params WHERE event_params. key = "page_location" AND event_name = "page_view" ) SELECT t1.event_date, page_url, COUNTIF(t2.user_pseudo_id IS NOT NULL ) AS engaged_users_visited, COUNTIF(t2.user_pseudo_id IS NULL ) AS not_engaged_users_visited FROM pageviews_table AS t1 LEFT JOIN engaged_users_table AS t2 ON t1.user_pseudo_id = t2.user_pseudo_id GROUP BY 1 , 2 ORDER BY 3 DESC

Let’s chop this down to pieces again.

In the query above, we have two table aliases created. The first one named engaged_users_table should be familiar - it’s the query we built in tip #3.

The second one named pageviews_table should ring a bell as well. We built it in tip #5.

We create these as table aliases so that we can use them in the subsequent join.

Now, let’s look at the rest of the query:

SELECT t1.event_date, page_url, COUNTIF(t2.user_pseudo_id IS NOT NULL ) AS engaged_users_visited, COUNTIF(t2.user_pseudo_id IS NULL ) AS not_engaged_users_visited FROM pageviews_table AS t1 LEFT JOIN engaged_users_table AS t2 ON t1.user_pseudo_id = t2.user_pseudo_id GROUP BY 1 , 2 ORDER BY 3 DESC

Focus on the part after the FROM clause. Here, we take the two tables, and we LEFT JOIN them. The pageviews_table is the “left” table, and engaged_users is the “right” table.

LEFT JOIN takes all the rows in the “left” table and the matching rows from the “right” table. The matching criterion is established in the ON clause. Thus, a match is made if the user_pseudo_id is the same between the two tables.

To illustrate, this is what a simple LEFT JOIN without any calculations or aggregations would look like:

Here you can see that on rows 1, 9, and 10 there was no match for the user_pseudo_id in the engaged_users table. This means that the user who dispatched this pageview was not considered “engaged” in the date range selected for analysis.

If you’re wondering what COUNTIF does, then check out these two statements detail:

COUNTIF( t2.user_pseudo_id IS NOT NULL ) AS engaged_users_visited, COUNTIF( t2.user_pseudo_id IS NULL ) AS not_engaged_users_visited

The first one increments the engaged_users_visited counter by one for all the rows where the user visited the page AND had a row in the engaged_users table.

The second one increments the not_engaged_users_visited counter by one for all the rows where the user visited the page and DID NOT have a row in the engaged_users table.

We can do this because the LEFT JOIN leaves the t2.* columns null in case the user ID from the pageviews_table was not found in the engaged_users_table .

SUMMARY

Phew!

That wasn’t a quick foray into BigQuery after all.

However, if you already possess a rudimentary understanding of SQL (and if you don’t, just take a look at the Mode Analytics tutorial), then the most complex things you need to wrap your head around are the ways in which BigQuery leverages its unique columnar format.

It’s quite different to a relational database, where joins are the most powerful tool in your kit for running analyses against different data structure.

In BigQuery, you need to understand the nested structures and how to UNNEST them. This, for many, is a difficult concept to grasp, but once you consider how UNNEST and CROSS JOIN simply populate the table as if it had been a flat structure all along, it should help you build the queries you want.

Analytic functions are arguably the coolest thing since Arnold Schwarzenegger played Mr. Freeze. They let you do aggregations and calculations without having to build a complicated mess of subqueries across all the rows you want to process. We’ve only explored a very simple use case here, but in subsequent articles we’ll definitely cover these in more detail.

Finally, huge thanks to Pawel for providing the SQL examples and walkthroughs. Simo contributed some enhancements to the queries, but his contribution is mainly editorial. Thus, direct all criticism and suggestions for improvement to him, and him alone.

If you’re an analyst working with Google Analytics and thinking about what you should do to improve your skill set, we recommend learning SQL and then going to town on either any one of the copious BigQuery sample datasets out there, or on your very own App + Web export.

Let us know in the comments if questions arose from this guide!