While BigQuery is often the perfect tool for doing data science and machine learning with your Google Analytics data, it can sometimes be frustrating to query basic web analytics metrics. In this post, I’ll walk through calculating some fundamental metrics at the page level by replicating the All Pages report for the Google Merchandise Store in BigQuery.

I will be using Google’s sample dataset for the Google Merchandise store for August 1, 2016. You can compare this data to data in Google Analytics by viewing the Google Analytics demo account. I will be using the standard SQL dialect.

Word of warning: Although this post discusses the All Pages report – probably the most popular report in Google Analytics – it is not for the faint of heart. We will discuss a lot of technical details surrounding Google Analytics definitions, the SQL language, and the format of the BigQuery data. If you have not worked in BigQuery before, you may want to start off with a gentler introduction.

Primary Dimension – Page

In BigQuery the page dimension, or URL, is stored in the field hits.page.pagePath. Recall that Google Analytics data is stored at the session level. To access hit level information, we will need to unnest our table by hits. The query below shows all URLs that were visited by users.

SELECT hits.page.pagePath FROM 'bigquery-public-data.google_analytics_sample.ga_sessions_20160801' AS GA, UNNEST(GA.hits) AS hits GROUP BY hits.page.pagePath

Pageviews

Next, we want to add in the number of pageviews associated with each URL. If you look at the Google Analytics schema, the only pageview related field you will see is totals.pageviews. Use extreme caution when using this field! It is a session level field (it’s counting the total number of pageviews in the entire session) and it should only be mixed with hit level fields with extreme care. This metric is not useful for replicating the All Pages report. Instead, we will need to calculate this metric ourselves.

So, what is a pageview and how do we calculate it? We often think of pageviews as the number of times that a page was loaded in the browser. In Google Analytics, this is measured by the number of times we send a “pageview” hit to Google Analytics. (Recall there are other types of hits that can be sent to Google Analytics, such as event, social, timing, and transaction hits.)

To calculate pageviews in Google Analytics, we need to count the number of times a hit of type PAGE is associated with each URL. As above, we will use the field hits.page.pagePath to identify the URL. Then filter the results by setting the hits.type field equal to ‘PAGE’ and count up the hits.

SELECT hits.page.pagePath, COUNT(*) AS pageviews FROM 'bigquery-public-data.google_analytics_sample.ga_sessions_20160801' AS GA, UNNEST(GA.hits) AS hits WHERE hits.type = 'PAGE' GROUP BY hits.page.pagePath ORDER BY pageviews DESC

Unique Pageviews

In Google Analytics, unique pageviews represent the number of sessions during which a page was viewed. Once again, we want to focus on the hits.type equals ‘PAGE’, but rather than count the hits, now we want a count distinct on the number of sessions.

To count unique sessions, we need a unique way of identifying a session. The visitId field looks promising for this. However, this identifier has some issues when a session ends a midnight, so I recommend using visitStartTime instead. Note that the visitStartTime will only be unique per user, so we need to concatenate the visitStartTime with the fullVisitorId to get a globally unique session ID.

Using this session identifier, we can now compute unique pageviews.

SELECT pagepath, COUNT(*) AS pageviews, COUNT(DISTINCT session_id) AS unique_pageviews FROM ( SELECT hits.page.pagePath, CONCAT(fullVisitorId, CAST(visitStartTime AS STRING)) AS session_id FROM 'bigquery-public-data.google_analytics_sample.ga_sessions_20160801' AS GA, UNNEST(GA.hits) AS hits WHERE hits.type = 'PAGE') GROUP BY pagePath ORDER BY pageviews DESC

Average Time on Page

The formula for average time on page is: Total Time on Page / (Pageviews - Exits) .

We will need to compute each of the components of this formula separately, then compute the average using the formula above.

Pageviews

Easy – we have already calculated this above.

SELECT hits.page.pagePath, COUNT(*) AS pageviews FROM 'bigquery-public-data.google_analytics_sample.ga_sessions_20160801' AS GA, UNNEST(GA.hits) AS hits WHERE hits.type = 'PAGE' GROUP BY hits.page.pagePath ORDER BY pageviews DESC

Exits

Fortunately, exit pages are designated with the hits.isExit field. We just need to count these up to get total exits on the page.

SELECT pagePath, SUM(exits) AS exits FROM ( SELECT hits.page.pagePath, CASE WHEN hits.isExit IS NOT NULL THEN 1 ELSE 0 END AS exits FROM 'bigquery-public-data.google_analytics_sample.ga_sessions_20160801' AS GA, UNNEST(GA.hits) AS hits) GROUP BY pagePath ORDER BY exits DESC

Total Time on Page

This one is tricky. Recall that Google Analytics calculates the time on page by comparing the timestamps of hits sent in on different pages. For non-exit pageviews, the time on page is computed by subtracting the timestamp of the next pageview hit from the timestamp of the current pageview hit. For any pageview that is an exit, we compute the time on page as the difference between the timestamp of the last interaction hit minus the timestamp of the pageview hit.

If a page is an exit page and it’s the only page of a session, and there are no interaction events, then this page is considered a bounce and does not contribute to total time on page. Put more simply, bounces do not affect the avg time on page metric.

The timestamp information for each hit can be found in the hits.time field. Note that this field is measured in milliseconds, so you will need to divide the final time by 1,000 to get back to seconds. For each pageviews, we need to compute two additional columns – the timestamp for the next page as well as the timestamp for the last interaction hit.

We can compute the timestamp for the next pageview (within that session) by using a lead window function.

SELECT fullVisitorId, visitStartTime, pagePath, hit_time, LEAD(hit_time) OVER (PARTITION BY fullVisitorId, visitStartTime ORDER BY hit_time) AS next_pageview FROM ( SELECT fullVisitorId, visitStartTime, hits.page.pagePath, hits.time / 1000 AS hit_time FROM 'bigquery-public-data.google_analytics_sample.ga_sessions_20160801' AS GA, UNNEST(GA.hits) AS hits WHERE hits.type = 'PAGE')

We can create a new column with the timestamp for the last interaction hit by finding the maximum timestamp out of all of the interaction events. Interaction hits are designated by the field hits.isInteraction.

SELECT fullVisitorId, visitStartTime, hits.page.pagePath, MAX(IF(hits.isInteraction IS NOT NULL, hits.time, 0)) OVER (PARTITION BY fullVisitorId, visitStartTime) as last_interaction FROM 'bigquery-public-data.google_analytics_sample.ga_sessions_20160801', UNNEST(hits) AS hits WHERE hits.type = 'PAGE'

Be careful when combining these two queries. Filtering in the where clause is a little tricky because we need to include events to get a last interaction hit value. Therefore, we need to be a little patient before calculating the next_pageview timestamp.

SELECT fullVisitorId, visitStartTime, pagePath, hit_time, type, isExit, last_interaction, LEAD(hit_time) OVER (PARTITION BY fullVisitorId, visitStartTime ORDER BY hit_time) AS next_pageview FROM ( SELECT fullVisitorId, visitStartTime, pagePath, hit_time, type, isExit, last_interaction FROM ( SELECT fullVisitorId, visitStartTime, hits.page.pagePath, hits.type, hits.isExit, hits.time / 1000 AS hit_time, MAX(IF(hits.isInteraction IS NOT NULL, hits.time / 1000, 0)) OVER (PARTITION BY fullVisitorId, visitStartTime) AS last_interaction FROM 'bigquery-public-data.google_analytics_sample.ga_sessions_20160801' AS GA, UNNEST(GA.hits) AS hits) WHERE type = 'PAGE')

Now we can compute time on page by comparing these two columns to the hits.time of the pageview.

SELECT fullVisitorId, visitStartTime, pagePath, hit_time, type, isExit, CASE WHEN isExit IS NOT NULL THEN last_interaction - hit_time ELSE next_pageview - hit_time END AS time_on_page FROM ( SELECT fullVisitorId, visitStartTime, pagePath, hit_time, type, isExit, last_interaction, LEAD(hit_time) OVER (PARTITION BY fullVisitorId, visitStartTime ORDER BY hit_time) AS next_pageview FROM ( SELECT fullVisitorId, visitStartTime, pagePath, hit_time, type, isExit, last_interaction FROM ( SELECT fullVisitorId, visitStartTime, hits.page.pagePath, hits.type, hits.isExit, hits.time / 1000 AS hit_time, MAX(IF(hits.isInteraction IS NOT NULL, hits.time / 1000, 0)) OVER (PARTITION BY fullVisitorId, visitStartTime) AS last_interaction FROM 'bigquery-public-data.google_analytics_sample.ga_sessions_20160801' AS GA, UNNEST(GA.hits) AS hits) WHERE type = 'PAGE'))

Put It All Together

Finally, we are ready to aggregate and compute the average time on page.

SELECT pagePath, pageviews, exits, total_time_on_page, CASE WHEN pageviews = exits THEN 0 ELSE total_time_on_page / (pageviews - exits) END AS avg_time_on_page FROM ( SELECT pagePath, COUNT(*) AS pageviews, SUM(IF(isExit IS NOT NULL, 1, 0)) AS exits, SUM(time_on_page) AS total_time_on_page FROM ( SELECT fullVisitorId, visitStartTime, pagePath, hit_time, type, isExit, CASE WHEN isExit IS NOT NULL THEN last_interaction - hit_time ELSE next_pageview - hit_time END AS time_on_page FROM ( SELECT fullVisitorId, visitStartTime, pagePath, hit_time, type, isExit, last_interaction, LEAD(hit_time) OVER (PARTITION BY fullVisitorId, visitStartTime ORDER BY hit_time) AS next_pageview FROM ( SELECT fullVisitorId, visitStartTime, pagePath, hit_time, type, isExit, last_interaction FROM ( SELECT fullVisitorId, visitStartTime, hits.page.pagePath, hits.type, hits.isExit, hits.time / 1000 AS hit_time, MAX(IF(hits.isInteraction IS NOT NULL, hits.time / 1000, 0)) OVER (PARTITION BY fullVisitorId, visitStartTime) AS last_interaction FROM 'bigquery-public-data.google_analytics_sample.ga_sessions_20160801' AS GA, UNNEST(GA.hits) AS hits) WHERE type = 'PAGE'))) GROUP BY pagePath) ORDER BY pageviews DESC

Entrances

Fortunately, entrances by page are easy to compute. There is a field called hits.isEntrance that we can use to determine whether that pageview is an entrance.

SELECT pagePath, SUM(entrances) AS entrances FROM ( SELECT hits.page.pagePath, CASE WHEN hits.isEntrance IS NOT NULL THEN 1 ELSE 0 END AS entrances FROM 'bigquery-public-data.google_analytics_sample.ga_sessions_20160801' AS GA, UNNEST(GA.hits) AS hits) GROUP BY pagePath ORDER BY entrances DESC

Bounce Rate

Bounce rate is another complicated field in BigQuery. The formula for bounce rate is: Bounces / Sessions . We will need to compute bounces and sessions separately.

Bounces

Bounces are attributed to the first interaction hit in a session in which there is exactly one interaction event. We can determine if there was exactly one interaction event in the session by using the totals.bounces field. Now we just need to find the first interaction hit in the session.

The hits.hitNumber field will work well for this task. We will just need to use a window function to identify the hit number for the first interaction hit in the session. (Note that we can determine if a hit is an interaction hit by using the hits.isInteraction field.)

SELECT fullVisitorId, visitStartTime, pagePath, CASE WHEN hitNumber = first_interaction THEN bounces ELSE 0 END AS bounces FROM ( SELECT fullVisitorId, visitStartTime, hits.page.pagePath, totals.bounces, hits.hitNumber, MIN(IF(hits.isInteraction IS NOT NULL, hits.hitNumber, 0)) OVER (PARTITION BY fullVisitorId, visitStartTime) AS first_interaction FROM 'bigquery-public-data.google_analytics_sample.ga_sessions_20160801' AS GA, UNNEST(GA.hits) AS hits)

Sessions

Sessions are attributed to the first hit (interaction or not) in a session where there is at least one interaction event. (Keep in mind that sessions are closely related to but not the same as entrances). Fortunately, we can use the totals.sessions field to identify whether there is at least one interaction event. Now we just need to identify the first hit in the session.

To find the first hit in the session, it is very tempting to just check if hits.hitNumber equals 1. Unfortunately, like visitId, this field does not restart at midnight. So, we will need to use a window function to identify the first hit number in the session. We can then use this column to compute the number of sessions associated with each page.

SELECT fullVisitorId, visitStartTime, pagePath, CASE WHEN hitNumber = first_hit THEN visits ELSE 0 END AS sessions FROM ( SELECT fullVisitorId, visitStartTime, hits.page.pagePath, totals.visits, hits.hitNumber, MIN(hits.hitNumber) OVER (PARTITION BY fullVisitorId, visitStartTime) AS first_hit FROM 'bigquery-public-data.google_analytics_sample.ga_sessions_20160801' AS GA, UNNEST(GA.hits) AS hits)

Aggregate and Combine

Now we just need to aggregate bounces and sessions then divide.

select pagePath, bounces, sessions, CASE WHEN sessions = 0 THEN 0 ELSE bounces / sessions END AS bounce_rate from ( SELECT pagePath, SUM(bounces) AS bounces, SUM(sessions) AS sessions FROM ( SELECT fullVisitorId, visitStartTime, pagePath, CASE WHEN hitNumber = first_interaction THEN bounces ELSE 0 END AS bounces, CASE WHEN hitNumber = first_hit THEN visits ELSE 0 END AS sessions FROM ( SELECT fullVisitorId, visitStartTime, hits.page.pagePath, totals.bounces, totals.visits, hits.hitNumber, MIN(IF(hits.isInteraction IS NOT NULL, hits.hitNumber, 0)) OVER (PARTITION BY fullVisitorId, visitStartTime) AS first_interaction, MIN(hits.hitNumber) OVER (PARTITION BY fullVisitorId, visitStartTime) AS first_hit FROM 'bigquery-public-data.google_analytics_sample.ga_sessions_20160801' AS GA, UNNEST(GA.hits) AS hits)) GROUP BY pagePath) ORDER BY sessions DESC

% Exit

The formula for % Exit is: Exits / Pageviews .

We have already calculated pageviews and exits above. Now we just need to combine these into a single query.

SELECT pagePath, pageviews, exits, CASE WHEN pageviews = 0 THEN 0 ELSE exits / pageviews END AS exit_rate FROM ( SELECT pagepath, COUNT(*) AS pageviews, SUM(exits) AS exits FROM ( SELECT hits.page.pagePath, CASE WHEN hits.isExit IS NOT NULL THEN 1 ELSE 0 END AS exits FROM 'bigquery-public-data.google_analytics_sample.ga_sessions_20160801' AS GA, UNNEST(GA.hits) AS hits WHERE hits.type = 'PAGE') GROUP BY pagePath) ORDER BY pageviews DESC

Page Value

Calculating page value is beyond the scope of this post. In general, I would recommend using the Google Analytics Core Reporting API to pull the page value or create your own custom content scoring method.

Accessing standard Google Analytics metrics through BigQuery can be more painful and time-consuming than using the Core Reporting API. However, BigQuery can give you the power to add more advanced reporting and analysis to the standard Google Analytics reports, and adding in those standard metrics can help provide context to your analysis.