Google Analytics (via reports or the API) typically deals with aggregated data, where metrics are already summed and averaged for you, and you can easily request a tabular report of (say) Sessions by Date. On the other hand, the Google Analytics BigQuery Export Schema contains a wealth of raw data.

While the raw data opens up infinite possibilities, it also means that most Google Analytics Metrics and some Dimensions are not included in the export. These pieces of information, such as ga:hasSocialSourceReferral , ga:channelGrouping , ga:daysSinceLastSession , ga:pagePathLevel1 , ga:landingPagePath , ga:contentGroupXX , and ga:previousPagePath need to be computed.

Perhaps most importantly, the goals that we’ve configured inside of Google Analytics are not stored in BigQuery and will need to be computed from scratch.

Brief Recap of the BigQuery Schema

BigQuery is a structured, table-based SQL database. In the BigQuery export, each row represents a session. Inside each session is the hit, custom dimensions, and other information about the session and hits. Below is an illustration of some of the fields within the export.

Note that the session-level custom dimensions hits are repeated within the session and how the hit-level custom dimensions are repeated within each hit; this is one of the special properties of BigQuery: repeated fields.

Also note how the custom dimensions, hits, and totals have named fields within them; this is another one of BigQuery’s special properties: nested records.

Recreating Metrics

Since no metrics are contained within BigQuery, let us first examine methods to compute them. If you’re familiar with Custom Reports or the Google Analytics API, then you’re familiar with the concept of having metrics computed for a group of dimensions. In BigQuery, and SQL at large, that concept translates into aggregates and GROUP BY .

A simple example is the number of sessions per day (metric= ga:sessions , dimension= ga:date ). We need to aggregate the sessions in the export by counting them, GROUP ed BY date .

SELECT date, SUM(totals.visits) AS sessions FROM [google.com:analytics-bigquery:LondonCycleHelmet.ga_sessions_20130910] GROUP BY date

date sessions 120130910 63

SELECT date, Filters can be performed with a WHERE clause, when filtering data in the table, or HAVING , when filtering an aggregate value. To count only sessions with transactions, we can filter on totals.transactions .

SUM(totals.visits) AS sessions FROM [google.com:analytics-bigquery:LondonCycleHelmet.ga_sessions_20130910] WHERE totals.transactions > 0 GROUP BY date

date sessions 120130910 16

HAVING is similar to a WHERE but works on aggregate values, e.g. metrics. Using the above example, we could find all days with more than 70 sessions. (Since the LondonCycleHelmet dataset is only a single day, the results are immediately useful.)

SELECT date, SUM(totals.visits) AS sessions FROM [google.com:analytics-bigquery:LondonCycleHelmet.ga_sessions_20130910] GROUP BY date HAVING sessions > 70

date sessions Zero Results returned

Now let us consider a slightly more advanced example: computing the metric ga:percentNewSessions with dimensions of ga:medium , we can aggregate the number of sessions by a count of all and a count of new sessions, GROUP ed BY trafficSource.medium .

(We, unfortunately, cannot aggregate by averaging the new sessions flag in the export, because it is set to NULL , instead of 0, if the session is not new; and aggregates ignore null values, i.e. average would return 1.)

SELECT trafficSource.medium, SUM(totals.newVisits) / SUM(totals.visits) AS percentNewSessions FROM [google.com:analytics-bigquery:LondonCycleHelmet.ga_sessions_20130910] GROUP BY trafficSource.medium

trafficSource_medium percentNewSessions referral 0.5714285714285714 organic 0.4782608695652174 cpc 0.6 (none) 0.7222222222222222

We can also compute multiple metrics at once. Additionally, we are not limited to the 7 dimensions and 10 metrics that the Google Analytics API limits us to.

SELECT CONCAT(trafficSource.source,"/",trafficSource.medium) AS sourceMedium, newSessions / sessions AS percentNewSessions, bounces / sessions AS bounceRate FROM ( SELECT trafficSource.source, trafficSource.medium, SUM(totals.visits) AS sessions, SUM(totals.newVisits) AS newSessions, SUM(totals.bounces) AS bounces, FROM [google.com:analytics-bigquery:LondonCycleHelmet.ga_sessions_20130910] GROUP BY trafficSource.source, trafficSource.medium)

sourceMedium percentNewSessions bounceRate technologysauce.com/referral 0.5714285714285714 0.14285714285714285 google/organic 0.4782608695652174 0.21739130434782608 google/cpc 0.6 0.2 (direct)/(none) 0.7222222222222222 0.1111111111111111

Interesting New Metrics

Since we can leverage all of BigQuery against our raw data, we can compute metrics that dont’t exist in Google Analytics or the API, e.g. quantiles or bucketing/binning.

Here we’ll compute the 25%ile, median, and 75%ile of ga:sessionDuration ( totals.timeOnSite in BigQuery). We’re also showing of the ability, and common usage pattern, of using a subquery to do a calculation, or convert data for usage later on in the query.

SELECT trafficSource.medium, SUM(totals.visits) AS sessions, -- 5 will give the min, 25%, 50%, 75%, max with 20% error -- the more buckets, the better the approximation (error = 1/number of buckets) -- at the cost of more computation -- QUANTILES returns all of the buckets, use NTH to extract the bucket you want NTH(2, QUANTILES(totals.timeOnSite, 5)) AS firstQuartile, NTH(3, QUANTILES(totals.timeOnSite, 5)) AS mean, NTH(3, QUANTILES(totals.timeOnSite, 5)) AS thirdQuartile FROM ( SELECT trafficSource.medium, totals.visits, -- Sessions with a single page view will have no time on site reported IF(totals.timeOnSite IS NULL, 0, totals.timeOnSite) AS totals.timeOnSite FROM [google.com:analytics-bigquery:LondonCycleHelmet.ga_sessions_20130910]) GROUP BY trafficSource.medium

trafficSource_medium sessions firstQuartile mean thirdQuartile referral 7 7 20 20 organic 23 5 18 18 cpc 15 13 23 23 (none) 18 8 18 18

Now we’ll bucket/bin our data in preparation for a histogram.

SELECT trafficSource.medium, SUM(totals.visits) AS sessions, SUM(IF(timeOnSite_bucket5sec = 0,1,0)) AS B0_4, SUM(IF(timeOnSite_bucket5sec = 1,1,0)) AS B5_9, SUM(IF(timeOnSite_bucket5sec = 2,1,0)) AS B10_14, SUM(IF(timeOnSite_bucket5sec = 3,1,0)) AS B15_19, SUM(IF(timeOnSite_bucket5sec = 4,1,0)) AS B20_24, SUM(IF(timeOnSite_bucket5sec = 5,1,0)) AS B25_29, SUM(IF(timeOnSite_bucket5sec > 5,1,0)) AS B30_ FROM ( SELECT trafficSource.medium, totals.visits, -- Sessions with a single page view will have no time on site reported FLOOR(IF(totals.timeOnSite IS NULL, 0, totals.timeOnSite) / 5) AS timeOnSite_bucket5sec FROM [google.com:analytics-bigquery:LondonCycleHelmet.ga_sessions_20130910]) GROUP BY trafficSource.medium

trafficSource_medium sessions B0_4 B5_9 B10_14 B15_19 B20_24 B25_29 B30_ referral 7 1 1 0 1 1 1 2 organic 23 5 3 2 2 2 1 8 cpc 15 3 0 1 1 3 1 6 (none) 18 2 4 1 2 1 0 8

Dimensions

Some dimensions are simple to compute, while others require more ingenuity. Often these require subqueries, like we have seen earlier.

Extracting Dimensions from Other Fields

ga:pagePathLevel1 is an example of an easy-to-extract dimension. It is the first segment of the ga:pagePath ( hits.page.pagePath in BigQuery), which we can pull out using SPLIT and NTH .

SELECT hits.page.pagePath, NTH(1, SPLIT(hits.page.pagePath, '/')) AS pagePathLevel1, NTH(2, SPLIT(hits.page.pagePath, '/')) AS pagePathLevel2 FROM -- We need to explicitly FLATTEN otherwise SPLIT will not behave as expected FLATTEN([google.com:analytics-bigquery:LondonCycleHelmet.ga_sessions_20130910], hits.page.pagePath)

hits_page_pagePath pagePathLevel1 pagePathLevel2 /helmets/foldable.html helmets foldable.html / null null /vests/ vests null /vests/yellow.html vests yellow.html

However, we still need to bring along values we want to use later. For instance, if we wanted to compute the bounce rate per ga:pagePathLevel1 , we should bring along totals.bounces .

SELECT pagePathLevel1, ROUND(SUM(totals.bounces) / SUM(totals.visits), 2) AS bounceRate FROM ( SELECT totals.visits, -- If no session bounced, then SUM of nulls is null, not 0 IF(totals.bounces IS NULL, 0, 1) AS totals.bounces, NTH(1, SPLIT(hits.page.pagePath, '/')) AS pagePathLevel1, FROM -- We need to explicitly FLATTEN otherwise SPLIT will not behave as expected FLATTEN([google.com:analytics-bigquery:LondonCycleHelmet.ga_sessions_20130910], hits.page.pagePath)) GROUP BY pagePathLevel1

pagePathLevel1 bounceRate helmets 0.07 null 0.09 vests 0.03 login.html 0.0

Beyond simple transformations on fields, sometimes we want to create “arbitrary” groupings. In the case of Content Groupings that can be based on data already in the export (e.g. URL or Custom Dimension), we can tease it out via a CASE statement, among other ways.

SELECT hits.page.pagePath, CASE WHEN pagePathLevel1 = 'helmets' THEN "Products" WHEN pagePathLevel1 = 'vests' THEN "Products" WHEN pagePathLevel1 IN ('basket.html', 'shipping.html', 'billing.html', 'confirm.html') THEN "Checkout" WHEN pagePathLevel1 IS NULL OR pagePathLevel1 = 'login.html' THEN "RestOfSite" ELSE 'Unknown' END AS contentGroup1, CASE WHEN pagePathLevel2 = 'foldable.html' THEN "Foldable" WHEN pagePathLevel2 IN ('orange.html', 'yellow.html') THEN "Color" END AS contentGroup2 FROM ( SELECT fullVisitorId, visitID, hits.page.pagePath, NTH(1, SPLIT(hits.page.pagePath, '/')) AS pagePathLevel1, NTH(2, SPLIT(hits.page.pagePath, '/')) AS pagePathLevel2, FROM -- We need to explicitly FLATTEN otherwise SPLIT will not behave as expected FLATTEN([google.com:analytics-bigquery:LondonCycleHelmet.ga_sessions_20130910], hits.page.pagePath)) GROUP BY hits.page.pagePath, contentGroup1, contentGroup2

hits_page_pagePath contentGroup1 contentGroup2 /helmets/foldable.html Products Foldable / RestOfSite null /vests/ Products null /vests/yellow.html Products Color /vests/orange.html Products Color /login.html RestOfSite null /basket.html Checkout null

Slightly More Interesting Example

ga:landingPagePath is similarly easy to find using row-scoped aggregates; what that means is that for each row, we’re going to compute an aggregate based on a repeated value, in this case the hits.page.pagePath and hits.page.type . Since a session need not start with a pageview, we can’t simply grab all the hits with a hit.hitNumber equal to 1. To solve this problem, we’ll take the first hits.page.pagePath where hits.page.type is “PAGE”.

SELECT fullVisitorId, visitId, -- Uses the knowledge that hits are stored in chronological order FIRST(IF(hits.type = "PAGE", hits.page.pagePath, NULL)) WITHIN RECORD AS landingPagePath FROM [google.com:analytics-bigquery:LondonCycleHelmet.ga_sessions_20130910]

fullVisitorId visitId landingPagePath 380066991751227408 1378805776 /helmets/foldable.html 712553853382222331 1378804218 /vests/ 881288060286722202 1378803865 /helmets/ 881288060286722202 1378804975 /

More Complex Example

For a metric like ga:daysSinceLastSession that requires knowledge of more than one session, in this case, a user’s previous session, we can use windowing functions.

Windowing functions allow us to compute a value for the current row given the value of other rows in the “window” of data we’re looking at. In the following example, LAG returns the row before the row being looked at as defined by PARTITION BY fullvisitorid ORDER BY visitStartTime ASC . PARTITION BY acts much like a GROUP BY clause, in that it creates new windows for each unique value of the fields listed. ORDER BY sorts the window according to the fields and order listed.

We can read LAG(visitStartTime) OVER (PARTITION BY fullvisitorid ORDER BY visitStartTime ASC) as “looking at only the sessions for a given user, ordered by visitStartTime , what was the visitStartTime previous to the the row I’m currently on”.

SELECT fullvisitorid, visitId, DATEDIFF(SEC_TO_TIMESTAMP(visitStartTime), SEC_TO_TIMESTAMP(prevVisitStartTime)) AS daysSinceLastSession, FLOOR((visitStartTime - prevVisitStartTime)/60) as minutesSinceLastSession FROM ( SELECT fullvisitorid, visitId, visitStartTime, LAG(visitStartTime) OVER (PARTITION BY fullvisitorid ORDER BY visitStartTime ASC) AS prevVisitStartTime FROM [google.com:analytics-bigquery:LondonCycleHelmet.ga_sessions_20130910])

fullvisitorid visitId daysSinceLastSession minutesSinceLastSession 3960256913998800485 1378805348 null null 4158255675143559999 1378818879 null null 4158255675143559999 1378819831 0 15.0

We are finding the visitStartTime of the previous session chronologically, and combining it with the visitStartTime of the current session to compute the days between the previous and current session. Above prevVisitStartTime will be NULL if there is no previous session, causing DATEDIFF return a NULL as well.

(The LondonCycleHelmet dataset is only of a single day, so we will only see NULL for the first session and 0 for the subsequent ones. That is why I also included minutesSinceLastSession to make the output a little more exciting!)

Even More Complicated Dimensions

Some dimensions, such as ga:channelGrouping and ga:hasSocialSourceReferral , require more computation that you can (or are willing to) write as a SQL statement. For this we leverage User Defined Functions, (UDFs) in BigQuery. UDFs are pieces of JavaScript that run in V8 on the same machine your data is on. For more information see my post, Self-Joins, Windowing, and User Defined Functions in BigQuery, or the BigQuery Documentation. UDFs take in a row of data, and return 0 or more rows.

In our case, we can use them to compute the Default Channel Grouping found in Google Analytics.

UDF:

function computeDefaultChannelGroup(row, output) { if( row.trafficSourceSource == '(direct)' && ( row.trafficSourceMedium == '(not set)' || row.trafficSourceMedium == '(none)') ) { output("Direct"); } else if (row.trafficSourceMedium=="organic") { output("Organic"); } else if (row.trafficSourceMedium=="referral") { output("Referral"); } else if (row.trafficSourceMedium=="email") { output("Email"); } else if ( ( row.trafficSourceMedium == "cpv" || row.trafficSourceMedium == "cpa" || row.trafficSourceMedium == "cpp" || row.trafficSourceMedium == "content-text") ) { output("Other Advertising"); } else if ( ( row.trafficSourceMedium == "cpc" || row.trafficSourceMedium == "ppc" || row.trafficSourceMedium == "paidsearch") && row.adNetworkType != "Content" ) { output("Paid Search"); } else if ( ( row.trafficSourceMedium == "social" || row.trafficSourceMedium == "social-network" || row.trafficSourceMedium == "social-media" || row.trafficSourceMedium == "sm" || row.trafficSourceMedium == "social network" || row.trafficSourceMedium == "social media") || false // Social Source Referral exactly match Yes ) { output("Social"); } else if ( ( row.trafficSourceMedium == "display" || row.trafficSourceMedium == "cpm" || row.trafficSourceMedium == "banner") || row.adNetworkType == "Content" ) { output("Display"); } else { output("Unknown"); } } function defaultChannelGroup(row, emit) { function output(channelGroup) { emit({channelGroup:channelGroup, fullVisitorId: row.fullVisitorId, visitId: row.visitId, }); } computeDefaultChannelGroup(row, output); } bigquery.defineFunction( 'defaultChannelGroup', // Name of the function exported to SQL ['trafficSourceMedium', 'trafficSourceSource', 'fullVisitorId', 'visitId','adNetworkType'], // Names of input columns [{'name': 'channelGroup', 'type': 'string'}, // Output schemac {'name': 'fullVisitorId', 'type': 'string'}, {'name': 'visitId', 'type': 'integer'}, ], defaultChannelGroup // Reference to JavaScript UDF );

SQL:

SELECT fullVisitorId, visitId, channelGroup FROM defaultChannelGroup( SELECT trafficSource.Medium AS trafficSourceMedium, trafficSource.Source AS trafficSourceSource, fullVisitorId, visitId, -- LCH is too old to have the needed field -- trafficSource.adwordsClickInfo.adNetworkType NULL AS adNetworkType FROM [google.com:analytics-bigquery:LondonCycleHelmet.ga_sessions_20130910])

fullVisitorId visitId channelGroup 380066991751227408 1378805776 Referral 712553853382222331 1378804218 Organic 881288060286722202 1378803865 Organic 881288060286722202 1378804975 Organic 881288060286722202 1378805870 Organic 1677140157296205498 1378803386 Paid Search

Note that above doesn’t take into account social referrers, as these are not in the export. A previous post provides a list we can use to build up social refers to check against.

Content Groupings, if too complicated to do with a CASE and/or REGEXP_EXTRACT , can also be computed via UDFs.

Wrap Up

Hopefully this post helps demystify what is and isn’t contained in the BigQuery export and gives enough examples and information on to compute missing dimensions and metrics from the export. BigQuery is an extremely flexible tool that can free your analysis from the constraints of the API, even if everything the API provides isn’t immediately available.