Best practices from a Data Engineering perspective

Introduction

BigQuery is a serverless service that allows users to process huge volumes of data. Although there are other companies that offer the same type of service (AWS Athena), BigQuery has the best user-interface and best performance when administering queries in a serverless service. As many salespeople would say, it is like BigQuery has some witchcraft under the hood and magically can aggregate huge volumes of data. When I watched a video advertisement about a video-game maker product that narrated the following, I almost chuckled:

There is an automatic dungeon generator function that allows even beginners to easily make dungeons. All you have to do is decide on the style and the size and your dungeon are done. It is almost like cheating.

In summary, although this perspective is not wrong, as BigQuery has abstracted a lot of the stuff that required manual effort from Data Engineers before, it is not to say that beginners cannot produce results without pre-requisite knowledge of how the product works in the first place, but if they ever do so, most of the time, it will be done in an inefficient manner.

In addition, it is good to understand the business model that BigQuery is playing in the marketplace. First of all, BigQuery is a product that to this day has added a lot of features and support compared to AWS Athena in both user experience and performance. For that reason, they have to protect their proprietary technology. First of all, BigQuery is currently using Collosus as its distributed file system and Capacitor as its columnar format storage. These technologies are proprietary and cannot be replicated on any other eco-system other than BigQuery. To put it an analogy, the iOS operating system is great software, but only phone devices that Apple create can use that software. In contrast, AWS Athena is more like an Android device. AWS Athena stores the database tables as files in either Parquet or ORC in the S3 data lake. These files are not restricted to be run only in Athena. In contrast, these same files can be run under other services, such as other AWS services like Redshift Spectrum and EMR or provisioning your own cloud Hadoop instance that connects to the S3 data lake. In order for other services to be able to access the tables that are stored in BigQuery, they need to be exported to Google Cloud storage in a non-columnar format either programmatically via API or via the user interface manually. Very recently, BigQuery has abstracted the programming required of importing tables from Redshift and files from S3 to BigQuery via their user interface, but not the other way around. Regardless, these services do provide less friction on trying out BigQuery with a worst-case scenario for at least to be used as a supporting database function.

In the same fashion iPhones become more expensive with the more storage and cloud storage you buy, BigQuery becomes expensive the more queries consume more data to a big extent and the more data is stored within BigQuery to some extent. BigQuery pricing indicates that it costs $5 per terabyte of data queried and $20 per month per terabyte stored within BigQuery ($10 per month if said data is not edited within 90 days). Although BigQuery queries are a thing to watch out, BigQuery storage is also a thing to watch in the long term. As the data accumulates, the more the rental cost for said storage will be charged per month. For that reason, it is good to do regular cleaning of tables that are not used anymore or cleaning and normalizing existing raw data. Compared to Athena where the files can be compressed when queried, BigQuery contains the tables uncompressed. What I found was the same queries that I run on BigQuery in AWS Athena costs four to five times cheaper because the data is compressed. Of course, running AWS Athena feels like I have a cheap Android device that is slower than the latest iPhone device, but in overall, if the queries are not compute intensive, there is a lot of savings if queried in AWS Athena. Nonetheless, the biggest benefit of all with BigQuery is it provides the best user experience of all. In contrast to configuring a data lake that requires a lot of developers for provisioning and maintenance, BigQuery has evolved over the years where you can do all these tasks in two to three clicks. So basically, does BigQuery offers value for heavy-duty work that you don’t want to get your hands dirty? Yes, it does. And you can save much more on your BigQuery costs if you follow best practices from a Data Engineering perspective.

I found that Bigquery best practices cost for “Materialize query results in stages” was not comprehensive enough (If you have not read Google best practices documentation yet, it is highly recommended to read it in compliment with this). It is much easier to explain in practice now that BigQuery supports DDL statements and scheduling queries. In this step by step guide, you will learn how to run efficient queries from a public session dataset that Google provides.

Foreword: Cost savings are for illustration purposes only

Actual billing costs may differ to what will be presented in this article as the dataset that we are going to play around is less than the minimum query billing cost. BigQuery and AWS Athena minimum billing per query are 10MB. The dataset that we will be dealing with is activity log on an e-commerce site between the date ranges of 2016–08–23 and 2016–09–22 consisting of 6MB. By finishing the step by step guide, the table will be reduced in size to 300KB. Regardless of the size, due to the minimum billing query charge, you will be charged 10MB every time when querying the 6MB that we will be initially exploring and still be charged 10MB every time when we query our reduced 300KB partitioned table even when we pick only the columns and date ranges that we need. Given the query computed size for a date range of a month is less than 10MB, our optimization will not save any actual costs at all. However, most actual cases of querying activity log on an e-commerce site within a month range will have over a gigabyte of data instead due to:

The variety of dimensions to pick from: If we explore the public dataset that we are playing with, we will see a lot of the dimensions have the value null, “(not set)” and “not available in demo dataset”. For these reasons, there is not a lot of dimensions to explore in this dataset, reducing the columns that we are able to pick and thus reducing the size of data that we are querying from. Actual e-commerce will have most of the columns populated and not omitted due to privacy policy like with the public dataset that we will be working with.

If we explore the public dataset that we are playing with, we will see a lot of the dimensions have the value null, “(not set)” and “not available in demo dataset”. For these reasons, there is not a lot of dimensions to explore in this dataset, reducing the columns that we are able to pick and thus reducing the size of data that we are querying from. Actual e-commerce will have most of the columns populated and not omitted due to privacy policy like with the public dataset that we will be working with. High traffic website: The dataset that we are working with is very low in traffic. If we run the below query, we get 1.5k-3K (median 2.5K) sessions, 8K-18K (median 14K) hits, and 60–150K (median 105K) product impressions per day within the date ranges that we will be working with. If we use the median values, then for each hour per day, there is an average of 100 visitors accessing the website that produced 600 interactions to the website and 4500 product impressions. In contrast, a high traffic website usually gets every hour more than five thousand visitors producing more than a hundred thousand interactions and a quarter million of product impressions.

The whole idea from the above is if your query is actually a small dataset (less than 10MB) as to the one that we are going to work, it is better to extract it and place it in a relational database as BigQuery is OLAP instead of OLTP. Relational databases can handle aggregations like the above with small date ranges. Aside there are concurrency limits, the 10MB minimum billing charge is most likely so users do not abuse the service by using BigQuery as a relational database. Please take this into consideration when optimizing queries in BigQuery.

The problem

There may be several users that may be running the below query directly via their dashboards:

The problem with the above query is that every time a user requests to see the dashboard at a specific date range, the parameters $start and $end stored in the dashboard are passed to the query and a big amount of data has to be computed in BigQuery. The cost accumulates if it becomes queried often. For instance, if my start parameter is “20160823” and end parameter is “20160922”, I will be charged 6 megabytes only. However, if this query is fetched 30 times per day for a date range of 30 days for each request, then for one day I will be charged 180MB per day and 5.4GB per month correspondingly.

Our goal here is how to make this query at least four times less expensive than our current above setup. One step is aggregating the records only once through updating a separate table incrementally at regular intervals. There were a lot of ways in the past where users could increment their existing table for the most part programmatically via API. Now that BigQuery allows users to schedule queries and run DDL statements, we will guide through on how all this can be done now easily through the query console user interface. Before you begin, make sure the account being used has the proper credentials for creating tables and scheduling queries in BigQuery. Given that our queries run under a columnar database, we will start on how to reduce the number of records by either converting dimensions into metrics or generalizing them. In most cases, we have seen that if all the steps are followed, query cost can be reduced up to 20 to 40 times the actual cost.

Step 1: Converting defined dimensions into metrics

An important element about columnar databases is the query cost is based on the columns selected, not the number of records. Aside from the storage, what that means is that there is no need to be selective in how many columns it should be stored in a table as you only get charged the columns that you pick upon request. The problem then lies only on the records which the only way they can be controlled in a limited way is by partitioning/clustering/sharding.

If you would have run the previous queries with the parameters aforementioned, it would have resulted in receiving 21011 records. It would make more sense to reduce the records further if some of the dimensions are defined. For instance, according to BigQuery Export Schema documentation, we can observe that the column hits.ecommerceAction.action_type is limited to the values between zero and nine. If we placed each value to its own column, the records would be reduced further. For instance, the below query produced 9179 records instead:

Step 2: Generalising existing dimensions

The final dashboard for our problem only needs the information at the region level instead of at the country level. The public google analytics session sample provides the region of the session with the session custom dimension with an index value of four. If we replace country dimension with region dimension, our records will reduce to 1608 records. That is thirteen times smaller than what we actually started.

Step 3: Aggregating the data only once.

Why?

Aggregating only once by incrementally updating a staging table on a daily basis will save a lot of costs if said queries used to frequently run on top of raw tables.

So far, we have reduced the number of records from 21011 to 1608, yet we still get charged six megabytes for the request. What is going on? As mentioned in the introduction, BigQuery charges based on the columns and records processed and not by the number of records that did it output. If it was not noticed, the ga_sessions table is fetched by a wildcard and is filtered by the _table_suffix to pick only the tables that we want (each table stores only one specific date). This way of separating the data across several tables that share the same name is called sharding. If we count the number of records that we aggregate from, we fetch 434760:

If we replace fetching our results from our raw table with a new table that gets updated incrementally that has already the aggregated data that we need, when someone actually queries the results, it will compute 0.36% (1608/434760) records of what we have actually computed before.

There are some rules when aggregating data only once. Metrics that require to count distinct values does not work for instance. From a data engineering perspective, there is a general rule that metrics that need to count distinct values within a date range should not be used as it violates the rule of aggregating data only once and instead other workarounds should exist for those metrics. If it still need to include metrics that violate the rule of aggregating data only once is not enough from a daily basis, for the most efficient use of resources, create a table that picks the raw data on a weekly basis or a monthly basis for only those metrics that violate the rule and then join them with the table that gets updated on a daily basis. This should also apply for records where values change over time. For instance, you can set a condition to only refresh records that were created within the last thirty days if said records change over time.

Steps to update table incrementally

A. Creating the view

For now, we will focus on records that do not change over time. In order to aggregate the data on a scheduled basis, we will need to make the parameters that we used previously to run dynamically and save it as a view. Technically, we would have used:

to retrieve data for the previous day, but since public ga_sessions table is not updated anymore, we go back around two to three years from the present day (After a long time this post was published, query may need to go farther back by modifying the number of days that get subtracted from INTERVAL DAY):

We save the above query as view by clicking save view in our query editor in a new or existing dataset called test with table name public_ga_sessions_summary.

B. Scheduling the view on a daily basis

Create a new query with the following:

2. Click Schedule query. For name of scheduled query, we will call it public_ga_sessions_summary and store it in dataset test with table name public_ga_sessions_summary_table with write preference WRITE_TRUNCATE and partitioning field as date (Currently, partitioning field in scheduling queries is only available on classic BigQuery and can be accessed by https://bigquery.cloud.google.com/ ). Then we click Add.

What this will do is create a table which will update for sessions that existed yesterday. Then after twenty-four hours, BigQuery will automatically update the table test.public_ga_sessions_summary_table for the following day. Besides having the records controlled by the partition field date, you can schedule the query by sharding the table in the same format as to how we queried google_analytics_sample.ga_sessions table. This has its own drawbacks (cannot update multiple tables at once and slower performance) and benefits (metadata of each table can be different). For more information on what are the other options on scheduling queries, you can check the documentation Google provides which includes backfilling for previous dates with manual scheduled runs. For our use case, we partitioned the table. The way to filter records in this use case is by filtering the column that we partitioned. In our case, it is the date column.

Another important thing to mention is to make sure the scheduled queries are done under a google email account that is generic, such as a department like customer_service@company_name.com. If the scheduled queries are tied to a user and later gets off-boarded, those scheduled queries will be removed and will affect users that depend on the deliverables of those scheduled jobs.

C. Refreshing the table on an ad-hoc basis

There are many use cases where we need to refresh the table:

New columns: Sometimes, we want to add columns and populate values for those columns for the previous dates. In addition, currently in BigQuery, there is a big caveat. A BigQuery table can only add new columns. Existing columns cannot be modified by name or data type. Columns cannot be removed. You only have one chance when adding new columns so it is required to be careful. Otherwise, you will be stuck with those columns.

Sometimes, we want to add columns and populate values for those columns for the previous dates. In addition, currently in BigQuery, there is a big caveat. A BigQuery table can only add new columns. Existing columns cannot be modified by name or data type. Columns cannot be removed. You only have one chance when adding new columns so it is required to be careful. Otherwise, you will be stuck with those columns. Business Logic: Some of the business logic for some columns has changed requiring previous records to be updated.

Some of the business logic for some columns has changed requiring previous records to be updated. Backfilling a date range: It was requested to fill data for previous dates that are currently not there or need to be replaced.

Our case is to backfilling a date range. We used to query for thirty past days but we only have one past day from the query below:

First, we need to create an upsert table that contains the last thirty days which will be used for our upsert statement. It is basically the same query as our view but fills in explicitly the missing 29 days we want.

Then a merge upsert statement to our query will fill out the missing records:

The above query overwrites existing records with the new records if new records are found to have a match among composite primary keys, otherwise, it inserts them as new records. A thing to note is to make sure the composite primary keys are not NULL when joined (use COALESCE if so). In cases where you only want to backfill only selected columns (new columns or columns that have changed their business logic) for the previous dates, then the upsert table only needs to mention the composite primary key and selected columns and the merge statement only needs to specify the selected columns in the update set statement without requiring to include the “when not matched then” condition.

There is a simpler way of backfilling date ranges in the case we could remove all the existing records within those date ranges. First, we would run a delete statement within our date range:

And then we would do an insert statement from the upsert table we have created before:

Result

When running the query below again, we see we have data for the previous thirty days. We also see that the total records are the same as before:

The other thing to mention is if we run our final table instead of the raw table instead, we get charged only 300KB. Our previous query cost 6MB. That is a savings of 95%.

Also, another thing to mention is if you do not need to use all the columns in a query, you will save even more. A good idea is converting raw columns to integers whenever possible if they used to store numbers within string data types or create boolean flags out of them. The smaller the size the data type takes, the less it will cost when querying. NULL values are considered data-free. Picking only the dimension columns region and date and metrics sessions and product_view costs us 50KB instead.

In addition, since we partitioned by the column date, we will save even more by getting data for only the date ranges we want. Picking only the date ranges between 2016–08–28 and 2016–08–29 reduces the query cost to 3KB instead.

Other optimizations to know

Some optimizations that are more tricky and challenging to implement which are good to be familiar with for future endeavors are:

Hierarchical tables

In case of not noticing, the table that we were working all this time was data contained in a hierarchical data structure mostly as nested sets. Hierarchical tables is a way of removing data redundancy in columnar databases that contain big data. In databases that contain mostly normal volume of data and are lean to be more relational, a star schema which splits the data into either a dimension or a fact table should suffice for efficient querying. In contrast, a columnar database that contains a lot of big data is not efficient to contain all the data into separate tables as joins take more computing time under those conditions. That is where columns that have nested fields (which usually repeat) comes in handy to solve such issue. For instance, if we wanted to visualize our public google analytics session sample, it would look like the following if each nested field was its own separate table (search “This row and nested fields” within BigQuery Export Schema documentation and also any data type that is RECORD — for now, we will only limit to custom dimensions within session, hits, and products — to find out how many we can start with):

A visualization of our sessions table into separate tables

Take the above illustration as a speculation as each organization session table may be different (some may have only a 1 to 1 relationship instead of a 1 to many relationships to the above tables or the composite primary key used may be a different column than what we use in here). Whatever the session table is structured, what we can observe from the above is that if we wanted to contain all the tables instead into one denormalized table, then we will make joins from one table (i.e. sessions) to another (i.e. hits) into another (i.e. product, promotion, refund, promotionactioninfo). Not only that, each join is a one to many relationships. After we do a join and then into another join, our records grow exponentially. If we take our previous illustration to what we defined what a high traffic website is for a single hour, our 4 thousand sessions would become 100 thousand interactions after joining session with hits and 250 thousand impressions after joining products from hits. Now if I wanted to pick the composite primary keys of the session from such table (fullvisitorid, visitid), it would have to read against 250 thousand records instead of 4 thousand records as the table is now denormalized and is now data redundant. If we do the math, our query is now at least 50 times more expensive for selecting only those columns. If we have each table into its own nested field, we avoid that problem and can query the data that way more efficiently as if all the data was normalized. Whenever we want to do a join, we just mention the UNNEST keyword and put between the brackets the nested field that we want to join with and we are good to go. Notice that there is no need to mention an “ON” statement between joins because it is already joined as the nested field is already configured how it is connected with the rest of the data within the table. By default, UNNEST joins are by default as INNER joins. If you need to allow joins where there is a 1 to 0 relationship, you must mention explicitly LEFT UNNEST instead. Ultimately, the need to not do joins saves the trouble of BigQuery to use compute resources to make the linking between tables on the fly and at the same time reduce data redundancy by keeping the data normalized.

The benefit of having a table in such format is when multiple users do various different aggregations on top of it. For instance, the marketing team may want to look at the click-through rate from the promotion table (Query A), the purchasing team may want to look at the product table to see which ones are the most viewed (Query B) and the IT team wants to predict how much bandwidth to allocate for traffic by looking from previous interactions within the hits table (Query C). Regardless of the path each query goes, a denormalized table will be more expensive than one that is more normalized by having its table in a hierarchical format with nested fields. For illustration purposes, let us say Query A, B, C run each day and cost $5, $15 and $3 respectively. If this was run in a denormalized table, let us assume it would cost five times more instead. Looking at the visualization below, we save $92 per day and $2760 per month. The more queries we run against a denormalized table, the more savings we can get if it was structured in a hierarchical format with nested fields.

When a table is stored with nested fields, not only we avoid compute resources to do joins against several tables, we also save a lot of cost within our queries by avoiding data redundancy.

In summary, it is good practice for a data practitioner to transform tables as hierarchical when:

A lot of queries are run on top a denormalized table : As illustrated previously, several thousand dollars can be saved per month not only on the query costs but also on the storage costs as there will be less data redundancy.

: As illustrated previously, several thousand dollars can be saved per month not only on the query costs but also on the storage costs as there will be less data redundancy. A lot of fragmented tables that contain redundant data and are expensive to join: A lot of times, several imported tables will share the same common columns while joining between them becomes too compute intensive at the lowest granularity level. If it is possible to consolidate all of those tables into one hierarchical table, we can eliminate data redundancy while enabling us to do queries at a lower granularity level that we could not do before.

There are multiple use cases also that I have found the need to create a better version from an existing imported hierarchical table with nested fields:

Create calculated metrics that are compute and cost intensive: There may be a lot of metrics that are compute and cost intensive if it becomes frequently used per day. Imagine Query A, B, C tried to fetch how many times each session added an item to their cart by going at the hits nested field. Instead of letting Query A, B, C compute the same metric several times on their own every day at the hits nested field, it is better if it is already computed and stored as a summary column at the session table instead.

There may be a lot of metrics that are compute and cost intensive if it becomes frequently used per day. Imagine Query A, B, C tried to fetch how many times each session added an item to their cart by going at the hits nested field. Instead of letting Query A, B, C compute the same metric several times on their own every day at the hits nested field, it is better if it is already computed and stored as a summary column at the session table instead. Move values that are placed in the wrong location: This can happen often and it is embarrassing but you may find out that a custom dimension value that is on the hits table was supposed to be on the session table leaving that value data redundant. Moving values is one good reason to create another version of the same table to reduce data redundancy.

The challenge is maintaining nested fields has its own caveats. For instance, currently in BigQuery, overwriting a partition within a hierarchical table is treated as a whole including its nested fields. For instance, if I wanted to overwrite for a specific date range due to business logic changes only some columns in the sessions table which consists of 5GB and not the hits table which consists of 25GB, I could do that if the tables where separated on its own and would cost 5GB, but its not possible if the hits was a nested field within the session table which would eventually cost 30GB even on the situation when I don’t have to update any of the fields within the hits table. However, in most cases, the cost of maintaining the tables is negligible to the amount of savings done on the queries that will run atop of it. To get a better introduction with examples about the benefits of having nested fields, I totally recommend reading the article Why Nesting Is So Cool from data-discovery platform Looker.

Cluster Columns

Another popular option to optimize queries is defining cluster columns within tables. Similar to partition columns, cluster columns try to put the specific values within the said column in a few files (buckets) as possible. For that reason, it is much easier to “fit” within said bucket if the value is very unique enough that it does not show up too many times within records. If such a column is used often for filtering, clustering within that column will definitely reduce cost when it is used within the filter. However, it does not work exactly the same as partition columns.

If I had to say, partition columns are first class while cluster columns are the second class for saving query costs. If you partition by a specific column and you filter with it, you are guaranteed the query will only retrieve the data based on the value that you are filtering only because it stores all those files with that specific value in a specific folder. This does not happen with cluster columns as its more of an approximation. What I speculate what BigQuery does under the hood is create a fixed number of files under a folder and tries to group the same value into as few files as possible out of all the files within that folder. So there may be overlap where a specific value resides in more than one file or half of all values are stored in one file because the data under such folder is too small that BigQuery bucketed all those values within two files (best practices for AWS Athena indicate that each file at minimum should be 128MB so BigQuery must be bucketing them with a minimum size for optimization purposes as well). Overall, with trial and error, what we have seen that matters most are:

Size of the table that you are clustering: If the partition of a specific date contains over 1GB and you want to cluster that partition with a field, clustering will help, but if said partition is less than 100MB, you may not see much savings with clustering as BigQuery may set a minimum size for each bucket.

If the partition of a specific date contains over 1GB and you want to cluster that partition with a field, clustering will help, but if said partition is less than 100MB, you may not see much savings with clustering as BigQuery may set a minimum size for each bucket. Cluster field has high cardinality (a large number of unique values) : If the cluster column has high cardinality, the easier it can fit into fewer buckets. With the fewer buckets to scan, the cheaper our queries will cost.

: If the cluster column has high cardinality, the easier it can fit into fewer buckets. With the fewer buckets to scan, the cheaper our queries will cost. Selecting few values to filter within the cluster field: The more values we try to filter with our cluster field, the more buckets it needs to scan as each value may be stored in a different bucket.

The more values we try to filter with our cluster field, the more buckets it needs to scan as each value may be stored in a different bucket. Selecting few columns within table: Especially if the table has a lot of columns, we see the same filter against a cluster field has more saving in proportion when selecting few columns compared to selecting all the columns within the table.

Overall, clustering is very tricky, but if it is known what is popular filtered within queries and gets configured properly, it can tremendously save a lot of costs up to 80% when it gets used.

Other best practices

Monitor Logs: Database administrator should provide a summary and detail table view of the queries cost run by each user. The summary version should show the total cost each user accrued by month and day and the detail version should show only the cost of each query in descending order for this month containing the frequency the query was run for this month, its first and last occurring date, and by which user. Regularly checking your user email in the summary view to see the total costs and where the query costs are allocated from the detail view is important to gain visibility of your daily usage.

A database administrator of the project should be able to import query activity in BigQuery by creating exports through stackdriver logging. The export options allow to log BigQuery services in a dataset of your choice within BigQuery and the ability of using advanced filters to pick only the relevant logs that are in need. We found picking only the queries that are billed over 10 million bytes (10MB) were the only ones that we wanted to pay attention to. This was achieved by using the following advanced filter:

After creating the export, you will be able to see a new table within the dataset that has been selected which stores the logs that we wanted to retrieve. The next task requires to create views where users can get a summary and detail view of their contributing costs. Google provides several good views examples that you can create from the query logs provided.

Mostly For Experimental Use: When dimensions and metrics that are computed are not specific and can be shared across several departments, it makes sense for regular users to not compute those dimensions/metrics and instead be delegated to other departments, such as Business Intelligence and Data Warehouse teams. It is much better if the same metrics/dimensions are not created multiple times and instead only be done once by a team that maintains them. For most cases, we expect those projects are on an ad-hoc basis. If those projects ever become permanent, they can still be done by users for testing/experimental purposes until the relevant departments have the time to add the required dimensions/fields within the data warehouse. Last but not least, make sure the scheduled queries are stopped when the deliverables produced are not in use and the tables created have an expiry date by updating the table’s description. For instance, running the below script will immediately set the lifecycle of the number of partitions to contain within the existing table that we have worked before to 15 days:

The following day, it will automatically remove for us the earliest day that exists within that table and our scheduled job will insert the latest day within that table. Without setting the expiration date to our tables, as long as the scheduled job runs on a daily basis, our storage cost will be linearly increasing.