I have been working on Google BigQuery for the past three years and this blog is to share with you the things that I gathered along the way. One of the most important learning has been that there are a lot of ways to write SQL queries on BigQuery but knowing what works best in terms of cost and time efficiency will help everyone write efficient queries.

Before I share 4 effective ways to write efficient queries on BigQuery, let me share why I feel there is a need to consider this-

BigQuery is a serverless, highly-scalable, and cost-effective cloud data warehouse with an in-memory BI Engine and built-in machine learning. BigQuery runs blazing-fast SQL queries on gigabytes to petabytes of data and simplifies joining public or commercial datasets with your data.

On BigQuery (BQ), on-demand queries are charged based on the number of bytes read. The pricing of BQ is usage-based, i.e., you pay as you use the BQ resources which could be the amount of data you store, query, and insert by streaming. Since you have practically unlimited resources, it is easy to spend a lot if you are not cautious enough.

Here are four optimizations that will help you save cost and time of your queries:

Use of clustering with partitioned tables.

Performance of queries that aggregate data and use filter clauses can be improved with clustering. Clustering can be used for the following cases: a) When you work with partitioned tables. b) When you commonly use filters or aggregation on specific columns in your queries.

Query without clustered table

Query with clustered table

As you can see, this table is clustered on sqldate and actor1geo_fullname fields bringing the bytes read down from 185 GB to 437 MB.

2. Using REGEXP_CONTAINS instead of UPPER() and LOWER() functions.

Since each character needs to be mapped individually, LOWER() and UPPER() do not perform well with unicode text. Instead, use REGEX_MATCH() and add the case-insensitivity check (?i) to your regex to do case insensitive search. Multiple search values on the same field can be combined into a single regex.

Query with upper() function

Query with regexp_contains()

As you can see the results of the query processing time in both cases, there is a difference of 4 seconds in processing. This will be even greater when you are working on a larger dataset.

3. Using ARRAY_AGG() instead of ROW_NUMBER().

Using the ROW_NUMBER() function sometimes fails with the error “Resources Exceeded” as data volume grows to too many elements to ORDER BY in a single partition. By using ARRAY_AGG(), you can avoid this problem as the ORDER BY is allowed to drop everything except the top record on each GROUP BY.

Query using row_number()

Query using ARRAY_AGG()

In this case, you can see the query is rewritten using ARRAY_AGG(). Even though the query might perform a little slower, it will never give you “Resources exceeded error” which can be a bit frustrating.

4. Using approximate functions.

In cases where you don’t need exact counts, you can use approximate functions. Approximate functions yield result which is generally within 1% of the exact number.

In this case, you can see the query is taking a few seconds fewer when you are using approximate functions. Again the difference would be substantial on a larger dataset.

These are some of the best practices to be followed which will help you get the best out of BQ. There are a lot of others that are also listed on BigQuery’s official documentation.

Feel free to comment on my post and let me know if I missed out on any other optimisation technique. Also If you like my post, don’t forget to clap :)