BigQuery performance tips: Searching for text 8x faster — “LOWER() is slower” (DEPRECATED)

I was stunned when some of my BigQuery queries were taking a minute or so, instead of the usual few seconds. The culprit: Case insensitive searches. The good news: There are ways to make them way faster.

Note from the team: Stay tuned! Very soon BigQuery will turn this advice irrelevant. Note 2: Good news! BigQuery has rendered this advice obsolete — LOWER() and UPPER() behave way faster now. Will I blog about this? Stay tuned!

The problem

Processing gigabytes of data with BigQuery should be super fast. For example the following query performs a case insensitive search over 5GB of text in 18 seconds:

#standardSQL

SELECT COUNT(*) c

FROM `bigquery-public-data.hacker_news.full`

WHERE LOWER(text) LIKE '%bigquery%' # 18s

Usually BigQuery is faster than this, but the underlying problem becomes really apparent when adding new search terms. For example, when searching for 3 terms, a similar query takes almost a minute:

#standardSQL

SELECT COUNT(*) c

FROM `bigquery-public-data.hacker_news.full`

WHERE LOWER(text) LIKE '%bigquery%'

OR LOWER(text) LIKE '%big query%' # 34s #standardSQL

SELECT COUNT(*) c

FROM `bigquery-public-data.hacker_news.full`

WHERE LOWER(text) LIKE '%bigquery%'

OR LOWER(text) LIKE '%big query%'

OR LOWER(text) LIKE '%google cloud%' # 52s

How can I improve my query performance?

BigQuery performance tip: Avoid using LOWER() and UPPER()

LOWER() and UPPER() operations have a hard time when dealing with Unicode text: each character needs to be mapped individually and they can also be multi-byte.

There are 3 solutions that I covered on Stack Overflow:

Solution 1: Case insensitive regex

A faster alternative to LOWER() : Use REGEX_MATCH() and add the case insensitive (?i) modifier to your regular expression:

#standardSQL

SELECT COUNT(*) c

FROM `bigquery-public-data.hacker_news.full`

WHERE REGEXP_CONTAINS(text, '(?i)bigquery') # 7s # REGEXP_CONTAINS(text, '(?i)bigquery')

# OR REGEXP_CONTAINS(text, '(?i)big query') # 9s # REGEXP_CONTAINS(text, '(?i)bigquery')

# OR REGEXP_CONTAINS(text, '(?i)big query')

# OR REGEXP_CONTAINS(text, '(?i)google cloud') # 11s

Query execution time gets much better:

1 search term: 18s down to 8s

2 search terms: 34s down to 9s

3 search terms: 52s down to 11s.

Solution 2: Combine regexes

Why do 3 searches when a regular expression can combine them into 1?

#standardSQL

SELECT COUNT(*) c

FROM `bigquery-public-data.hacker_news.full`

WHERE REGEXP_CONTAINS(text, '(?i)(bigquery|big query|google cloud)') # 7s

3 terms in 7s — nice.

Solution 3: Transform to bytes

This is uglier, but shows that UPPER() and LOWER() perform way better when dealing with individual bytes — for equivalent results in these searches:

#standardSQL

SELECT COUNT(*) c

FROM `bigquery-public-data.hacker_news.full`

WHERE LOWER(CAST(text AS BYTES)) LIKE b'%bigquery%'

OR LOWER(CAST(text AS BYTES)) LIKE b'%big query%'

OR LOWER(CAST(text AS BYTES)) LIKE b'%google cloud%' # 7s

In closing

LOWER() is slower! Use the regex (?i) modifier instead.

Note from the team: Good news! BigQuery has rendered this advice obsolete — LOWER() and UPPER() behave way faster now. Will I blog about this? Stay tuned!

H/T to Elliott Brossard for his insider BigQuery tips.

Next steps

If this worked for you, please feel free to comment with your performance improvements.

Want more stories? Check my Medium, follow me on twitter, and subscribe to reddit.com/r/bigquery. And try BigQuery — every month you get a full terabyte of analysis for free.