And that are all the steps BigQuery had to go through to run our query. Now, if this is a little hard to read, we have some alternatives. For example, the legacy BigQuery web UI has more compact results:

Here we can see that the slowest operations were computing while reading the 56 million rows table twice.

So if I want to improve the performance of my query, I have to focus on that. Well, let’s do it — if I change the 2:

FROM `fh-bigquery.wikidata.subclasses_20190822`

to:

FROM `fh-bigquery.wikidata.wikidata_latest_20190822`

Now my query runs in half the time! And then we moved the slowest part elsewhere:

Which is this JOIN now:

It even shows us that it’s looking for all the super-heroes between ‘3-D Man’ and ‘Zor-El’… yes, it’s going through the whole alphabet.

For an even deeper view, check out the BigQuery Query plan visualizer by Stephan Meyn:

#5 The power of Materialization

It’s really cool to have these tables in BigQuery. But how did I load them?

You can see that I periodically bring new raw files into GCS (Google Cloud Storage), and then I read them raw into BigQuery.

In the case of the Wikipedia pageviews, I do all the CSV parsing inside BigQuery, as there are many edge cases, and I need to solve some case by case.

Then I materialize this tables periodically into my partitioned and clustered tables.

In the case of Wikidata, they have some complicated JSON — so I read each JSON row raw into BigQuery. I could parse it with SQL, but I that’s not enough. And that brings us to our next super power.

#6 Navigating the multiverse

So we live in this SQL universe. It’s an incredible place to manipulate and understand data, but each universe has its limitations and its rules. What if we could jump to a different universe, with different rules and powers, and manage to connect both universes, somehow? What if we could jump into the.. JavaScript universe?

First, let’s talk about the UDFs — User Defined Functions:

[Go through SQL UDFs, sharing them, like fhoffa.x.random_int() or fhoffa.x.median() , then introduce the power of natural language processing by downloading a random JS library from the Internet, and using it from within BigQuery]

And thus I can now take each row of Wikidata JSON and parse it inside BigQuery, using whatever JavaScript logic I want to use, and then I materialize this into BigQuery.

#7 Time travel

Let’s take this table. It’s a beautiful table, with a couple thousand rows. But not everyone is happy — turns out someone wants to delete half of it’s rows — randomly.

How would our super-enemy pull this off?

DELETE FROM `fh-bigquery.temp.peaceful_universe`

WHERE fhoffa.x.random_int(0,10) < 5

Oh no. Half of the rows of our peaceful universe are gone. Randomly. How is that even fair? How will we ever recover from this?

“Mr Hoffa, I don’t feel so good” — one row.

5 days later

We learned how to move forward without these rows, but we still miss them. If only there was a way to travel back in time and bring them back.

Yes we can.

Instead of:

SELECT COUNT(*)

FROM `fh-bigquery.temp.peaceful_universe`

we can write:

SELECT COUNT(*)

FROM `fh-bigquery.temp.peaceful_universe`

FOR SYSTEM TIME AS OF

TIMESTAMP_ADD(CURRENT_TIMESTAMP(), INTERVAL -5 MINUTE)

And to replace the table with the past one:

CREATE OR REPLACE TABLE `fh-bigquery.temp.peaceful_universe`

AS

SELECT *

FROM `fh-bigquery.temp.peaceful_universe`

FOR SYSTEM TIME AS OF

TIMESTAMP_ADD(CURRENT_TIMESTAMP(), INTERVAL -5 MINUTE)

Warning: CREATE OR REPLACE TABLE deletes the table history, so write the results elsewhere. But soon this warning won’t be necessary.

#8 The power of super-speed

#9 Invulnerability

Our most annoying enemy:

Division by zero

From the BigQuery docs:

SAFE. prefix

Syntax:

SAFE.function_name()

Description

If you begin a function with the SAFE. prefix, it will return NULL instead of an error.

Operators, such as + and = , do not support the SAFE. prefix. To prevent errors from a division operation, use SAFE_DIVIDE. Some operators, such as IN , ARRAY , and UNNEST , resemble functions, but do not support the SAFE. prefix. The CAST and EXTRACT functions also do not support the SAFE. prefix. To prevent errors from casting, use SAFE_CAST.

#10 The power of self-control

All super-heroes struggle when they first discover their super-powers. Having super strength is cool, but you can break a lot of things if you’re not careful. Having super-speed is fun — but only if you also learn how to break. And if you’re accidentally exposed to gamma rays, you might never be able to control your powers, once you get angry. Don’t get angry!

Look mom, I can query 5 petabytes of data in 3 minutes!

That’s super cool, until you remember that querying one petabyte is one thousand times more expensive than querying one terabyte. And you only have one free terabyte every month. If you have not entered a credit card, don’t worry — you will have your free terabyte every month, no need to have a credit card. But if you want to go further, now you need to be aware of your budget and set-up cost controls.

Look at the docs:

And if that’s not clear enough, I wrote an illustrated guide on Stack Overflow:

Remember — with great powers, comes great responsibility. Turn on your cost controls.