How to backup a BigQuery table (or dataset) to Google Cloud Storage and restore from it

String together calls to the bq command-line utility

BigQuery is fully managed and takes care of managing redundant backups in storage. It also supports “time-travel”, the ability to query a snapshot as of a day ago. So, if an ETL job goes bad and you want to revert back to yesterday’s data, you can simply do:

CREATE OR REPLACE TABLE dataset.table_restored

AS

SELECT *

FROM dataset.table

FOR SYSTEM TIME AS OF

TIMESTAMP_ADD(CURRENT_TIMESTAMP(), INTERVAL -1 DAY)

However, time travel is restricted to 7 days. There are situations (playback, regulatory compliance, etc.) when you might need to restore a table as it existed 30 days ago or 1 year ago.

Python scripts to backup and restore

For your convenience, I’ve put together a pair of Python programs for backing up and restoring BigQuery tables and datasets. Get them from this GitHub repository: https://github.com/GoogleCloudPlatform/bigquery-oreilly-book/tree/master/blogs/bigquery_backup

Here’s how you use the scripts:

To backup a table to GCS

./bq_backup.py --input dataset.tablename --output gs://BUCKET/backup

The script saves a schema.json, a tabledef.json, and extracted data in AVRO format to GCS.

You can also backup all the tables in a data set:

./bq_backup.py --input dataset --output gs://BUCKET/backup

Restore tables one-by-one by specifying a destination data set

./bq_restore.py --input gs://BUCKET/backup/fromdataset/fromtable --output destdataset

How the Python scripts work

The scripts use the BigQuery command-line utility bq to put together a backup and restore solution.

bq_backup.py invokes the following commands:

bq show --schema dataset.table. # schema.json

bq --format=json show dataset.table. # tbldef.json

bq extract --destination_format=AVRO \

dataset.table gs://.../data_*.avro # AVRO files

It saves the JSON files to Google Cloud Storage alongside the AVRO files.

bq_restore.py uses the table definition to find characteristics such as whether the table is time-partitioned, range-partitioned, or clustered and then invokes the command:

bq load --source_format=AVRO \

--time_partitioning_expiration ... \

--time_partitioning_field ... \

--time_partitioning_type ... \

--clustering_fields ... \

--schema ... \

todataset.table_name \

gs://.../data_*.avro

In the case of views, the scripts store and restore the view definition. The view definition is part of the table definition JSON, and to restore it, the script simply needs to invoke:

bq mk --view query --nouse_legacy_sql todataset.table_name

Enjoy!

1. Do you need to backup to Google Cloud Storage?

After I published this article, Antoine Cas responded on Twitter:

He’s absolutely right. This article assumes that you want a backup in the form of files on Cloud Storage. This might be because your compliance auditor wants the data to be exported out to some specific location, or it might be because you want the backups to be processable by other tools.

If you do not need the backup to be in the form of files, a much simpler way to backup your BigQuery table is use bq cp to create/restore the backup:

# backup

date=...

bq mk dataset_${date}

bq cp dataset.table dataset_${date}.table # restore

bq cp dataset_20200301.table dataset_restore.table

2. Are the backup files compressed?

Yes! The data is stored in Avro format, and the Avro format employs compression. The two JSON files (table definition and schema) are not compressed, but those are relatively tiny.

As an example, I backed up a BigQuery table with 400 million rows that took 11.1 GB in BigQuery. When storing it as Avro on GCS, the same table was saved as 47 files each of which was 174.2 MB, so 8.2 GB. The two JSON files occupied a total of 1.3 MB, essentially just roundoff. This makes sense because the BigQuery storage is optimized for interactive querying whereas the Avro format is not.

3. Can you actually roundtrip from BigQuery to Avro and back?

BigQuery can export most primitive types and nested and repeated fields into Avro. For full details on the Avro representation, please see the documentation. The backup tool specifies use_avro_logical_types, so DATE and TIME are stored as date and time-micros respectively.

That said, you should verify that the backup/restore works on your table.