My Experience with BQ

Playing with Sharded tables in BigQuery

Let’s learn how to create a backup of Sharded tables in BigQuery.

Photo by Kolar.io on Unsplash

If you want to do it yourself or somebody tells you to create a backup/ copy of a table in BigQuery then you’ll jump and say yeah I can do this in a jiffy. Sounds very simple right, but at this moment you don’t know which type of table it is. We’ve to create a backup of ga_sessions table which looks like this:

1. The visual look of ga_session table

From the image what all thoughts come to your mind? ga_sessions is the table name but what’s this 75 in front of it? Never seen this kind of table though. Also, the table icon looks somehow different.

If you haven’t had any experience in exporting to BigQuery from the Google Analytics metrics, then your first step is to identify which type of table it is. Is it a cluster table? Nah 🤔

Ohh so it looks like a partitioned table. Yeah, it’s a partitioned table. But I never saw a date filter in front of the table. Also, when I’m trying to copy the table why am I getting the data only for the current day in the destination table? Huh 🤨

2. Table with date filter

If the table is partitioned then usually we’ll get the info about the table under the Details tab (refer to image below). In our case, nothing is specified about the type of table in the details tab of the table, what’s next🧐?

3. Partitioned tables are highlighted like this

This kind of table is referred to as the “Sharded” table. Let’s get some understanding of types of tables in BigQuery. We’ll not go deeper into this article. Stay hungry to eat more such topics later.

Table Partitioning:

A special BigQuery table that is divided into segments known as partitions is referred to as the Partition table. For better maintainability and query performance, it’s better to partition the tables with massive data.

There are 3 options in BQ to partition tables:

Partitioned tables:

While creating a partition table, we specify the column that should be used to partition like TIMESTAMP or DATE column. Based on the values stored in these columns, data will be written to the respective partitions. Max 4000 partitions are allowed per partitioned table. Ingestion-time partitioned tables:

A pseudo column called _PARTITIONTIME will be used in this to retrieve data. BQ will automatically load the data into date-based partitions which are based on the ingestion date of the data. Sharded tables:

What is Sharding? Sharding is splitting up the data into small chunks based on some logic and these small chunks are called Shards.

These tables just follow the time-based naming approach in the table name. They are called as date-shared tables. Ex: [PREFIX]_YYYYMMDD.

Unlike partitions, there is no restriction on sharded tables, but we can reference only 1000 tables in a query. When we create sharded tables, BQ maintains a copy of the schema and metadata for each sharded table.

Now, we’ve understood most types of tables, let’s go back to our problem. Basically, in this scenario, there is a separate table for each day and the prefix of the table is the same for all i.e. ga_sessions_. For example: If we refer to image 2, we’ll have two separate tables for both the dates i.e. ga_sessions_20200118 and ga_sessions_20200117.

Steps:

Take the list of all the days that you wish to copy by hitting below SQL command. [You can modify this as per your need]

SELECT

REPLACE(STRING_AGG(CONCAT('"', partition_name,'"') ORDER BY partition_name ), ","," ")

FROM

(

SELECT

DISTINCT date AS partition_name

FROM

`[Project_ID].[DATASET_NAME].ga_sessions_*`

ORDER BY PARSE_DATE(“%Y%m%d”, date)

) Output:

"20180322" "20180323" "20180324" "20180325" "20180326" "20180327"...

2. Now, go to google console, select the project and click on Activate Cloud Shell.

4. Activate cloud shell for BigQuery

3. Check the current project using echo $DEVSHELL_PROJECT_ID. If you’re in an incorrect project then change it and move to step 4.

4. Create a bash script and specify the days that you just got from the above query.

tables=(“20180322” “20180323” “20180324” “20180325” “20180326” “20180327”…)

5. Iterate through all the days available in tables variable and use BigQuery’s copy command to move a table from source to destination.

Syntax: [Note: There is a space between the source and destination table name]

bq cp -a source_table destination_table

6. Our, script file looks like this:

tables=("20180322" "20180323" "20180324" "20180325" "20180326" "20180327"…) for val in ${tables[*]}; do

bq cp -a [source_dataset].ga_sessions_$val [destination_dataset].ga_sessions_backup_$val done

What this script will do when we execute it? It will copy one table from the source to the destination. We can change the destination table name like we did it here or we can keep it as it is if it’s different dataset or project.

So we’re done, right?

Photo by Diego PH on Unsplash

Wait, the story doesn’t end here…

How to validate whether all the tables are copied or not?

WITH first_ga_session AS (

SELECT MIN(PARSE_DATE("%Y%m%d", REGEXP_EXTRACT(table_id, '20[0–9]

{6,6}'))) AS day

FROM `[PROJECT_ID].[DATASET_NAME].__TABLES__` AS ga_tables

WHERE table_id LIKE 'ga_sessions_backup_2%'

),

all_days AS (

SELECT period

FROM

UNNEST(GENERATE_DATE_ARRAY((SELECT day from first_ga_session),

CURRENT_DATE())) AS period

),

available_ga_sessions AS (

SELECT PARSE_DATE("%Y%m%d", REGEXP_EXTRACT(table_id, '20[0–9]

{6,6}')) AS ga_day

FROM `[PROJECT_ID].[DATASET_NAME].__TABLES__` AS ga_tables

WHERE table_id LIKE 'ga_sessions_backup_2%'

) SELECT A.period AS Day, B.ga_day AS Available_session FROM all_days A LEFT JOIN available_ga_sessions B ON A.period = B.ga_day WHERE B.ga_day IS NULL

The above query will give us all the days that are missing in the destination table.

Wrap Up

Here we learned about how to backup sharded tables in BigQuery. I used this approach many times with huge and massive size tables, it is always helpful.

I hope this article was useful for you, and if you haven’t tried it before then start playing with it. It will be FUN!!!

Thank you for spending your time & feel free to contact me. I would love to hear your questions, comments, or suggestions.