TLDR; link to code repo at the bottom with an example Airflow DAG.

When Google made Apache Airflow a managed service in GCP last year I was enthusiastic — mostly because I had looked into airflow before, and I found it nice to have a task scheduler that is written in actual code instead of point and click. Airflow was the first proper task scheduler for GCP and prior to this, if you wanted a scheduler you would have to use a third party service or cron scheduler . Cron is fine if you have tasks like “ping that every minute” or “reload cache every hour”, but once you start having tasks like “load data from that source, then put it into another source, then email someone”, you really need a task manager. Because your jobs will crash, and then you need to restart something in the middle and kick off an entire pipeline. You do not want to create those kinds of scripts without a framework!

Almost a year has passed since Airflow made it to GCP, and while there has been many good additions that makes GCP and BigQuery integration easier, there is no documentation yet that covers how to move data from Cloud SQL to BigQuery in a repetitive manner. Hopefully there will come an easy way of moving data from Cloud SQL to BigQuery in the near future.

First solution: automate with gcloud and Cloud Composer!

What Google has done well, however, is gcloud , the toolbox of GCP. You can do whatever you can do in the GUI (and even a bit more), and in the GUI you can do exports and imports.

The solution I will present now is best suited for batch like data transfers in the “daily” range. Say, if you wanted to move your data warehouse tables to BigQuery every night.

With gcloud it is possible to trigger an export job (to CSV) to Cloud Storage for a table in Cloud Composer. Since table export also takes a custom SQL you can make the export a bit smarter, so you don’t have to do a full copy every day.

So my initial plan looked something like this:

Export table to Cloud Storage Import into BigQuery staging area Possibly a BigQuery SQL to merge and join.

After working on this for some hours I ran into a very irritating bug that required me to alter the pipeline slightly. For some reason Google doesn’t know how to export Cloud SQL to CSV properly. If you have a NULL value in the CSV the export is broken (“foo”,””,”bar” becomes “foo”,”N”,”bar”). The bug has been around for quite some time, so to use exports, we have to work around it.

In my case it could be solved by some script hacking and creative uses of gsutil and sed. Relevant code can be found here.

The other cases I hadn’t really thought about when I started this task was the schema. With just a CSV file, you will get very generic names on the columns (colum1, colum2, etc). When I first wrote this, I was planning on adding a JSON file containing the data model for each table to be synced, but then I thought… why not grab it from MySQL? Then we don’t have to spend (manual) time on maintaining a separate data model.

Fortunately it is pretty easy to add new tasks to Airflow with dependencies needed, so that went really well.

Revisited steps:

Create a bucket in the same location as where your BigQuery data set is, this will temporary save data for you. You probably want to set retention policy to a fixed time (for example 1 week), because you will hold quite some duplicate data here. Export table to Cloud Storage as CSV, using gcloud Export MySQL schema (column name, type) as CSV, using gcloud Create a BigQuery JSON schema from the exported CSV file, and fix some data types, like MySQL DATE being exported as YYYY-mm-dd hh:mm:ss, while BigQuery wants this as YYYY-mm-dd only Fix NULL values in the export due to this bug (https://cloud.google.com/sql/docs/mysql/known-issues#import-export) — basically exported NULL values will break the CSV. Using gsutil and sed. Import it to bigquery

Disclaimer: I have not tested how step 4 works with very large files.