Loading MySQL backup files into BigQuery — straight out of Cloud SQL

Reading MySQL backup files into BigQuery is not always easy. Now you can load these backups into a Cloud SQL MySQL instance — and then have BigQuery read straight out of MySQL.

The problem

Let’s say you need to load some existing MySQL backups into BigQuery — like the MySQL dumps that GHTorrent publishes. But when you try to load them into BigQuery, you find that it’s not straightforward:

bq load --autodetect temp.users1 gs://fh/users.csv BigQuery error in load operation: Error processing job: Error while reading data, error message: CSV table encountered too many errors, giving up. Failure details:

- gs://fh/users.csv: Error while reading data, error message: Error detected while parsing row starting at position: 1558192614. Error: Data between close double quote (") and field separator.

The solution: Cloud SQL as a BigQuery federated source

Now that BigQuery can read tables straight out from Cloud SQL instances (MySQL and PostgreSQL) we can just load our MySQL backups into Cloud SQL, and then:

CREATE TABLE temp.users AS SELECT * FROM EXTERNAL_QUERY(

'fh-bigquery.US.mycloudsql'

, 'SELECT * FROM users;') # This statement created a new table named fh-bigquery:temp.users.

Step by step

Setup a MySQL instance in Cloud SQL

It’s pretty straightforward to create a MySQL instance:

Then create a user for BigQuery to connect:

Creating a user for BigQuery

Create a database too:

Creating the ghtorrent database

You can connect to it:

gcloud sql connect mycloudsql --user=root --quiet

And setup your tables — in this case, taking the script out of GHTorrent’s schema.sql :



SET

SET ALLOW_INVALID_DATES';

SET

DROP SCHEMA IF EXISTS `ghtorrent` ;

CREATE SCHEMA IF NOT EXISTS `ghtorrent` DEFAULT CHARACTER SET utf8 ;

USE `ghtorrent` ;

-- -----------------------------------------------------

-- Table `ghtorrent`.`users`

-- -----------------------------------------------------

DROP TABLE IF EXISTS `ghtorrent`.`users` ;

CREATE TABLE IF NOT EXISTS `ghtorrent`.`users` (

`id` INT(11) NOT NULL AUTO_INCREMENT COMMENT '',

`login` VARCHAR(255) NOT NULL COMMENT '',

`company` VARCHAR(255) NULL DEFAULT NULL COMMENT '',

`created_at` TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP COMMENT '',

`type` VARCHAR(255) NOT NULL DEFAULT 'USR' COMMENT '',

`fake` TINYINT(1) NOT NULL DEFAULT '0' COMMENT '',

`deleted` TINYINT(1) NOT NULL DEFAULT '0' COMMENT '',

`long` DECIMAL(11,8) COMMENT '',

`lat` DECIMAL(10,8) COMMENT '',

`country_code` CHAR(3) COMMENT '',

`state` VARCHAR(255) COMMENT '',

`city` VARCHAR(255) COMMENT '',

`location` VARCHAR(255) NULL DEFAULT NULL COMMENT '',

PRIMARY KEY (`id`) COMMENT '')

ENGINE = InnoDB

DEFAULT CHARACTER SET = utf8; SET @OLD_UNIQUE_CHECKS =@@UNIQUE_CHECKS, UNIQUE_CHECKS=0;SET @OLD_FOREIGN_KEY_CHEC KS=@@FOREIGN_KEY_CHECKS, FOREIGN_KEY_CHECKS=0;SET @OLD_SQL_MODE =@@SQL_MODE, SQL_MODE='TRADITIONAL,';SET @OLD_TIME_ZONE =@@session.time_zone;DROP SCHEMA IF EXISTS `ghtorrent` ;CREATE SCHEMA IF NOT EXISTS `ghtorrent` DEFAULT CHARACTER SET utf8 ;USE `ghtorrent` ;-- ------------------------------------------------------- Table `ghtorrent`.`users`-- -----------------------------------------------------DROP TABLE IF EXISTS `ghtorrent`.`users` ;CREATE TABLE IF NOT EXISTS `ghtorrent`.`users` (`id` INT(11) NOT NULL AUTO_INCREMENT COMMENT '',`login` VARCHAR(255) NOT NULL COMMENT '',`company` VARCHAR(255) NULL DEFAULT NULL COMMENT '',`created_at` TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP COMMENT '',`type` VARCHAR(255) NOT NULL DEFAULT 'USR' COMMENT '',`fake` TINYINT(1) NOT NULL DEFAULT '0' COMMENT '',`deleted` TINYINT(1) NOT NULL DEFAULT '0' COMMENT '',`long` DECIMAL(11,8) COMMENT '',`lat` DECIMAL(10,8) COMMENT '',`country_code` CHAR(3) COMMENT '',`state` VARCHAR(255) COMMENT '',`city` VARCHAR(255) COMMENT '',`location` VARCHAR(255) NULL DEFAULT NULL COMMENT '',PRIMARY KEY (`id`) COMMENT '')ENGINE = InnoDBDEFAULT CHARACTER SET = utf8;

Note that particular MySQL config line that says ALLOW_INVALID_DATES , which will create some trouble for us later.

Load your backups into CloudSQL

Once you’ve created a database and the tables are set up, reading an existing file from GCS into CloudSQL is a one step operation:

gcloud sql import csv \

mycloudsql \

gs://fh-bigquery/ghtorrent/mysql-2019-02-01/users.csv \

--database=ghtorrent --table=users;

Connect BigQuery to CloudSQL

In the new BigQuery web UI you can create a connection to the CloudSQL instance we just created: