At Shopify, we run a large fleet of MySQL servers, with numerous replica-sets (internally known as “shards”) spread across three Google Cloud Platform (GCP) regions. Given the petabyte scale size and criticality of data, we need a robust and efficient backup and restore solution. We drastically reduced our Recovery Time Objective (RTO) to under 30 minutes by redesigning our tooling to use disk-based snapshots, and we want to share how it was done.

Challenges with Existing Tools

For several years, we backed up our MySQL data using Percona’s Xtrabackup utility, stored its output in files, and archived them on Google Cloud Storage (GCS). While pretty robust, it provided a significant challenge when backing up and restoring data. The amount of time taken to back up a petabyte of data spread across multiple regions was too long, and increasingly hard to improve. We perform backups in all availability regions to decrease the time it takes to restore data cross-region. However, the restore times for each of our shards was more than six hours, which forced us to accept a very high RTO.

While this lengthy restore time was painful when using backups for disaster recovery, we also leverage backups for day-to-day tasks, such as re-building replicas. Long restore times also impaired our ability to scale replicas up and down in a cluster for purposes like scaling our reads to replicas.

Overcoming Challenges

Since we run our MySQL servers on GCP’s Compute Engine VMs using Persistent Disk (PD) volumes for storage, we invested time in leveraging PD’s snapshot feature. Using snapshots was simple enough, conceptually. In terms of storage, each initial snapshot of a PD volume is a full copy of the data, whereas the subsequent ones are automatically incremental, storing only data that has changed.

In our benchmarks, an initial snapshot of a multi-terabyte PD volume took around 20 minutes and each incremental snapshot typically took less than 10 minutes. The incremental nature of PD snapshots allows us to snapshot disks very frequently, helps us with having the latest copy of data, and minimizes our Mean Time To Recovery.

Modernizing our Backup Infrastructure

Taking a Backup

We built our new backup tooling around the GCP API to invoke PD snapshots. This tooling takes into account the availability regions and zones, the role of MySQL instance (replica or master) and the other MySQL consistency variables. We deployed this tooling in our Kubernetes infrastructure as CronJobs, giving the jobs a distributed nature and avoiding tying them to our individual MySQL VMs allowing us to avoid having to handle coordination in case of a host failure. The CronJob is scheduled to run every 15 minutes across all the clusters in all of our available regions, helping us avoid costs related to snapshot transfer across different regions.



Backup workflow selecting replica and calling disk API to snapshot, per cron schedule

The backup tooling creates snapshots of our MySQL instances nearly 100 times a day across all of our shards, totaling thousands of snapshots every day with virtually no failures.

Since we snapshot so frequently, it can easily cost thousands of dollars every day for snapshot storage if the snapshots aren’t deleted correctly. To ensure we only keep (and pay for) what we actually need, we built a framework to establish a retention policy that meets our Disaster Recovery needs. The tooling enforcing our retention policy is deployed and managed using Kubernetes, similar to the snapshot CronJobs. We create thousands of snapshots every day, but we also delete thousands of them, keeping only the latest two snapshots for each shard, and dailies, weeklies, etc. in each region per our retention policy



Backup retention workflow, listing and deleting snapshots outside of retention policy

Performing a Restore

Having a very recent snapshot always at the ready provides us with the benefit of being able to use these snapshots to clone replicas with the most recent data possible. Given the small amount of time it takes to restore snapshots by exporting a snapshot to a new PD volume, this has brought down our RTO to typically less than 30 minutes, including recovery from replication lag.



Backup restore workflow, selecting a latest snapshot and exporting to disk and attaching to a VM

Additionally, restoring a backup is now quite simple: The process involves creating new PDs with source as the latest snapshot to restore and starting MySQL on top of that disk. Since our snapshots are taken while MySQL is online, after restore it must go through MySQL InnoDB instance recovery, and within a few minutes the instance is ready to serve production queries.

Assuring Data Integrity and Reliability

While PD snapshot-based backups are obviously fast and efficient, we needed to ensure that they are reliable, as well. We run a backup verification process for all of the daily backups that we retain. This means verifying two daily snapshots per shard, per region.



In our backup verification tooling, we export each retained snapshot to a PD volume, attached to Kubernetes Jobs and verify the following:

if a MySQL instance can be started using the backup

if replication can be started using MySQL Global Transaction ID (GTID) auto-positioning with that backup

if there is any InnoDB page-level corruption within the backup



Backup verification process, selecting daily snapshot, exporting to disk and spinning up a Kubernetes job to run verification steps

This verification process restores and verifies more than a petabyte of data every day utilizing fewer resources than expected.

PD snapshots are fast and efficient, but the snapshots created exist only inside of GCP and can only be exported to new PD volumes. To ensure data availability in case of catastrophe, we needed to store backups at an offsite location. We created tooling which backs up the data contained in snapshots to an offsite location. The tooling exports the selected snapshot to new PD volume and runs Kubernetes Jobs to compress, encrypt and archive the data, before transferring them as files to an offsite location operated by another provider.

Evaluating the Pros and Cons of Our New Backup and Restore Solution

Pros

Using PD snapshots allows for faster backups compared to traditional file-based backup methods.

Backups taken using PD snapshots are faster to restore, as they can leverage vast computing resources available to GCP.

The incremental nature of snapshots results in reduced backup times, making it possible to take backups more frequently.

The performance impact on the donors of snapshots is noticeably lower than the performance impact of the donors of xtrabackup based backups.

Cons

Using PD snapshots is more expensive for storage compared to traditional file based backups stored in GCS.

The snapshot process itself doesn’t perform any integrity checks, for example, scanning for InnoDB page corruption, ensuring data consistency, etc. which means additional tools may need to be built.

Because snapshots are not inherently stored as a conveniently packaged backup, it is more tedious to copy, store, or send them off-site.

We undertook this project at the start of 2019 and, within a few months, we had a very robust backup infrastructure built around Google Cloud’s Persistent disk snapshot API. This tooling has been serving us well and has introduced us to new possibilities like, scaling replicas up and down for reads quickly using these snapshots apart from Disaster recovery.

If database systems are something that interests you, we're looking for Database Engineers to join