After spending Saturday writing the backup and restoration logic for the Offer Drive backend, I realized that I’ve written some variant of this exact same code in literally every job I’ve held — Amazon, Twitter, and now OfferLetter.io.

It’s that time again…

Database backup/restoration is important to get right — most of the time, you won’t need it. But when you need it, you will really need it. As with all operational tasks, ask yourself:

Will I get this correct when everything’s breaking and I’m exhausted at 3AM?

Some deeper notes are below.

Scripting

For writing the scripts themselves, I’ve built and rebuilt backup/restore scripts in Perl, Pig, Python, and some crap hacked together in Bash once. My favorite implementation is probably the current one for Offer Drive .

Python, Boto3, Fabric. Simple to understand, test, and implement.

Storage

The actual storage of your backups should be done in a trusted, reliable, and distinct medium — different hosts, different datacenter. Storing backups exclusively in the same machine, or even in the same datacenter, as the rest of your fleet, will lead to problems.

Just use S3. If you have to use an on-premise storage layer, hopefully you have simple ways of testing, monitoring, and distributing your backups.

Encryption

Your database backups will probably benefit from being encrypted. This introduces the usual array of encryption-related challenges — rotating keys regularly, choosing strong keys, keeping the keys themselves secure, having a revocation policy, upgrading if there’s a flaw in the crypto discovered, etc.

However, I haven't always encrypted when surface area has been small, and/or data volumes haven’t been significant. Crypto is always tricky to get right, and complicates debugging due to harder introspection. Unless the scale or requirements are sufficient, I’d much rather invest time in shipping quickly and making sure the entry points to getting the S3 keys themselves (development machine, source repo, my brain) are secure.

I’m also a fan of creating stub encryptBackup/decryptBackup functions that are filled in later as the requirements evolve.

Naming

Good naming conventions are always fun to concoct. Treat the name as an interface onto the backup process. My general criteria for database backup names is that they should:

Be easily human-parseable — for debugging, maintainability, and development Contain necessary information for scripts Easily maintain an obvious chronological ordering

A format I like: {hostname}.{timestamp}.{partition}.{format}

Time

Time in a distributed system is hard. Anyone who’s worked at sufficient scale has seen the usual array of stuck clocks, misconfigured timezones, or possibly even system clocks counting backwards (yes, really). Getting time right means getting restores right when you’re exhausted at 3AM. Getting time wrong means that, well, you don’t get restores right when you’re exhausted at 3AM.

Accordingly, the timestamp we use in the filename should be seen as a guide. If you’re debugging an issue, don’t unconditionally assume time assumptions (and therefore, your sorting) are accurate.

Most of the time, relying on system clocks is not a problem, but it’s a good assumption to question if things are going wrong. You can also use the S3 last edited time metadata.

Schema Migrations

If you’re iterating on a production backend quickly, rapid schema migration and iteration can complicate debugging and your backup/restore procedures. Depending on the strictness of the exact tooling you’re using, you could inadvertently ignore tables, miss certain foreign key relations, and so forth.

But! Here’s the paradox — good backup/restoration can actually increase velocity of backend changes, especially when your project is new or low-volume. If you’re not as afraid of your data being blown away, you’ll have the confidence to ship faster, knowing you can quickly undo any damage.

1. Manually run backups both immediately before and after schema migrations 2. Want fancier? Dump out the schema, version number, and other necessary metadata, in a separate META file associated with the backup. More complicated, but it can help in especially obtuse environments, especially if the data layer is seeing rapid/decoupled iteration.

Verifying

Create a standardized process, even if there’s only one of you. Once a month, run a manual backup/restore whose input / output you monitor. If it takes much manual intervention, automate it. Fabric, SaltStack, Ansible, RightScale, whatever. Just get it done.

If you are not testing both backup and restoration, then you do not have a backup strategy.

This is the most commonly ignored best-practices I’ve seen, and due to the false sense of security, can be the most disastrous.

User Interface

Are you 100% confident in your ability to type the right command when you’re exhausted at 3AM, and verify that it worked? If not, fix it. Make it a one-liner in your script. Make sure that you get an email if it succeeds/fails.

A Note to Library Authors

If you’re maintaining backup/restoration scripts/extensions, your docs should always describe how to

How to backup How to restore What security guarantees exist (if any).

It will help your users maintain best-practices and will help your library become more widely-adopted.

Concluding

Hopefully you will never have to use your backup/restoration scripts. But when you’re writing them, assume that you will need them at the worst possible time—exhausted at 3AM — and make sure everything still works easily. Your customers will thank you.

Thanks to Gopal and Michelle for their input on this post.