A few months ago, we took around 15 seconds of unexpected API downtime during a planned database migration. We're always careful about deploying schema changes, so we were surprised to see one go so badly wrong. As a payments company, the uptime of our API matters more than most - if we're not accepting requests, our merchants are losing money. It's not in our nature to leave issues like this unexplored, so naturally we set about figuring out what went wrong. This is what we found out.

Background

We're no strangers to zero-downtime schema changes. Having the database stop responding to queries for more than a second or two isn't an option, so there's a bunch of stuff you learn early on. It's well covered in other articles, and it mostly boils down to:

Don't rename columns/tables which are in use by the app - always copy the data and drop the old one once the app is no longer using it

Don't rewrite a table while you have an exclusive lock on it (e.g. no ALTER TABLE foos ADD COLUMN bar varchar DEFAULT 'baz' NOT NULL )

) Don't perform expensive, synchronous actions while holding an exclusive lock (e.g. adding an index without the CONCURRENTLY flag)

This advice will take you a long way. It may even be all you need to scale this part of your app. For us, it wasn't, and we learned that the hard way.

The migration

Jump back to late January. At the time, we were building invoicing for our Pro product. We'd been through a couple of iterations, and settled on model/table names. We'd already deployed an earlier revision, so we had to rename the tables. That wasn't a problem though - the tables were empty, and there was no code depending on them in production.

The foreign key constraints on those tables had out of date names after the rename, so we decided to drop and recreate them. Again, we weren't worried. The tables were empty, so there would be no long-held lock taken to validate the constraints.

So what happened?

We deployed the changes, and all of our assumptions got blown out of the water. Just after the schema migration started, we started getting alerts about API requests timing out. These lasted for around 15 seconds, at which point the migration went through and our API came back up. After a few minutes collecting our thoughts, we started digging into what went wrong.

First, we re-ran the migrations against a backup of the database from earlier that day. They went through in a few hundred milliseconds. From there we turned back to the internet for an answer.

Information was scarce. We found lots of blog posts giving the advice from above, but no clues on what happened to us. Eventually, we stumbled on an old thread on the Postgres mailing list, which sounded exactly like the situation we'd ran into. We kept looking, and found a blog post which went into more depth.

In order to add a foreign key constraint, Postgres takes AccessExclusive locks on both the table with the constraint, and the one it references while it adds the triggers which enforce the constraint. When a lock can't be acquired because of a lock held by another transaction, it goes into a queue. Any locks that conflict with the queued lock will queue up behind it. As AccessExclusive locks conflict with every other type of lock, having one sat in the queue blocks all other operations on that table.

Here's a worked example using 3 concurrent transactions, started in order:

SELECT DISTINCT ( email ) FROM parent ; ALTER TABLE child ADD CONSTRAINT parent_fk FOREIGN KEY ( parent_id ) REFERENCES parent NOT VALID ; SELECT * FROM parent WHERE id = 123 ;

While the tables we were adding the constraints to were unused by the app code at that point, the tables they referenced were some of the most heavily used. An unfortunately timed, long-running read query on the parent table collided with the migration which added the foreign key constraint.

The ALTER TABLE statement itself was fast to execute, but the effect of it waiting for an AccessExclusive lock on the referenced table caused the downtime - read/write queries issued by calls to our API piled up behind it, and clients timed out.

Avoiding downtime

Applications vary too much for there to be a "one size fits all" solution to this problem, but there are a few good places to start:

Eliminate long-running queries/transactions from your application. Run analytics queries against an asynchronously updated replica. It's worth setting log_min_duration_statement and log_lock_waits to find these issues in your app before they turn into downtime.

Set lock_timeout in your migration scripts to a pause your app can tolerate. It's better to abort a deploy than take your application down.

in your migration scripts to a pause your app can tolerate. It's better to abort a deploy than take your application down. Split your schema changes up. Problems become easier to diagnose. Transactions around DDL are shorter, so locks aren't held so long.

Keep Postgres up to date. The locking code is improved with every release.

Whether this is worth doing comes down to the type of project you're working on. Some sites get by just fine putting up a maintenance page for the 30 seconds it takes to deploy. If that's not an option for you, then hopefully the advice in this post will help you avoid unexpected downtime one day.