At Discourse we have always been huge fans of continuous deployment. Every commit we make heads to our continuous integration test suite. If all the tests pass (ui, unit, integration, smoke) we automatically deploy the latest version of our code to https://meta.discourse.org.

This pattern and practice we follow allows the thousands self-installers out there to safely upgrade to the tests-passed version whenever they feel like it.

Because we deploy so often we need to take extra care not to have any outages during deployments. One of the most common reasons for outages during application deployment is database schema changes.

The problem with schema changes

Our current deployment mechanism roughly goes as follows:

Migrate database to new schema

Bundle up application into a single docker image

Push to registry

Spin down old instance, pull new instance, spin up new instance (and repeat)

If we ever create an incompatible database schema we risk breaking all the old application instances running older versions of our code. In practice, this can lead to tens of minutes of outage!

In ActiveRecord the situation is particularly dire cause in production the database schema is cached and any changes in schema that drop or rename columns very quickly risk breaking every query to the affected model raising invalid schema exceptions.

Over the years we have introduced various patterns to overcome this problem and enable us to deploy schema changes safely, minimizing outages.

Tracking rich information about migrations

ActiveRecord has a table called schema_migrations where is stores information about migrations that ran.

Unfortunately the amount of data stored in this table is extremely limited, in fact it boils down to:

connection.create_table(table_name, id: false) do |t| t.string :version, version_options end

The table has a lonely column storing the “version” of migrations that ran.

It does not store when the migration ran It does not store how long it took the migration to run It has nothing about the version of Rails that was running when the migration ran

This lack of information, especially, not knowing when stuff ran makes creating clean systems for dealing with schema changes hard to build. Additionally, debugging strange and wonderful issues with migrations is very hard without rich information.

Discourse, monkey patches Rails to log rich information about migrations:

github.com discourse/discourse/blob/6a3c8fe69c16ad7360046f145db6689c18e91005/lib/freedom_patches/schema_migration_details.rb module FreedomPatches module SchemaMigrationDetails def exec_migration(conn, direction) rval = nil time = Benchmark.measure do rval = super end sql = <<SQL INSERT INTO schema_migration_details( version, hostname, name, git_version, duration, direction, rails_version, created_at ) values ( This file has been truncated. show original

Our patch provides us a very rich details surrounding all the migration circumstances. This really should be in Rails.

Defer dropping columns

Since we “know” when all previous migrations ran due to our rich migration logging, we are able to “defer drop” columns.

What this means is that we can guarantee we perform dangerous schema changes after we know that the new code is in place to handle the schema change.

In practice if we wish to drop a column we do not use migrations for it. Instead our db/seed takes care of defer dropping.

github.com discourse/discourse/blob/6a3c8fe69c16ad7360046f145db6689c18e91005/db/fixtures/009_users.rb#L36-L62 Migration::ColumnDropper.drop( table: 'users', after_migration: 'DropEmailFromUsers', columns: %w[ email email_always mailing_list_mode email_digests email_direct email_private_messages external_links_in_new_tab enable_quoting dynamic_favicon disable_jump_reply edit_history_public automatically_unpin_topics digest_after_days auto_track_topics_after_msecs new_topic_duration_minutes last_redirected_to_top_at This file has been truncated. show original

These defer drops will happen at least 30 minutes after the particular migration referenced ran (in the next migration cycle), giving us peace of mind that the new application code is in place.

If we wish to rename a column we will create a new column, duplicate the value into the new column, mark the old column readonly using a trigger and defer drop old column.

If we wish to drop or rename a table we follow a similar pattern.

The logic for defer dropping lives in ColumnDropper and TableDropper.

Not trusting ourselves

A big problem with spectacular special snowflake per-application practices is enforcement.

We have great patterns for ensuring safety, however sometimes people forget that we should never drop a column or a table the ActiveRecord migration way.

To ensure we never make the mistake of committing dangerous schema changes into our migrations, we patch the PG gem to disallow certain statements when we run them in the context of a migration.

Want to DROP TABLE ? Sorry, an exception will be raised. Want to DROP a column, an exception will be raised.

This makes it impractical to commit highly risky schema changes without following our best practices:

== 20180321015226 DropRandomColumnFromUser: migrating ========================= -- remove_column(:categories, :name) WARNING ------------------------------------------------------------------------------------- An attempt was made to drop or rename a column in a migration SQL used was: 'ALTER TABLE "categories" DROP "name"' Please use the deferred pattrn using Migration::ColumnDropper in db/seeds to drop or rename columns. Note, to minimize disruption use self.ignored_columns = ["column name"] on your ActiveRecord model, this can be removed 6 months or so later. This protection is in place to protect us against dropping columns that are currently in use by live applications. rake aborted! StandardError: An error has occurred, this and all later migrations canceled: Attempt was made to rename or delete column /home/sam/Source/discourse/db/migrate/20180321015226_drop_random_column_from_user.rb:3:in `up' Tasks: TOP => db:migrate (See full trace by running task with --trace)

This logic lives in safe_migrate.rb. Since this is a recent pattern we only enforce it for migrations after a certain date.

Alternatives

Some of what we do is available in gem form and some is not:

Strong Migrations offers enforcement. It also takes care of a bunch of interesting conditions like nudging you to create indexes concurrently in postgres. Enforcement is done via patching active record migrator, meaning that if anyone does stuff with SQL direct it will not be caught.

Zero downtime migrations very similar to strong migrations.

Outrigger allows you to tag migrations. This enables you to amend your deploy process so some migrations run pre-deploy and some run post-deploy. This is the simplest technique for managing migrations in such a way that you can avoid downtimes during deploy.

Handcuffs: very similar to outrigger, define phases for your migrations

What should you do?

Our current pattern for defer dropping columns and tables works for us, but is not yet ideal. Code that is in charge of “seeding” data now is also in charge of amending schema and timing of column drops is not as tightly controlled as it should be.

On the upside, rake db:migrate is all you need to run and it works magically all the time. Regardless of how you are hosted and what version your schema is at.

My recommendation though for what I would consider best practice here is a mixture of a bunch of ideas. All of it belongs in Rails.

Enforcement of best practices belongs in Rails

I think enforcement of safe schema changes should be introduced into ActiveRecord. This is something everyone should be aware of. It is practical to do zero downtime deploys today with schema changes.

class RemoveColumn < ActiveRecord::Migration[7.0] def up # this should raise an error remove_column :posts, :name end end

To make it work, everyone should be forced to add the after_deploy flag to the migration:

class RemoveColumn < ActiveRecord::Migration[7.0] after_deploy! # either this, or disable the option globally def up # this should still raise if class Post has no ignored_columns: [:name] remove_column :posts, :name end end

class RemoveColumn < ActiveRecord::Migration[7.0] after_deploy!(force: true) def up # this should work regardless of ignored_columns remove_column :posts, :name end end

I also think the ideal enforcement is via SQL analysis, however it is possible that this is a bit of a can-of-worms at Rails scale. For us it is practical cause we only support one database.

rake db:migrate should continue to work just as it always did.

For backwards compatibility rake db:migrate should run all migrations including after_deploy migrations. Applications who do not care about “zero downtime deploys” should also be allowed to opt out of the safety.

New post and pre migrate rake tasks should be introduced

To run all the application code compatible migrations you would run:

rake db:migrate:pre # runs all migrations without `after_deploy!`

To run all the destructive operations you would run:

rake db:migrate:post # runs all migrations with `after_deploy!`

Conclusion

If you are looking to start with “safe” zero downtime deploys today I would recommend: