by German Monfort (@gmonfort) and Prabhat Jha (@prabhatjha)

We recently moved our main production SQL database from Heroku PostgreSQL to AWS RDS. We hope that our approach, outlined below, can serve as a guideline for others.

Why Migrate: Cost and Control

Before we go into details, the primary reasons for the migration were cost and control. We have a decent amount of AWS credit and AWS reserved instances are cheaper than regular on-demand instances. We have been happy campers on Heroku for the past 3 years, and a significant part of our services still run on Heroku. After all, it’s designed to do most of things right without you having to worry about OS, Middleware, etc. Heroku is a great platform with good support, and our intention here is not to find fault with it.

For context, this is a very high level system diagram of our main production components which directly connect to our relational database:

Looking at our monthly infrastructure bill we realized that we were spending the most on our Postgres database. We therefore took the following steps to migrate:

Only 12 Steps: From Preparation to Migration

Collect database versions, storage size, CPU, IOPS, and PG extensions for all the different environments on Heroku. Find AWS RDS instances closest to that on Heroku and pick the next tier. The next tier gives us the buffer to smoothly handle anticipated growth. Launch services on AWS for all data migration actions for the production environment.

a. Create ec2 instance with high network capacity.

b. Install psql v9.6.



c. sudo yum install postgresql96-9.6.4-1.77.amzn1.x86_64

d. Launch RDS instance and time it.

e. Create a replica of RDS and time it.

Hot tip: If all of your databases are not on the same version, create multiple ec2 instances for each database. For example, you cannot use pg_dump/psql for v9.6 to dump data from v9.4.

Update RDS pg_settings to optimize for faster pg_restore using RDS parameter groups.

Here is a screenshot of the parameters we ended up changing.

Hot tip: Standard pg_settings on RDS are not optimal for a fast restore

Save the configurations for the Heroku app just in case you need to abort the migration.

<span style="font-weight: 400">heroku config -a staging-app > staging-app.txt</span>

Take Heroku apps offline (e.g. in maintenance mode). Dump and restore the data.

We played with several parameters for pg_dump and pg_restore. The following took the least time for a combined dump and restore operation:

pg_dump -h HEROKU_DB_HOST -d DB_NAME -U USER_ID -j 8 -Fd -v -f heroku-prod-dump pg_restore --verbose --clean --no-acl --no-owner -h RDS_DB_HOST -U USER_ID -d DB_NAME -j 8 --format=d heroku-prod-dump

Hot tip: The most important parameter is “-j” as emphasized in the notation.

This runs the dump in parallel by dumping njobs tables simultaneously. This option not only reduces the time of the dump, but it also increases the load on the database server. You can only use this option with the directory output format because this is the only output format where multiple processes can write their data at the same time.



pg_dump will open njobs + 1 connections to the database, so make sure your max_connections setting is high enough to accommodate all connections.

This step took us approximately 3 hours combined, one hour for pg_dump and two hours for pg_restore.

We also took this opportunity to take a look at unused indices:

select s.schemaname as sch, s.relname as rel, s.indexrelname as idx,

s.idx_scan as scans, pg_size_pretty(pg_relation_size(s.relid)) as ts,

pg_size_pretty(pg_relation_size(s.indexrelid)) as "is" from

pg_stat_user_indexes s join pg_index i on i.indexrelid=s.indexrelid

left join pg_constraint c on i.indrelid=c.conrelid and

array_to_string(i.indkey, '') = array_to_string(c.conkey, '') where

i.indisunique is false and pg_relation_size(s.relid) > 1000000 and

s.idx_scan < 10000000 and c.confrelid is null order by s.idx_scan asc,

pg_relation_size(s.relid) desc;

Hot tip: If you have big unused indexes, then dropping them before doing any migration will save you the most time compared to all other tweaks you can apply to pg_dump or pg_restore. In our case, we had some indexes on `jsonb` columns which were not used.

Restore pg_settings back to default 9.6 settings on RDS. Create a read replica on RDS. Detach follower database from apps on Heroku.

<span style="font-weight: 400">heroku addons:detach HEROKU_DB_FOLLOWER -a staging-survey</span>

Point apps using follower database to read replica on AWS.

<span style="font-weight: 400">heroku config:set DATABASE_URL=AWS_RDS_READ_REPLICA -a staging-survey</span>

ASIDE: You may have noticed in the diagram above that we use PG_BOUNCER on the app using follower database. We found that just setting `DATABASE_URL` is not enough. For some reason it worked in the staging environment but did not work in production. Please let us know if you know why. We had to explicitly create a new environment variable AWS_RDS_RO_URL (name could be anything other than DATABASE_URL)

<span style="font-weight: 400">heroku config:set PGBOUNCER_URLS=AWS_RDS_RO_URL -a staging-survey</span>

Hot tip: To be on the safer side, create an environment variable that is not named DATABASE_URL for PG_BOUNCER to work.

Point app using master database to master AWS RDS

Based on how we were able to do this step for follower database, our instinct was to do the same.

<span style="font-weight: 400">heroku config:set DATABASE_URL=AWS_RDS_MASTER -r staging-mainapp</span>

Hot tip: It turns out that if a master database is attached to the billing app, in Heroku’s lingo, you cannot detach it without destroying the database.

We did not want to destroy the database in case we needed to abort the migration operation, Murphy’s law and all.

We could not find this information anywhere on the interwebs. Our sincere thanks to Heroku support for suggesting an elegant workaround.

<span style="font-weight: 400">heroku addons:attach HEROKU_DB_NAME_FOR_MASTER --as my-backup_db -r staging-mainapp</span>

Now you can detach DATABASE_URL and point it to RDS.

heroku addons:detach DATABASE -r staging-mainapp heroku config:set DATABASE_URL=AWS_RDS_URL -r staging-mainapp

Hot tip: Provide a secondary attachment for the current DATABASE_URL so that it has something to hold onto. Basically, you end up having two attachments to the same database.

How did we do?

So far, our new RDS instance choices have been working out very well.

Here is a graph of API call volume in last 3 months:

Here is a graph of response times. You can see that there are no red bars which means errors and slow responses times are things of past for now.

ASIDE: Tuning the new DB for Customer Support Use

AWS does not have out of the box tool like Heroku’s Dataclip but our Customer Success and Support teams rely heavily on it for intelligent customer support.

Read Replica Optimization

Our CS people usually run queries that take an excruciatingly long time, i.e. more than 30 seconds. The default setting on RDS is to timeout the queries that take longer than 30 seconds. We ended up updating the following parameters on read replica instance powering our new new shiny Data Clip.

Other not so hot, but sanity preserving tips

Dry run. Dry run. Dry run. -- We did 3 dry runs in our non-production environment to gain some muscle memory, to have a good sense of how long this whole process would take, and have a checklist to perform after each step. Since we successfully reverted changes back to the original state each time, we were confident that even in the worst case of aborting, we would not have an inconsistent state.

Practice helped us not to panic during step no. 10 when PG_Bouncer config did not work as expected in production. Because we had looked into PG_Bouncer code we knew different ways to configure it.

During each dry run and for actual production migration, German and I were on screen sharing and double checking each other’s work.

While making jokes during a dry run was allowed, it was not allowed during production migration. No joke can relieve you from the stress of production data migration so why complicate things? YMMV.

It took us all together 7 hours, while we had estimated it to be 7.5 hours. It helped set correct expectation with our families. Very important, especially if it involves a weekend.

Between the two of us, I think we consumed non-trivial amounts of coffee and chai.

While we knew infra-ops is stressful, it gave us more appreciation of how stressful it actually is.

I am sure there are optimizations we missed -- both on the technical side as well as on the process side. We would be happy to know your tips on how we could have done better. Please comment here or tweet us.