The following post describes how we migrated a large NoSql database from one vendor to another in production without any downtime or data loss. The following methodology has worked for us and has been proven to be safe and simple. I am sure that there are several techniques for migrating NoSql databases, but here are some guidelines for doing so with minimal risk, while maintaining the ability to rollback at almost any point in the process. This post is based on a talk I gave at DevopsDays in Nov 2017.

In AppsFlyer, we have many NoSql databases. AppsFlyer’s data processing pipeline updates an internal state as it processes the events we get from mobile devices. This state is mission critical and is processed at high throughput.

One of those databases is our app-installs database. This database gets a write hit for roughly every new install that Appsflyer attributes and gets an additional read hit whenever Appsflyer gets any launch or in-app event.

Some numbers (true to the time of writing):

~2000 write IOPS

~180000 read IOPS.

~2 Billion records stored with a replication a factor of 3.

And, yes, it is mission critical.

But the time had come to replace it. It brought us very far and for that it deserves credit. But lately, as traffic and resulting throughput demand has raised, it just has not been able to deliver the stability we need.

Here are a few examples of instability: Scale out went out of control- we were using 45 r3.4xl instances. Daily backups could not keep up and took longer than a day to complete. XDCR stopped working. Losing a node (and adding one back) came at a performance hit and system instability (and when you have so many nodes on AWS the probability that something will go wrong gets very high). We were not happy with the answers we got from the paid support. These were all issues with which we could no longer live.

Migrating data is always hard. It is already hard when dealing with relational databases, and it is of course harder to do so in production, when the database(s) are in use. But doing so in NoSql databases is even more complex. This is because amongst the dozens of NoSql technologies out there, there is nothing in common. There is no common way to serialize or represent data, no standard query language, no schema, and when it comes to out-of-the-box migration tools- you are all alone.

So after some research and trials we chose Aerospike. We had a few months until our enterprise plan with Couchbase was due to expire and we started planning. Since most of the vendors do not encourage data migration out of their database for obvious reasons we were left with two basic options: Write a migration tool ourselves (‘out of band migration’) or use the application to migrate the data for us (‘in-band migration’). In many cases, choosing only one of those methods is sufficient for reasons I will enumerate shortly, but for other cases, you will have to use a combination thereof, as we did.

First phase (migrating inserts and updates):

The basic idea here is to use your application code to duplicate both inserts and updates (‘upserts’) to your new database. In other words, every write or update that is done for the old database is also done on your new database. This effectively starts the migration of all of your new data to the new cluster going forward. In many cases, a delete operation is considered a write here, at least from the application point of view.

This has a few implications:

You are changing your application code. This may impact the application performance, introduce new bugs, and complicate existing code. Take those risks into account. The data being written into the new database is completely redundant. If (when) you discover a bug, it is okay to just wipe out the entire database and start over. The data is still being served from the good, old database so the risk of data loss/ corruption is low.

You can start with a small cluster and grow up (scale out) as you go. This is a good opportunity to practice scaling out.

Testing the first phase:

This is the time to do some data sanity tests:

Compare the migrated data between the two databases and verify your expectations. Many databases even have different semantics for records TTL if you use it.

Get to know the new technology:

Look at the logs Set up automatic backups and restore from them! Inspect the actual database RAM and disk usage and see that it meets your initial sizing expectations. Time to wire up your metrics system and configure the proper alerting. And yes, you guessed it: test your alerts. Remember, after all, this is soon to be your shiny new production database.

Second phase:

This phase is a bit trickier, but still very simple. After you have enough confidence in the new database it is time to start serving from it. The idea here is to serve data from the new database, and if you fail to find something there, fall back to the old database and serve from there.

An important note here is to configure a metric that should count the instances when a record was not found in the new database but was found in the old one. This means that this record was not migrated yet by our application. It also means that if this metric stops rising for a long enough time, the migration is over.

This phase has a few implications:

If your new database crashes, the application code should quickly failover to the old cluster. Note that the old database is still updated with all data. This means that it is still safe to go back and start from scratch if (when) you find that something went wrong. This mechanism might have latency and other performance implications. Nothing is for free, you see. This is the first true combat test of your new database. For safety, if possible, the use of the new database should be deployed gradually. Try to serve any non-VIP data from it first, such as your beta users, or even generated test data.

The important thing is that the birth pains of your new database will affect your less important data. If you are in a hurry, trigger an explicit migration for every such a miss-and-hit. This means that for every miss-and-hit event, copy the missing record to the new database from the application. This practice should speed up the migration process. Make sure you can support the additional I/O load both on the databases and on the application servers.

Testing time again.

Whatever you tested after the first migration phase, reiterate. The load should be much higher. The database should be larger also, in order to support new read load capacity, etc… If you can afford it, it is highly recommended to leave it working for a few days, or weeks. Traffic may vary, and some errors might appear only after some mileage.

Is that it?

For quite a lot of use cases this is enough. If your records are short lived and/or the application naturally hits every record frequently, you just need to wait it out. Sit back and watch the miss-hit metric diminish to zero. In order to follow the frequency of that metric firing, we used our metering stack, which includes Statsd, Graphite and Grafana. The stack itself does not matter, as long as you can track the frequency of this metric over time.

A record-by record explicit migration:

For more sensitive/complex use cases, you will need to make sure that the entire set of records was migrated to the new database before the sun sets the old one. What we had to do now was to enumerate the entire set of keys from the old database and make sure we migrate the missing ones.