Motivation

Some time ago in Mediatoolkit, we had our application deployed on multiple servers in one data center in the US and it worked fairly smoothly.

As time went by, our demand for servers increased and so did our expenses. We found a better deal with another hosting service and decided to make a full migration to a new data-center, located in Europe (Croatia).

At first, it seemed like a straightforward job to do. Simply stop all services and databases, copy everything to a new data center and then start all databases and services in a new data center. We prepared ourselves that there will be some downtime in between, but nothing too dramatic.

Problems

I started playing a bit by copying random files to test how fast the transfer will be. Using scp, the max transfer speed was around 0,5MB/sec, which was quite a shock. I thought that there must be some overhead with encryption with scp, so I tried using good old nectat, and…

…you guessed it, 0.5MB/sec again.

To resolve the issue, we contacted both of our hostings to complain about the link throughput. They responded that service worked fine with a trans-ocean link which had a throughput of 10Gbit/sec. After some time, and exchanging a couple of emails, it was obvious that we will not get a better connection throughput.

Our data warehouse was around 5TB of data, and with the speed of 0.5MB/sec, it would take a couple of months to get the transfer done. We figured how it would be faster if they just plugged-out all the hard drives and sent them by mail. Not by airplane but by boat delivery.

Databases we were using were:

MongoDB (some instances of mongo-rocks and some instances of tokumx storage engine)

Lucene

RocksDB

Redis

plain filesystem for some binary files (images, pdf documents)

Another thing that I noticed, was that there could be two scp transfers at the same time and that they both had 0.5MB/sec. Then, I tried 3 transfers and still 0.5MB/sec each, totaling 1.5MB/sec.

Then came the idea to see how many simultaneous transfers I could spin up before total speed won’t keep going up anymore.

I hit the limit with 30 parallel connections totaling 10MB/sec. Fantastic, 20x speed up, but, unfortunately, still too slow. With that speed, we could, at best, expect almost 5 days of downtime in case everything runs smoothly.

We definitely couldn’t afford such a long downtime and needed to figure out something else.

It was that time when you can’t do anything but bang your head to the wall.

Implementation

After a while, an idea popped up.

What if we copied all databases while they were operating and later stopped the databases and then re-synced all the changes?

The first tool that came to mind was rsync.

I experimented with it for a while and discovered a couple of problems in our case:

In our initial tests, it was slow when the directory contained a huge number of small files.

We also couldn’t use multiple connections to perform the transfer with (though, it’s probably doable with a complex bash script to handle multiple files concurrently)

It seemed to get stuck/hanging for hours when going through large files (>100GB)

One option was to use replication for mongo and redis instances. The problem was that connections were too slow to handle peak write rates and replicas couldn’t keep up with primaries. Another problem was that lucene and rocks-db are libraries which do not have a concept of service and replication.

Next thing I started was writing a prototype application which could perform transfers.

Although I felt like I was reinventing the wheel, I had to try something as none of the other solutions worked for us.

The basic idea was to split a file into smaller chunks and then handle each chunk separately. An idea inspired by how torrent works.

Now let’s say you want to sync two remote files. There are a couple of cases:

A remote file does not exist: full file transfer is needed.

A remote file exists with the same size and same digest hash: it means that they are very very likely the same.

Both files exist but their size and/or digest hashes mismatch: now we know that source and destination files are out of sync and further analysis will determine which portions of the file need re-transfer

In the last case we don’t want to give up and transfer the whole file, but rather split both files into chunks and treat each chunk as mini-file. Now, for each pair of corresponding chunks, we can do the same comparison of digest hashes. If they do match, consider chunks equal. If not, transfer that particular chunk.

Note that processing of different chunks in the file is totally independent and therefore highly parallelizable. The same argument holds for many small files. Each one can be handled concurrently and speed up the process of transfer.

This, concurrent, processing of files/chunks speed-up comes from the fact that the speed of exchanging metadata for files and chunks is mostly bound by the network I/O latency. So, a typical flow is “ask a server if the file exists” then wait for a response, “ask a server to calculate digest hashes” then wait for a response, and so on. With parallelization, we get to fill those gaps in time where our application does nothing but waits for a response from the server and thus can do more work in less time.

Preparation

After I completed the first working prototype, it was time to test it against working production database. The test was on an instance of mongodb, approximately 120GB in size. We started the transfer of files of the working database and it ran for a couple of hours. After completing the first round of transfer, it needed just one more run to sync changes that occurred during the first round of transfer. We stopped this database and started sync. It was completed in 15 minutes and we started original database back up again.

And then, the moment of truth… will the copied database be ok without any corruptions?

…

It was!

We felt a huge relief knowing that this prototype worked and that we would actually be able to perform migration of all the data in a reasonable amount of time.

A week before final migration, we started migration scripts to slowly copy the data while databases were all running and performing writes.

There were a couple of failed transfers which were re-started again to continue where they left off. When everything copied for the first time, it was corrupted, but it was now possible to make a final step in our migration.

Doing it

So, the day of migration came.

Our users were notified that there will be downtime during this day.

Afterward, we stopped all services and databases and started running our migration scripts. We completed re-sync of everything in bare 50 minutes. Much better than the ETA of a couple of hours. Then, we started all the databases and services and, voila! The whole system was up and running.

The only thing we did afterward, was some checking and monitoring if everything worked fine. So, we decided that we could start up a web server and call it done.

Conclusion

I find it amazing that we reduced the duration of a migration downtime from a couple of months to a single hour. I felt especially proud of myself for successfully developing and using custom-made tool for this migration.

Even if a simple migration may sound basic and the whole dataset wasn’t too big, it was still a challenge to do it with the least possible downtime.

We decided to open-source our project and gave it a name Pareco which stands for PArallel REmote COpy.

You can check out the project on github Pareco.