As a part of improving users experience within our company’s ecosystem, we decided to extract a piece of important functionality from a huge monolith mess to a separate independent service. Once the service was ready, the next step was to migrate millions of data from the old database, hosted in MongoDB, to a new one in Postgres. Not to mention the source database had many redundant and duplicate data, we also wanted to clean during this ETL process.

Since we were already using Go for most of our services, it was obvious for us to use the same language to build our migration tool. But this article will mostly discuss the design decisions we made along the way and will focus less on the coding part.

TL;DR — We managed to reduce the average time to migrate 1 million records from ~30 minutes to less than 25 seconds.

First Attempt — The quick and dirty way

Our first attempt was to evenly distribute the load to an equal number of Go-routines and handle them independently. The process involved the following steps:

Getting the total number of records that we needed to import.

Dividing it into an equal number of chunks and distributing the load between multiple workers. For instance, 1 million records can be divided into a chunk of 100K and can be handled by 10 Go-routines.

Let each worker thread query the source database (Mongo in our case), transform and save their part of data to the destination database (Postgres in our case).

Distributing workload to an equal number of Go-routines

We tested with a few thousand records and it seemed to work perfectly fine. After-all how hard can it be, right? But, our excitement only lasted for a short period of time. We ran our tool with 100 thousand to a million record, and the migration got slower and slower.

The average time to migrate 1 million records on a steady database was ~30 minutes. That means, theoretically, it would take more than 5 hours to migrate 10 million records if we do it during low traffic hours. And, we had a lot more to migrate.

Why not just increase the number of workers? Just because you can, doesn’t mean you should. Our workers were already fast enough, and increasing them did not help us much but it made it worse.

Debugging the tool

Since we were confident that our workers were fast enough, we started by checking database metrics in an attempt to figure out the issue. We were using Postgres’ multi-row insert to insert and update on conflict, in a batch of 1000 records at a time. So, for a million records we would run 1000 transactions in Postgres, which seemed perfectly fine and the metrics didn’t show anything suspicious.