Recently GitLab.com was migrated from Azure to Google Cloud Platform (GCP) to make it more suitable to mission-critical workloads requiring the lowest error rates and the highest availability. The key idea to make this possible without incurring a huge downtime was mirroring GitLab.

Among the reasons behind GitLab's decision to migrate to GCP was the attempt to improve performance and consistency. Additionally, GCP support for Kubernetes, writes GitLab's Chrissie Buchanan, was another compelling factor. However, the project to have GitLab run entirely on Kubernetes was postponed to after the migration was completed.

The GitLab team quickly realized the naive approach of shutting down GitLab.com, copy all data from Azure to GCP, change the DNS so it pointed to the new servers, and restart the services, was not going to be feasible. Indeed, copying about half a petabyte of data then verify all the data was transferred correctly was going to require a huge amount of downtime.

Therefore, GitLab engineers took a different route: adding a new feature to GitLab to enable mirroring across multiple, self-synchronizing GitLab instances. Usually, mirroring is used to improve performance and reliability when distributing data across the Cloud. In this case, it would also be used for failing over the Azure-based service and replace it with the one running on GCP.

The new feature was called Geo, and this made it possible to migrate all GitLab data without stopping the main service. When all the data had been transferred, GitLab engineers started working on a procedure to failover Azure to the GCP environment, and then make the latter the new primary. This procedure was extremely delicate and it took a long iterative, fix-and-try process to get right.

InfoQ has spoken with Brandon Jung, VP of Alliances at GitLab, and Andrew Newdigate, Staff Engineer, Infrastructure, to learn more.

The migration process as you described it was pretty straightforward once you understood the key was using mirroring, except for handling the final failover step. Could you describe some of the difficulties you found there and how you solved them?

Andrew: One example. This was a small issue, but it was one we overlooked for a long time: GitLab the company uses GitLab.com for almost all workflow, including the failover process, which was implemented as an issue template, in markdown, on GitLab.com.

Since we were practicing the failover against our staging instance, but our workflow was running on GitLab.com, our production instance, it was always available during the staging failover.

It was only quite late in the migration process, that somebody pointed out that we would not be able to use GitLab.com for our workflow during the actual production failover as it would be down.

In hindsight, this was a very obvious problem, but somehow we had missed it, probably because we are so used to using our product.

The solution was easy: we used GitLab’s push mirroring feature to maintain a replica of the Migration project on a separate, internal GitLab instance. Any changes made to GitLab.com would be replicated to the mirror. During the failover we would use that instance instead of production.

GitLab.com is reporting a significant improvement as to both the daily error rates and the overall service availability. How do you explain the fact that moving to GCP brought such an improvement?

Brandon: Consistent networking on GCP was a critical part that improved the API endpoint performance. Andrew: Brandon mentioned networking and this has been one of the major factors. I also discussed other reasons why we’ve seen the improvements we have, in this blog post. In that article, I discuss 5 other factors, not all of them technical.

Have you factored out the contribution to those improvements that might have come from the use of Geo?

Andrew: Geo is a replication and offsite mirroring solution, also suitable for disaster recovery options, but it’s not something that in itself would help improve the availability of GitLab.com In fact, once the failover was complete, we disabled Geo on GitLab.com. We turned it back on a few months back where functions as an offsite backup solution. In the event of a major, data-centre-wide outage, we would fail over to it, however it does not receive production traffic directly.

The migration process was no easy task to carry through. What did the GitLab team learned from migrating to GCP?

Brandon: Overall we were able to successfully move from Microsoft Azure to Google Cloud because of our culture of development and our focus on open source. Brandon: By moving to Google Cloud Platform and giving users an easy way to get started with Kubernetes, GitLab improved performance, availability, and customer service. Andrew: The GCP Migration project was one of the largest engineering projects carried out by GitLab up until that time. It required coordination between multiple teams across the organisation - Engineering (Geo, CI, Plan), QA Team, Infrastructure, Marketing, Support. Multiple work streams needed to be delivered in coordination. GitLab has a very well defined and mature engineering delivery process well suited to our remote-only culture, but we needed to adapt and extend these processes to deliver this company-wide project on time and safely. Since we were dogfooding GitLab (the product) to deliver the project we were able to provide feedback to our product managers about areas that could be improved for running large multi-team projects. Our product management team were extremely responsive and much of our feedback has now been incorporated into the product.

If you are interested in getting the full detail about the GitLab migration to GCP, then we recommend reading the related blog posts on the GitLab website.