Why and how we converted 89 git repositories to one monorepo

I work at Impossible Software, where we offer realtime video personalization as SaaS.

Two years ago we switched from manually setting up Git repositories on a server to using Gitlab for managing Git repositories. The number of Git repositories grew rapidly, mostly because of a 1:1:1 mapping of Git repository → Jenkins job → Debian package.

Skip to the last section if you’re only interested in the how and not so much in the why.

ETOOMANYREPOS

Overall there were 100+ repositories with 89 in active use.

├── api │ ├── router ... ├── deployment │ ├── ami-builder │ ├── aws-infrastructure ... ├── packaged │ ├── awscli │ ├── docker │ ├── elasticsearch ... ├── support │ ├── ldap-auth-server │ ├── runit-addons ...

The problem with the sheer number of repositories was where to find stuff. It wasn’t clear what piece of functionality could be found where and people were only subscribed to the repositories in Gitlab that they needed for their work. So that would be one problem a monorepo could fix. You then always have all code in a local checkout and a call of ack or find will help find what you are looking for.

Build process has become absurd

We used Debian packages as build artifacts, which made a lot of sense when the final result was Amazon Machine Images composed of Debian packages.

When we switched to deploying software with Docker in early 2015 we took an incremental approach and made Docker images like the existing AMIs.

So here’s how the software ended up in Docker images:

Git repo → Jenkins job → Debian package created by fpm → local Debian repository by aptly → aptly pushes to S3 repository → docker build –no-cache=true → Docker image

The remote repository on S3 is needed because we’re also installing Debian packages on the EC2 hosts. This would still be ok if all you’re doing is building final Docker images to be used on production. But what about development, testing, etc.? Any build of a Docker image has to go through this process, which means fixing anything requires modifying the working copies in question, pushing to master, opening the Jenkins web interface and waiting for the final push to the S3 repository being complete, then building the Docker image again. We’re talking durations of at least a minute and manually watching the Jenkins job. Longer when the aptly repository gets spammed with packages and then contains hundreds of package versions. Then we’re talking 2+ minutes.

This is an absurdly convoluted way when we’re really talking about getting local code into a locally created Docker image for development or testing.

Dependencies

There were several dependencies between repositories that were problematic. Our core product is written in C++ and has Python bindings. The only project that actually uses these Python bindings, however was in a separate repository. So these two repositories had to be updated in lockstep.

There was common code for Python projects that was located in a single repository. It was pulled in by a git reference in the requirements.txt files.

All third party Python packages were tranformed into wheels and put in another wheels repository, which was cloned or pulled from on each build.

All in all, these multiple repositories almost always made it more difficult to get stuff done and get stuff right.

Enter the Monorepo

Inspired by Go in a monorepo I though about the situation and while I originally was no big fan of the idea of a monolithic repository when I first learned that Google uses this approach, I became convinced that a monorepo would be the far lesser evil for us.

It would allow us to touch multiple projects and still make atomic commits. It would also allow us to fix our build process and go from local change to local build artifact and Docker image in a single command easily.

I managed to convince the team that the monorepo was a good idea and got the ok to implement it.

Conversion process

So we have lots of Git repositories managed by Gitlab, conveniently put into groups. The only practical approach with so many repositories is to create a monorepo where the original repo router in the Gitlab group api would end up in /api/router in the monorepo.

Here’s the code I ended up with: https://gist.github.com/ghaering/b0e95087771dbefa386f edited to remove confidential information. It only imports the master branches of the original git repositories, which was acceptable for us.

Here’s how the code works:

First the original repositories are cloned locally.

The local repositories are rewritten using git filter-branch –index-filter. This is much much faster than using –tree-filter which you’ll often find recommended.

In the monorepo, remotes are added pointing to the rewritten original repos and they’re simply imported via git pull.

This makes sure the entire history is intact, unlike other approaches I’ve tried.

Final words

If you consider consolidating multiple git repositories into a monorepo, I hope you’ll find my code useful. And if you’re not using a monorepo, maybe you could find some arguments for monorepos in this post.