Repo style wars: mono vs multi

This essay was originally written when consulting for Eero who has graciously allowed me to share it.

The Fundamental Law of Repo Topology is that you must not have cyclical dependencies between repos. If you do you are in for a world of hurt when you have to perform a series of non-atomic changes to update libraries.1 Going with a monorepo has the advantage that you never have this problem because there’s only one repo. On the other hand, working in a monorepo implies certain things about the rest of your development process and even philosophy of development.

Two philosophies

The fundamental difference between the monorepo and multirepo philosophies boils down to a difference about what will allow teams working together on a system to go fastest. The multirepo view, in extreme form, is that if you let every sub-team live in its own repo, they have the flexibility to work in their area however they want, using whatever libraries, tools, development workflow, etc. will maximize their productivity. The cost, obviously, is that anything not developed within a given repo has to be consumed as if it was a third-party library or service, even if it was written by the person sitting one desk over. If you find a bug in, say, a library you use, you have to fix it in the appropriate repo, get a new artifact published, and then go back to your own repo to make the change to your code. In the other repo you have to deal with not only a different code base but potentially with a different libraries and tools or even a different workflow. Or maybe you just have to ask someone who owns that system to make the change for you and wait for them to get around to it.

The monorepo view, on the other hand, is that that friction, especially when dealing with more complicated dependency graphs, is much more costly than multirepo advocates recognize and that the productivity gains to be had by letting different teams go their own way aren’t really all that significant: While it may be the case that some teams will find a locally optimal way of working, it is also likely that their gains will be offset by other teams choosing a sub-optimal way of working, probably by cargo culting some other team’s approach and then letting it bitrot. By putting all your eggs in the one basket of the monorepo you can then afford to invest in watching that basket carefully. Clearly the friction of having to make changes in other repos or, worse, having to wait for other teams to make changes for you, is largely avoided in a monorepo because anyone can (and is in fact encouraged) to change anything. If you find a bug in a library, you can fix it and get on with your life with no more friction than if you had found a bug in your own code.

But this does point at what is perhaps the sharpest practical difference between the monorepo and multirepo philosophies, the difference in who is responsible for making the changes necessary to deal with library changes. That is, if the owners of Library X make a change to their API, who’s responsible for fixing the code that uses X? In a monorepo it has to be the Library X author because they can’t check in their change until the whole build is clean; they have to find and fix any code that their change would break. In other words, library authors are forced to balance the benefits of making breaking changes against the cost of fixing all the code they break because they’re going to do the fixing themselves.

In a multirepo world, on the other hand, the Library X folks would simply check in their changes to their repo and then publish a new versioned artifact and it falls to the teams working in other repos to—at some point—change their dependency on X to the newer version and make whatever changes they need to to deal with it.

Again, this choice really rests on what you think will let developers go fastest. Obviously for library authors, the multirepo approach requires less work in the short term—they just make the changes they want and publish a new binary. And consumers of the library are not slowed down by having to deal with the Library X changes and their code base does not get destabilized by having someone come in and muck with it. So life seems good—they can wait until an opportune time to adopt the new version and make the necessary changes to their own code.

But there’s no such thing as a free lunch. In the long term, both library maintainers and library users pay for their short term gains. Maintainers have to support multiple versions of their library since not everyone will upgrade right away. And consumers not only have to eventually upgrade to new version of things, they may be blocked from upgrading when they want to because other code they depend on hasn’t upgraded yet.2

Implications of a monorepo

The main challenge of running a monorepo is that it will naturally get larger over time and since you can’t scale horizontally (by splitting into multiple repos) you have to scale vertically and not all your tools will necessarily be up to the challenge. In particular:

You will need a specialized build tool that is designed for building everything from source and intelligently caching built artifacts. Google’s Bazel, Facebook’s Buck, and Twitter’s Pants are the contenders with Bazel probably the best bet.

If your repo gets too big, standard version control may choke on it. Git can certainly handle the Linux kernel but that’s not really all that big a repo. Github focuses on scaling horizontally—lots of little repos—but it’s not clear how ready they are for a truly giant single repo. Twitter runs perhaps the biggest monolithic git repo in the world and it is still pretty painful to deal with. And I don’t think many of Twitter’s git hacks have been upstreamed. On the other hand, Facebook’s Mercurial hacks are available and Perforce could probably handle anything you’re going to throw at it with no problem. That said, people are still working on git so if it can handle your current codebase and number of developers without hiccups, you may have enough headroom to grow a git repo for some time.

It cuts against the Git and Mercurial model to check out a subset of the repo. In any event you will need tooling support to figure out what you need to check out in order to be able to build an arbitrary target within your monorepo.

IDEs may have trouble with a giant workspace or non-standard build tools. (Though Bazel seems to have IntelliJ integration and Pants pupports to though it never worked well. Facebook solved this problem by buliding their own IDE.)

Some other considerations:

Because all the code is in one place, there is little natural barrier to tangling it all together. Thus you will need constant vigilance to maintain a good overall structure. On the plus side, global refactorings are possible.

If you develop public open source projects you either don’t get the benefits of the monorepo for those projects or have to kludge around syncing the code to Github where, unless you are extremely fancy, the outside contributor experience will be second class. (They can’t just send a Github PR that gets merged but rather something that then gets applied inside the monorepo and eventually published back out to Github.)

Version management: in monorepos everything in master is on the same version. Can still have gaps between what’s in master and what’s deployed. Especially if some projects actually build from deploy branches for sanity’s sake.

Implications of a multirepo

Perhaps the main advantage of a multirepo is that it avoids the tooling challenges of a monorepo since existing tools are already designed to deal with the scale of project you’re likely to put in a single repo. However there are other issues, even if you believe that the multirepo approach will get you productivity gains:

You need to be very strict and intentional about managing versions and baking in the cost of updating code to use new versions of libraries it depends on. You may need to tool up around tracking dependencies globally to give people a proper understanding of what needs to be updated when.

The way multirepos force a certain loose coupling between projects becomes a drawback when reality dictates that things the used to be coupled should be decoupled or things that used to be decoupled need to be joined closer together. You will want tooling to make this not terrible.

To mitigate the problems caused by having to work in mulitple repos you’ll probably want to provide some standardization of tooling, conventions, etc. Though this cuts against the basic tenet of the multirepo philosophy.

Finding code is harder when you first have to find the right repo. At the very least you’ll probably need to maintain some kind of catalog of repos or strong naming convention that will allow someone looking at, say, a dependency in a build file, to figure out what repo they need to look in to find the code.

Best of both worlds?

Can you get the best of both worlds somehow? Probably not. You could certainly tool up around a multirepo to make multiple repos act more like a single monorepo (this is what Android does with a tool called repo) and you could establish a consistent toolchain, etc. across all repos to reduce the costs of making changes in other repos. But at that point its not clear what benefit you are getting from multiple repos unless there are other reasons to have some code in separate repos.

Or, in a monorepo, you could use branches to mimic the local stability you get in a multirepo—every project would live in a longlived branch and only merges changes from other projects when needed. But then you get all the potential for irresolvable diamond dependencies preventing you from upgrading when you actually want to.

Further reading