Lessons from PostgreSQL's Git transition

This article brought to you by LWN subscribers Subscribers to LWN.net made this article — and everything that surrounds it — possible. If you appreciate our content, please buy a subscription and make the next set of articles possible.

The PostgreSQL Project finally switched from CVS to Git in September 2010, and did its first release based on the new Git repository on October 5. Making the switch happen took years and resulted in at least one near-disaster. Other projects that are contemplating, or working on, a transition in their version control system may find useful lessons in how PostgreSQL fared.

A History of CVS

Switching version control systems is a relatively straightforward process for a young project, but not for an old one. From 1996 through mid-September 2010, the PostgreSQL Project was developed using CVS. In fact, the date the CVS server went live — July 8, 1996 — is generally considered the "birthday" of the open-source project, which came out of the ten-year-old university project. Twenty-one major releases and 154 minor releases were committed, branched, and packaged on CVS. A large web and development infrastructure existed around CVS, as well as a multi-step, multi-role release procedure.

In 2004, as Subversion was beginning to become popular, PostgreSQL contributors first started to argue about switching away from CVS. However, Subversion was not seen as sufficiently mature at the time, or as offering enough advantages over CVS. This discussion, and occasional flamewars, continued to crop up on the main "pgsql-hackers" mailing list. Those who wanted to migrate off of CVS split into multiple camps based on the system they preferred, including Subversion, Arch, and Monotone. And, of course, a large group of developers saw no reason to change at all.

As with many projects which operate by rough consensus, where there is no consensus, there is no action. The community verdict was to wait for one version control system to become the clear leader. In retrospect, this turned out to have been a good decision.

First Git Mirror

In 2007, one of the git-cvsimport maintainers from New Zealand decided to set up a persistent and frequently updated Git mirror for the PostgreSQL CVS tree. This mirror was extremely popular and one Google Summer of Code student even did his project using the Git mirror instead of the CVS repository. An increasing number of PostgreSQL contributors started using Git mirrors.

In 2008 a few members of the PostgreSQL web team decided to set up an "official" git.postgresql.org, using FromCVS for conversion, and Gitweb with custom administration scripts for the web version. This was done without obtaining community consensus, and caused some controversy. But once the flames died out, many people started using the repository. However, the mirror became a source of frustration for the developers who depended on it. Synchronization with CVS was often undependable, with changes made in CVS failing to show up in Git.

While one or two Subversion mirrors were created well before the Git mirror, and there was even a Bazaar mirror on Launchpad, none of these were at all popular. By mid-2008, it was clear that a majority of PostgreSQL developers favored an eventual switch to Git.

Evaluation

pgCon is the annual international PostgreSQL conference in Ottawa, Canada. By pgCon 2009, PostgreSQL was one of only a handful of major projects still using CVS. It was time to switch to something, and a long discussion ensued at the developer meeting. With a version of Git available for Windows, a major hurdle had been cleared, but several issues remained to be solved, most of them having to do with project infrastructure.

First, the project has an automated regression testing infrastructure called the PostgreSQL Buildfarm. This network of donated servers and virtual machines does daily or hourly CVS checkouts, builds PostgreSQL, and runs regression tests. The Buildfarm would need to be updated to use Git, which was a challenge because not all of the operating systems represented in the build farm (such as UnixWare and AIX) had Git packages available.

The second challenge was the code review process. Using CVS, PostgreSQL contributors submitted their patches in context diff format to the pgsql-hackers mailing list and the CommitFest application. The committers did not want to change this process.

It was also unclear whether it would be possible to recreate past releases from Git.

Decision to Switch

By pgCon 2010, these issues had been resolved. The Buildfarm code had been patched, and the web team planned to set up a CVS mirror of Git for the build farm members who could not run Git. Developers had tested several back branches, and tweaked the conversion process until it was possible to produce a back-branch release which was identical to the one produced by CVS.

The Buildfarm developers, primarily Andrew Dunstan, worked on transitioning build farm members to using the Git mirror in advance of the switchover. However, the git.postgresql.org mirror proved too unreliable for this purpose, and Andrew had to set up another mirror on Github using some custom scripts. That took longer than the changes to the Buildfarm client code. But it worked, and the build farm servers started to convert to Git.

Accordingly, those at the developer meeting set a date: on August 18th, 2010, PostgreSQL would switch over. The Git repository would become the canonical code base and the CVS repository would become the mirror. This date was based on the assumption that 9.0 would be released by then, and August 18th would be after the first Alpha release for 9.1. Suitably, at pgCon Andrew did a talk on "Git for PostgreSQL developers".

Failure to Switch Over

By August 13th, things were looking somewhat different; 9.0 was not released yet. But everything else was scheduled, so the conversion went ahead. First, the developers froze the CVS tree. Next, Magnus Hagander employed the cvs2git tool to do a final conversion from CVS to Git. Then all of the PostgreSQL developers were asked to test the new repository. Things looked OK.

Then, the day before the Git repository was expected to be opened to new patches, committer Robert Haas noticed a problem:

2010-02-28 PostgreSQL... This commit was manufactured by cvs2svn to create branch REL8_3_STABLE Prior to that commit, this history is nonsense - it appears to be the history of our 9.0 development prior to that date. I would say we're going back to good old CVS. The first few revs look OK, but [then] you get to this:Prior to that commit, this history is nonsense - it appears to be the history of our 9.0 development prior to that date. I would say we're going back to good old CVS.

It seems that, where we had ported patches from later versions to earlier versions, cvs2git had manufactured inappropriate merge commits. Nobody had noticed because it didn't affect the head of any branch, just the history.

We reverted to CVS, and postponed switchover until after 9.0 was released.

The Second Conversion

Over the next few weeks, web team member Magnus Hagander worked with CVS2git developers Max Bowsher and Michael Haggerty to clean up issues with the conversion. In addition to the false merge commits, there were quite a number of other artificial commits and weird history in the converted version, such as branches which didn't exist and reappearing deleted files. A good portion of this was due to the fact that not only were we running an old version of CVS, but repository steward Marc Fournier and others had also done "CVS surgery" a number of times to fix problems over the years.

A second conversion test broke the ability to recreate any release. We discovered the new Debian server we were running it on did not default to ISO date format and had changed all the date strings for the releases. Fixed. A fair amount of "dirty history" in CVS simply could not be converted and needed to be patched in CVS before conversion. That was patched by Tom Lane.

Magnus, Max, Robert, and Tom stuck with it and resolved all of the issues which were considered roadblocks. cvs2git's next release will contain several improvements which are a result of our conversion. Several community members also added documentation to the PostgreSQL wiki on how we would use Git for development, including how to use git.postgresql.org, and how to commit.

On September 20, we released version 9.0 of PostgreSQL. Since Magnus was not heavily involved in the 9.0 release, he was able to spend his time preparing. So, on September 21st, we switched to Git.

Not Done Yet

However, the actual day of conversion was hardly the end. There was still a lot of minor cleanup to do. Contributors had to prune fictitious branches, move tags which appeared in the wrong place, and clean up many other issues.

Because the community wanted to preserve old releases exactly as they had been in CVS, we needed to preserve the file header tags in each file and remove them only from the "tip" of each branch. These are the comments which CVS automatically creates in each file which look like this:

* $PostgreSQL: pgsql/src/tools/fsync/test_fsync.c,v 1.27 2010/02/26 02:01:39 momjian Exp $

Git doesn't use these tags since they don't work with atomic multi-file commits, so they needed to be removed from current development without removing them from the history.

Administrators also had to fix people's usernames, since contributors had freely used idiosyncratic nicknames and incomplete user information for git.postgresql.org before it was the canonical repository. These were changed to their main e-mail addresses with their real names. And, most of all, hackers who had waited until the conversion to learn Git were asking for command help on the mailing lists.

After the conversion, the project had to resolve policies and contents for .gitignore files, which tell Git what kinds of files to ignore. These are used to prevent developers from accidentally sending in patches containing editor backup files or build artifacts.

One issue which still isn't resolved is the CVS mirror for the Git repository. git-cvsserver turned out to have serious scalability limits, which makes it too limited to support the build farm servers. As a result, Andrew Dunstan required all of the build farm machine owners to migrate their nodes to Git immediately rather than gradually. However, there are still a few machines which are very valuable for testing and cannot run Git and, so far, there is no solution for those.

No Merge Commits

PostgreSQL had introduced a new workflow for reviewing patches in 2008, called "CommitFests" — a bi-monthly process where committers and reviewers clear out the pending patch queue. By 2010, the bugs had been worked out of the CommitFests and everyone thought they were working well. Among other things, the new workflow was helping train new contributors. So nobody wanted to change it. Also, many committers were okay with the migration only if they could still review patches the way they were used to.

The result was the adoption of a policy which will surprise veteran Git users. Per Magnus's blog:

... We still allow any developers (and committers) to use whatever parts of git they want as they develop, but for commits going into the main tree, we are making a number of restrictions ... We will not allow merge commits ... We will not use the author field in git to tag it with the patches original author ... we will require that author and committer are always set to the same thing, and we will then credit the author(s) (along with the reviewer(s)) in the commit message ...

Yes, that's correct. No merge commits. To submit a patch, extract it as a context diff and e-mail it. Committers are to apply the patch under their own names, without branch history. The project has decided, more-or-less, to use Git like it was CVS as far as commits to the main repository are concerned. Rather than adapt the PostgreSQL project's workflow to Git, Git would be adapted to the project's workflow.

It's possible, even likely, that eventually the PostgreSQL project will move towards the "normal Git workflow" where branches and merges would be used for feature work. But it definitely won't happen this year.

The First Release

The conversion wasn't really complete until the project did a release from it. That happened on Tuesday, October 5th with a combined security update. Unsurprisingly, there was a problem.

Security releases usually have fairly tight timing; from the moment someone commits the security patches, the issues fixed are publicly visible. Due to that, and limited familiarity with the Git commands, the source packager missed the final commit on the latest branch (9.0.1). That caused the release to be delayed by a day.

Git can't be blamed for everything, though. The release was delayed for another 2 hours because the Subversion repository which holds the www.postgresql.org web code locked up. But the release did go out.

Benefits of Git

Now that the migration is mostly done, many people are discovering the benefits of working on an up-to-date, distributed, version control tool.

Robert Haas has documented how to get commit summaries and sizes from Git. He wrote a perl script (which Tom Lane improved) that allows you to produce a changelog suitable for release notes from Git. Andrew Dunstan also found new ways to sync his local repository.

The pgAdmin project has also switched to Git and the pgWeb project will hopefully soon follow.

In the future, Git should both allow developers to work on longer-lived forks more easily and to test them more fully; we've already seen this with synchronous replication and SE-Postgres. Hopefully the translation teams will also be able to take advantage of forks on git.postgresql.org to collaborate on translating the docs and messages. Most importantly, Git should help prevent bit-rot in features which take months or years to develop.

Best of all, most PostgreSQL developers were able to continue hacking away without being involved in the switchover at all.

Lessons Learned

Assuming that there are any projects out there who have not yet switched to their distributed version control system of choice, here's a few things to learn from our migration:

Start with a Git mirror.

Designate a specific "Git migration team". Make sure they have lots of free time.

Your first attempt to migrate will probably fail, so you need to be prepared for more than one.

Changing your infrastructure, workflow, and build tool dependencies is harder than the repository conversion.

Make friends with the conversion tool authors.

Write lots of docs about the new tools and workflow.

The more history you have on your current system, the more work conversion is going to be.

Things which are broken in your current history are not going to fix themselves when you migrate.

When testing the conversion, make sure to look at more than HEAD and branch-tips.

The biggest lesson, though, is not to be in a hurry! It was over three years from PostgreSQL's first Git mirror to final conversion, and 16 months of actual preparation. If you take your time and are ready to retry things that don't work the first time, you should be able to have a successful migration to Git.