Since Linus Torvalds presentation at Google about git in May 2007, the adoption and interest for Distributed Version Control Systems has been constantly rising. We will introduce the concept of Distributed Version Control, see when to use it, why it may be better than what you're currently using, and have a look at three actors in the area: git, Mercurial and Bazaar.

What?

A Version Control System (or SCM) is responsible for keeping track of several revisions of the same unit of information. It's commonly used in software development to manage source code project. The historical and first project VCS of choice was CVS started in 1986. Since then many other SCM have flourished with their specific advantages over CVS: Subversion (2000), Perforce (1995), CVSNT (1998), ...



In December 1999, in order to manage the mainline kernel sources, Linus chose BitKeeper described as "the best tool for the job". Prior to this Linus was integrating each patch manually. While all its predecessors were working in a Client-(Central)Server model BitKeeper was the first VCS to allow a truly distributed system in which everybody owns their own master copy. Due to licensing conflicts, BitKeeper was later abandoned in favor of git (Apr, 2005). Other systems following the same model are available: Mercurial (Apr, 2005), Bazaar (Mar, 2005), darcs (Nov, 2004), Monotone (Apr, 2003).

Why?

Or a more precise question: Why Central VCS (and notably Subversion) are not satisfying?

Several things are blamed on Subversion:

Major reason is that branching is easy but merging is a pain (but one doesn't go without the other). And it's likely that any consequent project you'll work on will need easy gymnastic with splits, dev, test branches. Subversion has no History-aware merge capability, forcing its users to manually track exactly which revisions have been merged between branches making it error-prone.

No way to push changes to another user (without submitting to the Central Server).

Subversion fails to merge changes when files or directories are renamed.

The trunk/tags/branches convention can be considered misleading.

Offline commits are not possible.

.svn files pollute your local directories.

files pollute your local directories. svn:external can be harmful to handle.

can be harmful to handle. Performance

The modern DVCS fixed those issues with both their own implementation tricks and from the fact that they were distributed. But as we will see in conclusion, Subversion did not resign yet.

How?

Decentralization

Distributed Version Control Systems take advantage of the peer-to-peer approach. Clients can communicate between each other and maintain their own local branches without having to go through a Central Server/Repository. Then synchronization takes place between the peers who decide which changesets to exchange.

This results in some striking differences and advantages from a centralized system:

No canonical, reference copy of the codebase exists by default; only working copies.

exists by default; only working copies. Disconnected operations : Common operations such as commits, viewing history, diff, and reverting changes are fast, because there is no need to communicate with a central server. Even if a central server can exist (for stable, reference or backup version), if Distribution is well used it shouldn't be as much queried as in a CVCS schema.

: Common operations such as commits, viewing history, diff, and reverting changes are fast, because there is no need to communicate with a central server. Even if a central server can exist (for stable, reference or backup version), if Distribution is well used it shouldn't be as much queried as in a CVCS schema. Each working copy is effectively a remoted backup of the codebase and change history , providing natural security against data loss.

, providing natural security against data loss. Experimental branches – creating and destroying branches are simple operations and fast.

– creating and destroying branches are simple operations and fast. Collaboration between peers made easy.

For an introduction to DVCS collaboration pratices, you might have a look at the Intro to Distributed Version Control (Illustrated) or possible Collaboration workflows.

You should also be aware that there are some disadvantages in opting for DVCS, notably in term of complexity; This decentralized view is very different from Central world and it might need some time to get used to for your developers. Changeset tracking instead of file tracking can also be confusing even if very powerful and making it theoritically possible to track method move through file.

Who?

The battle rages on! Some of the Good and the Bad.

The good and the bad essentially from an updated (because some old arguments are not true anymore) compilation of blogs and my personal experience.

You should notice that it is a very short list of features (ie git has more than 150 commands), and some issues might be more critical than others.

You should notice that in the survey, there was no option to choose Ruby as proficient language. Should be interesting to add it for survey 2008.

It's also funny to see that ~1/3 of people use Distributed VCS (here git) in collaboration with ... 0 or 1 person!

Guis

gitk on Linux TortoiseHg on Windows OliveGtk on Linux

The guis look nearly the same with a preference for the effectiveness of gitk. TortoiseHg (with folder watch activated) was really slow with a big repository like Mozilla.

A quick and non-exhaustive look at performance

Conditions of the bench

git is still leading the performance battle, but Hg and Bzr have made great improvements in the past year.

You should notice that Mercurial doubles the number of files in your repository (the historic is kept per file in .hg/store/data ). It doesn't seem to be a good choice for Windows system running on NTFS.

It's also interesting to see that git takes a big advantage of the system when executing command. While Hg and Bzr do not spend a big proportion of time in system, Git can take up to 10-40% cpu time within system call, which raises the question as to how it will perform on Windows system where the git-developers won't have access to all the system performance trick they are used to with Linux.

Single Merges and Merge Queues should be tested, this is a tiedous part to benchmark.

Benchmarks should also be run on Windows as:

Even if your server is running on *nix, many developers are still having a Windows environment at work and DVCS transfered more processing on the developer station Performance might be really different on Windows machine.

When?

Experience stories.

I had the chance to catch up with Kelly O'Hair from Sun about its choice for Hg for OpenJDK.

Sebastien Auvray: I read the reasons for migrating from TeamWare to Mercurial but had remaining questions. Did you simply follow OpenSolaris choice?

Kelly O'Hair: To some degree yes, but the OpenSolaris choice also became the Sun wide choice to any Sun Software teams having to convert. The OpenSolaris investigation was pretty complete and they had all the exact requirements we had. We had to convert for OpenJDK, because TeamWare was unacceptable for an open source project, the answer of Mercurial was pretty obvious for us.

Or did you do a refreshed tournament and tried the other DVCS again (git, ...)?

We did not do a detailed re-investigation, that seemed like a waste of time. The only other possible choice in my view was git, and since git wasn't giving Windows a priority, which we needed. Again the choice was obvious.

OpenSolaris reports took place in April 2006 which is 2 years ago.

Understood. Some things may have changed, git has improved, but the ball was rolling, and Mercurial was improving too.

Also did you encounter any specific problems in the migration?

File permissions and ownership can be a problem in sharing a repository vis a NFS or UFS file system, so we finally setup a server to handle the shared repositories, the better way. That could be made easier.

The other issue is that using hooks to rollback or filter pushes creates a window where someone could accidently pull changes that will be rolled back, so you have to use a pair of repositories, one for pushes and one for pulls, with an automatic sync after the hooks run to sync them up.

Using forests also introduces a problem because a forest push is just a set of individual pushes, and if one push failed, technically you would want to rollback all other pushes. Nobody is doing this, and just taking their chances. If the repositories in the forest are fairly independent, this is not a real problem.

In the day-to-day usage?

Remains to be seen. Change like this is easy for some, harder for others. Given time, I think most people have and will adapt and learn to love it.

The concept of "working set files" (having to do 'hg update') and having to merge changesets that don't seem to merge anything is confusing to people. Also, the idea that they are pushing changesets and not files is something people have a problem with, "Why can't I just bringover this one file?".

What is better than TeamWare?

Much much much faster than TeamWare. Our teams in China and Russia are looking forward to full deployment because they don't need to keep mirrors of integration areas. Refreshes (pulls) are very fast over slow connections.

The state of the repository in Mercurial is well-defined, unlike TeamWare which allowed for partial workspaces, TeamWare was just a loose bag of individually managed files (SCCS files).

The changeset concept was missing in TeamWare, along with the concept of well known simple state of the entire repository (a simple changeset id).

Is there anything you're missing from TeamWare?

People are missing the email notifications and putback/bringover transaction history, but the changeset provides much of that.

What may be missing is somekind of repository transaction history, but again, email archives of Mercurial events could provide this.

Is Hg becoming the VCS of choice for Sun including internal projects? Or is Sun using it only for public projects that need openness?

Both internal and external projects are converting, where it makes sense.

I've seen a big increase in interest from internal projects that are taking the plunge.

I also caught up with Pierre d'Herbemont from VLC to get their opinion about git.

Sebastien Auvray: Firstly what was the version control system you were using prior to using Git?

Pierre d'Herbemont: SVN and a git-svn mirror.

When did you migrate?

We opened a git mirror of the svn tree, to ease VLC Google Summer of Code projects. So that was back then. Then we totally migrated to git on March 1st-2nd 2008.

Why did you chose Git over its competitors?

Over SVN: Git is fast. Branch is cheap. Atomic Commits. Rebasing on top of an other tree.

Over other distributed system: Proven user base (Linux Kernel). I have been successfully using it while working on Wine. Git is sexy. And Some core developers had experiences with Git, whereas no one has with Mercurial and such. Nothing technical there.

Also did you encounter any specific problems in the migration? In the day-to-day usage?

We encountered some troubles with Trac and buildbot. Their support for Git is really minimal especially in their releases versions. We had to checkout Builbot latest trunk. For Trac we are using a crippled Git plugin. Trac Git Plugin needs Trac 0.11. But Trac 0.11 isn't stable and has some known memleak that prevent us from switching. So basically we are waiting for them to fix that...

It took some times for some committers to get accustomed with Git. But after two days, everything seemed fine. And some Git-beginners starts to really enjoy Git.

So what ?

Choosing between Distributed VCS and Central VCS is far from being easy. DVCS will definitely change the way you work and collaborate. Subversion, one of the Central VCS leader, has not resigned yet in the performance and features battle, and 1.5 version should come up with good compromises. It can count on its existing userbase and simplicity favor (at the cost of some pain). In very specific case like project dealing with large opaque binary files, Subversion would be better than DVCS because the client-side space usage stays constant. Also if you use partial checkout heavily, svn will perform better (but when massively used this reveals a problem in the setting of your modules).

Once you made the choice for either Distributed or Central solution, then it will also be hard to compare the competitors in their area as implementations/commands and at the end performances can be very different. And there is no real existing benchmarks for the common operations.

In this hard battle, Bazaar lost many new really influencing early adopters (Mozilla, Solaris, OpenJDK) because of its poor performance of the beginning. It also has to be said that Bazaar website is a lot more Marketing-oriented: by publicizing not-all-true differences with its competitors, or by publicizing benchmark comparison with its competitors only about Space efficiency while there's no timing benchmark comparison of daily commands: diff, add, ...

I feel that even though the 3 projects started out at nearly the same time, bzr did face a lot more performance and design problem at the early beginning making it a bit less mature than its competitors now.

Yet unseen phenomenon, it seems as if some choices have emerged based on the language used by the communities: Java / Sun related developments seem to be interested more in Mercurial while C / Linux / Ruby / Rails related projects are attracted by git.

Hope this article enlightened you and your experiences and feedbacks are always welcome!

Credits:

People who kindly accepted my interview: Kelly O'Hair, Pierre d'Herbemont.

Ian Clatworthy for his help and reactivity on the conversion of the Mozilla Hg Repository to Bzr.

#git, #mercurial, #bzr on Freenode IRC, #mozilla on Mozilla IRC.

Athletism Picture by Antonio Juez

Random quotes:

Linus Torvald: "Subversion has been the most pointless project ever started". "If you like using CVS, you should be in some kind of mental institution or somewhere else".

Mark Shuttleworth (Ubuntu / Canonical Ltd.): "Merging is the key to software developer collaboration."

Ian Clatworthy (Canonical / Bazaar): "By 2011-2012, I predict this technology will be widely adopted and many teams will wonder how they once managed without it."

Assaf Arkin in Git forking for fun and profit originally: "Apache built a great infrastructure around SVN, lots of sweat and tears went into making it happen, and at first I felt like we’re circumventing all of that. But the longer I thought about it, the more I realized that Git is just more social than SVN, and that’s exactly what Apache is about."

[Article updated on 20080512 according to the comments here and from Ian Clatworthy and reedit]:

Bzr plugins and Windows Gui added: rebase, ..., Wildcat BZR, ...

Hg Shelve added.

SLOC for Hg updated (HTML doc used to be counted, I kept contrib which is responsible for the presence of Lisp and Tcl/Tk).

Repository size for git updated after doing proper repack command ( git repack -a -f -d --window=100 --depth=100 until size becomes constant) (Thanks to the comment by dhamma vicaya).

Apologies:

darcs, Monotone were not taken into account in this comparison because it was already a hard work to gather all this information and to actually test those 3 DVCS. Strangely, even though they are the oldest in the DVCS scene, the focus is more on the DVCS I reviewed here (which doesn't help moving the focus I admit but darcs, Monotone users/developers are welcome to post comments and advertising here!).

References:

The very exhaustive Wikipedia page about Git.

Distributed Revision Control Wikipedia page.

Comparison of Revision Control Software Wikipedia page.



Distributed Version Control - Why and How by Ian Clatworthy, Canonical (Bazaar).

Intro to Distributed Version Control (Illustrated) by Kalid Azad.

Distributed Version Control Systems by Dave Dribin (who finally chose Mercurial).

Why Distributed Version Control by Wincent Colaiuta.

Source Code Management for OpenSolaris. OpenSolaris SCM Project History (2005).

Mercurial OpenJDK Questions by Kelly O'Hair, Sun.

Why I chose git by Aristotle Pagaltzis.

Distributed SCM by Gnome crew.

FreeBSD SCM Requirements.

Open Office Requirements.

Mozilla VCS Requirements.

Use Mercurial you git! by Ian Dees.

What a DVCS gets you (maybe) by Bill de hÓra.

The Differences Between Mercurial and Git.

And all URLs referenced in this article.

Cheat Sheets:

Git Cheat Sheet

Mercurial Cheat Sheet

Bazaar Quick Start Card

Benchmark conditions.

Benchmark was done using AMD Athlon(tm) 64 Processor 3500+ 1GB RAM on Linux Kubuntu 6.10 Edgy x86_64 with ext3 fs.

Each command was run 8 times (and the best and worst time were cut out). They were done locally through the filesystem (other protocol tests should definitely be done as even if DVCS are not coupled with a central server, network communications when badly implemented can lower user performance).

Version used are:

Repository consists in a snapshot of 12456 changesets (from 20080303, 70853 total revisions from the hg Repository), ~30000 files from Mozilla Repository (originally hg formatted and translated into git repository thanks to hg-fast-export.sh for git and hg-fast-export.sh coupled with fast-import plugin for bazaar).

Default file formats were used and git repository size remained the same running git-gc (which can be considered normal for a freshly migrated repository). One file was modified ( dom/src/base/nsDOMClassInfo.cpp ) just like a benchmark test done by Jst 1.5 year ago.

About the Author

Sébastien Auvray is a senior software designer and technology enthousiast. After being forced to use CVS, svn now he has to suffer the daily usage of Perforce at work. Sébastien is also one of the Ruby editors of InfoQ.