I’ve spent a lot of time and bandwidth on this blog thinking out loud about version-control systems and software forges. In my last post, I announced that I was going to try to sneak up on the problem of designing a better software forge by enhancing Roundup.

Over the last three years I’ve gotten a couple different versions of the following response to my thinking-out-loud: “Centralized forges and bugtracking are old-school thinking, as hoary as centralized VCSes. Why shouldn’t all that metadata live in the project repo and be peer-merged on demand the way code is?”

This is a good question, but I think the people advocating systems like Bugs Everywhere, scmbug, and ticgit have invested a lot of cleverness in the wrong answer.

In order to see why, we need to look at a very basic question: Why do DVCSes work? And more to the point, why did it take us so long to figure out that they would work?

The original, oldest-school RCS/SCCS model of source control used locking. When you wanted to modify a file. you checked it out and locked it. Then you modified it, then you checked it back in and released the lock.

The locking model was founded on two assumptions. First, that that modification conflicts would be frequent and severe enough that they could only be prevented by granting programmers temporary keep-out exclusions. Second, that the timescale of whole coherent changes – the interval between “I start working on this thought” and “I’m done” would generally be pretty short.

The people who wrote these early version control systems believed, in other words, that source code has locality and contention patterns a lot like a database. Databases beg to have access to them carefully serialized, with reliable locking, because their usage pattern is one that involves frequent access contention over small pieces of data.

But both database-centric assumptions were incorrect. Actual experience showed that modification conflicts in source code are rare, usually mechanically resolvable, and if not almost always easily resolvable by eyeball and hand. On the other hand, the actual time scale of coherent changes is long enough that locking all the files required for them over the whole span would frequently cause conflicts, even though conflicts over the actual individual small spans of code being modified within them are so rare.

As we gradually figured this out over a span of about twenty years, conflict resolution in VCSes moved from lock-based to merge-based, with DVCSes at the end of that evolution.

DVCSes are based on the assumption that a programmer can clone a repo, disappear into a cave, and spend days or weeks coding in isolation in the sublime confidence that when he/she wants to rejoin the world, peer-merging with other repos will still be pretty easy. And…usually…it is easy. Large projects like the Linux kernel depend especially heavily on this assumption – they’d collapse if it weren’t true.

Code, it turns out, is not like a database. Strictly serializing access to source code isn’t that important, because most changesets are mutually irrelevant and mutually commutative. (Though when you put it that way, a programmer will be apt to gulp and boggle before eventually conceding the point.)

OK, so, the people who advocate decentralized forges and bugtracking have on the face of it a strong historical case. Decentralization worked big-time for managing code changes; isn’t it silly, a repetition of old-school locking-VCS narrow-think, to doubt that it will work just as well for…say…bug-tracking?

Indeed, a stupid person could reject distributed bug-tracking for stupid reasons. But that doesn’t make all reasons for doubt stupid – and the right question to ask is whether bug records (and other pieces of project metadata) have an access pattern more like source code or a database.

I think bug trackers are more like databases than like source code in almost all relevant ways. It’s not necessarily relevant that they’re normally implemented on top of databases; I’m talking about the human workflow around them.

The difference is this: When you check out a revision of software, you need to have a coherent state of it, but it’s not necessary that you have every single bleeding-edge changeset of every developer hacking on it everywhere. If J. Random Neckbeard is off in a cave somewhere refactoring a major subsystem, you don’t even want to see his changes until he decides he’s reached a good point to merge up to the public repository.

The natural cycle on conversations around bug reports is a lot tighter. If you’re part of a issue-tracker thread that is trying to characterize a bug, and someone else posts critical test results, you want to know about that right now.

It’s not that modification conflicts are important in this context; generally they aren’t. No, the issues are (a) timeliness, and (b) having a defined rendezvous point where you can browse fresh metadata, chatter, and attachments related to your bug. All these pull in the direction of centralization, or at the very least a single aggregated event feed at a known location – something more like a blog or social network than a flock of DVCS repos passing around changesets.

Other things pull this way as well. Consider this very apt quote from Jonathan Corbet in 2008:

A bug tracker serves as a sort of to-do list for developers, but there is more to it than that. It is also a focal point for a conversation between developers and users. Most users are unlikely to be impressed by a message like “set up a git repository and run these commands to file or comment on a bug.” There is, in other words, value in a central system with a web interface which makes the issue tracking system accessible to a wider community. Any distributed bug tracking system which does not facilitate this wider conversation will, in the end, not be successful.

He’s got a strong point. The perceived technical elegance of distributed bug-tracking gains us nothing if it locks out people who aren’t developers.

Corbet also reminds us of an interesting fact when he brings up to-do lists. This is that projects normally have several different kinds of to-do-lists that are managed in different ways.

At one extreme, we have roadmap and design documents that change infrequently and have code-like access patterns – that is, modification conflicts are unusual, and having the version of the design document matched to your code is usually much more important than having the latest version.

At the other extreme, we have the implicit to-do queue provided by an issue tracker. Items on this tend to change much more quickly and have shorter lifetimes.

Somewhere in the middle is the traditional TO-DO file, which tends to be a sort of grafitti wall describing medium-scale tasks.

The point I’m driving at here is that the differing ways we manage these to-do lists are a consequence of the workflow around them. To-dos with code-like access patterns want to live in the repository with your code; to-dos with database-like access patterns want to live in a bug tracker or something else like a specialized database engine (blog, wiki, whatever).

There’s a more general point here about software forges. Software forges – centralized rendezvous points where project metadata lives in something that is not your repository – make sense precisely to the extent that some project metadata is not like code.

Bug databases are the most obvious example. Another one is wikis. Also mailing lists (when you’re on a mailing list, you really want the latest state of the conversation, not just the state your repo happened to get on the last pull.)

To sum up: there are natural roles for both the DVCS and the bugtracker/forge, defined by the workflows around them. If we try and force either tool to cover the entire role of the other, the “solution” won’t be comfortable for developers and users, won’t scale well, and just plain won’t fit – no matter how much love and ingenuity we expend on a sweet technical hack.