I’ve been mostly blog-silent for the last week because I’ve been working my tail off on a new project. It’s reposurgeon, a tool for performing surgery on repository histories, and there are several interesting things to note about it.

One of my regular guests on the blog, apparently a younger programmer, recently left a comment regretting the old days he wasn’t there for. He reports feeling stifled by the experience of Googling every time he has a nifty programming idea and discovering that someone else has already done it.

But with new days come new challenges, and new opportunities. Reposurgeon exploits a possibility that didn’t exist until quite recently; there has never been anything quite like it before, though there was were very partial precedents in svndumpfilter and my own svnsquash/svncutter tool.

reposurgeon is a repository-history editor. With it, you can edit old comments and metadata, remove junk commits (of the sort frequently generated by repository conversion tools such as cvs2svn), and perform various other operations that version-control systems (VCSes) don’t want to let you do.

I wrote it because I’ve been doing a bunch of repository conversions recently and I wanted a way to deal with the crufty artifacts those tend to create. But there are other obvious uses; one would be expunging repo contents that’s got some intellectual-property issue from the history, so you’re not re-infringing every time somebody clones the repository…

And when I say “editor”, I mean a tool general enough to have uses I’m not anticipating. It has a rather powerful little minilanguage in it for specifying selection sets. It lets you dump the metadata from the VCS history (committer, commit date, and comment text) in a textual form that you can edit and feed back into to it to modify the history.

Something like this, though limited to one specific VCS, could have been written before now. But the astute reader will note that I haven’t mentioned a specific VCS. In fact, reposurgeon can operate on histories created by RCS, CVS, Subversion, git, Mercurial, bzr, and probably several other VCSes about which I haven’t the faintest clue.

Yes, you may boggle now. If you are wondering where in the fleeping frack I get off claiming to support version-control systems I admit I don’t know anything about, this shows you are paying attention. It has been rumored that I am a clever fellow, but what, what, *what*? Right. You deserve an explanation and shall have one.

The trick that enables reposurgeon to do its magic is that it only fakes editing repositories. What it actually edits is git-fast-import command streams.

Aha! Some of you are already nodding knowingly. For the rest of the audience, a “git-fast-import command stream” is a format Linus Torvalds and friends invented to flatten a git repository history into one big file that can be used to reconstitute the repository by another instance of git.

This is meant to enable writing import tools; the moment you have an exporter that can generate this format from (say) Mercurial or bzr, you can transcode repositories from the other system to git as easily as you pour water from one cup to another. (Well. Some older systems that use only local usernames as committer IDs rather than full email addresses have an issue, but it’s easily worked around in a variety of ways.) And exporters are easy to write; if your special VCS has a command-line interface at all, odds are building an exporter is about a day’s work in the scripting language of your choice.

This stream format has useful properties. It’s easy to parse and self-describing. But the really important property is that it expresses an ontology or data model that is both very simple and general enough to capture the state of repositories made not just with git but with other VCSes. It has to, or it wouldn’t be good enough to support importers – too much state would get lost in transition.

So, lots of git fans have written exporters for a huge range of VCSes. Not to be outdone, fans of other VCSes have written importers for their systems. They don’t want all the migration to be one-way, you see. Lossless import capability makes a VCS a potential destination, giving it a competitive advantage over systems that can only export projects away from themselves.

Look what’s happening here! Without necessarily intending it, the git crew have created a de-facto-standard interchange format for passing around version-control histories. This is huge. Because what it actually does is decouple the whole I’ve-got-a-project-history thing from any individual version-control system.

Watch for second-order consequences of this in the future. In particular, I predict that VCSes will increasingly converge on supporting exactly the set of abstractions in a fast-import stream. They’re a good enough set, and being interoperable will prove a powerful lure.

While the rumor that I’m a clever fellow isn’t entirely false, the most important knack I have is for seeing the stupid-obvious possibilities that have been sitting under peoples’ noses all along – in this case, the possibility for a VCS-independent history-editing tool. What reposurgeon actually does is take a fast-import stream in one end, allow you to hack it in various interesting ways, then ship the modified repo out its other end as a modified fast-import stream.

It can look like you’re editing a repository, sure. But that’s because reposurgeon has a method table in it that’s indexed by VCS type and contains a small handful of command-line templates for each of them, including an importer command and an exporter command. And that, basically, is it; that handful of commands is all reposurgeon knows about any specific VCS, and probably all it will ever need to know. Adding reposurgeon support for new VCSes is easy and doesn’t require changing any executable code at all.

If you want to work with a VCS that isn’t in the list, use the exporter of your choice to dump to a stream file, then tell reposurgeon to load from that. When you’re done, tell reposurgeon to write the stream to another file, then use whatever importer you like to rebuild a repo from it.

There is one significant drawback to operating this way. In a system like git or Mercurial that uses hashes of a commit’s content to identify it, the IDs of anything that’s downstream of a commit you alter or delete will change. This will tend to hose people trying to sync from the modified repo. You do not, repeat not, want to use reposurgeon on a publicly-visible repo – not unless you can get everyone to re-clone it in a clean directory afterwards.

But this isn’t reposurgeon’s fault; any surgery tool would have the same issue. In a way, that’s liberating; metadata that no surgery tool could possibly preserve even in principle is metadata that reposurgeon doesn’t have to worry about preserving.

There are a couple other fun things about reposurgeon. I wrote and documented the whole thing in eight days from a standing start, and if after seeing the manual page you think that’s a lot of work for eight days you’re not wrong. I was able to be that productive because (a) I didn’t pause to reinvent any wheels, and (b) stuck to a brutally simple, minimalist design.

An example of not reinventing wheels is how I support metadata editing. Look at the following session transcript; I’ve added some whitespace and comments (beginning with ;;) for clarity:

esr@snark:~/WWW/reposurgeon$ reposurgeon reposurgeon% read reposurgeon: from git repo at '.'......(0.20 sec) done. ;; That's the repo being grabbed reposurgeon% list 426 426 2010-11-01T00:47:57 Documentation improvement. ;; That's a summary listing of event 426, a commit reposurgeon% write 426 commit refs/heads/master mark :425 author Eric S. Raymond 1288572477 -0400 committer Eric S. Raymond 1288572477 -0400 data 27 Documentation improvement. from :423 M 100755 :424 reposurgeon ;; That's how it looks as a fragment of a git-import stream reposurgeon% mailbox_out 426 ------------------------------------------------------------------------------ Event-Number: 426 Author: Eric S. Raymond Author-Date: Mon 01 Nov 2010 00:47:57 -0400 Committer: Eric S. Raymond Committer-Date: Mon 01 Nov 2010 00:47:57 -0400 Documentation improvement. ;; And that's how it looks when it's been mailboxized.

That ‘mailboxized’ version is is the form you get to edit. It contains all the metadata you can safely modify and nothing that you can’t, with the (unavoidable) exception of the event number. In fact, reposurgeon has an “edit” command you can use that grabs as much of the repository’s metadata as you want, launches an editor session, then de-mailboxizes what you leave when you exit your editor and applies the changed bits.

The point here is that I didn’t invent anything I didn’t have to. Reposurgeon isn’t some glossy idiotic GUI thing where you have to edit commit metadata via a form full of clicky-boxes; you get to use your own editor and deal with a data format as simple as an email message.

Reposurgeon is a command-line tool in the classic Unix style. (Yeah, I wrote a book about that once.) Part of the reason I wrote it that way is that it meant I got to use the Python cmd.Cmd class as my interpreter framework – once again, not inventing anything I didn’t have to. But I would have done it anyway, because command-line tools can be scripted. And that was a goal.

(This project is, by the way, reason #2317 why I heart Python. The ready-to-use convenience of the cmd.Cmd class, the email parser, and shlex were absolutely essential to getting this done without bogging me down in low-level implementation details.)

Back to scripting. Here’s a command I could have used to generate that transcript:

reposurgeon 'verbose 1' read 'list 42' 'write 426' 'mailbox_out 426'

See that? Reposurgeon actually improves on the classic style by having no command-line options. None. Instead…the command-line arguments are interpreted exactly the same way user input typed to the prompt would be. There’s only one specially-interpreted command-line cookie; ‘-‘, which means “run the interactive interpreter’.

So, for example, I can say “reposurgeon read list -” to my shell prompt; reposurgeon will cheerfully read the repo in the current directory, list its commits and tags, then hand me a prompt. This is what I mean by brutally (and effectively) simple.

It’s not perfect, of course. I’ve only tested it on small repos with linear histories, which is why I’m calling the initial release 0.1; it needs to be torture-tested using large repos with tricky topologies, and it needs to be tested on things that aren’t git. There are some operations I have planned but haven’t implemented yet, like doing a topological cut of a repo into two repos that keeps the rest of the branch structure intact.

But it’s a good start. It’s already quite useful for my original goal, cleaning up cruft from repo conversions. The core classes are solid. The expression language works. The code is properly factored. It should make a good platform for implementing more complex surgical operations like history merges.

Best of all, every time I add a capability to the tool, it will support every single VCS, now and in the future, that speaks the import-stream format. And the fact that this is even conceivable is a pretty good reason not to pine for the old days.

UPDATE: Thanks to Russ Nelson’s suggestion, the project now travels under the sign of the blue sturgeon. Heh.