Large file management with git-annex

LWN.net needs you! Without subscribers, LWN would simply not exist. Please consider signing up for a subscription and helping to keep LWN publishing

As its introduction says, git-annex sounds like something of a paradox. It uses Git to manage files that are larger than Git can easily handle—without checking them into the repository. But git-annex provides ways to track those files using much of the same infrastructure as Git, so that moving or deleting those files can all be tracked in much the same way as committed files. In addition, git-annex allows for branches and distributed clones of its trees.

Developer Joey Hess lists two use cases for git-annex that will appeal to folks who juggle many large files on multiple storage devices, frequently move between different locations and computers, or some combination thereof. Because git-annex tracks the locations of the actual data files, which may not be locally present, it can act like a hierarchical storage manager. The filenames will be present in the repository, but their content may need to be fetched from elsewhere or from a currently offline disk. git-annex will fetch the data if it can find it in an online repository or ask that a particular repository be made available.

In addition, git-annex ensures that there is at least one copy—though it can be configured to keep more than one—of a file's contents available before dropping the file from a local repository. That way, the user can drop a large file (or files) from their laptop, say, while knowing that the contents are still available on some other repository that git-annex was able to contact. For "The Archivist", which is one of Hess's use cases, that is essential, so that they can reorganize their files at will, while knowing that they can't be accidentally deleted.

But those same attributes are useful to "The Nomad" (Hess's other use case):

When she has 1 bar on her cell, Alice queues up interesting files on her server for later. At a coffee shop, she has git-annex download them to her USB drive. High in the sky or in a remote cabin, she catches up on podcasts, videos, and games, first letting git-annex copy them from her USB drive to the netbook (this saves battery power). When she's done, she tells git-annex which to keep and which to remove. They're all removed from her netbook to save space, and Alice [knows] that next time she syncs up to the net, her changes will be synced back to her server.

It does all this via a git-annex binary that is built from Haskell sources. That allows git-annex to integrate with Git, so using it is as simple as " git annex ...". Unlike many free software utilities, git-annex also comes with fairly extensive documentation, including a man page and a walk-through. As might be expected, the code is available via a Git repository—though Debian unstable users can apt-get install it.

When files are added to git-annex, their content is moved to a .git/annex/objects directory and a symbolic link is created using the original filename and pointing to the content. Those symbolic links are handled by Git directly, while git-annex arranges for the content to be present as requested. Creating a repository is pretty straightforward:

$ mkdir ~/annextst $ cd ~/annextst $ git init $ git annex init "desktop repo"

git annex

$ cp /tmp/big_file . $ git annex add . add big_file ok $ git commit -a -m "added big_file"

git annex add

git remote

$ git annex init "some other repo"

The "" command gives the annex a name that can be used to identify the repository later on. One then adds files to the repository in a fairly obvious way:The last command may seem a bit surprising, but Git is what will track the symbolic link(s) that thecreated. As the walk-through shows , that Git repository can be cloned elsewhere (on another machine or a removable USB device for example) and then each of those repositories can be added as remote repositories (i.e.) of each other. The only additional step for turning it into a git-annex repository is to do:in the cloned directory.

Getting file content is as simple as doing:

$ git annex get some_file

$ git annex drop some_file

git pull

while removing files is done with:though that may fail if git-annex cannot find another copy in the repositories it can currently contact (which can, of course, be overridden). Syncing between repositories is done with the usual "" command. Another nice feature of git-annex is that it works seamlessly with files that are already present in the git repository, so handling a combination of giant and normal-sized files is easy.

There are several types of storage back-ends that git-annex can use to store the key-value pairs that relate the filename to its contents. The default is WORM (write once, read many), which is also the least expensive because it assumes that file contents do not change once they have been stored. The SHA1 backend stores the file content object based on its SHA1 hash, which can be an expensive operation on very large files, but will track changes to the contents. There is also a URL backend that fetches the content from an external URL (as the name implies).