A new hash algorithm for Git

This article brought to you by LWN subscribers Subscribers to LWN.net made this article — and everything that surrounds it — possible. If you appreciate our content, please buy a subscription and make the next set of articles possible.

The Git source-code management system is famously built on the SHA‑1 hashing algorithm, which has become an increasingly weak foundation over the years. SHA‑1 is now considered to be broken and, despite the fact that it does not yet seem to be so broken that it could be used to compromise Git repositories, users are increasingly worried about its security. The good news is that work on moving Git past SHA‑1 has been underway for some time, and is slowly coming to fruition; there is a version of the code that can be looked at now.

How Git works, simplified

To understand why SHA‑1 matters to Git, it helps to have an idea of how the underlying Git database works. What follows is an oversimplified view of how Git manages objects that can be skipped by readers who are already familiar with this material.

Git is often described as being built on a content-addressable filesystem — one where you can look up an object if you know that object's contents. That may not seem particularly useful, but there's more than one way to "know" those contents. In particular, you can substitute a cryptographic hash for the contents themselves; that hash is rather easier to work with and has some other useful properties.

Git stores a number of object types, using SHA‑1 hashes to identify them. So, for example, the SHA‑1 hash of drivers/block/floppy.c in a 5.6-merge-window kernel, as calculated by Git, is 485865fd0412e40d041e861506bb3ac11a3a91e3 . Conceptually, at least, Git will store that version of floppy.c in a file, using that hash as its name; early versions of Git actually did that. If somebody makes a change to floppy.c , even just removing an extra space from the end of a line, the result will have a completely different SHA‑1 hash and will be stored under a different name.

A Git repository is thus full of objects (often called "blobs") with SHA‑1 names; since a new one is created for each revision of a file, they tend to proliferate. Your editor's kernel repository currently contains 8,647,655 objects. But blobs are not the only types of objects stored in a Git repository.

An individual file object holds a particular set of contents, but it has no information about where that file appears in the repository hierarchy. If floppy.c is moved to drivers/staging someday, its hash will remain the same, so its representation in the Git object database will not change. Keeping track of how files are organized into a directory hierarchy is the job of a "tree" object. Any given tree object can be thought of as a collection of blobs (each identified by its SHA‑1 hash, of course) associated with their location in the directory tree. As one might expect, a tree object has an SHA‑1 hash of its own that is used to store it in the repository.

Finally, a "commit" object records the state of the repository at a particular point in time. A commit contains some metadata (committer, date, etc.) along with the SHA‑1 hash of a tree object reflecting the current state of the repository. With that information, Git can check out the repository at a given commit, reproducing the state of the files in the repository at that point. Importantly, a commit also contains the hash of the previous commit (or multiple commits in the case of a merge); it thus records not just the state of the repository, but the previous state, making it possible to determine exactly what changed.

Commits, too, have SHA‑1 hashes, and the hash of the previous commit (or commits) is included in that calculation. If two chains of development end up with the same file contents, the resulting commits will still have different hashes. Thus, unlike some other source-code management systems, Git does not (conceptually, at least) record "deltas" from one revision to the next. It thus forms a sort of blockchain, with each block containing the state of the repository at a given commit.

Why hash security matters

The compromise of kernel.org in 2011 created a fair amount of concern about the security of the kernel source repository. If an attacker were able to put a backdoor into the kernel code, the result could be the eventual compromise of vast numbers of deployed systems. Malicious code placed into the kernel's build system could be run behind any number of corporate and government firewalls. It was not a pleasant scenario but, thanks to the use of Git, it was also not a particularly likely one.

Let us imagine that some attacker has gained control of kernel.org and wants to place some evil code into floppy.c — something unspeakable like a change that replaces random sectors with segments from Rick Astley videos, say. Somehow this change would have to be incorporated into the repository so that it would be included in subsequent pulls. But the change to floppy.c changes its SHA‑1 hash; that, in turn, will change every tree object containing the evil floppy.c and every commit that includes it as well. The head commit for the repository would certainly change, as would older ones if the attacker tried to make the change appear to have happened in the distant past.

Somewhere out there is certainly some developer who actually memorizes SHA‑1 hashes and would immediately notice a change like that. The rest of us probably would not, but Git will. The distributed nature of Git means that there are many copies of the repository out there; as soon as a developer tries to pull from or push to the corrupted repository, the operation will fail due to the mismatched hashes between the two repositories and the corruption will come to light.

Repository integrity is also protected by signed tags, which include the hash for a specific commit and a cryptographic signature. The chain of hashes leading up to a given tag cannot be changed without invalidating the tag itself. The use of signed tags is not universal in the kernel community (and rare to nonexistent in many other projects), but mainline kernel releases are signed that way. When one sees Linus Torvalds's signature on a tag, one knows that the repository is in the state he intended when the tag was applied.

All of this depends on the strength of the hash used, though. If our attacker is able to modify floppy.c in such a way that its SHA‑1 hash does not change, that modification could well go undetected. That is why the news of SHA‑1 hash collisions creates concern; if SHA‑1 cannot be trusted to detect hostile changes, then it is no longer assuring the integrity of the repository.

The world has not ended yet, fortunately. It is still reasonably expensive to create any sort of SHA‑1 hash collision at all. Creating any new version of floppy.c with the same hash would be hard. An attacker would not just have to do that, though; this new version would have to contain the desired hostile code, still function as a working floppy driver, and not look like an obfuscated C code contest entry (at least not more than it already does). Creating such a beast is probably still unfeasible. But the writing is clearly on the wall; the time when SHA‑1 is too weak for Git is rapidly approaching.

Moving to a stronger hash

Back in the early days of Git, Torvalds was unconcerned about the possibility of SHA‑1 being broken; as a result, he never designed in the ability to switch to a different hash; SHA‑1 is fundamental to how Git operates. As of 2017, the Git code was full of declarations like:

unsigned char sha1[20];

In other words, the type of the hash was deeply wired into the code, and it was assumed that hashes would fit into a 20-byte array.

At that time, Git developer brian m. carlson was already at work to separate the Git core from the specific hash being used; indeed, he had been working on it since 2014. It was unclear what hash might eventually replace SHA‑1, but it was possible to create an abstract type for object hashes that would hide that detail. At this point, that work is done and merged.

The decision on a replacement hash algorithm was made in 2018. A number of possibilities were considered, but the Git community settled on SHA‑256 as the next-generation Git hash. The commit enshrining that choice cites its relatively long history, wide support, and good performance. The community has also decided on (and mostly implemented) a transition plan that is well documented; most of what follows is shamelessly cribbed from that file.

With the hash algorithm abstracted out of the core Git code, the transition is, on the surface, relatively easy. A new version of Git can be made with a different hash algorithm, along with a tool that will convert a repository from the old hash to the new. With a simple command like:

git convert-repo --to-hash=sha-256 --frobnicate-blobs --climb-subtrees \ --liability-waiver=none --use-shovels --carbon-offsets

a user can leave SHA‑1 behind (note that the specific command-line options may differ). There is only one problem with this plan, though: most Git repositories do not operate in a vacuum. This sort of flag-day conversion might work for a tiny project, but it's not going to work well for a project like the kernel. So Git needs to be able to work with both SHA‑1 and SHA‑256 hashes for the foreseeable future. There are a number of implications to this requirement that make themselves felt throughout the system.

One of the transition design goals is that SHA‑256 repositories should be able to interoperate with SHA‑1 repositories managed by older versions of Git. If kernel.org updates to the new format, developers running older versions should still be able to pull from (and push to) that site. That will only happen if Git continues to track the SHA‑1 hashes for each object indefinitely.

For blobs, this tracking will happen through the maintenance of a set of translation tables; given a hash generated with one algorithm, Git will be able to look up the corresponding hash from the other. Needless to say, this lookup will only succeed for objects that are actually in the repository. These translation tables will be maintained in the "pack files" that hold most objects in a contemporary Git repository. There will be a separate table for "loose objects" that are stored as separate files rather than in packs; the cost of lookups in that table is seen as being high enough that measures need to be taken to minimize the number of loose objects in any given repository.

The handling of other object types is a bit more complicated. An SHA‑1 tree object, for example, must contain SHA‑1 hashes for the objects in the tree. So if such a tree object is requested, Git will have to locate the SHA‑256 version, then translate all the object hashes contained within it before returning it. Similar translations will be required for commits. Signed tags will contain both hashes.

With this machinery in place, Git installations will be interoperable during the transition. Eventually, all users will have upgraded to SHA‑256-capable versions of Git, at which point repository owners could begin turning off the SHA‑1 capability and removing the translation tables. The transition will, at that point, be complete.

Some inconvenient details

There are likely to be some glitches along the way, naturally. One of them is a simple human-factors problem: when a user supplies a hash value, should it be interpreted as SHA‑1 or SHA‑256? In some cases, it's unambiguous; SHA‑1 hashes are 160 bits wide, so a 256-bit hash must be SHA‑256, for example. But a shorter hash could be either, since hashes can be (and often are) abbreviated. The transition document describes a multi-phase process during which the interpretation of hash values would change, but most users are unlikely to go through that process.

There is, of course, a way to unambiguously give a hash value in the new Git code, and they can even be mixed on the command line; this example comes from the transition document:

git --output-format=sha1 log abac87a^{sha1}..f787cac^{sha256}

For a Git user interface this is relatively straightforward and concise, but one can still imagine that users might tire of it relatively quickly. The obvious solution to this sort of bracket fatigue is to fully transition a project to SHA‑256 as quickly as possible.

There is another issue out there, though: there are a lot of SHA‑1 hash values in the wild. The kernel repository currently contains over 40,000 commits with a Fixes: tag; each one of those includes an SHA‑1 hash. These hash values also can be found in bug-tracker histories, release announcements, vulnerability disclosures, and more. In a repository without SHA‑1 compatibility, all of those hashes will become meaningless. To address this issue, one can imagine that the Git developers may eventually add a mode where translations for old SHA‑1 hashes remain in the repository, but no SHA‑1 hashes for new objects are added.

Current state

Much of the work to implement the SHA‑256 transition has been done, but it remains in a relatively unstable state and most of it is not even being actively tested yet. In mid-January, carlson posted the first part of this transition code, which clearly only solves part of the problem:

First, it contains the pieces necessary to set up repositories and write _but not read_ extensions.objectFormat. In other words, you can create a SHA‑256 repository, but will be unable to read it.

The value of write-only repositories is generally agreed to be relatively low; not even SCCS was so limited. Carlson's purpose in posting the code at this stage is to try to reveal any core issues that will be harder to change as the work progresses. Developers who are interested in where Git is going may well want to take a close look at this code; converting their working repositories over is not recommended, though.

As it turns out, carlson's work goes well beyond what has been put out for testing now; he will post it when he is ready, but really curious people can see it now in his GitHub repository. This work is unlikely to land on the systems of most Git users for some time yet, but it is good to know that it is getting close to ready. The Git developers (carlson in particular) have quietly been working on this project for years; we will all benefit from it.

