The SHA-1 hash algorithm has been known for at least a decade to be weak; while no generated hash collisions had been reported, it was assumed that this would happen before too long. On February 23, Google announced that it had succeeded at this task. While the technique used is computationally expensive, this event has clarified what most developers have known for some time: it is time to move away from SHA-1. While the migration has essentially been completed in some areas (SSL certificates, for example), there are still important places where it is heavily used, including at the core of the Git source-code management system. Unsurprisingly, the long-simmering discussion in the Git community on moving away from SHA-1 is now at a full boil.

Git uses SHA-1 extensively. When a "blob" (a revision of a file, essentially) is placed in a repository, the blob's contents (with some Git metadata) are hashed, and the result is used to identify the blob thereafter. The other types of objects in a Git repository — "trees", identifying a directory hierarchy full of blobs, and "commits", describing revisions — are also identified by their SHA-1 hashes. The hash of each commit object is calculated from, among other things, the hash of the previous commit in the chain. The result is that the same commit ID in two repositories is, in the absence of hash collisions, guaranteed to refer not only to the same set of files, but to an identical history leading to that state.

The use of SHA-1 in this way makes it difficult to tamper with the files in a repository; a change to a file will change the resulting hash, so the change will be noticed. Git thus functions as a sort of source-code blockchain that encodes the full repository history in its current state. If an attacker can generate two files with the same SHA-1 hash, though, it may be possible to substitute one for the other in a repository without being detected. That could be a way to get hostile code into any project stored in Git — an outcome that is generally viewed as a bad thing.

There are two separate issues that need to be considered when looking at the implications of SHA-1's demise: how urgent is the need for Git to switch to something else, and how can that switch be carried out?

The sky isn't quite falling yet

The discussion on the use of SHA-1 in Git is far from new. Linus Torvalds posted the first version of Git on April 7, 2005, saying: "It's not an SCM, it's a distribution and archival mechanism. I bet you could make a reasonable SCM on top of it, though." Three weeks later, a conversation on the wisdom of the SHA-1 choice was already underway. At that time, Torvalds responded that SHA-1 is not the real security mechanism used in Git and, as a result, even a full compromise of the hash function would not necessarily be a problem.

His argument, essentially, was that the distributed nature of Git repositories makes an attack difficult even given the ability to generate collisions cheaply. Just replacing an object in a repository is not enough; the attacker would have to find a way to distribute that object to other repositories around the world, which is not an easy task. The colliding object would have to function as C source (if the kernel were the target of attack here), and would have to stand up to a casual inspection — it would have to look like proper kernel source. That increases the difficulty of generating the collision significantly. There are, he noted, easier ways to get bad code into the kernel: "So if you actually wanted to corrupt the kernel tree, you'd do it by just fooling me into accepting a crap patch. Hey, it happens all the time."

After the Google announcement, Torvalds posted a lengthy message about SHA-1 and Git. He pointed out that generating collisions in source code is harder than with PDF files (as Google used) because the latter can contain a great deal of invisible data that does not change the formatted result. The kernel's chain of trust is where the project's security really lies, he said. He also noted that the fingerprints of the technique used to generate this SHA-1 collision are easy to detect, so any eventual attack based on this method can be easily defended against. That said, he also noted (without getting into details) that there is a plan to move Git away from SHA-1 in the near future.

Torvalds's view is fairly sanguine, in other words; others are a bit more worried, for a number of reasons. Not everybody uses Git just for C source code, for example; less "transparent" file types might be more easily subject to attack. The scariest possibility might be firmware blobs, which are just binary code; modifications to such a blob will not be easy to notice via any sort of inspection.

The distribution argument has a significant flaw as well: Git repositories are often not as widely distributed as one might think. There are central hosting sites, such as GitHub and kernel.org, that contain large numbers of repositories; these sites routinely keep objects in a single, central store to reduce storage and backup needs. Kernel.org probably has hundreds of kernel repositories, but commit c470abd4fde40ea6a0846a2beab642a578c0b8cd (tagging the 4.10 release) is the same object in all of them. A single bad object in a central site like this could thus contaminate many repositories.

Joey Hess described how such an attack might be carried out. An attacker gets a subsystem maintainer to accept the "good" version of an object; meanwhile, the "bad" version is placed in a repository on the hosting site. When the maintainer pushes their repository to that site, the bad object may displace the good one, since it already exists in the repository with the SHA-1 ID of the good object. That bad object would then be propagated in any subsequent pushes or pulls.

It is also worth noting that there is a certain amount of invisible data even in "transparent" files like C source. The Git headers themselves have some dark corners where the bits needed to force a collision can be hidden. The good news there is that such an attack is relatively easy to detect. In many cases, the existing " git fsck " functionality will find it, and central sites tend to run fsck regularly already. It turns out that the Git transfer.fsckobjects configuration variable can be used to force a check whenever objects move between repositories. Even Torvalds was surprised to learn that this option exists; there is now talk of enabling it by default.

Moving on

The Git developers may feel that the weakening of SHA-1 is not an emergency, but there also appears to be a strong consensus that, after all these years, the project needs to move on to a more secure hash algorithm. While it may be true that, as Torvalds said, there is a plan for this transition, it must be said that the plan is in a rather early stage, and that there are some problems that must be solved first.

The first of those is fairly prosaic. A quick look through the Git source turns up a great many variable declarations like:

unsigned char sha1[20];

In other words, the format of the hash used to identify every object in a Git repository is declared as a basic type with a hard-coded constant size. One might think that the developers involved could have avoided this situation, but it is an issue that must be dealt with now. Brian Carlson has been working on switching to an opaque struct object_id type for some time (he first mentioned this work in April 2014), but it is slow going. As of February 25, he still had over 1100 sites in need of conversion.

That work, in any case, is just code refactoring. A trickier task is figuring out how to introduce a new hash algorithm without breaking existing repositories, without requiring the rewriting of the history in those repositories, and maximizing interoperability. The plan, as worked out primarily by Torvalds and Jeff King, is to introduce a new blob type that is identified by a new hash type (possibly parameterizing the hash type so that the next transition is easier). These blobs could only exist in a repository managed by a version of Git that is new enough to understand their format.

Once the "use the new hash type" bit has been flipped on a given repository, all new objects must use that type. New objects would not be allowed to contain pointers to old-hash objects, with one exception: a new-hash commit could have an old-hash parent. The intent behind this rule is to make the transition to the new object IDs happen as quickly as possible; once the bit is flipped, all new work uses those IDs.

One result of this approach would be some inevitable duplication of objects around the transition, as the same files are stored under both the old and new IDs. The alternative is to perform some sort of mapping or otherwise allow objects to be known under both the old and new IDs, but that would add some significant complexity and would also increase the amount of data stored in the repository. In fact, that kind of mapping could grow in a hurry as the number of hash algorithms used by Git grows. So it seems more likely that the one-time duplication cost is the path that will be chosen.

Once a repository moves to the new format, any other repositories that push to or pull from that repository will also have to change. An attempt to pull from a new-format repository into one that hasn't made the transition will simply fail. So there will be a flag day of sorts for most projects. In the kernel's case, there will presumably come a day when the kernel.org repository starts using new IDs, and the rest of the community will have to follow suit. Such a change should probably happen a fair while after Git itself is capable of using the new IDs so that the updated software is widely distributed by the time it is needed.

A tiny little detail that hasn't yet been worked out is which hash algorithm will be chosen to succeed SHA-1. Most developers appear to think that SHA-3 is the logical next step, but that discussion has not yet begun in earnest.

So, while the sky may not be falling, it is showing increasing signs of structural instability. As has been seen, moving Git to a new hash type is not a trivial task; it will not be accomplished overnight, or even this year if one looks realistically at what needs to be done. The time has certainly come for the project to finally start making real progress on this perennial wishlist item. The good news is that the developers involved would appear to have heard this message and are bringing a new focus to the task.

Comments (37 posted)

We would like to be able to trust our software when we run it — that's one reason why we're free software enthusiasts — but without the ability to trust the hardware we run it on, no amount of openness and security in our software can save us. Georg Greve, one-time president of the Free Software Foundation Europe, spent nearly an hour talking about "How Open POWER is changing the game and why the Free Software Community should care". It was a talk that was in many ways an old-time pep rally rather than a technical presentation.

Greve founded the FSFE and steered it for nearly nine years. He's been honored by the German government for services to free software and open standards. But his current daytime job is CEO of Kolab which, while it is an excellent project, is very much a software one. So why, he asked, did he choose to dedicate his time on-stage at FOSDEM to talking about hardware?

We need hardware to run our software on. If we want control over our software, we had better have trust in that underlying hardware. There are two routes to that trust; one is faith-based, and the other is through verifiability. At the moment, the CPUs at the heart of the desktop and server equipment that most of us run our free software on are generally made by Intel, and its track record for both justifiable good faith and verifiable openness is not all that good, he said.

Greve pointed out that every modern Intel x86-type processor contains a second, internal CPU that you cannot audit, but that can take over your machine. That means you can't tell what the people who made your hardware, or the governments to whom those people are beholden, are asking your hardware to do; some recent events give cause for nervousness about what that might be. Greve paused before noting that Kolab is deliberately a European enterprise: "Snowden made us nervous. The recent election confirmed our concerns."

Even worse than what the makers of the hardware might ask it to do is what black hats might ask of it. Sooner or later someone will discover a vulnerability in the management CPU; imagine a rootkit that you cannot keep out, cannot detect, and cannot remove. No, we need a platform that gives us openness, control, and the ability to build our own, he said.

An audience member asked whether early Intel processors might now be clear of patent protection, and therefore eligible for being used as a basis for such a project. Greve replied that he was fairly sure you could lawfully build an 80286, but why would you? The important issue isn't merely open hardware, it's open, cutting-edge hardware.

Fortunately, IBM decided it felt the same way. It took its Power architecture CPUs (yes, the chips formerly used in Apple Macs and many other systems) and gave the architecture in its entirety to the OpenPOWER Foundation. Members of the foundation are allowed to customize OpenPOWER CPUs in order to create products that meet their needs. There's a definite focus on the data-center end of things; OpenPOWER is clearly aimed at people who have big computing needs. It remains focused right down to the sales end: there are companies shipping products based on these CPUs right now.

A question from the audience noted that the individual and academic memberships are non-voting (more precisely, a single board member represents all the "associate and academic" members, no matter how many there may be). The higher and more expensive tiers of membership get board seats based on the number of members. Greve conceded the point, but it was clear that he hoped the foundation would evolve in increasing openness through community participation and that, in any case, there aren't all that many alternatives.

Greve sees OpenPOWER as part of a sea change in mindset. There is an increasing awareness that products based on the old "trust us" mantra are becoming decreasingly attractive in a world that has realized that governments will get their fingers in wherever they can. He drew attention to the OpenCAPI consortium, which is trying to develop a new, open, high-performance bus architecture. One of its major players is AMD: "organizations that have not traditionally shared our core values are suddenly coming on board". We of the free-software community have a lot to offer: we already know about collaborating, sharing, and engaging. Since big players are suddenly listening to words such as these, we've been handed a real opportunity to shape the discussion and its trajectory, he said.

He did briefly mention the TALOS workstation, a Kickstarter project to produce Power-based open machines; he accepted that it had crashed and burned but he wasn't interested in examining the failure in too much depth. He noted that such a device would be quite valuable and hoped that development might restart. Meanwhile, effort right now was better spent in engaging with OpenPOWER; we should build for it, break it, and reassemble it, he said. He offered to get audience members OpenPOWER hardware if they were serious about working with the platform. He also asked them to help spread the word: "When IBM tries to communicate something, it ends up being a well-kept secret. They are horrible communicators."

Anyone with tendrils into the European Parliament was urged to try to raise awareness of the issue. Airbus was the European reaction to US dominance of the airframe industry, which is an industry that was deemed critical enough that Europe needed its own dog in the race; Greve can't see why computer hardware is any way less important. He noted that China is already building its own customized OpenPOWER chips; it removed the US cryptographic elements, which weren't trusted, and replaced them with its own. Why, he asked, isn't Europe doing exactly the same thing?

For all that this was a more of an old-time political rally than a technical talk, and even if you don't rush to line up behind his banner, it's difficult not to concede that Greve has something of a point.

[Thanks to the Linux Foundation, LWN's travel sponsor, for making this article possible.]

Comments (10 posted)

This is the last of the articles from this year's FOSDEM and, though the other articles were about particular talks, what follows are my personal impressions from this engaging conference. This was my first FOSDEM. I've known about it for years, but never been before; in retrospect I regret that, because it's not like any other conference I've ever been to.

The first huge difference is that it's free to attend, and fiercely proud of it ("FOSDEM is free. Not just free as in software, but free as in beer, though this being Belgium, the beer is not free."). You don't register to attend, you just show up. Yes, a contribution is asked for, and yes, the queue to donate €25 (and get a T-shirt) can stretch round the block, but there's no pressure put on you to contribute.

The second oddity is that although the FOSDEM staff arrange a substantial core conference — six main talk streams, on and off — most of what happens at FOSDEM isn't actually organized by FOSDEM. There are over forty developer rooms; FOSDEM makes sure the rooms are ready, the A/V equipment is working, and records the talks on video, but arranging the talk program is left to the teams whose room proposals have been accepted. Nearly sixty different organizations had stands at the venue. The Birds-of-a-feather (BoF) sessions are like the developer rooms, but even less formal. There are also the lightning talks; a room that is set aside so people who have something to show but don't want to do a full talk get fifteen minutes on stage to say what they can. An amazing amount of collaborative activity happens at FOSDEM simply because everyone is there, and FOSDEM makes it easy for it to happen.

That means a lot of people; some eight thousand attended FOSDEM over the course of the weekend. Some of those I spoke to weren't planning to go to a single talk. They were there purely to meet fellow collaborators and hack on their projects; they intended to watch the talks on video later. These are people who spend long working weeks coding professionally, and what they do on their time off and at their own expense is exactly the same thing, but on stuff they love, with people they respect. When a significant number of the attendees all show up in the same place at the same time, such as for the opening or closing keynotes, you really get a sense for how many people are there.

One of the reasons you don't get a feel for how many people are at FOSDEM until that point is that the conference is quite widely geographically distributed. Yes, it's all on the Solbosch campus of the Université Libre de Bruxelles (ULB), but that's not small, and although FOSDEM doesn't use all of it, what it does use is fairly spread out. So quite a lot of time is spent walking from one building to another; you follow the ant trails like everyone else, but I still found myself looking at a map each time I moved, for most of day one. It can take ten minutes walking to get all the way across the conference, so when deciding which talks to listen to, where they are is nearly as important as when. As ever, the organizers try to hack to improve people's lives; this year saw the first incarnation of http://nav.fosdem.org (no link, as it's no longer up), a mobile-friendly app designed to get you from talk to talk as efficiently as possible. I tried it, and it was clearly a first incarnation; things like this only get better. The free scheduling and map app on F-Droid was magnificent, though, and kept up with all the room changes. I didn't bother with a paper schedule the whole weekend.

There is a downside to this popularity. FOSDEM can't magically resize the rooms that ULB provides. The conference is aware that the big hall, Janson, is really too big for anything except the opening and closing ceremonies, and that most of the other rooms are too small. They have a system of FULL/OPEN notices for the doors of the particularly-desirable rooms to indicate when a room is at legal capacity, after which time it's one-in-one-out; the queues for entry can get rather long. In the opening keynote they named the Python and Ruby development rooms as being well known for quickly filling up; if you're not there early for those, you're probably not getting in.

FOSDEM definitely has its own way of doing things. We were told in no uncertain terms that the WiFi provided was IPv6-only. A second network, with the subtle SSID "FOSDEM-ancient", was provided for devices that couldn't run IPv6, but we were encouraged to agitate for IPv6-compatible devices ("bang on the heads of those responsible, particularly if they're here").

I spoke to several young Germans who told me how professionally the Chaos Communications Congress video-records its conferences, but FOSDEM proudly displayed its unique render farm (seen at left in a photo from the closing keynote); I was told the organizers were able to video the whole of FOSDEM for less than CCC spent on a single room. They were even more proud of the core router and server setup ("this is a lot better than last year's, because it's not balancing on chairs").

One serious event, the discovery during the conference of a systematic attempt to harvest Google account details from participants via a malicious WiFi network, was efficiently dealt with, and Google was notified about the affected accounts.

It is a conference that people love. Those who organize it, those who volunteer at it, and those who attend, seem alike to feel really strongly about the conference. Many who are there attend at their own expense, using their own free time to do so. If I may be forgiven for gazing balefully at my fellow Britons for a minute, I've never felt so surrounded by young Europeans — people who came from all corners of the continent, got by in a whole variety of languages not least of which was English, and collaborated as if national borders were not particularly relevant to them. I still can't believe my country desires not to be a part of that.

FOSDEM is a conference where the opening address isn't given by some high-profile industry figure, but by an impassioned young Bulgarian hacker who spends five minutes of the time talking very movingly about her mother's experiences as a COBOL programmer in 1990s Bulgaria. It is the only time I've ever been in the same room as a thousand other people all of whom know how to put their phones on silent, and remember to do so. FOSDEM is different, and it is fun. You should consider going.

[Thanks to the Linux Foundation, LWN's travel sponsor, for making this article possible.]

Comments (39 posted)