[Cryptography] SHA1 collisions make Git vulnerable to attakcs by third-parties, not just repo maintainers

I tried to fix this when git was young, when it would've been easy. Linus rejected the suggestion and didn't seem to understand the threat. He wired assumptions about SHA1 deeply into git. In the next few years, nasty people will teach him the threat model, with ungentle manipulations of his and many other peoples' source trees. John To: torvalds at osdl.org, gnu at toad.com Subject: SHA1 is broken; be sure to parameterize your hash function Date: Sat, 23 Apr 2005 15:21:07 -0700 From: John Gilmore <gnu at new.toad.com> It's interesting watching git evolve. I have one comment, which is that the code and the contributors are throwing around the term "SHA1 hash" a lot. They shouldn't. SHA1 has been broken; it's possible to generate two different blobs that hash to the same SHA1 hash. (MD5 has totally failed; there's a one-machine one-day crack. SHA1 is still *hard* to crack.) But as Jon Callas and Bruce Schneier said: "Attacks always get better; they never get worse. It's time to walk, but not run, to the fire exits. You don't see smoke, but the fire alarms have gone off. It's time for us all to migrate away from SHA-1." See the summary with bibliography at: http://www.schneier.com/crypto-gram-0503.html Since we don't have a reliable long-term hash function today, you'll have to change hash functions a few years out. Some foresight now will save much later pain in keeping big trees like the kernel secure. Either that, or you'll want to re-examine git's security assumptions now: what are the implications if multiple different blobs can be intentionally generated that have the same hash? My initial guess is that changing hash functions will be easier than making git work in the presence of unreliable hashing. In the git sources, you'll need to install a better hash function when one is invented. For now, just make sure the code and the repositories are modular -- they don't care what hash function is in use. Whether that means making a single git repository able to use several hash functions, or merely making it possible to have one repository that uses SHA1 and another that uses some future WonderHash, is a system design decision for you and the git contributors to make. The simplest case -- copying a repository with one hash function into a new repository using a different hash function -- will change not only all the hashes, but also the contents of objects that use hash values to point to other objects. If any of those objects are signed (e.g. by PGP keys) then those signatures will not be valid in the new copy. Adding support now for SHA256 as well as SHA1 would make it likely that at least git has no wired-in dependencies on the *names* or *lengths* of hashes, and let you explore the system level issues. (I wouldn't build in the assumption that each different hash function produces a different length output, either, though these two happen to.) Enjoy, John Gilmore Date: Mon, 25 Apr 2005 13:38:40 -0700 (PDT) From: Linus Torvalds <torvalds at osdl.org> To: Seth David Schoen <schoen at eff.org> cc: John Gilmore <gnu at toad.com>, Kees Cook <kees at osdl.org> Subject: Re: John Gilmore on SHA-1 [gnu at toad.com: Pls forward to Linus: SHA1 is broken] In-Reply-To: <20050425192520.GS14282 at zork.net> ... As to your SHA1 concerns: > It's interesting watching git evolve. I have one comment, which is > that the code and the contributors are throwing around the term "SHA1 > hash" a lot. They shouldn't. SHA1 has been broken; it's possible to > generate two different blobs that hash to the same SHA1 hash. Actually, even the theoretical breaking has not been proven for a pre-existing SHA1 hash (ie you need to control both the starting point for it), and more importantly, git really uses the SHA1 has a _hash_, not necessarily as a cryptographically secure one. IOW, security doesn't actually depend on the hash being cryptographic, and all git really wants is to avoid collisions, ie it wants it to hash the contents well. That, sha1 definitely does, and even an md5sum would suffice (but having 160 bits instead of "just" 128 obviously adds to the space, so that's always a bonus). Of course, the fact that sha1 is also very expensive to try to fool is a big bonus, since it means that it's just another layer on the real security model. But the _real_ security comes from the fact that git is distributed, which means that a developer should never actually use a public tree for his development. For example, I've got two separate firewall layers (and a NAT) in between me and the internet, and my personal tree is on that machine. I never actually trust or use the external trees - I just push the result to them. This is something you cannot do with a centralized SCM server like SVN or other traditional crud. A centralized one obviously has to be accessible to all the developers, which means that it's forced to be open enough to be much more easily attackable, and also means that there is a single point of failure also from a security standpoint. In contrast, even if somebody were to compromise my machine, that does _not_ automatically compromise the trees of other developers. They'd still have all the pristine objects, and never even fetch an object from me that has the same name (ie sha1 hash) as one they already have. In other words, to really break a git archive, you need to - be able to replace an existing SHA1 hash'ed object with one that hashes to the same thing (_not_ the breakage that has been shown to be possible already) - the replacement has to still honor all the other git consistency checks (even "blob" objects have them: they need to have a valid header with a valid length, so it's not sufficient to just find another object that hashes to the right thing, you have to find an object with a valid header that hashes to the right thing) - you have to break in to _all_ archives that already have that object and replace it quietly enough that nobody notices. Quite frankly, it's not worth worrying about. It's a hell of a lot easier to just break a source archive with other means (ie pay a developer ten million dollars to just insert the back door you want inserted). Linus To: David Wagner <daw at cs.berkeley.edu> Subject: Re: Linus Torvalds: Re: SHA1 is broken Date: Fri, 29 Apr 2005 01:20:21 -0700 From: John Gilmore <gnu at toad.com> > SHA1 isn't totally broken yet. The attack still requires at least > 2^60 work to find a collision. Knew that -- but "Attacks never get harder, only easier." > No one has publicly reported finding a collision in SHA1 yet. I thought the Chinese team had reported four pairs of colliding plaintexts -- they just hadn't revealed exactly how they generated them. Or are you distinguishing "finding" from "generating" a collision? > One question I would have is what is the impact of a SHA1 collision on > his system? In other words, what harm can you do if you can find SHA1 > collisions efficiently? I'm not familiar with his source mgmt system, > but if there is little harm one can do with a collision, then maybe it > just doesn't matter very much. Here's the mailing list for git: http://kerneltrap.org/mailarchive/15/overview/browse/month Somewhere in there it told me where to find the sources, which include a design document about how it works. Ah, there it is: http://www.kernel.org/pub/software/scm/cogito/ http://www.kernel.org/pub/software/scm/cogito/README Basically, it assumes, deeply embedded, that if two blobs have the same hash, they ARE THE SAME BLOB. You can destroy its integrity by feeding it various blobs which happen to hash to the same values. He seems to think that the only possible attack is that someone would go in and modify the database by hand -- rather than feeding it new input that confuses it. John PS (added 25 Feb 2017): If you assume NSA is six months or a year ahead of the open academic/industrial sector in attacking SHA1, what would they have already subverted using a similar attack? Hmm, check the "cmp" and "diff" sources! If you don't trust the SHA1 hashes that say two trees are the same, the second step is comparing the trees of files directly. Making an input pattern that causes cmp and diff to always say, "yup, no differences here!" would allow any fraudulently inserted modifications to spread much further.