Stack Overflow is leaking user emails

Unsafe use of md5 hashes

I am developing a “Google Alerts” for developers service GitSpo. I have not figured out exactly what it is, but it is growing fast and people are liking it. A big part of GitSpo is aggregating data from different social networks, such as Twitter, LinkedIn, and Stack Overflow. This is when I noticed something odd: Stack Overflow default user profiles are using Gravatar.

Leaks due to old pipes

For those of you not familiar, Gravatar is a service that allows you to associate an image (an avatar) with your email. That image can then be used by other websites (e.g. Stack Overflow) to display an avatar for people signing up on their website. User’s avatar is found by hashing their email, e.g. My email is gajus@gajus.com. Anyone who have my email can generate a Gravatar URL:

https://www.gravatar.com/avatar/74a5bd659b3a8af09a336a932eebe3b1

Which will load my avatar:

The service was launched in 2007 and grew rapidly at least in part due it being the default avatar for comments left on WordPress sites. It is a neat idea: upload avatar once and have it follow you around the Internet. Update your Gravatar, and your avatar updates across all websites. Unfortunately, the hashing algorithm they’ve chosen is not particularly safe.

Gravatar image is generated by MD5 hashing a trimmed, lower-case representation of your email, i.e. md5('gajus@gajus.com') === '74a5bd659b3a8af09a336a932eebe3b1' . It is a fast hash. Using MD5 to hash private data was a bad choice even at the time. Today, there are MD5 databases that contain over 90 trillion hashes. Furthermore, as most emails contain only a narrow range of characters ( /^[a-z@\-.]+$/ ) and you can assume their ending (popular email domains like @gmail.com ), there are a lot permutations that need to be pre-hashed.

As an experiment, I picked hashes of 1000 Stack Overflow profiles and used one of the MD5 ‘decryption’ services, which gave me 721 emails (a 72% success rate).

However, the interesting use case is not getting the emails. A lot of developer emails are already semi-public, e.g. GitHub user emails can be obtained from their public profile, commit logs, license files, or even comments in the code. As GitSpo has an index of all public GitHub users and repositories, I was able to extra the associated email addresses, hash them and match them to Stack Overflow. All 1000 of them.

It is worth noting that Stack Overflow is not the only service that is using Gravatar (WordPress, HootSuite, TechDirt, Disqus, just to name a few others). Stack Overflow simply stood out because it is a developer resource and it surprised me that this slipped through the cracks.

There is not much Stack Overflow can do about it today — Stack Overflow has many of their website copies floating around the Internet. However, it would be the best to stop relying on Gravatar as a service for new users that are joining the system.