Last week I was contacted by someone alerting me to the presence of a spam list. A big one. That's a bit of a relative term though because whilst I've loaded "big" spam lists into Have I been pwned (HIBP) before, the largest to date has been a mere 393m records and belonged to River City Media. The one I'm writing about today is 711m records which makes it the largest single set of data I've ever loaded into HIBP. Just for a sense of scale, that's almost one address for every single man, woman and child in all of Europe. This blog posts explains everything I know about it.

Firstly, the guy who contacted me is Benkow moʞuƎq and he's done some really interesting malware and spambot analysis in the past. During our communication over the last week, I had a read of his piece on Spambot safari #2 - Online Mail System which is a good example of the sort of work he's been doing (it's also a good example of how dodgy some of this spammer code is!) He went on to explain how he'd located a machine used by the "Onliner Spambot" and pointed me to a path on an IP address with directory listing enabled:

I've obfuscated a bunch of info here because as of the time of writing, the server is still up and I don't want to give away any information that could result in the data being spread further. The IP address is actually based in the Netherlands and Benkow and I have been in touch with a trusted source there who's communicating with law enforcement in an attempt to get it shut down ASAP. Until that time, I'm not going to share file names in their entirety although I'll certainly describe anything of relevance in them.

Before I dive into the data, Benkow has posted a dedicated piece on the mechanics of this spambot that's worth a read. You can also find a great story on ZDNet from Zack Whittaker which is a good overview of the situation. The gap I want to fill here is to explain what I can about the data because there'll be a very large number of people finding themselves on HIBP and wondering what an earth is going on. If you haven't already read Benkow's piece, there's 2 important classes of data you need to understand:

Email addresses. That's it - just masses and masses of email addresses used to deliver spam to. In some cases, a single file may contain tens or even hundreds of millions of addresses. Email addresses and passwords. Benkow explains that these are used in an attempt to abuse the owners' SMTP server in order to deliver spam. I also believe that many of these may simply be aggregations from various other breach sources I'll talk about a little later on.

Getting on to the data itself, the first place to start is with an uncomfortable truth: my email address is in there. Twice:

That first file is the 14GB one from the earlier directly listing whilst the second is 131MB. In many cases, I found the same data in both the former larger file and a subsequent smaller one. Interestingly, as you can see from the suffix above, both refer to "UK" (I'm certainly not from the United Kingdom) whilst others refer to "AU" (although I'm not in there). There are no other 2 letter country codes represented in the file names but clearly when we're talking many hundreds of millions of addresses here, a heap of them are from other locations so take those suffixes with a grain of salt.

One of the files with the "NewFile_" prefix contained over 43k rows associated with the Roads and Maritime Services of my neighbouring state here in Australia:

Every row contains RMSETollDontReply@rms.nsw.gov.au in quotes followed by "support@" and then predominantly .com.au domains, albeit with over 13k .ru domains. This email address is used to send notifications relating to the "E-Tag" device installed on your car windscreen so that you can pay tolls. I know this because I've received a bunch of them in the past:

I'll take a stab at it and say that there's not many legitimate drivers using the New South Wales toll road system with Russian email addresses! Clearly, the constant alias on every one of these accounts is auto-generated. Interestingly, I saw a similar pattern with the B2B USA Businesses spam list I loaded last month with many comments like this:

I received a domain alert on this one. Went through the process, turned out to be an invented address (sales@domain). Address doesn't exist. — Peter Bance (@peterbance) July 19, 2017

There's also some pretty poorly parsed data in there which I suspect may have been scraped off the web. For example, Employees-bringing-in-their-own-electrical-appliances.htmlmark.cornish@bowelcanceruk.org.uk appears twice:

The first file is the same one my own email address was in and the second is the same file name structure albeit with a different number in it. And if you're wondering why I've publicly listed someone else's address, it's because it's already publicly listed:

But of course, the data in the dump has a bunch of junk prefixed to the address, junk which appears to be an HTML file name and may indicate the "address" was scraped off the web and the parsing simply wasn't done very well. The point here is that there's going to be a bunch of addresses here that simply aren't very well-formed so whilst the "711 million" headline is technically accurate, the number of real humans in the data is going to be somewhat less.

And then we get into passwords. One file is named numerically and contains 1.2m rows like this:

A random selection of a dozen different email addresses checked against HIBP showed that every single one of them was in the LinkedIn data breach. Now this is interesting because assuming that's the source, all those passwords were exposed as SHA1 hashes (no salt) so it's quite possible these are just a small sample of the 164m addresses that were in there and had readily crackable passwords.

A similar file (with a similar naming structure) contains 4.2m email address and password pairs, this time with every single account having a hit on the massive Exploit.In combo list. This should give you an appreciation of how our data is redistributed over and over again once it's out there in the public domain.

Other files have equal or even greater numbers; one has 29m rows of email address and password pairs, the former of which consistently shows up in HIBP, albeit without an exclusive data breach pattern per the previous two examples. Another is named in a fashion that suggests it's a large Aussie set of addresses and true to its name, there's 12.5m rows in there which would mean roughly one per every 2 people in the country. A large portion of those don't show up in HIBP at all, including the 379k .gov.au ones which appear to be a mixture of legitimate, fake and malformed addresses.

Yet another file contains over 3k records with email, password, SMTP server and port (both 25 and 587 are common SMTP ports):

This immediately illustrates the value of the data: thousands of valid SMTP accounts give the spammer a nice range of mail servers to send their messages from. There are many files like this too; another one contained 142k email addresses, passwords, SMTP servers and ports.

Some of the data was quite a jumble of info, for example the file with 20m rows of Russian email addresses surrounded by what appears to be file names containing SQL:

And it goes on and on. Email addresses, passwords and SMTP servers and ports spread across tens of gigabytes of files. It took HIBP 110 data breaches over a period of 2 and a half years to accumulate 711m addresses and here we go, in one fell swoop, with that many concentrated in a single location. It's a mind-boggling amount of data.

The above examples are by no means exhaustive, rather they're intended to illustrate just how diverse the data is. It helps explain both the massive number of records and the inevitable responses I'll get of "there are addresses in there which aren't real". It also illustrates how broad the sources of data inevitably are; finding yourself in this data set unfortunately doesn't give you much insight into where your email address was obtained from nor what you can actually do about it. I have no idea how this service got mine, but even for me with all the data I see doing what I do, there was still a moment where I went "ah, this helps explain all the spam I get". And that's the unfortunate reality for all of us: our email addresses are a simple commodity that's shared and traded with reckless abandon, used by unscrupulous parties to bombard us with everything from Viagra offers to promises of Nigerian prince wealth. That, unfortunately, is life on the web today.

All 711 million records are now searchable in HIBP.

Edit 1: For the folks asking for passwords (or if one was present for the account), please read Here are all the reasons I don't make passwords available via Have I been pwned.

Edit 2: For those asking for an indication of whether a password exists for an email address or not, there are several major challenges with this:

The source data is in such a mess the parsing would be a nightmare. At present, I just point a regex at it and suck the email addresses out. If I could readily pull all the passwords and indicate whether it existed or not, I'd still be bombarded with requests for "can I please have the password" which brings us back to the previous edit. Looking at this more generally, data breaches frequently have a broad range of attributes (the "Compromised data" section on each breach listed on the Who's been pwned page. Indicating which attribute was present on each account would be extremely time consuming. I simply don't have a construct for this in HIBP at present. This would be a ground-up new feature.

For this particular incident, if you're creating strong, unique passwords on each service (get a password manager if you don't have one already) and using multi-step verification wherever possible, I wouldn't be at all worried. If you're not, now's a great time to start!

Edit 3: Regarding those asking "what do I do now", you'll fall into one of two camps:

If you're using a password manager and creating genuinely strong, unique passwords everywhere, I wouldn't be in the least bit worried. The sources of this list are from various legacy incidents, for example LinkedIn which hashed passwords. Yes, it was only SHA1 without a salt, your 20+ char random password still isn't getting cracked. Every password I saw in the data set was weak; I didn't see any evidence to suggest there were plain text data sources. If you're not already using a password manager, start now. Now! You've already had weak passwords exposed (which they almost certainly are if you're not using a password manager), and they've been out there for some time. Go and grab 1Password and spend a couple of hours learning to use it and changing the passwords of your most important accounts now and everything else as time permits.

Further to that, regardless of which camp you're in, enable multi-step verification on everything. Even if your password is exposed (either because it's weak or was stored by the site in plain text), this renders the credentials alone absolutely useless.

None of this is hard, it requires just a little bit of effort and makes a fundamental difference to your security profile.