The tech news recently has seen quite a lot of chatter about an alleged haul of Apple credentials, apparently about 250 million of them in all. Allegedly. Maybe. Or was it 300 million?. No - wait - it might have only been 200 million. The number itself has been the source of plenty of debate even within the members of the Turkish Crime Family (TCF) themselves. Now who really knows if they're Turkish or a family, the only part of the name we can get any consensus on is the "crime" component courtesy of them attempting to extort Apple as they threaten to delete account contents and remote-wipe Apple devices if payment isn't forthcoming by... today. The 7th of April. Which, of course, it won't be but it all begs the question - what data do they actually have?

Zack Whittaker who wrote the first story I mentioned above has been following this incident pretty closely. He managed to get his hands on a sample of the alleged haul of data and sent it through to me for further analysis. By running Have I been pwned (HIBP) and having 2.6 billion accounts from various data breaches to refer to, I've got a great data set with which to reference incidents like this. I want to walk you through what I've found and ultimately how I've identified where the vast majority of accounts have come from.

The data Zack received looks like this:

[redacted]@icloud.com;Kid123 [redacted]@icloud.com;hihihi [redacted]@icloud.com;rita00 [redacted]@mac.com;Jungheepak [redacted]@icloud.com;215100

There are 69,355 email addresses and they're almost entirely spread across 4 domains:

mac.com: 42,051 me.com: 24,459 icloud.com: 2,720 me.com.au: 45

That's 99.88% of the addresses on those 4 Apple domains. Throughout this analysis, you'll see figures that are very close to absolutes, but just not quite "perfect". Ultimately, I'm going to show how this list was cobbled together so keep in mind that we're talking about humans (likely children or very young adults), doing fallible human things.

Here's the top 10 passwords:

Football95: 2,345 disney1: 1,270 (blank): 841 123456: 773 111111: 705 dthomas: 414 Password: 344 Cannon: 200 conrad76: 176 duece2: 168

There's the usual selection of poor choices people make here in terms of simplicity and predictability, but those first 2 in particular are way too specific to be from a generic collection of data. It suggests bias in either the sample set or the way the data was extracted from it, that is there's a factor that's skewed the results towards these two.

So far, this is all just pretty generic analysis of the data as it was provided to me, I really wanted to see how it stacked up against what's in HIBP because that's where I can actually add a unique perspective. So I took the 69,355 records but before going any further, grabbed a distinct list of email addresses which turned out to only total 52,742. In other words, 24% of the records contain email addresses that appear more than once. (Also, keep that in mind when hearing figures of "hundreds of millions of records" - how many people does that actually mean?) I took this unique set and fired it all into HIBP which returned results like this:

You're looking at the tail end of the analysis here and for each email address, I'm showing which breaches it appeared in. There were two things that really stood out to me when I first ran this:

Firstly, there's a very high hit rate here. Just eyeballing it I could see that almost every account in the Apple data I checked had been breached before (I'm going to show how this isn't actually from Apple, but I'll call it that for simplicity's sake.) This is really unusual because if you read back through the HIBP Twitter feed, you'll see there's usually around half of the accounts in any given data set that are already in the system. In this case, 51,707 of the unique 52,742 email addresses had a hit. More than 98% of the email addresses had already appeared in data breaches loaded into HIBP. This says to me that they were almost certainly sourced from existing breaches, which brings me to the second observation:

There's a lot of Evony entries in the screen above. It appears on almost every row and Evony is a breach with "only" 29 million accounts. That pales in comparison to the likes of Dropbox (68 million), LinkedIn (164 million) and MySpace (359 million). It also made me curious - what was the spread of accounts in data breaches? So I pulled the stats:

Evony: 40,751 MySpace: 17,635 Lastfm: 11,137 Adobe: 9,629 LinkedIn: 8,773 RiverCityMedia: 8,416 Dropbox: 7,397 Tumblr: 4,319 Fling: 1,498 AshleyMadison: 1,362

Look at how high Evony is here - more than 77% of the email addresses in the TCF data are also in the Evony breach. That's more than double MySpace even though the social media platform had 12 times as many accounts exposed in their breach. Now keep in mind that the numbers above add up to way more than the number in the Apple data as many accounts appear in multiple breaches. As the earlier image shows, a bunch of these accounts have been pwned over and over again; the second one from the top was in Evony, Modern Business Solutions and River City Media. But Evony remains way over-represented which is also enormously coincidental in terms of the timing:

I was sent the Evony data right around the time the Apple ransom came out. As data starts circulating, it's rapidly picked up by nefarious parties and used for, well, nefarious reasons and that appears to be precisely what happened here. I decided to pull the original Evony data I was sent back out of my local archives and take a closer look. Here's what it looks like:

The first file contains the usernames, email and IP address and an unsalted MD5 hash of the password. The second file contains cracked passwords alongside email addresses:

Note the dates on the files too - the first one is mere weeks before the TCF shenanigans began and the second one looks like it was prepared around the middle of August last year which is consistent with reports. In fact, cracking data breach passwords and selling them to anyone who would pay was what the now defunct LeakedSource was infamous for, and indeed they appear to have been the source of this incident coming to light according to the article (this practice also inevitably contributed to why it's now defunct!)

Having the Evony data from the breach alongside the Apple data allowed me to work out how many of the email address and password pairs in the latter was sourced from the former. I started out with a sanity check of the total intersection based on email address alone:

That's a total of 40,751 matches which aligns with the check I did against the live HIBP system. What's really important is when the match is done on password as well:

99.9% of the accounts that are in both the Evony breach and the Apple data share the same password. This is way too high to represent password reuse from across systems so in other words, the folks that cobbled this together constructed a significant portion of it from the Evony data. Clearly, they were leaning very heavily on this particular breach in order to construct the Apple list and that now poses a very interesting question - how many of the alleged millions of accounts actually are there? I mean they've said "Hey, we've got this massive list we're going to reset, here's a sample to prove we're serious" and they sent 69k rows to Zack, can we look at this data and draw some conclusions about the 200 or 250 or 300 million - or whatever hundreds of millions claim? Is the data anywhere near that size?

Let's look at it like this: if they built up a big list from various data breaches (of which Evony is obviously the predominant one) and then took a small slice of those hundreds of millions of accounts to send to Zack as a sample, then the Apple addresses in that sample would represent a small part of those in the overall Evony data breach too. This is a theory that's easy to test via one simple query:

Think about what this means: irrespective of the list TCF built up, there are 40,866 Apple email addresses in the Evony breach belonging to those 4 popular domains. Yet somehow - miraculously - the TCF "sample" included 99.6% of all Apple accounts in the Evony breach they relied so heavily on. The chances of them grabbing something like 0.02% of their alleged hundreds of millions haul, sending them to Zack as a "sample" and that amazingly representing almost every single last Apple address in the Evony data breach is unfathomably slim.

We've already established that more than 77% of the unique email addresses in the Apple data came from Evony so that clearly accounts for the lion's share of it, but what about the last 22.x%? When I looked at that top 10 sites the Apple data was found in via the HIBP queries, the one that stood out the most after Evony was Last.FM. Whilst it had less matches than MySpace (11,137 versus 17,636), it had a significantly higher proportion of accounts based on the size of the breach which was 37 million records, a mere tenth of the size. So I pulled the source data for that breach and dug deeper.

I grabbed a list of all Apple email addresses that I hadn't already matched via the Evony data (that is both the email and password matched) and found 12,044 distinct ones left. Of those, 9,008 were also in the Last.FM breach so I joined across the data and dumped out all the MD5 hashes from that breach and all the plain text passwords from the Apple data. This meant I had one big hash list from Last.FM and one big plain text password list from Apple and what I was curious about was how many of these matched. This would tell me how much of the remaining Apple data was potentially cobbled together by cracking Last.FM password hashes.

Those 9,008 common accounts had 8,565 unique plain text passwords from Apple and 8,776 MD5 hashes from Last.FM. The former gave me a word list with which to attempt cracking the latter and the result would illustrate how many of the 9,008 accounts where potentially sourced from this breach. Here's what I found:

Here we're seeing hashcat cracking a third of all the Last.FM hashes using the passwords in the Apple data. But it's more than the 2,893 records shown here because multiple accounts had the same password and once expanded out, we get a lot more. A further 3,243 email addresses not already found in the Evony data matched perfectly to accounts in the Last.FM breach. So between that and Evony, we're up to about 44k of about 53k accounts now accounted for or in other words, we've clearly identified the likely source of 83% of the records. I could keep going - I could load the MySpace breach and the LinkedIn breach and keep cracking hashes and filling in gaps, but the source of the data was now abundantly clear. Let's apply Occams Razor to this and I'll draw the most obvious conclusion possible from the whole thing:

The list of Apple accounts is not hundreds of millions, it is instead less than 53k and it's comprised predominantly of accounts from the Evony data breach and a small handful of others.

Now, that's not to say there's no risk at hand here, but rather that the risk is no different to the one we're faced after every data breach: a bunch of people have reused their passwords and they're now going to have other accounts pwned as a result. But that's a very different story to the headlines of "hundreds of millions of Apple accounts will be reset and iPhones wiped". It's nowhere near as bad 53k either because a significant chunk of those people won't have reused their passwords. Of those that have, many my no longer even be valid for Apple services and indeed Zack found that when he reached out to people listed in the sample data. But here's something even more significant - Apple has the sample set I've been analysing which puts them well and truly one step in front of TCF. That doesn't necessarily mean they're going to lock accounts out or force password resets, but it does mean they can associate a much high risk rating to these accounts and protect them in other ways. Plus of course there's a small portion of those who will have multi-factor authentication enabled so even a correct password will be useless. Think of all these factors as a funnel which gradually decreases the usefulness of the accounts such that only a tiny fraction of the alleged haul is actually of any use whatsoever:

The conclusions above are all pretty reasonable based on the evidence at hand. There are probably some nuances I've missed (I usually get some pretty insightful comments on these sorts of posts so I'm looking forward to those), but for the most part I think we can all be reasonably sure about what's gone on here. With that now understood, let's turn our attention to TCF for a moment and I'll start with Zack's commentary here:

Immature and naive "hacker" thinks wiping over 200 million iPhones will get him rich, and not bludgeoned to death by an angry mob. How cute. pic.twitter.com/lkysePbuFb — Zack Whittaker (@zackwhittaker) March 27, 2017

Zack has been pretty blunt here but it's hard to argue with his logic. These are very likely kids, either in the legal sense or in the "they're a lot younger than most of us" one and Zack is right - they're painting a big target on their backs. Of course they're full of bravado and frankly, are probably a bit excited about how much attention they've received, but this is now starting to play out in a very familiar way. If they keep going, it will also have a very familiar ending.

It's familiar in the same way it was for kids like Jake Davies of Lulzsec fame. Like TCF, he felt beyond reach as he went about wreaking havoc on the internet until his inevitable capture. In that link you see him turning up at court with his mum and judging by the look on her face, she wasn't very amused by the whole thing. (Side note: Jake has gone on to do fantastic things and is a great example of how these kids can later turn their lives around.) Same again for Alex Davros who was running LeakedSource; seemed like a great idea at the time as he revelled in internet anonymity, not such a good idea now. Similar story again for the guy behind the TalkTalk attack, just that we have no idea who he is; when you're 17 there are child protection acts in place because hey, kids do stupid things and need saving from themselves.

And their behaviour is generally pretty consistent with that demographic too. At one stage an old pic of me appeared as their profile photo:

It later appeared in a tweet which was subsequently deleted. It doesn't bother me in the least but again, it gives you a pretty good idea of the sophistication of the "threat actor" involved here.

There's more oddness in the Twitter profile itself, namely with the follower count:

At first glance, that's a good effort for an account only set up last month! Except, of course, they're mostly fake:

There's obviously a lot more accounts following now than just a couple of weeks ago, but they're clearly mostly fake. Whether this is intentional to artificially inflate their perceived sense of significance or it's someone else altogether messing with their reputation is not clear, but it's still part of the broader set of observations around consistency, trustworthiness and the general likelihood that they can actually deliver on their threats.

The chances of anything of significance happening to Apple accounts today is near zero. Not just because of the observations above but due to the simple fact that the locations they've harvested the data from have well and truly already been pilfered by others attempting to do the same thing. This is classic credential stuffing and services like Apple's are amongst the biggest targets after breaches such as Evony. The only people likely to be adversely impacted by this are those who chose poor passwords that were readily cracked, reused them across services, had them exposed in (probably) the Evony data breach, don't have multi factor auth turned on at Apple, failed to change them after all the news about this and finally, were not protected by Apple come the deadline. In other words, innocent people who made a series of very bad security choices.

As for TCF, they're at a bit of a crossroads as of now. On the one hand, they've inevitably got some small number of accounts that actually work and they may even be able to wipe some data from a number of them. Thing is though, it almost certainly won't be any more than a tiny fraction of a percent which won't do great things for their credibility. Yet even if it is a small number, once you start destroying other people's things the law tends to take you a bit more seriously and the inevitable conclusion of that isn't pretty. On the other hand, now seems like a good time to get out while the getting is good. They had some fun, got some good press and to their credit, certainly put the issue of password reuse and account security back in the headlines. Apple obviously isn't giving them any dough and they'll have realised that by now, seems like a really good time to let this all drift off into data breach history...

Edit: Shortly after the deadline passed, TCF claimed to have received payment:

Hello everybody, look what we have here https://t.co/I3B0wh1Udv — Turkish Crime Family (@turkcrimefamily) April 7, 2017

However, it had absolutely nothing to do with Apple, ransoms or indeed with TCF themselves: