This week, I started looking into a large database backup file which turned out to contain the personal data of a significant portion of the South African population. It's an explosive situation with potentially severe ramifications and I've been bombarded by questions about it over the last 48 hours. This post explains everything I know.

Who Am I and Why Do I Have This Data?

Some background context is important as I appreciate there's a lot of folks out there who haven't heard of me or what I do before. I'm an independent Australian (I have a Microsoft Regional Director title but RDs don't actually work for Microsoft) and I specialise in security training folks who build online systems. For the last 4 years, I've also run a free service called Have I Been Pwned (HIBP) which aggregates data breaches and presently contains about 4.8 billion records from these incidents. In simple terms, this means that when there's a hack of a service like Dropbox, LinkedIn or MySpace and the data is published online (as each of those was last year), supporters of HIBP frequently send that data to me so that I help people impacted by the incident learn of their exposure. People either search by email address on the website or I automatically notify subscribers. About 1.7 million people presently subscribe to those notifications and I've had up to 3 million people visit the site in a single day after a major data breach.

On March 14 this year, someone sent me a 27GB file called "masterdeeds.sql" which was a MySQL database backup file. There was nothing immediately remarkable about it; there was no clear indication of a source (many similar examples include the source website in the file name) and there were "only" 2.2 million email addresses in the file (I was dealing with breaches containing tens or even hundreds of millions of records at the time). It went into an archive folder with literally hundreds of other similar files which, time permitting, I'd come back to and review later.

Fast forward to this month and I'm running out of space on the disk holding the breaches I'm yet to process. I start working through the largest incidents first; one of those is Victory Phones which has since made headlines due to it containing Republican donor records. Another is the masterdeeds.sql file which I begin loading into a local database on my laptop for further analysis. The import runs for several days until eventually last Sunday, I had to get on a plane to head interstate and run some training which meant turning off the machine and ceasing the process. It stopped after importing 31,631,992 records. (You'll read later how the complete size is significantly larger than this.)

Tuesday my time, I had the afternoon free so I sat in my hotel room and started looking closer at the data. It was clear there were a lot of South Africa references in there but just by looking at the data, I still couldn't work out the origin so I tweeted out for some help:

South African followers: I have a very large breach titled "masterdeeds". Names, genders, ethnicities, home ownership; looks gov, ideas? — Troy Hunt (@troyhunt) October 17, 2017

I followed up by sharing the script that creates the database table in the hope that someone would recognise the field names:

Multiple South African Twitter followers then chimed in with thoughts on the origin. Several of them also got in touch with me privately and shared personal information about themselves so that I could verify the accuracy of the data. Searching through the incompletely imported database, I didn't find everyone who contacted me but for those who did, the data was always accurate. Realising that the government issued ID's were also present, I began searching the 27GB file directly for the ID rather than the partially incomplete database. Every search for every person that sent me their number returned a hit.

During this process, I learned that these government issued IDs contain both the owner's date of birth and gender which is usually considered very personal data. This resource on decoding your South African ID number explains it quite clearly:

I also learned that like social security numbers in the US, the IDs are frequently used for identity verification and should be considered secret. Disclosure en mass like this could have serious ramifications for all sorts of situations where folks in South Africa are required to prove their identity, primarily because it's enormously useful information for people wishing to impersonate others.

Attributing the Source

The morning after my original tweets seeking support, I had a number of emails from Tefo Mohapi of iAfrikan. Tefo had done some great investigative work in an attempt to track down the source of the data which he later covered in two stories. The first was South Africa's Largest Ever Data Breach in which he identified a company named Dracore as a possible source. The Dracore website explains how they offer "data enrichment" services which includes the following:

Our data services are designed to help you access top quality, reliable tracing data – fast. Our database is continually updated 24 hours a day, 7 days a week, 365 days a year.

Dracore themselves then refer to this data as "a goldmine of information":

Which is all beginning to sound analogous to the Master Deeds data we were dealing with. Tefo made multiple attempts to reach out to them which resulted in the following response:

Escalating This Matter To Our Legal Counsel

Now I want to make something clear here: the resulting investigation indicated that whilst the data may have been originally "enriched" by Dracore, another party was subsequently responsible for the leak. However, there is only one acceptable response Dracore could have given at this point and it's "let us do everything we can to get to the bottom of this as a matter of priority". I'm enormously disappointed to see a response like this which puts self-interest in front of the privacy of tens of millions of South Africans.

Shortly after the original piece, Tefo followed up with a story titled Is Dracore Data Sciences Responsible For South Africa's Largest Ever Data Leak? In that piece, he said the following:

Dracore is also known for having a number of clients in the real estate business. This, however, does not necessarily mean they were responsible for the site where the leaked records were found.

Again, I want to be clear about this: whilst it appears the original source of the data was Dracore, it's always been entirely possible that a customer of theirs was responsible for disclosing it. In that post, Tefo identified that customer as Jigsaw Holdings. It's best you read his original article to understand how he joined those dots, I'd prefer to focus purely on the data exposure here.

In fairness to Dracore, I'd also like to share a link to their response.

Where the Data Was Located and When It Was Removed?

During his investigation, Tefo was contacted by an individual going by the name of Flash Gordon on Twitter. It turns out it was this person who originally located the data and I was able to date when I received it by looking back at my DMs with him or her. "Flash" was also able to advise that alarmingly, the data was still publicly exposed 7 months on from when they'd originally located it. Let me talk about that in more detail.

Flash had found the entire 27GB file sitting on a publicly facing web server. It had literally been published there and then the server configured to allow directory browsing. What this meant is that anyone with a web browser could go to that address and see all the files hosted on the site. The Master Deeds file had a "Last modified" date of 8 April 2015; it could have been exposed since that date.

This is really alarming because it means at the absolute least, the data was left open to the public for 7 months. At worst, it was 2.5 years if we go all the way back the "Last modified" date in early 2015. In fact, it could have been exposed for even longer because that's just the date it was last changed, not when it was created and not when it was necessarily placed on that server.

Tefo did his utmost yesterday to get the data taken offline and eventually, I got confirmation at 10:30 Wednesday morning South Africa time that it was down.

Who Else Has the Data?

I have absolutely no idea how far this has spread. What I can say with confidence though is that people are constantly scanning the web looking for precisely this sort of data. I've been involved with a bunch of similar cases in the past including the Red Cross Blood Service. In fact, I presented at the AusCERT conference earlier this year and shared part of the conversation I had with the individual who found that data (not Flash):

"Just scanning IPs" - it's frequently highly-automated and indiscriminate. It was the same story with Michael Page and the Indian pathology lab to name just a couple of others. These were discovered by individuals simply browsing the web via automated tools.

The logs of the server involved may reveal how many times the data has been requested. That is if they exist and if they go back far enough and even then, at the very least they'll show that unauthorised parties accessed the data. They'll give no indication how much further the data was spread after that.

At this time, the only safe assumption is that the owner of the data has lost control of it.

How Do I Know if My Data Was Exposed?

I'll start with the easy bit: I've loaded the 2.2 million unique email addresses in the data set into HIBP. You can search for your email there now and it will give you a yes or no answer as to whether it exists, but obviously the addresses only represent a small portion of the overall data set.

I do not have any plans to make the personal identification numbers searchable. Given the sensitivity of that data, it's not information I want to be responsible for managing on a service like this. However, given the size of the data as compared to the population of South Africa, there's an extremely high likelihood that anyone with an ID is in the data set.

What's the Total Size of The Data?

As I mentioned earlier, I had to stop the original data import at about 31 million rows. For the more technically inclined, the data was being restored to a MySQL database and there were multiple indexes defined in the script which always slows down insert statements. Yesterday, I dropped those indexes and ran the import again. This time it completed in the space of a few hours. This was the result:

My original import of the South African "Master Deeds" data didn't complete. Just ran a complete one: 60,323,827 rows with unique gov IDs. pic.twitter.com/ONlmJP2RtW — Troy Hunt (@troyhunt) October 18, 2017

The fact I only originally had only just over half the data loaded helps explain why some records weren't found when I originally queried the restored data but were subsequently found when I searched through the source file. As for that 60 million number, why is it so high? I mean South Africa only had a population of 55 million in 2015, how is the number larger than that? It turns out that the data also contains records where the individual is flagged as "deceased". South Africans living abroad may also account for the high number, the only thing we can confidently conclude is that the data represents a significant portion of the country.

What Now?

There's no easy or happy answers to this. People often ask if it's possible to "cleanse" data like this from the internet to which I usually reply that "trying to do that is like trying to remove piss from a pool".

A question that must be asked is whether South Africa wants private organisations like Dracore (allegedly) collating this much information about its citizens. To the best of my understanding, this wasn't done with consent; people didn't willingly provide their data for "enrichment" purposes. Now maybe that's still a totally legal activity on their behalf, but is it really in the country's best interests for an organisation to collate and then sell data to other parties in this fashion? The potential ramifications are now becoming clear.

Obviously, attribution is going to need to be confirmed at some point too. It's looking likely that Jigsaw was responsible for losing the data but to the best of my knowledge, they're yet to accept responsibility. Mind you, there's not a lot they can do about it at this time other than to help authorities understand the extent to which they may have leaked the data.

In terms of authorities, this raises a difficult question for the government and organisations alike; with this much data about this many people having been exposed for this long, what's the impact on identity verification processes? I mean if people need to provide data such as name, address and government issued ID in order to prove who they are, how does that change when an untold number of people have this information for the entire country? That's what worries me more than anything because for that, there are no easy answers.

I want to provide some further info based on both questions that have come up and new information. I've been interstate all week dealing with this between flights and running security training but am home now and have a bit more time to focus on it.

The first thing is that in one of the images above, I show just over 60 million records impacted. This is shown in MySQL's schema inspector which, as I've subsequently learned, is an estimate only:

SELECT COUNT(*) is exact, the schema inspector provides an estimate only. Estimates can cause problems with optimiser, too. — Chris Thompson (@yegct) October 19, 2017

I've since run more accurate counts on the data and want to share both the queries and results here to ensure there's no ambiguity. Firstly, the total record count is 66,360,837:

The split between living and deceased shows approximately 57 million alive and 9 million passed away:

Just in case you're doing the maths, this totals 4 less than the previous image. There were 4 extraneous records that appear to be data integrity issues at the source as they simply have a name in them. I'm conscious that the number of living records seems very high, so I decided to aggregate the data by age group:

I was pretty stunned to see that - 19% of the records in there are apparently children. That's not including teenagers either and if we add them, that figure jumps to 29%. Another search indicates just how young some of the children are:

Why on earth would you want little kids in this database?! As of today these are 3-year olds and no, there's no names or other personal data on those recrods but... why?!

We've also now seen an admission of who owned the server the data was left exposed on and yes, it was Jigsaw. This is an important step and obviously there's a lot of things that will need to play out, but at least the authorities now know which organisation to start with.

Speaking of authorities, the Department of Home Affairs is apparently now investigating as are the Hawks ("Directorate for Priority Crimes Investigations") which are important developments. This incident has to have government involvement because of the significance and impact on South Africans.

And finally, the CEO of Dracore did an interview that's worth listening to below:

It's not a flattering interview but despite that, I'm sympathetic to Chantelle in terms of her having been absolutely blindsided by this. Data breaches always take the organisation involved by surprise and they're almost always ill-equipped to deal with them - they've just simply never expected it to happen. Also keep in mind that whilst it wasn't Dracore's server that the data was exposed on, it seems highly likely that they were the original source of it (or at least a significant portion of it) so I'm also sympathetic in that sense.

But the indignation expressed by the host of that interview is reflective of what I'm hearing from folks in South Africa. The premise of a private company collecting huge troves of data about individuals without their consent then monetising it by selling it to other organisations who may then mishandle it (as is obviously the case), is alarming. Dracore may not be the ones who published the data to the internet, but questions must be asked about whether an organisation like that should have had it in the first place.

Adding a second longer interview with Chantelle from Dracore which goes into more detail and provides her with a better opportunity to explain how Dracore operates. Whilst I don't agree with her on several points, I concur with her response towards the end regarding the necessity to notify those impacted by this incident. But as the host of the interview says, the logistics of how you do that is quite another thing. (Re the 75m record comment, I haven't seen this figure represented before, certainly all the ones quoted by me are in the post and first update above.)

The bigger issue that goes beyond Dracore and frankly, beyond just South Africa, is non-consensual mass data collection, especially when done for commercial purposes. This incident demonstrates precisely what privacy advocates have been so concerned about and the victims of this are the innocent parties involved. Take a look at how Europe defines and handles personal data (see the "Understanding Personal Data" video) and consider the fundamentally different view of how it can be collected, who owns it and then what can be done with it. There would be a very different discussion happening right now if this incident occurred in the EU.