Earlier this week, I read a really interesting piece on 3 things that need to be done to save the web. The first observation was that "we’ve lost control of our personal data" and the author went on to observe the following:

As our data is then held in proprietary silos, out of sight to us, we lose out on the benefits we could realise if we had direct control over this data and chose when and with whom to share it. What’s more, we often do not have any way of feeding back to companies what data we’d rather not share

Now this wasn't written by just some random bloke on the internet, it was by Tim Berners-Lee, you know, the guy who kinda invented the web as we know it today. He made other very insightful observations as well, but that was the one that really stuck with me when I read it given how much I deal with incidents where precisely that has happened - people have lost control of their data. This week has provided many examples of that, including the story on the smart vibrator (there's two words I never expected to see together!) obtaining very personal data from people. The simple reality today is that our personal data is spread across places well beyond our control, which brings me to the issue at hand.

I was recently sent a large file of data from a source whose been quite reliable in the past. This one was a 52.2GB CSV file containing JSON data the likes you'd see come from a MongoDB, just like CloudPets a couple of weeks ago. It was big - very big - and in total contained 33,698,126 records. I'm going to talk more about the contents in a moment, but the easiest way to put this in context is just to show you an actual record:

{ "netprospex contact id":"177496766", "first name":"Zack", "last name":"Whittaker", "job title":"Writer Editor", "email":"zack.whittaker@cbsinteractive.com", "contact phone 1":"(415) 344-2000", "contact phone 2":"(415) 344-2000", "primary job function":"Marketing", "all job functions":"Creative", "joblevel":"", "company name":"CBS Interactive Inc.", "d-u-n-s":"808539506", "company phone":"(415) 344-2000", "location type":"HQ", "street address":"235 2Nd St", "city":"San Francisco", "state":"CA", "postal code":"94105", "county":"San Francisco", "country":"US", "web address":"http://www.zdnet.com", "revenue":"246860181", "revenuerange":"$100 mil to less than $250 mil", "employees":"600", "employee range":"500 to less than 1,000", "primary industry":"Advertising & Marketing", "all industries":"Advertising &Marketing; Information Collection & Delivery", "primary sic code":"7319", "primary sic description":"Advertising, nec", "company name (us ultimate parent)":"National Amusements, Inc.", "d-u-n-s (us ultimate parent)":"49422439", "street address (us ultimate parent)":"846 University Ave", "city (us ultimate parent)":"Norwood", "state (us ultimate parent)":"MA", "postalcode (us ultimate parent)":"02062", "country (us ultimate parent)":"US", "revenue (us ultimate parent)":"27613349110", "revenue range (us ultimate parent)":"$1 bil and above", "employees (us ultimate parent)":"133269", "employee range (us ultimate parent)":"100,000 and above" }

As you can see, this is Zack Whittaker and Zack is a journo who does a lot of great work for ZDNet (obviously, he also consented to me using his data). I've worked on stories with Zack in the past and finding him here made it a pretty easy decision who to talk further with about the data. But let me give you some first impressions of my own then I'll come back around to Zack.

This data is very corporate. It obviously has a lot of info relating to Zack's employer (CBS Interactive owns ZDNet along with a bunch of other online assets) and for the most part, it looks like fairly openly available info. But a few things were really nagging me about the data:

Firstly, it's perfect. Every name is properly cased, every email address is well-formed and there are none of the tell-tale signs of user-entered data. This didn't come from any sort of mass collection exercise such as buying marketing lists River City Media style, it was almost certainly carefully curated at some central point.

Secondly, the data is 100% US. Every single "country" value is precisely as you see above for Zack. It's from all over the US as you'd expect with a set of records that large; California is the most represented with over 4 million records, then New York state with 2.7 million, Texas with 2.6 etc.

Thirdly - and this is really a conclusion from the previous two points - it feels like data that was provided as a commercial feed of US businesses and their employees. This looks precisely like the sort of thing people would pay money for as it's a pretty valuable set of information. Which brings us to NetProspex.

The very first attribute in the JSON above is "netprospex contact id", a unique identifier which appears on every record. A quick Google and we end up on the NetProspex website which is a service provided by Dun & Bradstreet (D&B). It doesn't take long to build a hypothesis about how this data would have been used:

And just in case there was any doubt:

We help marketers develop and manage their B2B data. Our multi-faceted data quality processes — backed by the world's largest commercial database and seamless integration into your marketing systems — enables you to identify the best opportunities, build stronger relationships and accelerate growth for your company.

This was all starting to feel a bit déjà vu and despite the legitimacy of D&B as a commercial entity (they're also publicly listed), it was hard not to draw parallels to the way the River City Media data was intended to be used. But this data could potentially be far more valuable than the RCM data; it was very carefully curated, had very valuable corporate data attributes and whilst much smaller in size, obviously covered a significant portion of corporate America.

I decided to reach out to Zack because as I've written many times before, reporters like him are very adept at getting to the bottom of these issues. They're also well-practiced at getting answers from companies and obviously, D&B would need to be contacted. Zack and I spoke at length, particularly about the potential ramifications of the data. To frame that discussion, let me share the breakdown of results for the top 10 companies in the data set:

DOD Cce : 101,013

United States Postal Service : 88,153

AT&T Inc. : 67382

Wal-Mart Stores, Inc. : 55,421

CVS Health Corporation : 40,739

The Ohio State University : 38,705

Citigroup Inc. : 35,292

Wells Fargo Bank, National Association : 34,928

Kaiser Foundation Hospitals : 34,805

International Business Machines Corporation : 33,412

The Department of Defence was the most heavily represented and obviously seeing over 100k military personnel piqued our attention. There are over 10k unique job titles in there too, titles such as "Soldier" (which was the most common with 2.7k entries), but also titles like "Ammunition Specialist" (91 people) and "Chemical Engineer" (32) along with the sorts of roles you'd expect in the army such as "Intelligence Analyst" (715) and "Platoon Sargent" (670). When you look at that list and ask "How would the US military feel about this data - complete with PII and job title - being circulated", you can't help but feel it poses some serious risks. (The ISIS kill list of last year was one of the first things I thought of.) We've been bombarded by news of state sponsored hacking recently and frankly, if I was a foreign power with a deep interest in infiltrating US military operations, I'd be very interested in a nicely curated list pointing me directly to hundreds of intelligence analysts.

And then you move on to corporate America. Take Wells Fargo, for example: this list makes it very easy to build a comprehensive picture of people and their roles. For example, there's everyone in the C-suite, but that's a pretty openly accessible set of data anyway. So go down a rung and you've got 45 Vice Presidents; Senior Vice Presidents, Assistant Vice Presidents, Executive Vice Presidents, all with names and email addresses alongside job titles. The value for very targeted spear phishing is enormous because you can carefully craft messages that refer to specific individuals of influence and their roles within the organisation. For example, sending a message on behalf of the "Vice President, Senior Private Banker" (her name is easily discoverable) to an accountant in the firm requesting an urgent transfer. Techniques such as whaling are made all the easier with this data and whilst that was always possible anyway, having so much of it in one place enables the automation of attacks across a broad range of organisations.

Moving on, Zack successfully got in touch with D&B and by all accounts they applied urgency to the issue and promptly investigated further. They confirmed a number of things we already knew; the data in isolation was not considered sensitive, it's mostly contact info and they sell the data to "thousands of customers". They also noted that some customers may then on-sell the data and that there are commercial models which provide access to certain segments of the information depending on what the customer is willing to pay for.

In terms of where this data specifically came from, D&B don't believe it was directly from one of their systems and with thousands of customers purchasing this information, we may well never know who lost it. They did tell Zack they believed the data was about 6 months old though so at least that helps us date the incident somewhat. And there was an incident - someone exposed this data (most likely via an unprotected MongoDB as with CloudPets) and another party downloaded it. Call it a leak or a data breach, but this is valuable information (it is not cheap to purchase!) which was never meant to see the light of day in this way.

They did go on to explain that the collection and sale of the data complied with US law and I have no reason to doubt that, although they also stated that it included "no PII data" which doesn't really reconcile with the definition:

information that can be used on its own or with other information to identify, contact, or locate a single person, or to identify an individual in context

When you have someone's first and last names, their job title and their email address along with the company they work for, you have PII. And that's really what makes this a highly volatile collection of data; this much personal information on this many people and set in the context of their professional roles poses numerous risks to the organisations involved here. I often work with companies attempting to mitigate the damage of their organisational data being publicly exposed (frequently due to data breaches), and I can confidently say that knowing this information is out there circulating would concern many of them.

We've lost control of our personal data and as Berners-Lee said only a few days ago, we often do not have any way of feeding back to companies what data we’d rather not share. Particularly when D&B believe they're operating legally by selling this information, what chance do we have - either as individuals or corporations - of regaining control of data like this? Next to zero and about the only thing you can do right now is assess whether you've been exposed by searching for it in Have I been pwned. You can also read Zack's piece on ZDNet for another perspective on the issue.

Let me finish on a lighter note: There are 3 records for individuals with a first name of "Donald", a last name of "Trump" and a job title of "President". They occupy genuine roles within legitimate businesses and just happen to share these three data points with the 45th bloke at the top. Their industries are "Airlines, Airports & Air Services", "Insurance" and... "Hair Salons".