The UpGuard Cyber Risk Team can now confirm that a cloud storage repository containing information belonging to LocalBlox, a personal and business data search service, was left publicly accessible, exposing 48 million records of detailed personal information on tens of millions of individuals, gathered and scraped from multiple sources.

This data includes names, physical addresses, dates of birth, scraped data from LinkedIn and Facebook, Twitter handles, and more. Ashfaq Rahman, co-founder of LocalBlox, a company that bills itself as the “World's Most Comprehensive Cross Device Identity Graph on Businesses, Consumers and Geo Audiences,” has confirmed to UpGuard that the exposed information belongs to them.

In the wake of the Facebook/Cambridge Analytica debacle, the importance of massive sets of psychographic data is becoming more and more apparent. The exposed LocalBlox dataset combines standard personal information like name and address, with data about the person’s internet usage, such as their LinkedIn histories and Twitter feeds. This combination begins to build a three-dimensional picture of every individual affected— who they are, what they talk about, what they like, even what they do for a living— in essence a blueprint from which to create targeted persuasive content, like advertising or political campaigning. If the legitimate uses of the data aren’t enough to give pause, the illegitimate uses range from traditional identity theft, to fraud, to ammunition for social engineering scams such as phishing.

The Discovery

On February 18th, 2018 an Amazon Web Services S3 bucket located at the subdomain “lbdumps” was discovered by the UpGuard Cyber Risk Team, publicly downloadable and configured for access via the internet. The bucket contained one 151.3 GB compressed file, which, when decompressed, revealed a 1.2 TB ndjson (newline-delineated json) file. Metadata in a header file pointed to LocalBlox as the owner. After downloading and beginning to analyze this extremely large data file, the UpGuard Cyber Risk Team notified LocalBlox of the exposure on February 28th; the bucket was secured later that day.

The file name provides some indication of the contents: “final_people_data_2017_5_26_48m.json.” As hinted, the massive file contains 48 million records, each in json format and separated by new lines. This master list corroborates information gathered from a variety of sources about individuals. The sheer breadth of the exposed data includes such information as individuals’ names, physical addresses, dates of birth, scraped LinkedIn job histories, public Facebook data, and individuals’ Twitter handles. In addition, it appears the prominent real estate site Zillow is used in the process as well, with information being somehow blended from the service's listings into the larger data pool. The database appears to work by tracking an IP address, matching collected data to that IP address when able, and thus providing a clearer image of the behavior and background of the user at that IP address.

The image of the "final_people_data" file in the repository.

Also of interest are exposed source fields, providing some indication of where the scraps of data were collected from. Some are fairly unambiguous, pointing to aggregated content, purchased marketing databases, or even information caches sold by payday loan operators to businesses seeking marketing data. Other fields are more mysterious, such as a source field labeled “ex.”

Included among the data are several Facebook data points, filled from queries like this one present in the dataset. In those instances the <query> and <email> fields were populated with the person's name and email address:

"term":"[name:>http://www.facebook.com/search.php?q=<query>,, email:>http://www.facebook.com/search.php?init=s:email&q=<email>&type=users]

Some of the data points associated with these queries include pictures, skills, lastUpdated, companies, currentJob, familyAdditionalDetails, Favorites, mergedIdentities, and a field labeled allSentences which includes other text from the search results. That text includes results that suggest this information was scraped from the Facebook html rather than gathered through the API. For example, this text from one record appears to come from the Facebook page footer in 2016:

English (US) , EspaÃ±ol , FranÃ§ais (France) , ä¸.æ–‡(ç:registered:€ä½“) , Ø§Ù„Ø¹Ø±Ø¨ÙŠØ:copyright: , PortuguÃªs (Brasil) , Italiano , í•œêµ.ì–´ , Deutsch , à¤¹à¤¿à¤¨à¥.à¤¦à¥€ , æ—¥æœ¬èªž , , ","Sign UpLog InMessengerFacebook LiteMobileFind FriendsPeoplePagesPlacesGamesLocations ","CelebritiesGroupsMomentsInstagramAboutCreate AdCreate PageDevelopersCareersPrivacyCookies ","Ad ChoicesTermsHelpSettingsActivity Log ","Facebook Â:copyright: 2016 "

This data highlights the ease with which Facebook data can be scraped, and the ubiquity of Facebook information in psychographic datasets. According to their website, “LocalBlox is the First Global Customer Intelligence Platform to search, combine and validate deep business and people profiles – at scale.” The exposed data wasn’t just a customer list, but the very product LocalBlox offers. Their value statements about the power of their data provide some insight into exactly why exposing such data is extremely dangerous. According to the LocalBlox website, “The need for deeper, more accurate data about individual businesses and consumers is becoming more urgent to compete.” This data is valuable because it can be used effectively, and this efficacy can become dangerous if put to malicious use.

The Significance

Social awareness of data exposure and its consequences has grown in parallel with the scope of datasets being aggregated, stored, shipped, and copied by numerous organizations around the world. The LocalBlox dataset, 1.2 terabytes in size, contained 48 million records on a lesser or similar number of individual people. The presence of scraped data from social media sites like Facebook also highlights an important fact: all too often, data held by widely used websites can be targeted by unknown third parties seeking to monetize this information. In such cases, both a targeted website like Facebook and any affected users are being victimized, as personal information entrusted to the social network is snatched up for the benefit of a platform of which no one is aware.

More importantly, the data gathered on these people connected their identity and online behaviors and activity, all in the context of targeted marketing, i.e. how best to persuade them. It is exactly this persuasive factor that lies at the heart of discussions about how data is gathered and sold: when aggregated together at scale, your psychographic data can be used to influence you. It is what makes exposures of this nature so dangerous, and also what drives not only the business model of LocalBlox, but of the entire data analytics industry. As it says on the LocalBlox website, the “Data and Analytics Market is Booming," and this is reflected in the advertising copy the site employs.

The LocalBlox website.

With this kind of business interest in data harvesting, processing, and resale, it should be no wonder that so many massive and intrusive data sets exist in the world, providing companies and political parties with detailed blueprints on how to influence people.

What should be a wonder is that these datasets aren’t better secured and administered. This exposure was not the result of a clever hack, or well-planned scheme, but of a simple misconfiguration of an enterprise asset— an S3 storage bucket— which left the data open to the entire internet. The profitability gained by data must come with the responsibility of protecting its integrity and privacy. Cloud storage itself provides functionality and speed at a reasonable cost, but cloud assets require careful configuration— the thin line between private and public can be erased with the flip of a single switch. The lack of controls around common IT processes are what allow critical errors like this to slip into production, eroding the privacy of millions of people.