Nearly two years on from the revelation of the Facebook-Cambridge Analytica scandal, you’d hope that Facebook has taken steps to stop bad actors from exploiting data that we share online. While Facebook has taken some of these steps, one loophole still exists. It’s a loophole which allows bad actors to see incredibly private information about users, information that the user may not share with their closest friends. That loophole is scraping.

What is Scraping?

Scraping is simply the act of taking public information from websites. By public information, we’re talking about the sort of information that’s accessible to anyone who views the site. If you wanted to store weather data, you could scrape a weather site. If you wanted to store sports results, you could scrape match reports. If the data is publicly available, chances are that you can scrape it.

In the above examples, you could scrape the data manually. That is, you could visit all the pages whose data you wanted to store, and copy it into a file. This isn’t how scraping is normally done however.

Typically, people code bots which scrape web pages for them. These bots can visit a huge number of sites and monitor them 24/7, to ensure that they capture any data which is displayed on these sites.

What Do Bots Actually Scrape?

The most common scraping bots actually power search engines. These bots scrape sites, looking for all the other sites which the original site links to. If the bot can find links to other sites, it then scrapes these too. The bot looks for sites which the new site links to, and so on. The process carries on and on, until the bots have found every site available on the internet (or at least every site that is linked to by at least one other).

From this data, search engines such as Google and Bing are able to build comprehensive databases of sites, and use these to deliver search results. Every time you make a search, the search engine is calling upon masses of data which it has gained from scraping site.

This is a fairly benign use of scraping. Here, scraping is being employed in a way which benefits everyone involved. The search engines (Google, Bing, etc.) benefit because they can deliver relevant sites to users. Users benefit because they can search for sites on these engines. The sites benefit because search engines afford them greater visibility.

From Good to Bad

Not all uses of scraping are as benevolent though. Just as scraping can be used to create all-encompassing search engines, scraping can be used to mine huge troves of personal data.

One such way of mining personal data is to scrape social media sites, such as Facebook. Scraping users’ profile pages can give basic information about them, who they’re friends with, and what photos they’ve posted.

Facebook realise the potential harm to user privacy from allowing anyone to scrape profiles. For this reason, most elements of a typical Facebook profile are set to private, meaning that they can’t be viewed by anyone who that user hasn’t added as a friend. If you try to scrape a random person’s Facebook account, you may not be able to pull much information other than their name, their profile picture, and any old posts on their timeline which haven’t been made private.

Facebook’s attempts to prevent profile scraping are praiseworthy, but they don’t go far enough. This is because some of the most valuable information that users create while using Facebook products isn’t surfaced on their profile at all.

Pages & Groups

Facebook pages and groups are two products which many of us are familiar with. By liking pages, we can express affinity for certain brands or causes, and add their content to our timelines. By joining groups we can become part of online communities, and share with others that share our interests or identities.

The sheer number of pages and groups that exist on Facebook is testament to the value they bring to people. The fact that there are so many pages and groups on Facebook also means that there is a wealth of data to be gained from knowing who likes what page, and who is part of what group.

Some of the pages we like, or groups that we’re members of, are fairly benign. If an advertiser wants to see if I like cycling, they don’t need to see that I like fan pages for a number of professional cyclists to work this out. They can simply rely on any of the many cycling-related ‘interest audiences’ which Facebook offers them for targeting.

But what if an advertiser wants to target someone based on much more personal attributes? What if an advertiser wants to target someone based on their sexuality, their religion, or their ethnicity?

A quick look at the groups and pages that exist on Facebook show that there are huge numbers of them which appeal to people with these kinds of attributes. If you’re LGBTQ you might like the LGBTQ Nation page, if you’re Muslim you might be a member of United Muslims, and if you’re black you might be a fan of Black Lives Matter.

If a bad actor had access to which pages and groups you followed, they’d be able to deduce a great deal about what sort of person you are. So, can bad actors access this data?

Scraping Groups & Pages

For every page, there is a list of people that like that page. For every group, there is a list of people that like that group. Facebook don’t make this list readily available, but that doesn’t mean that the list is particularly hard to find.

Say you want to find all the people who like a particular group. Let’s start with a group which doesn’t appeal to people with protected characteristics, like Running Events. This is a UK based group for people to share running events. We can find a list of its members simply by appending /members to its URL.

The full list of group members doesn’t immediately populate, you have to keep scrolling down in order to see the full list. This would be somewhat tedious for a human to do, meaning that a manual approach wouldn’t scale. Fortunately it’s fairly simple to write bots which can not only access the member list of a group, but which can also keep scrolling down the page as a human would, causing Facebook to load more members.

Once the bot has scrolled all the way to the bottom, it can now start scraping the page. It does this by saving the page’s HTML, and looking for markers which indicate users’ profile URLs. It programmatically runs through the entire HTML, and saves the profile URL of each user.

At this point the bot has potentially done it’s job. It’s successfully scraped the profile URL from every person who is a member of Running Events. This on its own is already a worrying feat. In addition to scraping profile URLs, the bot could then scrape those URLs to pull data points such as people’s names, and whatever other attributes they make public on their profile.

If the actor behind the bot already has some data on these people, they could augment this data based on the results of their scraping. For example, if a running retailer already has a comprehensive customer database, they could use the data they’ve scraped to learn which of their users are interested in running events.

Custom Audiences: Enriching The Data

Perhaps whoever is carrying out this scraping doesn’t just want to know who is into running events, but wants to specifically target people with advertisements; how do they do this?

Facebook allows you to upload customer data into its ad platform in order to target those customers, a process known as custom audience creation. Facebook don’t want to allow advertisers to be able to target users whose profiles they’ve scraped, so you can’t simply give Facebook the list of profile URLs you’ve just found.

To be able to target these users, you need to enrich the data, and be able to find email addresses and phone numbers for the users whose profiles have been scraped. If you can pass Facebook emails and phone numbers, in addition to their names, then Facebook will have enough fields to be able to match your data against Facebook users, effectively letting you target the people whose profiles you’ve scraped.

So, if all you know is someone’s name and Facebook profile, how do you get their email address or phone number?

People Search Engines

Search engines let us search sites. People search engines (PSEs) let us search people. It’s as simple as that.

While very few of us will have ever used a PSE, there are a variety of them that are available online. PSEs all work in the same way. They hold huge databases of personal details, and allow searchers to look up these users by providing one or more of the fields stored in this database.

One of the fields that these databases hold is social media profile URLs. By providing a Facebook profile URL to a PSE, the engine is able to find the user in its database with that same profile URL, and tell you everything else it knows about that user.

Some of the most well-known PSEs available currently are Pipl and CatchID. Both Russian companies, they offer APIs which allow users to upload hundreds of thousands of social media profiles to them. In return, users are offered everything that the PSEs know about the profile that’s been uploaded. This often includes phone numbers and emails.

If someone were to scrape a list of people who belong to a particular Facebook group, or who like a certain page, they could easily upload their profile URLs to a PSE. The PSE would, in most cases, be able to find a phone number and email that person whose profile URL was uploaded. If you have a list of people’s names, emails, and phone numbers, you can then upload this into Facebook in order to target these people with ads.

Think this all sounds like too much work? Worry not, there are services which can handle the scraping and data enrichment for you. One such service is LeadEnforce, which automates the whole process of scraping group members and page fans, and enriches this data with people search engines like Pipl and CatchID. LeadEnforce plans start at $99 a month.

What Does This Mean for User Privacy?

When we’re reminded of how much sensitive information we express through our group memberships, and our page likes, it’s easy to see why the above prevents a huge threat to user privacy online. If an LGBTQ person likes the LGBTQ Nation page, or if a Muslim is a member of the United Muslims group, they’re exposing pieces of sensitive personal information to any bad actor with the technical know-how to build a scraping bot.

Once a bad actor has access to this information, there are countless ways it can be abused. Scraped data could be used to serve voter-suppression ads to specific minorities, reducing their electoral turnout by suggesting that opposition candidates dislike their minority group. Scraped data could be used to target pharmaceutical ads to people with specific medical conditions, conditions which the bad actor has gleamed from member lists of groups like The Hairloss Crusaders.

It isn’t just about who you show ads to; it’s also about who you don’t show ads to. A homophobic restaurant owner could scrape data from local LGBT pages and set up their ads so that they don’t show to these users. A loan provider could create audiences of those in debt management groups, and ensure these people don’t see any of their loan ads. The possibilities, sadly, are endless.

Teaching Facebook What a Minority Looks Like

The threat posed by scraping isn’t limited to the people whose data is being scraped. By uploading data of people who like a certain page or group, a bad actor can teach Facebook what these people look like. A bad actor can do this by creating a lookalike audience from the data they upload.

Facebook populates the lookalike audience with its users who most closely resemble those whose details have been uploaded. In this way, a bad actor could use lookalikes to find audiences of people with protected characteristics.

If a bad actor wanted to target or exclude Jews from seeing their ads on Facebook, they could upload data for people who like Jewish pages or groups, and create a lookalike audience from that data. The lookalike audience would likely contain plenty of people who aren’t Jewish, but crucially it would likely over-index on people who are Jewish. This means that the proportion of people in the lookalike audience who are Jewish would be much higher than, say, the national average.

Beyond Just Advertising

To make matters worse, the danger doesn’t just stop with advertising. Bad actors could create entire databases of people based on specific characteristics, and use this to inform business decisions. A health insurance provider could scrape pages and groups related to medical conditions en masse, and use this information to deny people coverage, or inflate prices.

Arguably you wouldn’t even need to scrape page or group member lists for this. If you want to see all of a user’s page likes or groups then you can, if they haven’t been set to private, simply append /likes or /groups to their Facebook profile URL to get a complete list. In just seconds, you can learn things about a person that even their best friends may not know.

To Wrap Up

Facebook may disallow web scraping in their terms and conditions, but the fact that they make it so easy to carry out implies that they don’t see it as a serious issue. With the amount of data exposed by being able to see someone’s page likes, or their groups, the threat to user privacy is severe. For as long as Facebook don’t take steps to actually prevent web scraping, it will remain a ongoing threat to user privacy.