Claims that social media data won the presidency are greatly exaggerated

A piece of data science mythology has been floating around the internet for several weeks now. It surfaced most recently in Vice, and it tells the story of a firm, Cambridge Analytica, that was supposedly instrumental in Donald Trump’s campaign.

The story goes that by analysing marketing and social media data during the EU Referendum, data scientists were able to model the personalities of voters in unprecedented detail, helping the Leave campaign to an unlikely victory. Shortly after that, the firm was employed by the Trump campaign where, we are told, it contributed to another unlikely victory.

For me this story is like candy floss – it looks nice and substantial, but when you stick it in your mouth there's not much there and you’re still hungry. The reporting leaves a ton of questions unanswered, and when you try to look into them the results are less than satisfying.

Before we even get into methods, there’s Ted Cruz. The article posted by Vice doesn’t just gloss over him; it tries to present his campaign as some sort of victory for Cambridge Analytica's approach. This would be the campaign where Ted Cruz was wiped out in a few short weeks by a reality TV demagogue with no data science operation, and subjected to months’ long national humiliation.

They mention the Iowa primary on 1 February 2016, where the data science outfit helped to identify target voters. Cruz did indeed win, but took just 27 per cent of the vote in a four way race, only three points ahead of Trump. The authors don't mention the next three states in February – New Hampshire, South Carolina or Nevada – where he was thrashed. Nor do they mention Super Tuesday, on 1 March, where Trump thrashed him by double-digit margins in six states.

“The story of the Republican primaries is actually that Cambridge Analytica’s flashy data science team got beaten by a dude with a thousand-dollar website”

Remember, Cambridge Analytica weren't hired by Trump until June that year. Meanwhile, Trump’s entire data science operation was, as the article admits, “a marketing entrepreneur and failed start-up founder who created a rudimentary website for Trump for $1,500.”

So the story of the Republican primaries is actually that Cambridge Analytica’s flashy data science team got beaten by a dude with a thousand-dollar website. To turn that into this breathtaking story of an unbeatable voodoo-science outfit, powering Trump inexorably to victory, is quite a stretch. Who else have they even worked for? Without a list of clients it's very easy to cherry-pick the winners.

That's before we even get into the question of what they were actually doing. The authors tell us that Cambridge Analytica were using some combination of survey data, content scraped from social media, and traditional marketing data. They're then doing some kind of sentiment analysis to build a 'five traits' profile of millions of Americans (and Britons, in the case of the Brexit campaign).

The five traits model is a real thing in psychology, sure, and it may have some predictive power for things like mortality. It’s important to note that it’s not undisputed or unflawed, however. It's also true that you can take demographic data and correlate it with political leaning with a reasonable amount of success – we know that votes in the EU referendum tended to correlate to education, for example. That’s the big grain of truth at the heart of the story.

“Establishing personality traits from someone’s Facebook feed is at best untested science”

But let's just think this through. Firstly, this is data that's available to every other major data science campaign outfit. It's not some secret buried hard drive they found. Second, this usage would go far beyond anything that any published science can support. OCEAN personality traits would normally be assessed through a questionnaire. To establish them from someone’s Facebook feed is at best an untested piece of science. Is their feed representative? Is it even public or available to you? Is your algorithm 100 per cent confident or (more likely) only 75 per cent?

Then there's the challenge of bringing all this data together with any degree of accuracy. How confidently can you match a given Facebook account to a given record on the electoral roll? You might get lucky and find some location information that can match you to the only person with a specific name in a given town, and you might then be able to match that to a credit report or other bit of data.

What you end up with is a series of steps that individually sound plausible, but collectively turn to mush. Only 60 per cent of Britons are on Facebook. Of those, many will only use it sporadically. Maybe half have their profiles public. Maybe half of those yield enough information to do an accurate OCEAN profile. Maybe 75 per cent of those yield data unambiguous enough to match to a credit card report. You’re down to about 10 per cent of the population at this point – of course I’m eyeballing the numbers here for illustration, but they’re not unrealistic and you see where I’m going with this.

“There’s no evidence of this voodoo marketing in action”

A claim attributed to the company is, “We have profiled the personality of every adult in the United States of America — 220 million people.” Clearly only 20-30 million of those will have been profiled using social media data. Even for that sample, there’s no way of independently verifying whatever unpublished techniques they’re using. For the vast bulk of those people the only data available will be the bog standard marketing data used by any other direct marketing firm.

And that seems to have played out on the ground. There’s no evidence of this voodoo marketing in action, and we have plenty of anecdotes pointing to less than stellar use of data by campaigns. Leonid Bershidsky wrote an excellent piece in Bloomberg where he points out his own experience:

“I would have believed in the efficiency of these shamanic manipulations had I not been the recipient of numerous e-mail messages from the Trump campaign that designated me as a ‘Big League Supporter’ and doggedly asked for contributions and moral support, though I am disqualified as a Russian citizen. Whatever contact lists Trump’s data team had, it didn’t even match them against open social network data. Cambridge Analytica's microtargeting was obviously failing in my case. Even though I’d given my e-mail address to the campaigns of Bernie Sanders’s and Clinton, too, as I registered for their rallies, they didn’t senselessly bombard me with messages as Trump did.”

Then we come to the twist in the tale. After the story first appeared online, a spokesman for Cambridge Analytica came forward with the following statement: “Cambridge Analytica does not use data from Facebook. It has had no dealings with Dr. Michal Kosinski. It does not subcontract research. It does not use the same methodology. Psychographics was hardly used at all.”

Now, you may or may not choose to believe that statement but if you agree with my assessment so far, it seems likely to be the truth. Even if some Big Voodoo Data Science Company did have all this data, it still wouldn’t tell campaigns how to use it effectively. Nor would it have had much influence on the most effective strategies in play, like the news media war that Donald Trump engaged in so successfully.

So if you step right back and look at all this, what do we see? We see a data science firm with Steve Bannon on the board, bigly claims about its powers, whose exact methodology is unclear to us. We see a candidate, Donald Trump, who used the same successful strategy right the way through his campaign whether he was employing Cambridge Analytica or a random dude with HTML skills. We have another candidate, Ted Cruz, who used the same firm and tanked. We have another candidate, Hillary Clinton, who used something very similar to Cambridge Analytica and also lost.

How exactly do you turn all that into the story of an unstoppable data science behemoth?