http://www.flickr.com/photos/scottmark/81846559/

Is the NSA Blinded by Big Data?

Who says it’s the best method to catch the bad guys?

Rabbits hop around but that’s not very threatening. Hops on a network, however, are a different game.

When a rabbit hops on a lawn, each hop adds a yard or so. Not so in a network topology. In a network, the difference between one and two hops is huge. The difference between two and three hops is humongous because the effect of “hops” on traversing a network is exponential rather than additive. Exponential mechanisms don’t add, they multiply—and that is the key to understanding a multitude of modern phenomena ranging from viral videos to pandemics.

Awareness of the impact of exponential mechanisms is why people who study social networks and surveillance took notice—even if lawmakers listening did not—when NSA deputy director Chris Inglis said in Congressional testimony June 18 that the NSA went out “two or three hops” in a suspect’s network of contacts, compared with the previously reported two.

If you knew your network math, you weren’t hearing “oh, it’s three hops, not two.” You were hearing “wait, the NSA surveillance sweep area is potentially hundreds of times larger than what we thought?”

To explain the math, let’s focus a bit on what may be a suspect’s contacts. If Snowden’s allegations and some of the reporting we’ve read from major media outlets is true, the companies who provide NSA with “metadata” –in other words “who’s speaking with whom”— include Google, Facebook, Yahoo, YouTube, Skype, AOL and Apple.

An average person on Facebook has about 130 friends. Adolescents and young adults often have many more—a Pew study found that the average teen has a median 300 friends on Facebook. In addition to your Facebook friends, a “contact” could be anyone you emailed with Gmail, Yahoo or Hotmail account or instant messaged via a Microsoft service. Add anyone with whom you had a Skype conversation. Then there are reports that NSA collects cell phone metadata. Clearly, there will be some overlap between your Facebook friends and your email contacts and your phone contacts but it seems reasonable to assume that there will be many who are in one list but not in the other.

Let’s say that there are about another 150 people outside your Facebook friends that you might have emailed, called or otherwise have been detectable contact with over a year. About 300 “contacts” per person sounds like a reasonable baseline. Let’s do the math from this relatively modest base.

At one hop from a suspect, you have an investigation of about 300 people who are “one degree” out in the person’s network. This would be everyone she or he has contacted digitally through these services. We don’t know the time-frame, but let’s say in the last few years, as PRISM seems to be a few years old. You could reasonably investigate which of a suspect’s 300 or so contacts deserve a second look. While association alone should not be sufficient grounds for a charge, it’s obviously reasonable to look at who a suspect associates with directly.

At two hops, you have a community under suspicion. This is everyone who the suspect has contacted has contacted. Assuming each of your “contacts” also has about 300 contacts, that would be about 90,000 people. Hence, if you contact a local politician, everyone she contacts could be included in this list. Most people have friends who have more friends then they do (this is called the “Friendship paradox”) simply because you are more likely to be friends with people who have more friends. Thus, at two hops out, you are capturing large numbers of people who happen to be in contact with your more popular friends.

At three hops, you have a dragnet. If we stick to our base number and calculate 300*300*300, we are looking at 27 million people. This is no longer a community, it’s a good chunk of a nation.

[In response to comments here and on Twitter: Yes, obviously, there will be overlaps as you traverse the network so the 300*300*300 might go lower. On the other hand, 300 is actually a somewhat low limit for how many “contacts” a person has—research suggests as high as 600. Since I don’t know how NSA defines contacts —or how long far back in time it traces them— and since the numbers here are not meant to be precise estimates but to point out that the network quickly grows into the millions, making qualitative analyses impossible, I’m letting it stand.]

Obviously, NSA’s surveillance program—as outlined by Snowden’s allegations—raises constitutional and ethical issues. The fourth amendment implications are obvious There are also first and fifth amendment questions. Is there a chilling effect on free speech? What about protections against self-incrimination?

PRISM also creates many political roadblocks for dissidents outside the United States. While most of us were so focused on how much data Facebook and Microsoft were turning over to NSA, and how, I bet leaders of many countries were more keenly focused on a slide titled “You Should Use Both.” Along with PRISM, this slide listed a program titled “UPSTREAM,” which, allegedly, provided “collection of communications on fiber cables and infrastructure as data flows past.” Hence, many countries are going to try to make sure that their data does not “flow past” U.S. infrastructure—which means more local servers and hubs subject to governmental control.

But there is another side to having this much data, though—one that is not getting enough attention among the legal and ethical issues considerations for the surveilled. I can see why this much data is attractive to an analyst, or to a person running an intelligence agency. Data is immensely appealing to me as a social scientist. But our experience over the last few years shows that the blinding attraction of the data might cause us to lose sight of the fact that big data is not suited to every research problem.

Three hops create a data deluge. Like many scientific fields, this level of data collection means that NSA is also going through a shift from a “data poor” environment to a “data rich” environment. In that, they are not alone. Network social scientists who for decades struggled with ways to measure one-degree networks accurately (just your direct contacts) now face themselves looking at datasets of millions of “edges” (or links in the network). Oceanographers can play with 450 million data points around the world’s seas from just one project. Astronomers find that their datasets double each year. And so on.

Data deluge is a problem even at two hops but an insurmountable one at three hops. Data deluge encourages certain methods and discourages others. And I believe that the ones it encourages are less appropriate to NSA’s mandate—identifying security threats—than the ones it discourages.

At three hops out, you cannot examine individuals—you have to start relying on easily identifiable markers. You have to squint and look at outlines. You have to engage in pattern recognition. If you have swept meta-data on 27 million individuals, what do you look for? Males? Muslims? Those who bought guns? Fertilizers? Pressure cookers? It has to be something. There simply cannot be a semblance of individual examination. By necessity, the “sweep” has to be algorithmic and be trained to look for specific behaviors. (And pattern recognition is often another way of saying stereotypes.)

I am not dissing pattern recognition. It can be a great method for identifying regularities. We may find stereotypes noxious but they are so widespread partly because they have developed heuristics to help our brain deal with the data deluge of the world. How are you going to speak with someone with whom you just met? We often judge people, quickly, on their appearance and fit them in our mental categories. Three-piece suit? Lawyer. Tattoos and piercings? Artist. And so on. It often works, to a degree, and we adjust as we go. We make mistakes and learn to be cautious and hold back on what our pattern recognition antennas are telling us.

Pattern recognition is great for big things that happen again and again. Things that jump out from the data and that happen repeatedly so we can figure out commonalities. Storms and hurricanes. Supernovas. Patterns of migration. Family formation. Economic development.

But you know what pattern-recognition is worst at? It’s picking out things that are rare, for which there is not enough data to pick out regularities. The diamond in the rough, so to speak. The needle in the haystack. The person with the tattoos and the piercings who is the lawyer.

Overall, pattern-recognition will “underfit” the data if what you are looking for is an anomaly. Say, a terrorist wanting to blow up innocents to make a political point. (Here, I’m using “underfit” to mean that the pattern you are looking for just won’t fit your target—there is no pattern to your target. It’s too rare, too case-by-case).

What the past decade has shown us, thankfully, is that the number of people who want to murder Americans on American soil as a part of the conflict loosely defined as “war on terror” is not that many. We have had more deaths in the past ten years on U.S. soil due to crazed young adult men with guns than to anything that fits a reasonable definition of terrorism. The Boston bombers were arguably among the best fit but even there, it’s hard to deny the great overlap with the “crazed young man” category. Anyway, no pattern recognition system would have picked the sweet-faced Jahar who attended his high school prom.

Pattern recognition also “over-fits”—in other words, it produces “matches” when there is none and that’s the reason we find stereotypes to be noxious, even if we can point to a statistical reason for them. A pattern recognition algorithm would come up with “young Saudi men” as the closest heuristic for the 9/11 bombers. Yet, the “young Saudi male” friend in my own social network is a graduate of the Columbia School of Journalism and knows how to find the best pizza near Dupont circle after midnight.

Over fitting the data through stereotypes is not just unfair and injurious to our sense of justice. It also produces exclusion and pushes away the very demographics that one assumes, if the pattern is to be believed, are more likely to be in a contentious relationship with broader culture. Exclusion has costs for the excluded but also for the one doing the exclusion—I’d have been eating stale sandwiches rather than delicious wood-fired pizza. The NYPD would have spent less time and effort surveilling all the mosques in the region without a single criminal lead.

Thus, the impulse to “collect it all” has downsides besides the ethical and legal pitfalls discussed extensively. Even at two hops, let alone three, NSA’s surveillance program can be bad for anti-terrorism by shifting focus and resources from individual investigations, a more fitting method for rare events, to pattern recognition—which this data deluge will almost surely necessitate—and which is not a good method for detecting rare events.

I suppose the counter argument would be two say that the “hops” are not random or comprehensive and that NSA analysts only follow actual leads. However, such actual, concrete leads would even justify those old-fashioned warrants. If that’s all there is going on, we should be told about this (but then why do we need PRISM?).

I suspect, however, like as it is happening in many academic fields, the NSA is sorely tempted by all the data at its fingertips and is adjusting its methods to the data rather than to its research questions. That’s called looking for your keys under the light.