The National Security Agency's (NSA) apparatus for spying on what passes over the Internet, phone lines, and airways has long been the stuff of legend, with the public catching only brief glimpses into its Leviathan nature. Thanks to the documents leaked by former NSA contractor Edward Snowden, we now have a much bigger picture.

When that picture is combined with federal contract data and other pieces of the public record—as well as information from other whistleblowers and investigators—it's possible to deduce a great deal about what the NSA has built and what it can do.

We've already looked at the NSA's basic capabilities of collecting, managing, and processing "big data." But the recently released XKeyscore documents provide a much more complete picture of how the NSA feeds its big data monsters and how it gets "situational awareness" of what's happening on the Internet. What follows is an analysis of how XKeyscore works and how the NSA's network surveillance capabilities have evolved over the past decade.

Boot camp

After the attacks of September 11, 2001 and the subsequent passage of the USA PATRIOT Act, the NSA and other organizations within the federal intelligence, defense, and law enforcement communities rushed to up their game in Internet surveillance. The NSA had already developed a "signals intelligence" operation that spanned the globe. But it had not had a mandate for sweeping surveillance operations—let alone permission for it—since the Foreign Intelligence Surveillance Act (FISA) was passed in 1978. (Imagine what Richard Nixon could have done with Facebook monitoring.)

The Global War On Terror, or GWOT as it was known around DC's beltway, opened up the purse strings for everything on the intelligence, surveillance, and reconnaissance (ISR) shopping list. The NSA's budget is hidden within the larger National Intelligence Program (NIP) budget. But some estimates suggest that the NSA's piece of that pie is between 17 and 20 percent—putting its cumulative budget from fiscal year 2006 through 2012, conservatively, at about $58 billion.

Early on, the NSA needed a quick fix. It got that by buying largely off-the-shelf systems for network monitoring, as evidenced by the installation of hardware from Boeing subsidiary Narus at network tap sites such as AT&T's Folsom Street facility in San Francisco. In 2003, the NSA worked with AT&T to install a collection of networking and computing gear—including Narus' Semantic Traffic Analyzer (STA) 6400—to monitor the peering links for AT&T's WorldNet Internet service. Narus' STA software, which evolved into the Intelligent Traffic Analyzer line, was also used by the FBI as a replacement for its Carnivore system during that time frame.

Catching packets like tuna (not dolphin-safe)

Narus' system is broken into two parts. The first is a computing device in-line with the network that watches the metadata in the packets passing by for ones that match "key pairs," which can be a specific IP address or a range of IP addresses, a keyword within a Web browser request, or a pattern identifying a certain type of traffic such as a VPN or Tor connection.

Packets that match those rules are thrown to the second part of Narus' system—a collection of analytic processing systems—over a separate high-speed network backbone by way of messaging middleware similar to the transaction systems used in financial systems and commodity trading floors.

In the current generation of Narus' system, the processing systems run on commodity Linux servers and re-assemble network sessions as they're captured, mining them for metadata, file attachments, and other application data and then indexing and dumping that information to a searchable database.

There are a couple of trade-offs with Narus' approach. For one thing, the number of rules loaded on the network-sensing machine directly impact how much traffic it can handle—the more rules, the more compute power burned and memory consumed per packet, and the fewer packets that can be handled simultaneously. When I interviewed Narus' director of product management for cyber analytics Neil Harrington last year, he said that "with everything turned on" on a two-way, 10-gigabit Ethernet connection—that is, with all of the pre-configured filters turned on—"out of the possible 20 gigabits, we see about 12. If we turn off tag pairs that we’re not interested in, we can make it more efficient."

In other words, to handle really big volumes of data and not miss anything with a traffic analyzer, you have to widen the scope of what you collect. The processing side can handle the extra data—as long as the bandwidth of the local network fabric isn't exceeded and you've added enough servers and storage. But that means that more information is collected "inadvertently" in the process. It's like catching a few dolphins so you don't miss the tuna.

Collecting more data brings up another issue: where to put it all and how to transport it. Even when you store just the cream skimmed off the top of the 129.6 terabytes per day that can be collected from a 10-gigabit network tap, you're still faced with at least tens of terabytes of data per tap that need to be written to a database. The laws of physics prevented the NSA from moving all that digested data back over its own private networks to a central data center; getting all the raw packets collected by the taps back home was out of the question.

NSA, Web startup style

All of these considerations were behind the design of XKeyscore. Based on public data (such as "clearance" job listings and other sources), the NSA used a small internal startup-like organization made up of NSA personnel and contract help from companies such as defense contractor SAIC to build and maintain XKeyscore. The XKeyscore product team used many of the principles of "agile" development and the so-called "devops" approach to running a Web operation—shipping code early and often, having support staff and developers work alongside each other, and reacting quickly to customer demands with new (and sometimes experimental) features.

Built with the same fundamental front-end principles (albeit with some significant custom code thrown in, XKeyscore solved the problem of collecting at wire speed by dumping a lot more to a local storage "cache." And it balanced the conflict between minimizing how much data got sent home to the NSA's data centers and giving analysts flexibility and depth in how they searched data by using the power of Web interfaces like Representation State Transfer (REST).

XKeyscore takes the data brought in by the packet capture systems connected to the NSA's taps (Update: This technology is code-named TURMOIL) and processes it with arrays of Linux machines. The Linux processing nodes can run a collection of "plugin" analysis engines that look for content in captured network sessions; there are specialized plugins for mining packets for phone numbers, e-mail addresses, webmail and chat activity, and the full content of users' Web browser sessions. For selected traffic, XKeyscore can also generate a full replay of a network session between two Internet addresses.

But rather than dumping everything back to the mother ship, each XKeyscore site keeps most of the data in local caches. According to the documents leaked by Snowden, those caches can hold approximately 3 days of raw packet data—full "logs" of Internet sessions. There's also a local database at the network tap sites that can keep up to 30 days of locally processed metadata.

Only data related to a specific case file is pulled back over the network to the NSA's central database. The rest of the data is available through federated search—a search request is distributed across all of the XKeyscore tap sites, and any results are returned and aggregated.

To ask XKeyscore a question, analysts go to an internal Web front-end on the Joint Worldwide Intelligence Communications System (JWICS), the top-secret/sensitive compartmented information (TS/SCI) network shared by the intelligence community and the Department of Defense. They create a query, which is distributed out across the XKeyscore's approximately 150 sites. These sites include network taps at telecommunications peering points run by the NSA's Special Source Operations (SSO) division, systems tied to the network intercept sites of friendly foreign intelligence agencies, and the sites operated by "F6"—the joint CIA-NSA Special Collections Service, "black bag" operators who handle things like mid-ocean fiber taps.

The kinds of questions that can be asked of XKeyscore are limited only by the plugins that the NSA deploys and the creativity of the query. Any sort of metadata that can be extracted from a network session—the language used, IP address geolocation, the use of encryption, filenames of enclosures—can be tracked, cross-indexed, and searched. When the flow of data past a tap point is low, much of that information can be queried or monitored in near-real time. The only limiting factors are that the traffic has to pass through one of the NSA's tap points and that most of the data captured is lost after about three days.

How much is in there?

Because, like Narus, XKeyscore performs best for high volumes of traffic by "going shallow"—applying a small number of rules to determine what traffic gets captured and processed—the probability that information is being collected that is unrelated to people the NSA is really interested in (and who the agency has FISA warrants and National Intelligence case files for) is fairly high. But there have been steady improvements to the filter hardware that does the collection for XKeyscore.

For the collection points inside the US that collect data that is "one end foreign" (1EF)—that is, between an IP address in the US and one overseas—the SSO deployed a new system in 2012 that it said allows "more than 75 percent of the traffic to pass through the filter," according to information from The Guardian. That means that the large majority of traffic passing through US telecommunications peering points can be screened based on the rule sets used for packet capture. Depending on how wide the aperture of those rules are, that could either mean that the NSA is able to "go deep" on 75 percent of traffic and capture just the information they're looking for (with 25 percent of traffic slipping by untouched), or that 75 percent of traffic is getting dumped to cache to be processed—and is searchable with XKeyscore while it's sitting there.

That's a very important distinction. But since The Guardian has not released the document that quote was from, it's impossible to tell at the moment whether the NSA has improved its ability to respect the privacy of citizens or if it is just indexing even more of their daily Internet lives while hunting for terrorist suspects.