You think you have trouble downloading those torrents of Game of Thrones? Imagine the bandwidth pain of downloading audio from potentially thousands of phone calls captured by recording gear in a far-off country over a connection that makes DSL look good.

National Security Agency documents released this week by The Washington Post gave a glimpse of an NSA program that allows the agency to capture the voice content of virtually every phone call in an unnamed country and perform searches against the stored calls’ metadata to find and listen to conversations for up to a month after they happened. The revelation is just the latest demonstration of how the NSA has used big data technology to make its foreign surveillance both more massive and more manageable.

Just as the NSA and GCHQ have used Xkeyscore to make it possible to search through torrents of Internet traffic captured by its Turmoil monitoring systems scattered around the world, a system called Retrospective (or Retro) allows analysts to search through phone calls that are up to 30 days old based on call metadata. Originally developed for the NSA’s Mystic international telephone monitoring effort as a “one-off” capability, Retro may now be used in a number of other countries, scooping up calls that undoubtedly include ones that have nothing to do with the NSA’s foreign intelligence goals.

Of course, whether that capture can be considered monitoring comes down to semantics. In the NSA’s reasoning, it’s not “surveillance” until a human listens in. And since most of the calls accessible by Retrospective are flushed from its “cache” after a month without being queried, the NSA could argue that the calls have never been surveilled.

Pressure drop

The source of all the calls in the unnamed country is a “signals intelligence asset” that is referred to by the NSA as Scalawag. According to an NSA briefing paper published by the Post, Scalawag had “long since reached the point where it was collecting and sending home far more than the bandwidth could handle.” That was because the NSA was shipping every call grabbed by Scalawag back to the US for processing, regardless of its priority—even though the calls had to be stored locally first.

That over-collection was slowing down retrieval and processing of the most critical intelligence data from Scalawag, making it increasingly difficult for the NSA to do timely collection and analysis. In the summer of 2011, the briefing paper states, the NSA’s Special Source Operations unit (the division within the agency that handles monitoring performed through corporate “partnerships”) took steps to “alleviate pressure on bandwidth and decrease latency of [high-priority] data.”

First, SSO went through the “taskings” against Scalawag and cleared house, deleting monitoring tasks that brought back data that analysts had never touched. Other ongoing monitoring tasks had their priority of collection lowered based on how often the data returned was used. But the volume of data being shipped back to the NSA in the US continued to grow, and the breathing room gained by those measures quickly disappeared.

In December of 2011, SSO essentially banned lower-priority requests against Scalawag. But to really take a bite out of the bandwidth crunch, SSO also moved to handle searching and processing voice data closer to the source, with the Retrospective “retrieval tool.”

Reach out and touch someone

Based on what we already know about the NSA’s approach to remote data storage and the contents of the briefing memo, here’s a best guess at how Retrospective and Mystic work:

Being able to capture the voice content of virtually every phone call within a country, regardless of its size, means having practically unlimited access to its telecommunications infrastructure. Either the telecommunications system of the country being targeted is highly centralized, or the NSA needs to have monitoring gear installed at every telephone exchange in the country. Theoretically, a Retrospective “back-end” could be sitting at each of these exchanges, processing data to reduce the amount that needs to be dumped back to a central database.

Digitally captured voice data and the related call metadata are stored and indexed, likely in a “big data” data store such as Accumulo , the “NoSQL” database developed at the NSA and contributed to the Apache Foundation as open source in 2011. Just as with Xkeyscore, the Retrospective backend could then run a variety of indexing and processing tasks against the captured data—eliminating work that would need to be done later by an analyst to sort through it. It’s possible that tasks such as speech-to-text processing could be performed on calls for indexing as well as by a software agent attached to the data store.

Back at the NSA, an analyst can launch a search against calls in one of the Retrospective nodes using a Universal Tasking Tool (UTT), a front end tool that can use a wide assortment of identifiers: phone number, location, customer name, time of call, or practically anything else that can be derived from the call metadata and preprocessing analytics to determine its content. The search parameters are then sent from the UTT out to the Retrospective server.

Each search sent by the UTT to Retrospective is turned into a “worker” task—a software robot that continuously checks the indexes of the call repository for matching calls. Since the repository holds up to 30 days' worth of calls, it can effectively reach back into the past and immediately start returning calls that have already happened. As new calls get recorded, the worker program will catch them in nearly real time, queuing them up to be sent back to the NSA in the US.

Just how quickly calls get shipped back to the NSA is based on the priority that was assigned to the monitoring. Priority 1 tasks—the most urgent surveillance needs—will jump to the head of the line, while Priority 2 or 3 tasks will be added in queue behind them. All of the calls in queue will then be transferred back to the NSA’s network in the US and added to their requesters’ accounts on the NSA’s voice communications analysis platform, Nucleon.

According to budget documents obtained by the Post, the Retrospective technology—as part of the Mystic program—may already be deployed in as many as five additional countries beyond that of Scalawag. The beauty (if you can call it that) of using a distributed search technology is that it makes it easier for the NSA to scale this sort of surveillance, while keeping the impact on its relatively low-bandwidth network connections to the facilities that host the taps to a minimum. The biggest limiting factor of the approach is the required local storage capacity, but that could be knocked down significantly through audio compression. At worst, the NSA would have to reduce the time period of its call “buffer” to accommodate greater call volume, or install more disk drives.