As Ars reported yesterday, documents provided to The Washington Post by former National Security Agency contractor Edward Snowden show that the NSA was able to harvest enormous amounts of unencrypted information from Google and Yahoo by grabbing the data straight off the companies' wide-area networks. Analysis of the documents alongside previously leaked data and other information explains why engineers affiliated with Google shouted expletives when they were shown how the NSA effectively bypassed the safeguards that the companies had put in place to protect customer data.

In an interview with Bloomberg TV yesterday, NSA Director Gen. Keith Alexander said, "I can tell you factually we do not have access to Google servers [or] Yahoo servers." Technically, Gen. Alexander's denial is truthful—the NSA did not access Google's or Yahoo's servers itself. But the agency's MUSCULAR program, undertaken in collaboration with the United Kingdom's NSA equivalent, the GCHQ, does tap into the traffic of the networks that links those companies' data centers.

The taps, described as a "minor circuit move" by NSA documents, simply plugged into the telecommunications infrastructure that carries Google's and Yahoo's private fiber links. It gave the NSA access inside the two companies' Internet perimeters, allowing the agency to scan and capture massive amounts of data—so much that the NSA's Special Source Operations complained that it had too much garbage to sort through.

Forget the PRISM—go for the clear

The NSA already has access to selected content on Google and Yahoo through its PRISM program, a collaborative effort with the FBI that compels cloud providers to turn over select information through a FISA warrant. And it collects huge quantities of raw Internet traffic at major network exchange points, allowing the agency to perform keyword searches in realtime against the content and metadata of individual Internet packets.

But much of that raw traffic is encrypted, and the PRISM requests are relatively limited in scope. So the NSA went looking for a way to get the same sort of access to encrypted traffic to cloud providers that it had with unencrypted raw Internet traffic. The solution that the NSA and the GCHQ devised was to tap into the networks of the providers themselves as they crossed international borders.

Google and Yahoo maintain a number of overseas data centers to serve their international customers, and Internet traffic to Google and Yahoo is typically routed to the closest data center to the user. The Web and other Internet servers that handle those requests generally communicate with users via a Secure Socket Layer (SSL) encrypted session and act as a gateway to other services running within the data center—in the case of Google, this includes services like Gmail message stores, search engines, Maps requests, and Google Drive documents. Within Google's internal network, these requests are passed unencrypted, and requests often travel across multiple Google data centers to generate results.

In addition to passing user traffic, the fiber connections between data centers are also used to replicate data between data centers for backup and universal access. Yahoo, for example, replicates users' mailbox archives between data centers to ensure that they're available in case of an outage. In July of 2012, according to documents Snowden provided to the Washington Post, Yahoo began transferring entire e-mail accounts between data centers in its NArchive format, possibly as part of a consolidation of operations.

By gaining access to networks within Google's and Yahoo's security perimeters, the NSA was able to effectively defeat the SSL encryption used to protect customers' Web connections to the cloud providers, giving the agency's network filtering and data mining tools unfettered access to the content passing over the network. As a result, the NSA had access to millions of messages and Web transactions per day without having to use its FISA warrant power to compel Google or Yahoo to provide the data through PRISM. And it gained access to complete mailboxes of e-mail at Yahoo—including attachments that would not necessarily show up as part of intercepted Webmail sessions, because users would download them separately.

But the NSA and the GCHQ had to devise ways to process the streams of data passing between data centers to make it useful. That meant reverse-engineering some of the software and network interfaces of the cloud providers so that they could break apart data streams optimized to be sent across wide-area networks over multiple simultaneous data links. It also meant creating filtering capabilities that allowed the NSA and the GCHQ to separate traffic of intelligence interest from the vast amount of intra-data center communications that have nothing to do with user activity. So the NSA and the GCHQ configured a "distributed data distribution system" (as the NSA described MUSCULAR in this FAQ about the BOUNDLESSINFORMANT metadata search tool acquired by the American Civil Liberties Union) similar to XKeyscore to collect, filter, and process the content on those networks.

Mailbox overload

Even with filtering, the volume of that data presented a problem to NSA analysts. When Yahoo started performing its mailbox transfers, that data rapidly started to eclipse other sources of data being ingested into PINWALE, the NSA's primary analytical database for processing intercepted Internet traffic. PINWHALE also pulls in data harvested by the XKeyscore system and processes about 60 gigabytes of data per day that it gets passed from collection systems.

By February of 2013, Yahoo mailboxes were accounting for about a quarter of that daily traffic. And because of the nature of the mailboxes—many of them contained e-mail messages that were months or years old—most of the data was useless to analysts trying to find current data. Fifty-nine percent of the mail in the archives was over 180 days old, making it almost useless to analysts.

So the analysts requested "partial throttling" of Yahoo content to prevent data overload. "Numerous analysts have complained of [the Yahoo data's] existence," the notes from the PowerPoint slide on MUSCULAR stated, "and the relatively small intelligence value it contains does not justify the sheer volume of collection at MUSCULAR (1/4th of the daily total collect)."

This isn't the first throttling of data intercepts that the NSA had to undertake. It also had to throttle back its collection of Webmail address books, instead focusing on instant messaging "buddy lists." In 2012, the NSA created a "defeat" that exposed address book data over the Webmail protocols for Gmail, Yahoo, and Hotmail, and it later added Facebook instant messaging friends lists. That collection has been narrowed now to just Facebook address books—in part because of an episode last year where a Yahoo mail account monitored by the NSA was hacked and used by spammers, causing the account's address book to grow exponentially with unrelated e-mail addresses.