The NSA’s Data Haul Is Bigger Than You Can Possibly Imagine

Editor’s note: Shortly after this story was published, the Washington Post released a series of eye-popping leaked documents showing that the National Security Agency has accidentally intercepted the communications of thousands of people it had no right to spy on. The story below is in many ways the precursor to that blockbuster revelation.

The NSA, as intelligence historian Matthew Aid shows, collects so much information online that even its mistakes are enormous. Every day, it actively analyzes the rough equivalent of what’s inside the Library of Congress and "touches," to use the agency’s term, another 2,990 Libraries’ worth of data. With such a huge haul, even the most infrequent of error rates — one in a hundred thousand, say — still produces terabytes and terabytes of improperly-harvested data. It still means thousands and thousands of people are wrongly caught in the surveillance driftnet.

The NSA’s defenders will point to the many times the agency’s intelligence analysts followed the rules, and got things right. But that misses the point; no one expects these analysts, or the systems they use, to be flawless. The problem is that the surveillance net is so very large that even the most miniscule of imperfections can have outsized impact. And that calls into question whether the NSA’s intelligence-collection efforts have grown too big for their own good.

The electronic spies at the National Security Agency have tried lately to play down the amount of Internet traffic they inspect — and play up how central that monitoring is to stopping terrorist attacks. Neither one of those arguments is entirely true. Yes, the NSA claimed in a recently released white paper that it "touches" only 1.6 percent of the planet’s online data, but the agency neglected to note that this is roughly equivalent to the Library of Congress’s entire textual collection, inspected 2,990 times every day. And sure, the NSA’s Internet surveillance has been instrumental in some counterterrorism operations. But this analysis of online communications has also been central to U.S. spying on places like Syria, Libya, China, and Iran.

The importance of the Internet as an intelligence source for the NSA cannot be underestimated. The NSA may have made its Cold War reputation intercepting phone and radio traffic; these days, it’s all about the Net. According to information gathered from interviews with three former or currently serving U.S. intelligence officials conducted over the past month, the NSA is now producing high-grade intelligence information on a multitude of national and transnational targets at levels never before achieved in the agency’s history. Here are a few examples of the intelligence reportedly derived from NSA’s intercepts of the contents of emails and other Internet-based communications systems:

* According to a recently retired U.S. intelligence analyst, much of what the U.S. intelligence community knows, or thinks it knows, about the Iranian nuclear program is based largely on intercepted online communications.

* Intercepted emails and other Internet communications have been an essential source of information about what has been transpiring in Syria and the countries surrounding it since the Syrian civil war broke out in early 2011.

* The NSA’s ability to exploit email traffic, both plaintext and encrypted, has proved to be a critically important tool allowing the U.S. intelligence community to track military activities around the world, particularly in certain key countries in the Middle East, South Asia, and the Far East. For instance, intercepted Internet traffic reportedly played an important role in allowing the U.S. intelligence community to keep close tabs on the activities of military units loyal to Muammar al-Qaddafi during Libya’s civil war in 2011.

* Intercepted emails and text messages were also essential to the success of Gen. David Petraeus’s Baghdad "surge" operation in Iraq in the spring and summer of 2007. According to an Aug. 9 NSA white paper, "The senior U.S. commander in Iraq credited signals intelligence with being a prime reason for the significant progress made by U.S. troops in the 2008 [actually 2007] surge, directly enabling the removal of almost 4,000 insurgents from the battlefield."

* According to one official, intelligence information derived from Internet signals collection, or SIGINT (for "signals intelligence"), has been responsible, directly or indirectly, for more than 60 percent of the al Qaeda terrorists captured or killed since the 9/11 attacks.

* Since 2008, signals intelligence derived from mobile phone and email intercepts has become the principal intelligence source used by the CIA, the Defense Intelligence Agency, and Joint Special Operations Command to target unmanned drone strikes and commando raids against al Qaeda terrorists and local insurgent targets in northern Pakistan and Yemen. Signals intelligence has become so important to the U.S. intelligence community’s counterterrorism effort that it has given birth to a new type of CIA intelligence officer called a human intelligence targeting officer (HTO) who is responsible for fusing real-time signals intelligence concerning the locations of al Qaeda officials with available intelligence received from agents in order to direct CIA Reaper unmanned drones equipped with Hellfire air-to-surface missiles to their targets.

Working in close conjunction with its English-speaking partners in Britain, Canada, Australia, and New Zealand, the NSA is currently engaged in two Internet-related SIGINT collection programs.

The first involves the collection of Internet metadata — who communicates with whom and how. The domestic component of this program, which started shortly after 9/11, involved AT&T, Verizon, and Sprint providing the NSA with massive volumes of Internet usage data for all their subscribers in the United States and overseas. This program was officially terminated in December 2011 after Sen. Mark Udall and Sen. Ron Wyden questioned whether the program was producing sufficient intelligence to justify continuing to fund it. Whether the NSA still retains the massive database of Internet metadata is unknown. But the agency isn’t in the habit of throwing things away.

Either way, the NSA continues to collect the exact same sort of Internet metadata on foreign targets to this very day (though determining who’s a foreigner and who’s not can be a near-impossible task, as my FP colleague Shane Harris has shown). Every minute of every day of the year, the NSA’s vast array of computers sweeps the entire global Internet using almost exactly the same search and sweep techniques as Google, collecting vast amounts of metadata on Internet usage around the world. The metadata that the NSA and its partners collect every day yields vast amounts of information on computer systems and email communications links of particular interest to the agency: Internet protocol (IP) addresses, email accounts, user names, domains, service providers, server locations, ports, blocked sites, browser(s) used, dates and times of logins, length of web sessions, website addresses (URLs) visited, IP addresses contacted, and, for Skype users, all phone numbers called.

The Internet metadata program has been particularly useful for identifying which email links use PGP or other encryption systems, which automatically earns that particular system increased scrutiny by the NSA’s computer-hacking organization, the Office of Tailored Access Operations, to determine whether this communications traffic might be of intelligence value.

Separate from the Internet metadata program, the NSA and its overseas partners intercept the content of vast amounts of communications and digital data traffic carried on the Internet, especially email traffic. The NSA and its English-speaking partners are intercepting, machine-reading, and caching millions (if not billions) of emails every day. According to previously published reports, the agency may even be able to read emails that were encrypted with a wide variety of commercially available encryption systems.

Getting at the vast and growing volume of email and related communications traffic being carried over the Internet is, from a purely technical standpoint, a relatively easy proposition for the NSA because, according to industry estimates, roughly 80 percent of the world’s Internet traffic either originates in the United States or transits through Internet service providers and/or computer servers in the United States.

And what the NSA cannot access, sources report that the agency’s British, Canadian, Australian, and New Zealand SIGINT partners oftentimes can. They do this by covertly collecting all Internet and data traffic being carried on all fiber-optic cables that touch on their territory.

The majority of the Internet traffic entering, leaving, or transiting through the United States travels through one of 32 fiber-optic-cable landing points or terminals: 20 on the U.S. East Coast and 12 on the West Coast. According to the consulting firm TeleGeography in Washington, D.C., 56 global fiber-optic cable systems carrying Internet and digital data traffic to and from Europe, Asia, the Middle East, Africa, Latin America, and the Caribbean are connected to these 32 cable landing points.

The NSA can now access almost all traffic transiting through these fiber-optic cable systems (except those cables connecting the lower U.S. mainland with Alaska) pursuant to a classified program called Upstream. Upstream consists of four subordinate programs called Fairview, Stormbrew, Blarney, and Oakstar. An April 2013 top secret PowerPoint slide leaked by Edward Snowden to the Washington Post indicates that Stormbrew focuses on Internet traffic passing between the United States and Asia, while Blarney appears to cover traffic between the United States and Europe and the Middle East. The precise functions of the Fairview and Oakstar programs are not yet known.

Getting at this traffic is only technically feasible because of the NSA’s intimate relationships with the largest American telecommunications companies and Internet service providers. Thanks to a series of secret cooperative agreements with America’s three largest telecommunications companies — AT&T, Verizon, and Sprint — since 9/11 the NSA has been given access to virtually all foreign Internet traffic carried by these underwater fiber-optic cable systems. These access agreements with the "Big Three" telecommunications companies are legally sanctioned by warrants that are routinely renewed every 90 days by the Foreign Intelligence Surveillance Court in Washington, D.C.

AT&T, Verizon, and Sprint can access most Internet traffic transiting the United States via these fiber-optic cables because at some point the traffic passes through one or more gateway nodes, backbone nodes, remote access routers, Internet exchange points, or network access points in the United States that are operated by the "Big Three." At these points, Internet traffic of interest to the agency is intercepted by NSA equipment (euphemistically referred to as "black boxes" by company personnel) that is operated and maintained by specially cleared personnel on the payroll of the telecommunications companies.

For example, all Internet and data traffic from Latin America and the Caribbean arrives in the United States via eight submarine fiber-optic cables whose terminals are located in Florida at Jacksonville, Vero Beach, West Palm Beach, Spanish River Park, Boca Raton, Hollywood, North Miami Beach, and Miami. All Internet traffic from these eight fiber-optic cables is forwarded to the AT&T backbone node facility in Orlando, Florida, where email and data traffic of interest to the NSA is instantly copied and sent via secure buried fiber-optic cable links to NSA headquarters for processing, analysis, and reporting.

And since September 2007, the NSA has been able to expand and enhance its coverage of global Internet communications traffic through a now-infamous program called PRISM, which uses orders issued by the Foreign Intelligence Surveillance Court that permit the NSA to access emails and other communications traffic held by nine American companies: Microsoft, Google, Yahoo!, Facebook, PalTalk, YouTube, Skype, AOL, and Apple.

Thanks to PRISM, for the past six years the NSA has been exploiting a plethora of other communications systems besides emails that also use the Internet as their platform: voice-over-Internet protocol (VoIP) systems like Skype, instant messaging and text messaging systems, social networking sites, and web chat sites and forums, to name but a few. The NSA is also reading emails and text messages carried on 3G and 4G wireless traffic around the world because many of these systems are made by American companies, such as Verizon Wireless.

No matter how you measure it, the amount of intercepted Internet-based communications traffic that the NSA must process, analyze, and report on is massive and getting larger by the day.

In an unclassified white paper released on Aug. 9, the NSA claimed that it "touches" only 1.6 percent of the 1,826 petabytes of traffic currently being carried by the Internet, which equates to approximately 29.2 petabytes of communications data. To give one a sense of how much raw data this is, the Library of Congress’s entire collection, the world’s largest, holds an estimated 10 terabytes of data, which is equivalent to 0.009765625 petabytes. In other words, the NSA collects just from intercepted Internet traffic the equivalent of the entire textual collection of the Library of Congress 2,990 times every day.

Of this amount, according to the NSA, only 0.025 percent of the intercepted Internet material is selected for review based on a vast and ever-changing "key word" or "key phrase" alert system. On paper this sounds reasonably manageable until you realize that the daily amount of material in question is the equivalent of 75 percent of the Library of Congress’s entire collection.

More and more of the data to be reviewed is Chinese. Although the Internet was invented in the United States, its future is in China, which has seen its online population increase a hundredfold in the last 10 years and now boasts double the number of America’s Internet users. That means the NSA’s ability to access Chinese communications, which sources confirm is the U.S. intelligence community’s top Tier I target after al Qaeda and other foreign terrorist groups, has also increased a hundredfold in just the past decade, and the NSA’s access to Chinese communications will only continue to grow incrementally as tens of millions more Chinese people are expected to get online in the next few years.

The same is true about Russia, another increasingly important Tier I high-priority target for the U.S. intelligence community and another place where Internet usage is growing. As Russian President Vladimir Putin’s relations with Washington continue to deteriorate, the U.S. intelligence community’s prioritization of Russia as an intelligence target has risen significantly in just the past two months.

But if there is one area where Internet-based signals intelligence has played a particularly critical role, it is in the field of counterterrorism. The NSA has confirmed that al Qaeda and other terrorist leaders in the Middle East and South Asia depend on email and other Internet-based communications systems to communicate with one another because, according to a leaked 2009 NSA inspector general’s report, "they are ubiquitous, anonymous, and usually free of charge," allowing terrorist leaders to "access Web-based email accounts and similar services from any origination point around the world." Of course U.S. spies are going to try to listen in.