In August 2006, the Electronic Frontier Foundation (EFF) sought government records concerning the Federal Bureau of Investigation (FBI)'s Investigative Data Warehouse (IDW) pursuant to the Freedom of Information Act (FOIA). After the FBI failed to respond to EFF's requests within the timeline provided by the FOIA, EFF filed a lawsuit on October 17, 2006. Records began to arrive in September 2007. On April 14, 2009, the government filed a brief stating that no more documents were going to be provided, despite the Obama Administration's new guidelines on FOIA.

The following report is based upon the records provided by the FBI, along with public information about the IDW and the datasets included in the data warehouse.

I. Overview of the Investigative Data Warehouse

The Investigative Data Warehouse is a massive data warehouse, which the Bureau describes as "the FBI's single largest repository of operational and intelligence information." As described by FBI Section Chief Michael Morehart in 2005, the "IDW is a centralized, web-enabled, closed system repository for intelligence and investigative data." Unidentified FBI agents have described it "one-stop shopping" for FBI agents and an "uber-Google." According to the FBI, "[t]he IDW system provides data storage, database management, search, information presentation, and security services."

Documents show that the FBI began spending funds on the IDW in fiscal year 2002, "and system implementation was completed in FY 2005." "IDW 1.1 was released in July 2004 with enhanced functionality, including batch processing capabilities." The FBI worked with Science Applications International Corporation (SAIC), Convera and Chiliad to develop the project, among other contractors. As of January 2005, the IDW contained "more than 47 sources of counterterrorism data, including information from FBI files, other government agency data, and open source news feeds." A chart in the FBI documents shows IDW growing rapidly, breaking the half-billion mark in 2005. By March 2006, the IDW had 53 data sources and over half a billion (587,186,453) documents. By September 2008, the IDW had grown to nearly one billion (997,368,450) unique documents. The Library of Congress, by way of comparison, has about 138 million (138,313,427) items in its collection.

In addition to storing vast quantities of data, the IDW provides a content management and data mining system that is designed to permit a wide range of FBI personnel (investigative, analytical, administrative, and intelligence) to access and analyze aggregated data from over fifty previously separate datasets included in the warehouse. Moving forward, the FBI intends to increase its use of the IDW for "link analysis" (looking for links between suspects and other people – i.e. the Kevin Bacon game) and to start "pattern analysis" (defining a "predictive pattern of behavior" and searching for that pattern in the IDW's datasets before any criminal offence is committed – i.e. pre-crime).

II. IDW Systems Architecture

According to an FBI project description, "The IDW system environment consists of a collection of UNIX and NT servers that provide secure access to a family of very large-scale storage devices. The servers provide application, web servers, relational database servers, and security filtering servers. User desktop units that have access to FBINet can access the IDW web application. This provides browser-based access to the central databases and their access control units. The environment is designed to allow the FBI analytic and investigative users to access any of the data sources and analytic capabilities of the system for which they are authorized. The entire configuration is designed to be scalable to enable expansion as more data sources and capabilities are added."

A DOJ Inspector General report explained: "Data processing is conducted by a combination of Commercial-Off-the-Shelf (COTS) applications, interpreted scripts, and open-source software applications. Data storage is provided by several Oracle Relational Database Management Systems (DBMS) and in proprietary data formats. Physical storage is contained in Network Attached Storage (NAS) devices and component hard disks. Ethernet switches provide connectivity between components and to FBI LAN/WAN. An integrated firewall appliance in the switch provides network filtering."

IDW Subsystems

Pursuant to the IDW Concept of Operations, the IDW has two main subsystems, the IDW-Secret (IDW-S) and IDW-Special Projects Team (IDW-SPT). It also has a development platform (IDW-D) and a subsystem for maintenance and testing (IDW-I).

IDW-SecretThe IDW-S system is the main subsystem of the IDW, which is authorized to process classified national security data up to, and including, information designated Secret. However, IDW-S is not authorized to process any Top Secret data nor any Sensitive Compartmented Information (SCI). The addition of IDW-TS/SCI, a Top Secret/Sensitive Compartmented Information level data mart, appears to remain in the planning stages. The IDW-S system is the successor of the Secure Counter-Terrorism/Collaboration Operational Prototype Environment (SCOPE). IDW-Special Projects TeamAccording to an Inspector General report, "[i]n November 2003, the Counterterrorism Division, along with the Terrorist Financing Operations Section (TFOS), in the FBI began a special project to augment the existing IDW system with new capabilities for use by FBI and non-FBI agents on the JTTFs. The FBI Office of Intelligence is the executive sponsor of the IDW. The IDW Special Projects Team was originally initiated for the 2004 Threat Task Force." By May 2006, the "Special Project Team provided services to 5 task forces or operations." As described by the FBI: Special Projects Team (SPT) Subsystem

The Special Projects Team (SPT) Subsystem allows for the rapid import of new specialized data sources. These data sources are not made available to the general IDW users but instead are provided to a small group of users who have a demonstrated "need-to-know". The SPT System is similar in function to the IDW-S system. With the main difference is a different set of data sources. The SPT System allows its users to access not only the standard IDW Data Store but the specialized SPT Data Store.

IDW Features

In 2004, the Willie Hulon, then the Deputy Assistant Director for the Counterterrorism Division, said that the FBI was "introducing advanced analytical tools to help us make the most of the data stored in the IDW. These tools allow FBI agents and analysts to look across multiple cases and multiple data sources to identify relationships and other pieces of information that were not readily available using older FBI systems. These tools 1) make database searches simple and effective; 2) give analysts new visualization, geo-mapping, link-chart capabilities and reporting capabilities; and 3) allow analysts to request automatic updates to their query results whenever new, relevant data is downloaded into the database."

Deputy Assistant Director Hulon also asserted that "[w]hen the IDW is complete, Agents, JTTF [Joint Terrorism Task Force] members and analysts, using new analytical tools, will be able to search rapidly for pictures of known terrorists and match or compare the pictures with other individuals in minutes rather than days. They will be able to extract subjects' addresses, phone numbers, and other data in seconds, rather than searching for it manually. They will have the ability to identify relationships across cases. They will be able to search up to 100 million pages of international terrorism-related documents in seconds." (Since then, the number of records has grown nearly ten-fold).

At the FBI National Security Branch's "request, the FBI's Office of the Chief Technology Officer (OCTO) has developed an 'alert capability' that allows users of IDW to create up to 10 queries of the system and be automatically notified when a new document is uploaded to the database that meets their search criteria."

"Users can search for terms within a defined parameter of one another. For example, the search: 'flight school' NEAR/10 'lessons' would return all documents where the phrase 'flight school' occurred within 10 words of the word "lessons." Users can also specify whether they want exact searches, or if they want the search tool to include other synonyms and spelling variants for words and names."

"IDW includes the ability to search across spelling variants for common words, synonyms and meaning variants for words, as well as common misspellings of words. If a user misspells a common word, IDW will run the search as specified, but will prompt the user to ask if they intended to run the search with the correct spelling."

In its 2004 report to the 9-11 Commission, the FBI used an example (shown on the right) to illustrate the planned use of the IDW for data mining and link analysis, showing i2's Analyst's Notebook. i2 described the program as "the world's most powerful visual investigative analysis software," which is able to analyze "vast amounts of raw, multi-format data gathered from a wide variety of sources."

By 2006, the IDW was processing between 40,000 and 60,000 "interactive transactions" in any given week, along with between 50 and 150 batch jobs. An example of a batch process is where "the complete set of Suspicious Activity Reports is compared to the complete set of FBI terrorism files to identify individuals in common between them."

Datasets in the IDW

According to various FBI documents, the following 38 data soures were included in the IDW on or before August 2004. Of these, IDW-S included at least the first six items.

In August 2004, the FBI was considering adding several more datasets: the "FBI's Telephone Application, DHS data sources such as US-VISIT and SEVIS, Department of State data sources such as the Consular Consolidated Database (CCD), and Treasury Enforcement Communication System (TECS)." A later document shows that at least "most" of the Telephone Application is now in the IDW.

The Telephone Application (TA) "provides a central repository for telephone data obtained from investigations." "The TA is an investigative tool that also serves as the central repository for all telephone data collected during the course of FBI investigations. Included are pen register data, toll records, trap/trace, tape-edits, dialed digits, airnet (pager intercepts), cellular activity, push-to-talk, and corresponding subscriber information." Records obtained through National Security Letters are placed in the Telephone Application, as well as the IDW by way of the ACS system.

"The United States Visitor and Immigrant Status Indicator Technology (US-VISIT) Program is an integrated, automated biometric entry-exit system that records the arrival and departure of aliens; conducts certain terrorist, criminal, and immigration violation checks on aliens; and compares biometric identifiers to those collected on previous encounters to verify identity."

The Consular Consolidated Database (CCD) is a set of databases that includes "current and archived data from all of the Department of State's Consular Affairs post databases around the world. This includes the data from the Automated Biometric Identification System (ABIS), ARCS, Automated Cash Register System (ACS), Consular Lookout and Support System (CLASS), Consular Shared Tables (CST), DataShare, Diversity Visa Information System (DVIS), Immigrant Visa Information System (IVIS), Immigrant Visa Overseas (IVO), Non-Immigrant Visa (NIV), Visa Opinion Information Service (VOIS), and Waiver Review System (WRS) applications. The CCD also provides access to passport data in the Travel Document Information System (TDIS), Passport Lookout and Tracking System (PLOTS), and Passport Information Electronic Records System (PIERS). In addition to Consular Affairs data, other data from external agencies is integrated into the CCD, such as the 'Master Death Database from the Social Security Administration."

The Student and Exchange Visitor Information System (SEVIS) "maintains information on nonimmigrant students and exchange visitors (F, M and J Visas) and their dependents, and also on their associated schools and sponsors."

The Treasury Enforcement Communication System (TECS) "is a computerized information system designed to identify individuals and businesses suspected of, or involved in violation of federal law. The TECS is also a communications system permitting message transmittal between Treasury law enforcement offices and other Federal, national, state, and local law enforcement agencies."

Unidentified Additional Data Sources Added to IDW

The FBI set up an Information Sharing Policy Group (ISPG), chaired by the Executive Assistant Directors of Administration and Intelligence, to review requests to ingest additional datasets into the IDW, in response to Congressional "privacy concerns that may arise from FBI engaging in 'data mining.'"

In February 2005, the Counterterrorism Division asked for 8 more data sources. While the names of the data sources are redacted, items 1, 2 and 4 came from the Department of Homeland Security, and items 6, 7 and 8 were additional IntelPlus file rooms. The February 2005 email chain also refers to "2 data sets approved at the meeting yesterday" and "2 data set under consideration." In context, it appears that one of the two approved datasets was IntelPlus, which contained three file rooms. The FBI would "get all of the DHS data from the FTTTF [Foreign Terrorist Tracking Task Force] including the [Redacted]."

In March 2005, the Information Sharing Policy Group approved seven more unidentified datasets for the Special Projects Team version of the IDW. In May 2005, ISPG approved an additional seven unidentified datasets for the IDW-SPT. The IDW Special Projects Team "ingested and published a new telephone-type data source" on two dates: February 18, 2005, and March 18, 2005.

In August 2005, the "[Redacted] Reports Collection" was moved from the limited access IDW-SPT to the more widely available IDW-S. "This [Redacted] dataset contains copies of reports regarding [Redacted]."

Data Retention

As of March 2005:

There is no current Disposition Schedule for IDW. We have looked at the system and it is on our list of systems to be scheduled. With no Disposition Schedule, there is really no limitation on importing data, at least not from a records management standpoint. But, they will not be able to delete or destroy any of that information until a Disposition Schedule is approved.

Nevertheless, the IDW has a process to delete files: "it can occur that data for which IDW-S is not authorized is ingested into IDW-S. When such data is discovered on IDW-S it is necessary to delete this data and to update the Document Tracking Database with the appropriate "DEL" status for the file." The IDW also has a "secure delete" function.

III. Privacy Impact Assessment

The E-Government Act of 2002, Section 208, establishes a requirement for agencies to conduct privacy impact assessments (PIAs) for electronic information systems and collections.

A May 12, 2005 email from an unidentified employee in the FBI's Office of the General Counsel to FBI General Counsel Valerie Caproni notes that the author was "nervous about mentioning PIA in context of national security systems." The author admitted that "It is true the FBI currently requires PlAs for NS [national security] systems as well as non-NS systems." However, the author thought that the policy might change. Accordingly the author "recommend[ed] against raising congressional consciousness levels and expectations re NS PlAs." Caproni's response is short: "ok."

This email was in reply to a May 11 email from Caproni expressing her desire "slide something in about PIA" to a give a "sense that we really do worry about the privacy interests of uninvolved people whose data we slurp up."

However, this strategy failed. Congressional consciousness levels were raised by an August 30, 2006 Washington Post article on the IDW, in which EFF Senior Counsel David Sobel raised the issue of the IDW's lack of a formally published PIA.

The day the Post article ran, several FBI emails discussed the privacy concerns raised by the IDW. One Office of the General Counsel employee (only identified as Bill) explained the FBI's desire to play down the concerns: "I'm with [Redacted] in view that if everyone ([Redacted]) starts running around with their hair on fire on this, they will just be pouring gas on something that quite possibly would just fade away if we just shrug it off."

After these discussions, the FBI released the following response to the article:

Federal Bureau of Investigation

Response to Investigative Data Warehouse (IDW) Press Article for Senate Appropriations Committee

September 7, 2006 There are two concerns being expressed about IDW in the article. One deals with whether the FBI has complied with the Privacy Act's requirement to publish a "systems notice" in the Federal Register and the other is whether the FBI has complied with the privacy impact analysis requirements of the "E-Government Act." The answer to the first question is "yes." We consider IDW to be part of the FBI's Central Record System, an "umbrella" system that is comprised of all of the FBI's investigative files. While it is true that "IDW" isn't specifically mentioned in the CRS Privacy Act System Notice, we don't believe that is necessary. The system notice does state: "In recent years ... the FBI has been confronted with increasingly complicated cases, which require more intricate information processing capabilities. Since these complicated investigations frequently involve massive volumes of evidence and other investigative information, the FBI uses its computers, when necessary to collate, analyze, and retrieve investigative information in the most accurate and expeditious manner possible." The system notice describes in reasonable detail what information we obtain, what routine uses we make of it, the authorities for maintaining the system and so forth. This notice is published in the Federal Register and is publicly available. In our view, we are compliant with both the letter and spirit of the Privacy Act in this regard. The answer to the second question is also "yes." In fact, since IDW has been categorized as a "national security system," the E-Government Act does not require it to undergo a privacy impact analysis (PIA) at all. Even so, FBI and DOJ policy requires a PIA to be conducted. For IDW, the FBI has done several PIA's. We did one for the original system and did others as significant datasets were added to IDW. None of these systems were published since the law does not require them to be conducted in the first place. The point is that we have done far more to analyze the privacy implications of IDW than the law requires. Yes, the analyses have not been conducted in the public domain but Congress weighed the costs and benefits of conducting such an analysis in public and chose to exclude national security systems from that requirement when it passed the E- Government act.

For purposes of the E-Government Act, a National Security System is "an information system operated by the federal government, the function, operation or use of which involves: (a) intelligence activities, (b) cryptologic activities related to national security, (c) command and control of military forces, (d) equipment that is an integral part of a weapon or weapons systems, or (e) systems critical to the direct fulfillment of military or intelligence missions."

A heavily redacted March 2005 FBI Electronic Communication enclosed a completely redacted Privacy Impact Assessment about the IDW. In August 2007, the Office of the Inspector General conducted an audit of "all major Department [of Justice] information technology (IT) systems and planned initiatives." The OIG noted that it "did not obtain PIAs or explanations for the FBI's IDW."

IV. The Future of the IDW is Data Mining

When the FBI explained the IDW to Congress in 2004, it noted that when FBI Director Mueller testified about the IDW in 2003, he "used the term 'data mining' to be synonymous with 'advanced analysis.' The FBI does not conduct 'data mining' in accordance with the GAO definition, which means mining through large volumes of data with the intention of automatically predicting future activities."

Nevertheless, in March 2003, the FBI issued its Fiscal Year 2004 (Oct. 2003 – Sep. 2004) budget, in which the Bureau had requested a new "Communications Application":

The FBI requests $4,600,000 to obtain a software application that is capable of conducting sophisticated link analysis on extremely high volumes of telephone toll call data and other relational data. This software would enable the FBI to leverage modern technology to expeditiously conduct analyses of large collections of relational data.

By 2005, the FBI was still trying to minimize Congressional concerns over data mining. The FBI was concerned that the "distinction between a data mart and a data mining vehicle will be lost on those who just think we are looking into citizens' lives too much." On March 1, 2005, an unidentified Office of Congressional Affairs (OCA) employee noted in an email (emphasis original):

We had agreed on the following sentence as a way of avoiding some of the intricacies of data mining policy: "Where permitted by law, and appropriate to an authorized work activity, information gleaned from searching non-FBI databases may be included in FBI systems and, once there, may be accessed by employees conducting searches in furtherance of other authorized activities." Unfortunately, I couldn't get that to fly, since that was the crux of the Senator's inquiry.

In October 2005 FBI emails discuss the response to the August 2005 GAO report on data mining by the Foreign Terrorist Tracking Task Force (FTTTF). "In 2001, Homeland Security Presidential Directive-2 established the Foreign Terrorist Tracking Task Force (FTTTF) to provide actionable intelligence to law enforcement to assist in the location and detention and ultimate removal of terrorists and their supporters from the US." The FTTTF "operates two information systems—one unclassified and one classified—that form the basis of its data mining activities," using tools such as i2 Analyst Notebook application, Query Tracking and Initiation Program (QTIP), and Wareman. In addition to the FBI, "the participants in the FTTTF include the Department of Defense, the Department of Homeland Security's Bureaus of Immigration and Customs Enforcement and the Customs and Border Protection, the State Department, the Social Security Administration, the Office of Personnel Management, the Department of Energy, and the Central Intelligence Agency."

In these 2005 emails, an OCA employee suggested a limitation on the scope of the FBI's response to Congress: "Maybe we say that 'FTTTF refers to an operational task force. We understand the question to ask about data mining initiatives of FTTTF.'"

Around the same time, an unidentified Office of the General Counsel employee wrote:

Finally – I'm concerned about the statement that we only have 3 data mining projects in the FBI. In the cover letter, you make the point that our definition of data mining only includes large sets of data but I still think the definition is very broad and could include other systems. For example, what about STAS systems? I am not familiar with those systems -(but we are starting work on a PIA so I will be in the near future) but my sense is that they collect and sift through a lot of data. What about EDMS and some of the other systems that collect tech cut data from FISAs and allow analysts to search through the data for relevant info? I would think that could be considered data mining under your definition - but I'll defer to the CIO's office on this issue. We just need to make sure we can distinguish these other projects.

A few years later, however, the FBI became less circumspect about marrying the data sets of the IDW with the data mining capabilities of the FTTTF. For the FBI's FY2007 War Supplemental budget request, the FBI requested $10 million to consolidate the IDW and the FTTTF "and to develop and deploy a robust infrastructure capable of receiving, processing, and managing the quality of substantially increased amounts of additional data.

In its FY2008 "budget justification," the FBI explained that "[t]he Investigative Data Warehouse (IDW), combined with FTTTF's existing applications and business processes, will form the backbone of the NSB's data exploitation system." The FBI also requested "$11,969,000 ... for the National Security Branch Analysis Center (NSAC)." It explains:

Once operational, the NSAC will be tasked to satisfy unmet analytical and technical needs of the NSB, particularly in the areas of bulk data analysis, pattern analysis, and trend analysis. … The NSAC will provide subject-based "link analysis" through the utilization of the FBI's collection datasets, combined with public records on predicated subjects. "Link analysis" uses datasets to find links between subjects, suspects, and addresses or other pieces of relevant information, and other persons, places, and things. This technique is currently being used on a limited basis by the FBI; the NSAC will provide improved processes and greater access to this technique to all NSB components. The NSAC will also pursue "pattern analysis" as part of its service to the NSB. "Pattern analysis" queries take a predictive model or pattern of behavior and search for that pattern in datasets. The FBI's efforts to define predictive models and patterns of behavior will improve efforts to identify "sleeper cells."

"The National Security Analysis Center (NSAC) would bring together nearly 1.5 billion records created or collected by the FBI and other government agencies, a figure the FBI expects to quadruple in coming years." In June 2007, after seeing this budget request and noting that "[d]ocuments predict the NSAC will include six billion records by FY2012," the House Science and Technology Committee asked the Government Accountability Office to investigate the National Security Branch Analysis Center.

In 2008, the non-partisan National Research Council issued a 352-page study concluding that data mining is not an effective tool in the fight against terrorism. The report noted the poor quality of the data, the inevitability of false positives, the preliminary nature of the scientific evidence and individual privacy concerns in concluding that "automated identification of terrorists through data mining or any other mechanism is neither feasible as an objective nor desirable as a goal of technology development efforts."

Acronyms