The CIA offers an electronic search engine that lets you mine about 11 million agency documents that have been declassified over the years. It's called CREST, short for CIA Records Search Tool. But this represents only a portion the CIA's declassified materials, and if you want unfettered access to the search engine, you'll have to physically visit the National Archives at College Park, Maryland.

Using the Freedom of Information Act, historians and researchers have urged the CIA to provide them with their own copy of the CREST electronic database, so that they can seek greater insight into U.S. history and even build up additional checks and balances against the government's approach to official secrecy. But the agency won't do it. "Basically, the CIA is saying that the database of declassified documents is itself classified," explains Steve Aftergood, a senior research analyst with the Federation of American Scientists, who oversees the federation's government secrecy project.

It's an irony that represents a much larger problem in the world of declassified government documents. According to Aftergood – a researcher some have called the "the Yoda of Official Secrecy" – most government agencies haven't even gone as far as the CIA in providing online access to declassified documents, and as it stands, there's no good way of electronically searching declassified documents from across disparate agencies.

>'We may never completely understand official secrecy, but the best solution may be to just throw massive amounts of data at it.' Matthew Connelly

"The state of the declassified archives is really stuck in the middle of the 20th Century," says Aftergood. He calls it a "fairly dismal picture," but he also says there's an enormous opportunity to improve the way we research declassified materials – and improve it very quickly – through the use of modern technology.

That's the aim of a new project launched by a team of historians, mathematicians, and computer scientists at Columbia University in New York City. Led by Matthew Connelly – a Columbia professor trained in diplomatic history – the project is known as The Declassification Engine, and it seeks to provide a single online database for declassified documents from across the federal government, including the CIA, the State Department, and potentially any other agency.

The project is still in the early stages, but the team has already assembled a database of documents that stretches back to the 1940s, and it has begun building new tools for analyzing these materials. In aggregating all documents into a single database, the researchers hope to not only provide quicker access to declassified materials, but to glean far more information from these documents than we otherwise could.

In the parlance of the day, the project is tackling these documents with the help of Big Data. If you put enough of this declassified information in a single place, Connelly believes, you can begin to predict what government information is still being withheld. Many documents are declassified only with certain text redacted, for instance, and Connelly aims to develop tools that predict what text has been removed. "We may never completely understand official secrecy," Connelly says, "but the best solution may be to just throw massive amounts of data at it."

The trouble, as Connelly freely acknowledges, is that if you build a system that can reveal redacted text or predict what data is still classified, you may cross certain ethical and political boundaries. "You can imagine where the project would reach a point where it became threatening to declassifiers and make them more reticent to use redactions, as opposed to not releasing the documents in the first place," says David Pozen, a Columbia law professor who specializes in government secrecy, has worked on secrecy issues for the State Department, and has closely followed the creation of The Declassification Engine. "That's the potential perverse consequence of this work."

Like the CIA, other government agencies are already working to improve electronic access to declassified documents. The State Department offers an "online reading room" for declassified materials, and the National Archive now runs a National Declassification Center that seeks to centralize the government's declassification efforts (the National Archive and the Declassification Center were not immediately available to discuss this story). But according to many outside researchers, we're still a long way from the sort of consolidation they're looking for.

"Scholars have never been satisfied," says Richard Immerman, a professor of history at Temple University who has been working with declassified documents since the 1970s. "The problems hanging over classification have been severe, pretty much since the beginning, and the process has really not gotten much better. The problem is under-resourced and under-staffed, and those doing the work are under-trained."

In many cases, documents are declassified only because individuals will request them under the Freedom of Information Act, and this often means they're spread to the four winds. "There are a lot of declassified documents out there. Some of them are in historians' basements. Some are in specific libraries. Some are in digital archives. And they're in different formats. No one has systematically collected them into a searchable, usable, user-friendly database," says Columbia law professor David Pozen.

The Declassification Engine seeks to remedy this, but that's only the first step. Columbia's Matthew Connelly first dreamed up the idea when he realized that although more and more government documents are now created in electronic format, a dwindling percentage are declassified in electronic format. The rise of digital records, he told himself, should provide more opportunities for researchers, not less.

"When I began to notice that more and more of this stuff was born digital," he says, "I began to think you could start to use computational methods to try to figure out what was being withheld."

>'This is entirely premised on there being the same document released at different times or by different agencies, with certain text being visible in one version but not the other.' David Pozen

That's why he has enlisted the help of Columbia's David Madigan, the chair of the university's statistics department, and Michael Collins, a computer science professor who specializes in natural language processing and machine learning. Working alongside a fourth researcher – an MIT computer science PhD candidate named Alexander Rush – the team has already built tools that can analyze document redactions in new ways.

What their database of declassified materials has shown is that many documents are declassified at multiple times, often by multiple agencies, and that the redactions will differ depending on who is doing the declassifying and when. At the very least, says David Pozen, this suggests "a certain lack of meticulousness" on the part of government declassifiers. But it also provides a means predicating redacted text in other documents. If you know what's been redacted in some cases, you predict what has been redacted in others.

"This is entirely premised on there being the same document released at different times or by different agencies, with certain text being visible in one version but not the other," Pozen explains. "At the very least, it's unproblematic to look at how documents diverge and try to learn some lessons."

Connelly says the team is already working to determine the probability that a certain redaction is, say, a place name or an individual. And they can point to certain terms and names that increase the likelihood that information in a document will be redacted. But before going much further, he and others on the project aim to explore the ethical and political ramifications of such work. To that end, they held a conference in New York early this month, bringing together various historians, computer scientists, and other academics to discuss the matter, including Steve Aftergood and David Pozen.

On the one hand, researchers worry that the government is actively holding back the progress of digital researchers. Aftergood cites the CIA's stance on the CREST database as an example. The agency has released 11 million digital declassified documents, but it won't release the database providing access to those documents. "The CIA's stance seems to confirm one of the premises of The Declassification Engine project – that the collection of declassified documents may have emergent properties, that the whole is somehow greater than the parts," says Aftergood.

But the aim isn't to antagonize. The aim is to improve life for historians and researchers. Those involved in the project aren't looking to cross those ethical and political lines. "We don't even want to start tightroping on them," says Temple University professor Immerman, another who has followed the progress of The Declassification Engine. "We want to make things better."