A "Data Mining Research Problem Book" marked "top secret strap 1" has been leaked that details some of the key techniques used by GCHQ to sift through the huge volumes of data it pulls continuously from the Internet.

Originally obtained by Edward Snowden, the 96-page e-book has been published by Boing Boing, along with a second short document entitled "What's the worst that can happen?". Boing Boing describes this as "a kind of checklist for spies who are seeking permission to infect their adversaries' computers or networks with malicious software."

The data mining handbook was written by researchers from the Heilbronn Institute for Mathematical Research in Bristol, a partnership between GCHQ and the University of Bristol. According to Boing Boing, "Staff spend half their time working on public research, the other half is given over to secret projects for the government."

The handbook provides valuable insights into some of the details of GCHQ's data mining work, at least as it was in September 2011, when the document was written. At that time, some of the "bearers"—Internet links—were producing 10 gigabits per second. As the handbook notes: "A 10G bearer produces a phenomenal amount of data: far too much to store, or even to process in any complicated way." As a result, "To make things manageable, the first step is to discard the vast majority of the packets we see."

However, it is important to note that it is chiefly content that is discarded, not metadata. Here's why: "There are extremely stringent legal and policy constraints on what we can do with content, but we are much freer in how we can store and use metadata. Moreover, there is obviously a much higher volume of content than metadata. For these reasons, metadata feeds will usually be unselected—we pull everything we see; on the other hand, we generally only process content that we have a good reason to target." This confirms the central role played by metadata in GCHQ's surveillance, and that essentially all of it is already being collected, even before Snooper's Charter puts it on a firmer legal footing.

One interesting comment concerns false positives that can be thrown up by data mining: "It is important to point out that tolerance for false positives is very low: if an analyst is presented with three leads to look at, one of which is probably of interest, then they might have the time to follow that up. If they get a list of three hundred, five of which are probably of interest, then that is not much use to them." This would seem to reinforce the argument for targeted, rather than mass surveillance, although the handbook is obviously concerned with the latter.

Also notable is a section on steganography—the technique of hiding a message within another file: "Some targets try to hide their communications through the use of steganography. One approach is to slightly alter the coefficients in a JPEG image to encode the hidden data whilst trying to minimise visual changes in the JPEG." The fact that data-mining techniques have been developed to spot steganographic communications implies that it is not just a theoretical option.

Most of the handbook is devoted to reviewing the rather abstruse mathematics that can be applied to extract useful information from the huge stores of metadata that GCHQ gathers. Nonetheless, along the way, it provides useful insights into some of the key GCHQ programmes that are almost impossible to obtain any other way.

When Ars asked GCHQ whether the leaked document was genuine, a spokesperson said: "We have no comment to make on the story," and simply offered its boilerplate reply to all such requests: "It is longstanding policy that we do not comment on intelligence matters. Furthermore, all of GCHQ's work is carried out in accordance with a strict legal and policy framework, which ensures that our activities are authorised, necessary and proportionate, and that there is rigorous oversight, including from the Secretary of State, the Interception and Intelligence Services Commissioners and the Parliamentary Intelligence and Security Committee. All our operational processes rigorously support this position. In addition, the UK's interception regime is entirely compatible with the European Convention on Human Rights."

That last claim is about to be tested in court. As Ars reported recently, the European Court of Human Rights (ECtHR) has said that blanket surveillance without sufficient safeguards is a violation of basic rights. A ruling by the EctHR on whether GCHQ's activities are "entirely compatible with the European Convention on Human Rights" is expected soon.