What is an internet buffer?

In essence, an internet buffer is a little like Sky+, but on an almost unimaginably large scale. GCHQ, assisted by the NSA, intercepts and collects a large fraction of internet traffic coming into and out of the UK. This is then filtered to get rid of uninteresting content, and what remains is stored for a period of time – three days for content and 30 days for metadata.

The result is that GCHQ and NSA analysts have a vast pool of material to look back on if they are not watching a particular person in real time – just as you can use TV catch-up services to miss a programme you hadn't heard about.

How is it done?

GCHQ appears to have intercepts placed on most of the fibre-optic communications cables in and out of the country. This seems to involve some degree of co-operation – voluntary or otherwise – from companies operating either the cables or the stations at which they come into the country.

These agreements, and the exact identities of the companies that have signed up, are regarded as extremely sensitive, and classified as top secret. Staff are instructed to be very careful about sharing information that could reveal which companies are "special source" providers, for fear of "high-level political fallout". In one document, the companies are described as "intercept partners".

How does it operate?

The system seems to operate by allowing GCHQ to survey internet traffic flowing through different cables at regular intervals, and then automatically detecting which are most interesting, and harvesting the information from those.

The documents suggest GCHQ was able to survey about 1,500 of the 1,600 or so high-capacity cables in and out of the UK at any one time, and aspired to harvest information from 400 or so at once – a quarter of all traffic.

As of last year, the agency had gone halfway, attaching probes to 200 fibre-optic cables, each with a capacity of 10 gigabits per second. In theory, that gave GCHQ access to a flow of 21.6 petabytes in a day, equivalent to 192 times the British Library's entire book collection.

GCHQ documents say efforts are made to automatically filter out UK-to-UK communications, but it is unclear how this would be defined, or whether it would even be possible in many cases.

For example, an email sent using Gmail or Yahoo from one UK citizen to another would be very likely to travel through servers outside the UK. Distinguishing these from communications between people in the UK and outside would be a difficult task.

What does this let GCHQ do?

GCHQ and NSA analysts, who share direct access to the system, are repeatedly told they need a justification to look for information on targets in the system and can't simply go on fishing trips – under the Human Rights Act, searches must be necessary and proportionate. However, when they do search the data, they have lots of specialist tools that let them obtain a huge amount of information from it: details of email addresses, IP addresses, who people communicate with, and what search terms they use.

What's the difference between content and metadata?

The simple analogy for content and metadata is that content is a letter, and metadata is the envelope. However, internet metadata can reveal much more than that: where you are, what you are searching for, who you are messaging and more.

One of the documents seen by the Guardian sets out how GCHQ defines metadata in detail, noting that "we lean on legal and policy interpretations that are not always intuitive". It notes that in an email, the "to", "from" and "cc" fields are metadata, but the subject line is content. The document also sets out how, in some circumstances, even passwords can be regarded as metadata.

The distinction is a very important one to GCHQ with regard to the law, the document explains: "There are extremely stringent legal and policy constraints on what we can do with content, but we are much freer in how we can store metadata. Moreover, there is obviously a much higher volume of content than metadata.

"For these reasons, metadata feeds will usually be unselected – we pull everything we see; on the other hand, we generally only process content that we have a good reason to target."