Millions of files are being shared via peer to peer sites

The speed of current services, like BitTorrent, is limited by the number of people sharing a specific file, such as a movie or song.

Similarity-Enhanced Transfer (SET) works by spotting chunks of identical data in files that are an exact or near match to the one needed.

Using SET the researchers have seen speed increases of up to 500%.

Looks familiar

The findings are outlined in a paper, Exploiting Similarity for Multi-Source Downloads Using File Handprints, written by David Andersen of Carnegie Mellon University, Himabindu Pucha, of Purdue University, and Michael Kaminsky of Intel Research.

Current file-sharing systems, like BitTorrent work best when there are multiple sources of a specific shared file.

When a file is shared it is divided into chunks and those parcels of data are distributed to groups of people who are searching for that file.

The more sources of those chunks there are, the more information there is that can be sent to a user, resulting in faster download speeds.

Identical pieces

But these services often fail to deliver fast speeds because there are not enough users sharing the chunks of the specific file wanted.

"A big limitation of BitTorrent is that it only lets clients share data if they're downloading the exact same file," said Professor Andersen. "This means that the available client pool for any particular file is smaller than it needs to be."

SET works faster because the trio realised that many files being shared on the net contain identical pieces of data even though they appear to be different.

Professor Andersen said he was "shocked" by this discovery.

"In retrospect, of course, the causes of the similarity (different audio tracks, different artist names, etc.) make a lot of sense, but I honestly didn't expect it at all," he told the BBC News website.

The SET system assigns files a similarity ranking and can take chunks of data from files which are both identical and similar to the one being searched for.

The lower similarity ranking that SET searches for, the more sources for that data are likely to be found.

TV programmes like Lost are popular files to share

"The extra overhead of locating these sources does not out-weigh the benefit of using them to help saturate the recipient's available bandwidth," wrote the scientists. "Indeed, exploiting similar sources can significantly improve download time," it added.

SET uses a technique called handprinting - which has been used to filter junk e-mails - to seek out files that contain some of the data needed by the one a file-sharing program has requested.

For example, a search for a particular Madonna song may result in a wide range of titles because the tags, or labels, have been filled in differently, or incorrectly.

SET knows that the music data of a number of differently-labelled Madonna songs is identical and that chunks of data from those files can be shared between users.

In tests, SET improved the transfer time of an MP3 music file by 71% and a 55Mb movie trailer went 30% faster using the researchers' techniques to draw from movie trailers that were 47% similar.

Prof Andersen said that SET could help most with less popular files.

"It gives clients a larger pool of other clients to draw from," he said "This probably won't matter much for super-popular data, where there is already a huge set of people downloading it, but our experiments suggest that in the other cases, SET can help a lot."

Retro-fitting BitTorrent to use SET could be quite straightforward, said Prof Andersen, though he added that some of its techniques would need much more work to make it very good at finding files that have some, but not all, of the data that a user wants.

He said he hoped developers would take the ideas and build on them.

"This is a technique that I would like people to steal," he said and added that he hopes to use SET in a service for sharing software or academic papers.

"It would make P2P transfers faster and more efficient," he added, "and developers should just take the idea and use it in their own systems."