In popular perception, BitTorrent is a decentralized protocol; after all, all that data is coming from other peers and not from a central server, right? But because searching for particular files on BitTorrent networks can be a dodgy proposition, most BitTorrent users rely on torrent indexes like those provided by, ahem, The Pirate Bay, giving the system a central choke point. Shut down the torrent aggregators and files become much more difficult to find, so it's no surprise that content owners have recently targeted aggregators like Demonoid, OiNK, and the aforementioned The Pirate Bay. Now, a new project out of Cornell hopes to provide good quality, approximate keyword searching directly through BitTorrent networks—a truly decentralized system that doesn't rely on aggregators.

The state of P2P search

Cornell's "Cubit" project is the brainchild of graduate student Bernard Wong, his advisor Emin Gun Sirer, and Microsoft Research's Aleksandrs Slivkins. The goal of the project, in the words of its authors, is to provide "an efficient, accurate and robust method to handle imprecise string search in filesharing applications." Wong tells me that the motivation is misspellings, both in searches and filenames, and he points to Google stats showing that a full 20 percent of Google searches for Britney Spears spell the singer's name incorrectly.



Bernard Wong

P2P applications can perform searches, but most aren't very good at it. Distributed hash tables (DHT) are one common approach, but these are generally good only at finding exact matches due to the nature of hashes. Building an approximate DHT search system by iterating through all possible spelling variations of search queries is, in the understated terms of the Cubit team, a "highly inefficient solution." So most users just visit aggregators, but as Wong points out, such sites are getting raided or sent takedown notices and provide a centralized failure point. While the team doesn't advocate for copyright infringement, they are concerned with building robust and truly decentralized P2P networks, and they see approximate keyword search as being crucial to that effort.

Yes, it's a "very cool research problem," as Wong put it, but it's also a chance to build the sort of decentralized search that could one day do what the RIAA and MPAA have so far failed to do: make The Pirate Bay irrelevant.

"We do hope that this type of system can be used in place of a centralized aggregator," Wong tells me, but that day is still a long ways off. Aggregators have the potential to offer more metadata and offer better organization than Cubit, which relies on data from filenames and the comments sections of files.

"Edit distance" is everything

To start, the team built an Azureus (now called "Vuze") plugin for Cubit to demonstrate the technology. With the plugin installed, Cubit creates a lightweight overlay network that exists in parallel with BitTorrent, but is used only for searches. Cubit's central insight is the abandonment of hashes, which are only good at detecting identical matches, and instead building a network based on "edit distance."

Edit distance is "equal to the minimum number of insertions, deletions, and substitutions needed to transform one string to another." The edit distance between "ring" and "rings" is 1, for example, while the number of changes needed to go from "ring" to "earring" is 3 (see example below).



Edit distance between nodes

All files on all machines running Cubit are given a node ID, like "ring" or "earring," and the computer builds an internal map of all the nodes based on their edit distance from one another. When a search is accidentally run for "rong," nodes with the lowest edit distance from the word appear first in the results list. That means "ring" and "rang" would show up near the top of the list since they have an edit distance of one, while "rings" would be one of the next results because of its edit distance of two. This is all grossly simplified; tech heads who want to read about "Levenshtein distance" and "small-world construction" should check out the official paper describing Cubit (PDF).

Because of the way that the node map is created, searches also don't require querying every peer in the network and are claimed to require an entire order of magnitude fewer queries than DHT systems while providing more useful results. In addition, the system works for any language in which a "word similarity metric" (the edit distance) can be defined, so nothing about it is particular to English.

What's in a name?

So why choose "Cubit" for a project to build approximate search into P2P apps? Wong says that the team needed a name, the system relies on "edit distance," and cubit is an old unit of measurement. The name seemed "fitting."

To see the system in action, we installed the plugin and ran two searches. The first was through Vuze's own search functionality, and we looked for "machinime" (a misspelling of "machinima"). Predictably, there were no results, though "machinima" returned plenty. When we ran the search through Cubit, numerous results popped up, but most were for "machine" rather than "machinima."



Cubit results in Vuze

While the system, when complete, should make it simple to find and start torrent downloads without utilizing an index, Wong points out that it's not a boon to would-be copyright infringers. It makes it neither any harder nor any easier for investigators to find the IP addresses of people sharing files; they just need to search the network rather than the index. But what Cubit can do is force content owners to go directly after end users who are sharing particular files rather than simply trying to shut down the biggest indexes in order to hobble BitTorrent, bringing torrent search into the full decentralized world.