The transfer of scientific data has emerged as a significant challenge, as datasets continue to grow in size and demand for open access sharing increases. Current methods for file transfer do not scale well for large files and can cause long transfer times. In this study we present BioTorrents, a website that allows open access sharing of scientific data and uses the popular BitTorrent peer-to-peer file sharing technology. BioTorrents allows files to be transferred rapidly due to the sharing of bandwidth across multiple institutions and provides more reliable file transfers due to the built-in error checking of the file sharing technology. BioTorrents contains multiple features, including keyword searching, category browsing, RSS feeds, torrent comments, and a discussion forum. BioTorrents is available at http://www.biotorrents.net .

Introduction

The amount of data being produced in the sciences continues to expand at a tremendous rate[1]. In parallel, and also at an increasing rate, is the demand to make this data openly available to other researchers, both pre-publication[2] and post-publication[3]. Considerable effort and attention has been given to improving the portability of data by developing data format standards[4], minimal information for experiment reporting[5]–[8], data sharing polices[9], and data management[10]–[13]. However, the practical aspect of moving data from one location to another has relatively stayed the same; that being the use of Hypertext Transfer Protocol (HTTP) [14] or File Transfer Protocol (FTP) [15]. These protocols require that a single server be the source of the data and that all requests for data be handled from that single location (Fig. 1A). In addition, the server of the data has to have a large amount of bandwidth to provide adequate download speeds for all data requests. Unfortunately, as the number of requests for data increases and the provider's bandwidth becomes saturated, the access time for each data request can increase rapidly. Even if bandwidth limitations are very large, these file transfer methods require that the data is centrally stored, making the data inaccessible if the server malfunctions.

PPT PowerPoint slide

PowerPoint slide PNG larger image

larger image TIFF original image Download: Figure 1. Illustration of differences between traditional and peer to peer file transfer protocols. A) Traditional file transfer protocols such as HTTP and FTP use a single host for obtaining a dataset (grey filled black box), even though other computers contain the same file or partial copies while downloading (partially filled black box). This can cause transfers to be slow due to bandwidth limitations or if the host fails. B) The peer-to-peer file transfer protocol, BitTorrent, breaks up the dataset into small pieces (shown as pattern blocks within black box), and allows sharing among computers with full copies or partial copies of the dataset. This allows faster transfer times and decentralization of the data. https://doi.org/10.1371/journal.pone.0010071.g001

Many different solutions have been proposed to help with many of the challenges of moving large amounts of data. Bio-Mirror (http://www.bio-mirror.net/) was started in 1999 and consists of several servers sharing the same identical datasets in various countries. Bio-mirror improves on download speeds, but requires that the data be replicated across all servers, is restricted to only very popular genomic datasets, and does not include the fast growing datasets such as the Sequence Read Archive (SRA) (http://www.ncbi.nlm.nih.gov/sra). The Tranche Project (https://trancheproject.org/) is the software behind the Proteome Commons (https://proteomecommons.org/) proteomics repository. The focus of the Tranche Project is to provide a secure repository that can be shared across multiple servers. Considering that all bandwidth is provided by these dedicated Tranche servers, considerable administration and funding is necessary in order to maintain such a service. An alternative to these repository-like resources is to use a peer-to-peer file transfer protocol. These peer-to-peer networks allow the sharing of datasets directly with each other without the need for a central repository to provide the data hosting or bandwidth for downloading. One of the earliest and most popular peer-to-peer protocols is Gnutella (http://rfc-gnutella.sourceforge.net/) which is the protocol behind many popular file sharing clients such as LimeWire (http://www.limewire.com/), Shareaza (http://shareaza.sourceforge.net/), and BearShare (http://www.bearshare.com/). Unfortunately, this protocol was centered on sharing individual files and does scale well for sharing very large files. In comparison, the BitTorrent protocol [16] handles large files very well, is actively being developed, and is a very popular method for data transfer. For example, BitTorrent can be used to transfer data from the Amazon Simple Storage Service (S3) (http://aws.amazon.com/s3/), is used by Twitter (http://twitter.com/) as a method to distribute files to a large number of servers (http://github.com/lg/murder), and for distributing numerous types of media.

The BitTorrent protocol works by first splitting the data into small pieces (usually 514 Kb to 2 Mb in size), allowing the large dataset to be distributed in pieces and downloaded from various sources (Fig. 1B). A checksum is created for each file piece to verify the integrity of the data being received and these are stored within a small “torrent” file. The torrent file also contains the address of one or more “trackers”. The tracker is responsible for maintaining a list of clients that are currently sharing the torrent, so that clients can make direct connections with other clients to obtain the data. A BitTorrent software client (see Table 1) uses the data in the torrent file to contact the tracker and allow transferring of the data between computers containing either full or partial copies of the dataset. Therefore, bandwidth is shared and distributed among all computers in the transaction instead of a single source providing all of the required bandwidth. The sum of available bandwidth grows as the number of file transfers increases, and thus scales indefinitely. The end result is faster transfer times, less bandwidth requirements from a single source, and decentralization of the data.

Torrent files have been hosted on numerous websites and in theory scientific data can be currently transferred using any one of these BitTorrent trackers. However, many of these websites contain materials that violate copyright laws and are prone to being shut down due to copyright infringement. In addition, the vast majority of data on these trackers is non-science related and makes searching or browsing for legitimate scientific data nearly impossible. Therefore, to improve upon the open sharing of scientific data we created BioTorrents, a legal BitTorrent tracker that hosts scientific data and software.