Tahoe is a secure distributed filesystem that is designed to conform with the principle of least authority. The developers behind the project announced this month the release of version 1.5, which includes bugfixes and improvements to portability and performance, including a 10 percent boost to file upload speed over high-latency connections.

Tahoe's underlying architecture is similar to that of a peer-to-peer network. Files are distributed across multiple nodes in a manner that allows data integrity to be maintained in the event that individual nodes are compromised or fail. It uses AES encryption to protect file contents from tampering and scrutiny. Tahoe can be used to establish a relatively fault-tolerant storage pool that spans a number of conventional computers over a local network or the Internet. This approach to cloud storage might be more appropriately described as "crowd" storage.

Tahoe was originally developed with funding from Allmydata, a company that provides Web backup services. When Allmydata was originally founded, the company had some highly ambitious plans for distributed storage. It initially offered a service through which individual consumers could get cheap storage capacity on the distributed grid in exchange for volunteering to let the grid use some of their own local storage.

The idea was that every user would be able to get the benefits of distributed off-site backups by sharing a portion of their local drive space with the rest of the network. The company eventually dropped that strategy and now self-hosts all of their backup storage. The Tahoe source code, which is made available under the terms of GNU's General Public License (GPL), can be used to build distributed storage grids that function in much the same manner as Allmydata's original concept.

When a file is deployed to Tahoe, it is encrypted and split into pieces that are spread out across ten separate nodes. Using a variation of Reed-Solomon error correction, it can reconstruct a file using only three of the original ten nodes. This helps to ensure data integrity when some nodes are unavailable. This is a bit similar to how RAID storage works. Tahoe uses a library called zfec that provides an efficient implementation of the error correction code and exposes it through a Python API. For those of you who are finding this all a bit hard to follow, there is a simple interactive mockup that illustrates visually how Tahoe's distributed storage works.

Although Tahoe is a distributed filesystem, it is not entirely decentralized. It needs a central node, called an Introducer, which is responsible for getting new nodes connected to existing nodes on the grid. Tahoe is designed to minimize its dependency on the Introducer, but it's still basically a central point of failure. If the Introducer goes down, existing nodes will still be able to communicate with each other and propagate data but the grid won't be able to connect new nodes. The developers hope to address this limitation in a future version.

Tahoe is being used in a number of different ways. A common configuration that is documented at the project's wiki is described as a "friendnet", a group of roughly ten nodes that are connected over the Internet and provide shared secure storage capacity with optional filesharing. Another potential usage scenario is installing Tahoe on individual workstations on an office network and using their excess disk capacity as a storage pool. The Tahoe wiki describes that kind of setup as a "hivecache".

The Tahoe source code is primarily written in Python with the Twisted framework. The code base is highly portable and can run on Windows, Mac OS X, Linux, Solaris, and several flavors of BSD. It runs entirely in userspace and doesn't require any kernel modules or other low-level components. It works fine on regular commodity hardware and doesn't have any particularly special requirements. Installation instructions are available at the project's web site.

Listing image by Wikimedia Commons