This week Facebook will complete its roll-out of a new photo storage system designed to reduce the social network’s reliance on expensive proprietary solutions from NetApp and Akamai. The new large blob storage system, named Haystack, is a custom-built file system solution for the over 850 million photo s uploaded to the site each month (500 GB per day!). Jason Sobel, a former NetApp engineer, led Facebook‘s effort to design a more cost-effective and high-performance storage system for their unique needs. Robert Johnson, Facebook‘s Director of Engineering, mentioned the new storage system rollout in a Computerworld interview last week. Most of what we know about Haystack comes from a Stanford ACM presentation by Jason Sobel in June 2008. Haystack will allow Facebook to operate its massive photo archive from commodity hardware while reducing its dependence on CDN s in the United States.

The old Facebook system

Facebook has two main types of photo storage: profile photo s and photo libraries. Members upload photo s to Facebook and treat the transaction as digital archive with very few deletions and intermittent reads. Profile photo s are a per-member representation stored in multiple viewing sizes (150 px , 75 px , etc). The past Facebook system relied heavily on CDN s from Akamai and Limelight to protect its origin servers from a barrage of expensive requests and improve latency.

Facebook profile photo access is accelerated by Cachr, an image server powered by evhttp with a memcached backing store. Cachr protects the file system from new requests for heavily-accessed files.

The old photo storage system relied on a file handle cache placed in front of NetApp to quickly translate file name requests into a inode mapping. When a Facebook member deletes a photo its index entry is removed but the file still exists within the backing file system. Facebook photo s’ file handling cache is powered by lighttpd with a memcache storage layer to reduce load on the NetApp filers.

No need for POSIX

Facebook photo graphs are viewable by anyone in the world aware of the full asset URL . Each URL contains a profile ID , photo asset ID , requested size, and a magic hash to protect against brute-force access attempts.

/ [pvid] _ [key] _ [magic] _ [size] .jpg

Traditional file systems are governed by the POSIX standard governing metadata and access methods for each file. These file systems are designed for access control and accountability within a shared system. An Internet storage system written once and never deleted, with access granted to the world, has little need for such overhead. A POSIX -compliant node must specifically contain:

File length

Device ID

Storage block pointers

File owner

Group owner

Access rights on each assignment: read, write execute

Change time

Modification time

Last access time

Reference counts

Only the top three POSIX requirements matter to a file system such as Facebook. Its servers care where the file is located and its total length but have little concern for file system owners, access rights, timestamps, or the possibility of linked references. The additional overhead of POSIX -compliant metadata storage and lookup on NetApp Filers led to 3 disk I/O operations for each photo read. Facebook simply needs a fast blob store but was stuck inside a file system.

Haystack file storage

Haystack stores photo data inside 10 GB bucket with 1 MB of metadata for every GB stored. Metadata is guaranteed to be memory-resident, leading to only one disk seek for each photo . Haystack servers are built from commodity servers and disks assembled by Facebook to reduce costs associated with proprietary systems.

The Haystack index stores metadata about the one needle it needs to find within the Haystack. Incoming requests for a given photo asset are interpreted as before, but now contain a direct reference to the storage offset containing the appropriate data.

Cachr remains a first line-of-defense to Haystack lookups, quickly processing requests and loading images from memcached where appropriate. Haystack provides a fast and reliable file backing for these specialized requests.

Reduced CDN costs

The high performance of Haystack combined with new data center presence on the east and west coasts of the United States reduces Facebook’s reliance on costly CDNs. Facebook does not currently have the points of presence to match a specialist such as Akamai, but the combined latency of speed of light plus file access should be performant enough to reduce CDN in areas where Facebook already has existing data center assets. Facebook can partner with specialized CDN operators in markets such as Asia where it has no foreseeable physical presence to boost its access times for Asian market files.

Summary

Facebook has invested in its own large blob storage solution to replace expensive proprietary offerings from NetApp and others. The new server structure should reduce Facebook’s total cost per photo for both storage and delivery moving forward.

Big companies don’t always listen to the growing needs of application specialists such as Facebook. Yet you can always hire away their engineering talent to build you a new custom solution in-house, which is what Facebook has done.

Facebook has hinted at releasing more details about Haystack later this month, which may include an open-source roadmap.