History is full of great rivalries: France vs. England, Red Sox vs. Yankees, Sherlock Holmes vs. Moriarty, Ken vs. Ryu in Street Fighter... When it comes to Apache Hadoop data storage in the cloud, though, the biggest rivalry lies between the Hadoop Distributed File System (HDFS) and Amazon's Simple Storage Service (S3).

While Apache Hadoop has traditionally worked with HDFS, S3 also meets Hadoop's file system requirements. Companies such as Netflix have used this compatibility to build Hadoop data warehouses that store information in S3, rather than HDFS.

So what's all the hype about with S3, and is S3 better than HDFS for Hadoop cloud data storage? To understand the pros and cons of HDFS and S3, let's resolve this tech rivalry... in battle!

Before we get started, we'll provide a general overview of S3 and HDFS and the points of distinction between them. The main differences between HDFS and S3 are:

S3 is more scalable than HDFS.

When it comes to durability, S3 has the edge over HDFS.

Data in S3 is always persistent, unlike data in HDFS.

S3 is more cost-efficient and likely cheaper than HDFS.

HDFS excels when it comes to performance, outshining S3.

Table of Contents

What is HDFS?

HDFS (Hadoop Distributed File System) was built to be the primary data storage system for Hadoop applications. A project of the Apache Software Foundation, HDFS seeks to provide a distributed, fault-tolerant file system that can run on commodity hardware.

The HDFS layer of a cluster consists of a master node (also called a NameNode) that manages one or more slave nodes, each of which runs a DataNode instance. The NameNode keeps track of the data's location, while the DataNodes are tasked with storing and retrieving this data. Because files in HDFS are automatically stored across multiple machines, HDFS has built-in redundancy that protects against node failures and data loss.

What is Amazon S3?

Amazon S3 (Simple Storage Service) is a cloud IaaS (infrastructure as a service) solution from Amazon Web Services for object storage via a convenient web-based interface. According to Amazon, the benefits of S3 include "industry-leading scalability, data availability, security, and performance."

The basic storage unit of Amazon S3 is the "object", which consists of a file with an associated ID number and metadata. These objects are stored in buckets, which function similarly to folders or directories and which reside within the AWS region of your choice.

Round 1: Scalability

The showdown over scalability comes down to the question of horizontal vs. vertical scalability.

HDFS relies on local storage that scales horizontally. If you want to increase your storage space, you'll either have to add larger hard drives to existing nodes, or add more machines to the cluster. This is feasible, but more costly and complicated than S3.

S3 scales vertically and automatically according to your current data usage, without any need for action on your part. Even better, Amazon doesn't have predetermined limits on storage, so you have a practically infinite amount of space available.

Bottom line: The first round goes to S3, thanks to its greater scalability, flexibility, and elasticity.

Round 2: Durability

Data "durability" refers to the ability to keep your information intact long-term in cloud data storage, without suffering bit rot or corruption. So when it comes to durability, which is better: S3 or HDFS?

A statistical model for HDFS data durability suggests that the probability of losing a block of data (64 megabytes by default) on a large 4,000 node cluster (16 petabytes total storage, 250,736,598 block replicas) is 0.00000057 (5.7 x 10^-7) in the next 24 hours, and 0.00021 (2.1 x 10^-4) in the next 365 days. However, most clusters contain only a few dozen instances, and so the probability of losing data can be much higher.

S3 provides a durability of 99.999999999% of objects per year. This means that a single object could be lost per 10,000,000 objects once every 10,000 years (see the S3 FAQ).

The news gets even better for S3 users. One of my colleagues at Xplenty recently took an AWS workshop, and Amazon representatives reportedly claimed that they hadn’t actually lost a single object in the default S3 storage over the entire history of the service. (The cheaper Reduced Redundancy Storage (RRS) option, with a durability of only 99.99%, is also available.)

Bottom line: S3 wins again. Large clusters may have excellent durability, but in most cases S3 is more durable than HDFS.

Round 3: Persistence

In the world of cloud data storage, "persistence" refers to the survival of data after the process that creates it has finished.

With HDFS, data doesn’t persist when stopping EC2 or EMR instances. However, you can use costly EBS volumes in order to persist the data on EC2.

On the other hand, data is always persistent in S3—simple as that.

Bottom line: S3 comes out on top this round: it offers out-of-the-box data persistence, and HDFS doesn't.

Round 4: Price

In order to preserve data integrity, HDFS stores three copies of each block of data by default. This means exactly what it sounds like: HDFS requires triple the amount of storage space for your data—and therefore triple the cost. While you don't have to enable data replication in triplicate, storing just one copy is highly risky, putting you in danger of data loss.

Amazon handles the issue of data backups on S3 itself, so you pay for only the storage that you actually need. S3 also supports storing compressed files, which can help slash your storage costs.

Bottom line: S3 is the clear winner for this one, thanks to the lower storage overhead costs.

Round 5: Performance

So far, the comparison between HDFS and S3 hasn't even been a competition—S3 comes out on top for scalability, durability, persistence, and price. But what about the question of performance?

The good news is that HDFS performance is excellent. Because data is stored and processed on the same machines, access and processing speed are lightning-fast.

Unfortunately, S3 doesn’t perform as well as HDFS. The latency is obviously higher and the data throughput is lower. However, jobs on Hadoop are usually made of chains of map-reduce jobs and intermediate data is stored into HDFS and the local file system so other than reading from/writing to Amazon S3 you get the throughput of the nodes' local disks.

We recently ran some tests with TestDFSIO, a read/write test utility for Hadoop, on a cluster of m1.xlarge instances with four ephemeral disk devices per node. The results confirm that HDFS performs better:

HDFS on Ephemeral Storage Amazon S3 Read 350 mbps/node 120 mbps/node Write 200 mbps/node 100 mbps/node

Bottom line: HDFS finally wins a round, thanks to its strong all-round performance.

Round 6: Security

Some people think that HDFS isn't secure, but that’s a common misconception. Hadoop provides user authentication via Kerberos and authorization via file system permissions. Hadoop YARN takes this even further with a new feature called federations: dividing a cluster into several namespaces, thereby restricting users to only the data to which they should have access. In addition, data can be uploaded to Amazon instances securely via SSL.

S3 also has built-in security. It supports user authentication to control data access. At first, only the bucket and object owners have access to data. Further permissions can be granted to users and groups via bucket policies and Access Control Lists (ACL). S3 also allows you to securely encrypt and upload data via SSL.

Bottom line: It’s a tie: both HDFS and S3 have robust security measures.

Round 7: Limitations

Even though HDFS can store files of any size, it has well-documented issues storing really small files, which should be concatenated or unified to Hadoop Archives. In addition, data saved on a certain cluster in HDFS is only available to machines on that cluster, and cannot be used by instances outside the cluster.

That’s not the case with S3—data is independent of Hadoop clusters, and can be processed by any number of clusters simultaneously. However, files on S3 have several limitations of their own. The maximum file size is only 5 gigabytes, and additional Hadoop storage formats (such as Parquet or ORC) cannot be used on S3. This is because Hadoop needs to access particular bytes in these files, an ability that’s not provided by S3.

Bottom line: Another tie: both options have limitations that you should be aware of before choosing between HDFS and S3.

HDFS vs. S3: Who Wins?

With better scalability, built-in persistence, and lower prices, S3 is tonight’s winner! Still, HDFS comes away with a couple of important consolation prizes. For better performance and no limitations on file size or storage format, HDFS is the way to go. Whether you use HDFS or Amazon S3, Xplenty can help you integrate your data storage with any of your data destinations. Contact us today for a free 7 day trial on our platform.

Read more about Apache in the Xplenty blog.

Originally published: March 20th, 2014