When I told people that we were going to run our infrastructure on Amazon’s EC2 most people recoiled in disgust. I heard lots and lots of horror stories about how you simply couldn’t run production environments on EC2. Disk IO was horrible, throughput was bad, etc. Someone should seriously tell Amazon, since a lot of their own infrastructure runs in EC2 and AWS. For the most part, Amazon’s AWS were internal tools that they released publicly.

We’ve been ironing out kinks in our production environment for the last few weeks and one of the things that worried me was if these assertions were true. So, I set out to run a fairly comprehensive test of Disk IO and throughput. I ran hdparm -t , bonnie++ , and iozone against ephemeral drives in various configurations along with EBS volumes in various configurations.

For all of my tests I tested the regular ephemeral drives as they were installed (“Normal”), LVM in a JBOD setup (“LVM”), the two ephemeral drives in a software RAID0 setup (“RAID0”), a single 100GB EBS volume (“EBS”), and two 100GB EBS volumes in a software RAID0 setup (“EBS RAID0”). All of the tests were ran on large instances (m1.large). All tests were ran using the XFS file system.

hdparm -t

While hdparm -t isn’t the most comprehensive test in the world, I think it’s a decent gut check for simple throughput. To give a little context, I remember Digg’s production servers, depending on setup, ranging from 180MB/sec. to 340MB/sec. I’m guessing if you upgraded to the XL instance type and did a RAID0 across the four drives it has you’d see even better numbers with the RAID0 ephemeral drives.

What I also found pretty interesting about these numbers is that the EBS volumes stacked up “okay” against the ephemeral drives and that the EBS volumes in a RAID0 didn’t gain us a ton of throughput. Considering that the EBS volumes run over the network, which I assume is gigabit ethernet, 94MB/sec. is pretty much saturating that network connection and, to say the least, impressive given the circumstances.

For most applications, I’d guess that EBS throughput is just fine. The raw throughput only becomes a serious requirement when you’re moving around lots of large files. Most applications move around lots of small files, which I’d likely use S3 for anyways. If your application needs to move around lots of large files I’d consider using a RAID0 ephemeral drive setup with redundancy (e.g. MogileFS spreading files across many nodes in various data centers).

bonnie++

Disregarding the Input/Output Block performance of the ephemeral RAID0 setup here, it’s extremely interesting to note that EBS IO performance is better than the ephemeral drives and that EBS in a RAID0 was better in almost every metric as the ephemeral drives.

That all being said, RAID0 ephemeral drives are the clear winner here. I do wonder, however, if you could set up a RAID0 EBS array that had, say, four or six or eight volumes that’d be faster than the RAID0 setup.

If your application is IO bound then I’d probably recommend using EBS volumes if you can afford it. Otherwise, I’d use RAID0. Again, the trick with the ephemeral drives is to ensure your data is replicated across multiple nodes. Of course, this is cloud computing we’re talking about, so you should be doing that anyways.

Here’s the CPU numbers of the various configurations. One thing to note here is that EBS, LVM, and software RAID all come with CPU costs. Somewhat interesting to note is that the EBS has substantially less CPU usage in all areas except Input/Output Per Char.

If your application is both CPU and IO bound then I’d probably recommend upgrading your instance to an XL.

The last bonnie++ results are the random seeks per second and, wow, was I surprised. A single EBS runs pretty much dead even with the LVM JBOD and the EBS RAID0 is on par with the RAID0 ephemeral drives.

To say I was surprised by these numbers would be an understatement. The lesson here is that, if your application does lots of random seeking, you’ll want to use either EBS RAID0 volumes or RAID0 ephemeral drives.

iozone

Before running these tests I’d never even heard of this application, but it seemed to be used by quite a few folks so I thought I’d give it a shot.

Again, some interesting numbers from EBS volumes. What I found pretty interesting here is that the EBS RAID0 setups actually ended up being slower in a few metrics than a single EBS volume. No idea why that may be.

The other thing to note is that the single EBS volume outperformed the ephemeral RAID0 setup in a few different metrics, most notably being random writes.

Conclusions

I think the overall conclusion here is that disk IO and throughput on EC2 is pretty darn good. I have a few other conclusions as well.