MongoDB with Amazon’s EBS and Provisioned IOPS

By David Mytton,

CEO & Founder of Server Density.

Published on the 2nd October, 2012.

This is a guest post by Charity Majors, a full stack systems geek who is currently building exciting things at Parse. She likes sed and awk and peaty scotches.

We here at Parse recently upgraded all of our MongoDB replica sets to Amazon’s Provisioned IOPS, which is an EBS-backed offering that guarantees you a certain number of IOPS at least 99% of the time.

Previous setup

Our MongoDB replica nodes were previously backed by four regular EBS volumes using RAID 10. Performance was acceptable most of the time — we use the EC2 flavor with the maximum available RAM, m2.4xlarge, and our working set fits fairly well into memory. But we were very sensitive to any transient increase in reads or writes, and any time we had to touch the disk, we suffered. A handful of import or export jobs could cause a lot of writes to pile up and latency to spike (especially since we’re still running on MongoDB 2.0, with its global write lock). Any single network event or co-tenancy event would impact the entire replicaset in unpredictable ways, and four volumes means four potential points of failure or tenancy issues.

The PIOPS offering was exciting to us because it’s advertised as being backed by proper database hardware, meaning lower seek times and faster random reads and writes. You also get a dedicated ethernet throughput to your EBS volumes. So you don’t have to share disk I/O and production query traffic on the same network device.

We decided to use two striped 1000-IOPS volumes instead of the old RAID 10 configuration. We wanted to stripe for the extra speed (1000 is currently the maximum IOPS you can allocate per volume, though higher limits are coming). We don’t care about the redundancy of RAID 10, since it takes less time for us to reprovision a new replica set member from scratch than it would to repair a degraded RAID array.

Before the upgrade our average latency from the ELB point was around 200-300 ms, and we regularly saw spikes of a second or more. (This is the round trip processing time from when a query hits the ELB until it exits our systems. It’s an average of really fast traffic like API and web, and slower traffic like Cloud Code and massive push notification batches.) A few times a day we would see performance degrade even further to several seconds with no apparent proximate cause.

Switching to PIOPS

After switching to the new striped PIOPS volumes, our average end-to-end latency from ELB dropped by more than half to around 100 ms. More importantly, our latency is almost completely flat with no spikes. It’s really fun to skim back through our MMS graphs and observe sizable spikes in lock percentage, queues, and page faults that never registered so much as a blip on our latency graphs.

Switching primary to a cold secondary is also dramatically better on PIOPS. We generally see about 7 minutes of minimally degraded performance, where average round trip latency goes back up to 200-400 ms and a handful of writes time out. Before PIOPS it was impossible to switch to a cold replica set member — requests would pile up and start timing out in the app server layer, and it would take nearly two hours to read the full working set into memory and restore our traffic to normal rates. So we had to run warmup scripts constantly on our secondaries.

Another great thing about PIOPS is that it’s backed by all the normal EBS tools, which makes backups and provisioning a breeze. We use ec2-consistent-snapshot to lock Mongo on a secondary and take a snapshot every 30 minutes. And we use Chef to provision our EC2 nodes, so we can automatically create a new replica set member from snapshot and join it to the replica set in a matter of minutes.

A few caveats about snapshotting

Never EVER snapshot on the primary. Your writelock percentage will bounce between 0% and 100% until the snapshot completes, and performance will severely degrade.

It’s a good idea to set the priority on your snapshot node to 0, so it will never become primary.

When you restore from snapshot, it lazy-loads the blocks from S3. Never make a newly-restored node your primary or it will be miserably slow until it finishes downloading all the blocks.

Also, be aware that you really, really don’t want to hit the IOPS ceiling. MongoDB does not handle this gracefully. All your writes and Mongo activity will just hang for a while. This causes your connection counts, queues, and lock percentages to soar, and is likely to send you into a death spiral until you restart some app servers or reduce traffic on the supply side to recover. You should definitely provision enough IOPS so you never have to spike up to 100%. A good rule of thumb is to take the aggregate daily averages reported by “sar -d” and provision at least 2-3 that number of IOPS, depending on how spiky your traffic is.