Case Study: SSDs in AnandTech's Server Environment

For the majority of the history of AnandTech we've hosted our own server infrastructure. A benefit of running our own infrastructure is that we're able to gain a lot of hands on experience with enterprise environments that we'd otherwise have to report on from a distance.

When I first started covering SSDs four years ago I became obsessed with the idea of migrating nearly every system over to something SSD based. The first to make the switch were our CPU testbeds. Moving away from mechanical drives ensured better benchmark consistency between runs as any variation in IO load was easily absorbed by the tremendous amount of headroom that an SSD offered. The holy grail of course was migrating all of the AnandTech servers over to SSDs. Over the years our servers seem to die in the following order: hard drives, power supplies, motherboards. We tend to stay on a hardware platform until the systems start showing the signs of their age (e.g. motherboards start dying), but that's usually long enough that we encounter an annoying number of hard drive failures. A well validated SSD should have a predictable failure rate, making it an ideal candidate for an enterprise environment where downtime is quite costly and in the case of a small business, very annoying.

Our most recent server move is a long story for a separate article but to summarize the move, we recently switched hosting providers and data centers. Our hardware was formerly on the east coast and the new datacenter is in the middle of the country. At our old host we were trying out a new cloud platform while our new home would be a mixture of a traditional back-end with a virtualized front-end. With a tight timetable for the move and no desire to deploy an easily portable solution at our old home before making the move we were faced with a difficult task: how do we physically move our servers half way across the country with minimal downtime?

Thankfully our new host had temporary hardware very similar in capabilities to our new infrastructure that they were willing to put the site on as we moved our hardware. The only exception was, as you might guess, a relative lack of SSDs. Our new hardware uses a combination of consumer and enterprise SSDs but our new host only had mechanical drives or consumer grade SSDs on tap (Intel SSD 320s). The fact that we had run the site's databases off of a mechanical drive array for years meant that even a small number of consumer drives should be more than capable of handling the load. Remembering back to some of our earliest lessons in the SSD space: a single solid state drive can offer an order of magnitude better random IO performance than even the fastest hard drives. Sequential performance is typically closer, but with modern 3Gbps SSDs you're still looking at roughly a 100MB/s advantage in sequential throughput over the fastest mechanical drives.

Not wanting to deal with any potential IO performance issues, we decided to deploy a bunch of consumer grade Intel SSDs on our temporary DB servers that our host had on hand. This isn't uncommon, in fact a huge percentage of enterprise workloads are served just fine by consumer SSDs. It's only the absolute heaviest of workloads that demand eMLC or SLC drives. Given that we were going to stay on this platform for a short period of time, there was no need for eMLC/SLC drives. As we know from years of dealing with SSDs, NAND cells have a finite lifespan. Consumer grade MLC NAND is good for about 5000 program/erase cycles per cell, while enterprise grade MLC (eMLC/MLC-HET) can get you over 6x that. While most client workloads won't ever hit 5000 p/e cycles, a heavy enough enterprise workload can definitely reach that point. It's not just the amount of writing you do, it's also how much each write is amplified. Remember that although NAND is programmed at the page level (4KB - 8KB), it can only be erased at the block level (512KB - 2048KB). This imbalance in write/erase granularity means that eventually you'll have to write more to NAND than you've sent to the host (e.g. go to write 8KB but have to read, modify, write an entire 2048KB block as there are no empty blocks to write to). The ratio of NAND to host writes is referred to as write amplification. The combination of workload and write amplification are what determine the longevity of any SSD, but in the enterprise world it's something you actually need to pay attention to.



Write amplification around 1 may be realistic for client (read heavy) workloads, but not in the enterprise

Our host had eight Intel SSD 320s (120GB) on hand that we could use for our temporary database servers. From a performance standpoint these drives should be more than enough to handle our workload, but would they be reliable?

The easiest way to combat write amplification is to increase the amount of spare area on an SSD. NAND that isn't user addressable can be used for background operations and will help ensure that there are empty blocks to be written to as often as possible. If writes happen on empty vs. full blocks, write amplification goes down considerably.

We deployed the Intel SSD 320s partitioned down to 100GB each. Curious as to how they've been holding up over the past ~9 days that they've been running as our primary environment, I installed and ran Intel's SSD Toolbox on our DB servers.

As I mentioned in our review of Intel's SSD 710 you can actually measure things like write amplification, host writes and estimated lifespan using SMART attributes on the latest Intel SSDs. This won't work on the Intel SSD 510, X25-E or first generation X25-M, but on all other Intel SSDs it will.

The attributes of importance are E1, E2, E3 and E4 (these are hex representations of the SMART attribute numbers 225 - 228). E2 - E4 are all timer dependent, while E1 gives you an indication of just how much you've written to the NAND over the life of the drive. I'll get into the timer based counters shortly, but when we originally setup the servers I didn't have the foresight to reset the counters to get an accurate estimate of write amplification on our workload. Instead I'll look at total bytes written and use some internal estimates for write amplification to gauge lifespan.

The eight drives are divided among two database servers running two different applications (one for the main site and one for the forums/ads). The latter more heavily loaded but both are pretty demanding. The four drives are configured in a RAID10 to increase capacity and offer some redundancy. A simple RAID1 of larger drives would be fine, but 120GB drives are all we had on hand at the time. The total number of writes across all of the drives for the past 9 days is documented in the table below:

AnandTech Database Server SSD Workload SMART Attribute E1 MS SQL Server (Main Site DB) MySQL Server (Forums/Ads DB) Drive 0 576.28GB 1.03TB Drive 1 563.28GB 1.03TB Drive 2 564.44GB 1.13TB Drive 3 568.03GB 1.13TB

Each drive in the first DB server has seen around 570GB of writes in 9 days, or roughly 63GB/day. The drives in the second DB server have gone through 1.03TB of writes in the same period of time or 114GB/day. Note that both workloads are an order of magnitude greater than an average consumer workload of 5 - 10GB/day. That's not to say that we can't run these workloads on consumer SSDs, we just need to be careful.

With no write amplification we could run on these consumer drives indefinitely. With each MLC NAND cell good for 5000 program/erase cycles, we could write to the drives 5000 times over before we started to lose NAND. Based on the numbers above, we'd blow through a p/e cycle every ~2 days on the first DB server and every ~1 day on the second server.

While we like to assume write amplification is nice and low, in reality it isn't. Intel's own datasheets tell us the worst case write amplification for the 320:

If you divide the column on the right by the column on the left you'll come up with 125 program/erase cycles per cell (if you define 1TB as one trillion bytes). If we assume that each cell is good for 5000 p/e cycles (Intel's 25nm MLC NAND spec) then it means that we're actually writing 40x what we think we're writing. This 40x value gives us an upper bound for write amplification on Intel's SSD 320. That's far lower than the peak theoretical max write amplification of 256 (writing 2048KB for every 8KB write sent to the host), but it's safe to say that Intel's firmware won't let things get that bad.

Write amplification of 40x isn't very good but it's also not very realistic for the majority of workloads. Our database workloads are heavy but they are not perfectly random writes over all LBAs for the life of the drive. Those workloads do exist, but we're simply not an example of one. A more realistic, but still conservative estimate for write amplification in our case would be 10x (just based on some internal estimates for write amplification). The actual write amp is likely less than half that but again, I wanted to be conservative.

Calculating longevity based on this data is pretty simple. Multiply the total bytes written by estimated write amplification and that's how we can scale up/down:

Intel SSD 320 - 120GB - SSD Longevity MS SQL Server (Main Site DB) MySQL Server (Forums/Ads DB) Total Bytes Written (9 days) 576GB 1030GB Estimated Write Amplification 10x 10x Drive Capacity 120GB 120GB Cycles Used 48 85.8 Cycles Used per Day 5.33 9.53 Worst Case P/E Cycles Available 5000 5000 Estimated Lifespan (Days) 937.5 524.3 Estimated Life (Years) 2.57 years 1.44 years

With 10x write amplification we're looking at roughly 2.5 years for one set of drives and 1.4 years for the other set. From a cost standpoint, it's likely cheaper to go the consumer drive route and proactively replace drives compared to jumping to eMLC although that's not always desirable from an uptime perspective. If our write amplification estimates are off (they are) then you can expect something more along the lines of 5 years for the first set of drives and 3 for the second set.

We could combat write amplification by setting aside even more spare area for the drives. Partitioning them as 80GB or even 60GB drives would tangibly reduce write amplification and give us even more time on these consumer drives.

This isn't so much an issue for us as our stay on the 320s is temporary, but it did bring up an interesting question. As enterprise SSD endurance is heavily dependent on write amplification, how would Intel's SandForce based SSD 520 handle enterprise workloads given that its effective write amplification is often times less than 1x?

If you were able to average 0.5x write amplification with Intel's SSD 520, your 5000 p/e cycle MLC NAND would behave more like 10000 p/e cycle NAND. While that's still short of what you get from eMLC, it is perhaps enough to be a better balance of price/performance for many SMB enterprise customers.

It was time to do some investigating...

Intel's SSD 520 in the Enterprise

I went through the basic premise of SandForce's controller architecture in our review of the 520. By integrating a real time data compression/deduplication engine in the data path of the controller, SandForce can reduce the number of physical writes it commits to NAND. It's an interesting way of combating the issue of finite NAND flash endurance. It works very well on desktop systems (BSOD issues aside), and for many enterprise workloads it should do similarly well. By writing less, you can get more endurance out of your NAND, making it an ideal technology for use in the enterprise where NAND endurance is more of a concern.

The limitations are serious however. You cannot further compress something that is already compressed and data sets that are truly random in makeup can't be compressed either. If your enterprise workload triggers either of these conditions, or if you're working with encrypted data, you're not going to get a big benefit from SandForce's technology.

There are still a lot of enterprise workloads (including portions of ours) that just revolve around reading and writing simple text (e.g. pages of a review, or tracking banner impressions). For these workloads, SandForce could do quite well.

Intel's SSDs have often been used in datacenter environments, including the consumer drives for reasons I've already described. Armed with a full set of Intel SSDs I put all of them through our newly created Enterprise SSD suite to see how well they performed.

Enterprise SSD Comparison Intel SSD 710 Intel X25-E Intel SSD 520 Intel SSD 320 Capacities 100 / 200 / 300GB 32 / 64GB 60 / 120 / 180 / 240 / 480GB 80 / 120 / 160 / 300 / 600GB NAND 25nm HET MLC 50nm SLC 25nm MLC 25nm MLC Max Sequential Performance (Reads/Writes) 270 / 210 MBps 250 / 170 MBps 550 / 520 MBps 270 / 220 MBps Max Random Performance (Reads/Writes) 38.5K / 2.7K IOPS 35K / 3.3K IOPS 50K / Not Listed IOPS 39.5K / 600 IOPS Endurance (Max Data Written) 500TB - 1.5PB 1 - 2PB Not Listed 5 - 60TB Encryption AES-128 - AES-256 AES-128 Power Safe Write Cache Y N N Y Temp Sensor Y N N N

It's worth pointing out that the Intel SSD 520 and 510 are both 6Gbps drives, while many servers deployed today still only support 3Gbps SATA. I've provided results for both 3Gbps and 6Gbps configurations to showcase the differences.

The Test

Note that although we debuted these tests in previous reviews, the results here aren't comparable due to some changes in the software build on the system.

CPU Intel Core i7 2600K running at 3.4GHz (Turbo & EIST Disabled) Motherboard: Intel H67 Motherboard Chipset: Intel H67 Chipset Drivers: Intel 9.1.1.1015 + Intel RST 10.2 Memory: Qimonda DDR3-1333 4 x 1GB (7-7-7-20) Video Card: eVGA GeForce GTX 285 Video Drivers: NVIDIA ForceWare 190.38 64-bit Desktop Resolution: 1920 x 1200 OS: Windows 7 x64

Enterprise Storage Bench - Oracle Swingbench

We begin with a popular benchmark from our server reviews: the Oracle Swingbench. This is a pretty typical OLTP workload that focuses on servers with a light to medium workload of 100 - 150 concurrent users. The database size is fairly small at 10GB, however the workload is absolutely brutal.

Swingbench consists of over 1.28 million read IOs and 3.55 million writes. The read/write GB ratio is nearly 1:1 (bigger reads than writes). Parallelism in this workload comes through aggregating IOs as 88% of the operations in this benchmark are 8KB or smaller. This test is actually something we use in our CPU reviews so its queue depth averages only 1.33. We will be following up with a version that features a much higher queue depth in the future.

SLC NAND offers great write latency and we see a definite advantage to the older drive here in our Swingbench test. Only a 6Gbps SSD 520 is able to deliver better performance, everything else trails the 3+ year old drive. Note that the Marvell based Intel SSD 510, even on a 6Gbps controller, is the slowest drive in Intel's lineup. From a write amplification perspective, Marvell's controller has always been significantly behind Intel's own creations so the drop in performance isn't surprising. The 710 actually delivers performance that's lower than the 320, but you do get much better endurance out of the 710.

While throughput isn't much better on the 6Gbps Intel SSD 520, average service time is tangibly lower. There's clearly a benefit to higher bandwidth IO interfaces in the enterprise space, which is a big reason we're seeing a tremendous push for PCIe based SSDs. The 710 does well here but not nearly as well as the X25-E which continues to behave like a modern SSD thanks to its SLC NAND.

Enterprise Storage Bench - Microsoft SQL UpdateDailyStats

Our next two tests are taken from our own internal infrastructure. We do a lot of statistics tracking at AnandTech - we record traffic data to all articles as well as aggregate traffic for the entire site (including forums) on a daily basis. We also keep track of a running total of traffic for the month. Our first benchmark is a trace of the MS SQL process that does all of the daily and monthly stats processing for the site. We run this process once a day as it puts a fairly high load on our DB server. Then again, we don't have a beefy SSD array in there yet :)

The UpdateDailyStats procedure is mostly reads (3:1 ratio of GB reads to writes) with 431K read operations and 179K write ops. Average queue depth is 4.2 and only 34% of all IOs are issued at a queue depth of 1. The transfer size breakdown is as follows:

AnandTech Enterprise Storage Bench MS SQL UpdateDaily Stats IO Breakdown IO Size % of Total 8KB 21% 64KB 35% 128KB 35%

Our SQL tests are much more dependent on sequential throughput and thus we really see some impressive gains from moving to a 6Gbps SATA interface. Among the 3Gbps results the Intel SSD 520 is now the top performer, followed once again by the X25-E. To be honest, most of these drives do perform the same as they bump into the limits of 3Gbps SATA.

Once again we see a huge reduction in service time from the Intel SSD 520 running on a 6Gbps interface. Even on a 3Gbps interface the 520 takes the lead while the bulk of the 3Gbps drives cluster together around 14.4ms. Note the tangible difference in performance between the 300GB and 160GB Intel SSD 320. The gap isn't purely because of additional NAND parallelism, the 300GB drive ends up with more effective spare area since the workload size doesn't scale up with drive capacity. What you're looking at here is the impact of spare area on performance.

Enterprise Storage Bench - Microsoft SQL WeeklyMaintenance

Our final enterprise storage bench test once again comes from our own internal databases. We're looking at the stats DB again however this time we're running a trace of our Weekly Maintenance procedure. This procedure runs a consistency check on the 30GB database followed by a rebuild index on all tables to eliminate fragmentation. As its name implies, we run this procedure weekly against our stats DB.

The read:write ratio here remains around 3:1 but we're dealing with far more operations: approximately 1.8M reads and 1M writes. Average queue depth is up to 5.43.

Again, huge gains from the 520 on a 6Gbps interface. Moving over to a 3Gbps interface, all of these drives basically perform the same thanks to the 3Gbps SATA limitation.

Measuring How Long Your Intel SSD Will Last

Earlier in this review I talked about Intel's SMART attributes that allow you to accurately measure write amplification for a given workload. If you fire up Intel's SSD Toolbox or any tool that allows you to monitor SMART attributes you'll notice a few fields of interest. I mentioned these back in our 710 review, but the most important for our investigations here are E2 (226) and E4 (228):

The raw value of attribute E2, when divided by 1024, gives you an accurate report of the amount of wear on your NAND since the last timer reset. In this case we're looking at an Intel X25-M G2 (the earliest drive to support E2 reporting) whose E2 value is at 9755. Dividing that by 1024 gives us 9.526% (the field is only accurate to three decimal points).

I mentioned that this data is only accurate since the last timer reset, that's where the value stored in E4 comes into play. By executing a SMART EXECUTE OFFLINE IMMEDIATE subcommand 40h to the drive you'll reset the timer in E4 and the data stored in E2 and E3. The data in E2/E3 will then reflect the wear incurred since you reset the timer, giving you a great way of measuring write amplification for a specific workload.

How do you reset the E4 timer? I've always used smartmontools to do it. Download the appropriate binary from sourceforge and execute the following command:

smartctl -t vendor,0x40 /dev/hdX where X is the drive whose counter you're trying to reset (e.g. hda, hdb, hdc, etc...).

Doing so will reset the E4 counter to 65535. The counter then begins at 0 and will count up in minutes. After the first 60 minutes you'll get valid data in E2/E3. While E2 gives you an indication of how much wear your workload puts on the NAND, E3 gives you the percentage of IO operations that are reads since you reset E4. E3 is particularly useful for determining how write heavy your workload is at a quick glance. Remember, it's the process of programing/erasing a NAND cell that is most destructive - read heavy workloads are generally fine on consumer grade drives.

I reset the workload timer (E4) on all of the Intel SSDs that supported it and ran a loop of our MS SQL Weekly Maintenance benchmark that resulted in 320GB of writes to the drive. I then measured wear on the NAND (E2) and used that to calculate how many TBs we could write to these drives, using this workload, before we'd theoretically wear out their NAND. The results are below:

There are a few interesting takeaways from this data. For starters, Intel's SSD 710 uses high endurance MLC (aka eMLC, MLC-HET) which is good for a significant increase in p/e cycles. With tens of thousands of p/e cycles per NAND cell, the Intel SSD 710 offers nearly an order of magnitude better endurance than the Intel SSD 320. Part of this endurance advantage is delivered through an incredible amount of spare area. Remember that although the 710 featured here is a 200GB drive it actually has 320GB of NAND on board. If you set aside a similar amount of spare area on the 320 you'd get a measurable increase in endurance. We actually see an example of that if you look at the gains the 300GB SSD 320 offers over the 160GB drive. Both drives are subjected to the same sized workload (just under 60GB), but the 300GB drive has much more unused area to use for block recycling. The result is an 85% increase in estimated drive lifespan for an 87.5% increase in drive capacity.

What does this tell us for how long these drives would last? If all they were doing was running this workload once a week, even the 160GB SSD 320 would be just fine. In reality our SQL server does far more than this but even then we'd likely be ok with a consumer drive. Note that the 800TBs of writes for the 160GB 320 is well above the 15TB Intel rates the drive for. The difference here is that Intel is calculating lifespan based on 4KB random writes with a very high write amplification. If we work backwards from these numbers for the MLC drives you'll end up with around 4000 - 5000 p/e cycles. In reality, even 25nm Intel NAND lasts longer than what it's rated for so what you're seeing is that this workload has a very low write amplification on these drives thanks to its access pattern and small size relative to the capacity of these drives.

Every workload is going to be different but what may have been a brutal consumer of IOs in the past may still be right at home on consumer SSDs in your server.

Final Words

The X25-E remains one of the fastest Intel SSDs in the enterprise despite being three generations old from a controller standpoint. The inherent advantages of SLC NAND are undeniable. Intel's SSD 520 regularly comes close to the X25-E in performance and easily surpasses it if you've got a 6Gbps interface. Over a 3Gbps interface, most of these drives end up performing very similarly.

We also showed a clear relationship between performance and drive capacity/spare area. Sizing your drive appropriately for your workload is extremely important for both client and enterprise SSD deployments. On the client side we've typically advocated keeping around 20% of your drive free at all times, but for enterprise workloads with high writes you should shoot for a larger amount. How much spare area obviously depends on your workload but if you do a lot of writing, definitely don't skimp on capacity.

What's most interesting to me is that although the 520 offers great performance, it didn't offer a tremendous advantage in endurance in our tests. Its endurance was in line with the SSD 320, if not a bit lower if we normalize to capacity. Granted this will likely vary depending on the workload, but don't assume that the 520 alone will bring you enterprise class endurance thanks to its lower write amplification.

This brings us to the final point. If endurance is a concern, there really is no replacement for the Intel SSD 710. Depending on the workload you get almost an order of magnitude improvement in drive longevity. You do pay for that endurance though. While an Intel SSD 320 performs similarly to the 710 in a number of areas, the 710 weighs in at around $6/GB compared to sub-$2/GB for the 320. If you can get by with the consumer drives, either the 320 or 520, they are a much better solution from a cost perspective.

Intel gives you the tools to figure out how much NAND endurance you actually need, the only trick is that you'll need to run your workload on an Intel SSD to figure that out first. It's a clever way to sell your drives. The good news is that if you're moving from a hard drive based setup you should be able to at least try out your workload on a small number of SSDs (maybe even one if your data isn't too large) before deciding on a final configuration. There are obviously software tools you can use to monitor writes but they won't give you an idea of write amplification.