A year ago we’ve added SMART metrics collection to our monitoring agent that collects disk drive attributes on clients servers.

So here a couple of interesting cases from the real world.

Because we needed it to work without installing any additional software, like smartmontools, we implemented collection not of all the attributes, but only basic and not vendor-specific ones — to be able to provide consistent experience. And also that way we skipped burdensome task of maintaining a knowledge base of specific stuff — and I like that a lot :)

This time we’ll discuss only SMART attribute named “media wearout indicator”. Normalized, it shows a percentage of “write resource” left in the device. Under the hood the device keeps track of the number of cycles the NAND media has undergone, and the percentage is calculated against the maximum number of cycles for that device. The normalized value declines linearly from 100 to 1 as the average erase cycle count increases from 0.

Are there any actually dead SSDs?

Though SSDs are pretty common nowadays, just couple of years earlier you could hear a lot of fear talk about SSD wearout. So we wanted to see if some of it were true. So we searched for the maximum wearout across all the devices of all of our clients.

It was just 1%

Reading the docs says it just won’t go below 1%. So it is worn out.

We notified this client. Turns out it was a dedicated server in Hetzner. Their support replaced the device:

Do SSDs die fast?

As we introduced SMART monitoring for some of the clients already some time ago, we have accumulated history. And now we can see it on a timeline.

A server with highest wearout rate we have across our clients servers unfortunately was added to okmeter.io monitoring only two month ago:

This chart indicates that during these two month only, it burned through 8% of “write resource”.

So 100% of this SSD lifetime under that load will end in 100/(8/2) = 2 years.

Is that a lot or too little? I don’t know. But let’s check what kind of load it’s serving?

As you can see, it’s ceph doing all the disk writes, but it’s not doing these writes for itself — it’s a storage system for some application. This particular environment was running under Kubernetes, so let’s sneak a peek what’s running inside:

It’s Redis! Though you might’ve noticed divergence in values with the previous chart — values here are 2 times lower (it’s probably due to ceph’s data replication), load profile is the same, so we conclude it’s redis after all.

Let’s see what redis is doing:

So it’s on average less than 100 write commands per second. As you might know, there’s two ways Redis makes actual writes to disk:

RDB — which periodically snapshots all the dataset to the disk, and

— which periodically snapshots all the dataset to the disk, and AOF — which writes a log of all the changes.

It’s obvious that’s here we saw RDB with 1 minute dumps:

Case: SSD + RAID

We see that there are three common patterns of server storage system setup with SSDs:

Two SSDs in a RAID-1 that holds everything there is.

Some HDDs + SSDs in a RAID-10 — we see that setup a lot on traditional RDBMS servers: OS, WAL and some “cold” data on HDD, while SSD array hold hotest data.

Just a bunch of SSDs (JBOD) for some NoSQL like Apache Cassandra.

So in the first case with RAID-1 writes go to both disks symmetrically, and wearout happens with the same rate:

Looking for some anomalies we found one server where it was completely different: