July 9, 2018: This blog has been updated to include recommendations on when to not trim disks in order to prevent major slowdowns in production environments.

Is your Elasticsearch "Trimmed"?

Here at Elastic we regularly benchmark the performance of Elasticsearch. The results are publicly available. Looking at the results, we have observed a recurring pattern of performance degradation:

In the picture above we can see peak performance on 2016/09/26 gradually dropping until it recovers again on 2016/10/03, followed by another steady performance drop recovering on 2016/10/10. Have you spotted the pattern yet?

Date Day of the Week 20160926 Monday 20161003 Monday 20161010 Monday

Dismissing the possible influence of periodic astronomical effects, we started looking at the scheduled weekly jobs on the operating system instead, for an explanation.

Our benchmarks run on bare metal servers, use two SSD drives in software RAID-0 configuration and have Ubuntu 16.04 installed.

Following is the content of cron.weekly on a Ubuntu Server 16.04 LTS based VM. Notice the fstrim job.

vagrant@ubuntu1604vbox:/etc/cron.weekly$ ls fstrim man-db

What on earth is TRIM?

Quoting the Wikipedia article:

A trim command (known as TRIM in the ATA command set, and UNMAP in the SCSI command set) allows an operating system to inform a solid-state drive (SSD) which blocks of data are no longer considered in use and can be wiped internally.

SSD drives are composed of groups of Flash (NAND) memory cells. Three major differences from traditional magnetic disks can affect the performance of SSD drives:

Memory cells cannot be overwritten in one operation: existing data need to be erased first before they can be replaced with new data. Memory cells support a limited number of program/erase cycles. Memory cells are organized in pages and blocks and SSD drives can only erase entire blocks!

These limitations have forced manufacturers to implement Garbage Collection where blocks containing both pages with data and pages marked for deletion are emptied after copying pages with data to free blocks. Therefore SSD drives, internally, create more IO than explicitly generated by the OS for write operations; this phenomenon is called write amplification.

When the OS gets a request to delete a file, the filesystem only updates metadata and doesn't erase the disk addresses holding the actual data. This worsens write amplification on SSD drives during GC runs.

To address this problem, OSes can issue the TRIM command to make the SSD drive aware of erased data; this brings two advantages:

Prevents unnecessary data movement (of data already erased by the OS) during GC runs

Improves the efficiency of GC due to more free space available

What does it mean practically?

It means that your SSD drives will operate faster if the filesystem passes information to them about files that have been deleted.

This what the fstrim crontab process does on Ubuntu.

Does it really make such a big difference?

You bet! After moving the fstrim cronjob (after 2016/10/17) to a daily schedule we can see the earlier graph stabilizing:

The detrimental effect of no TRIM can be seen running ten consecutive runs of a different (PMC) benchmark below. The filesystem got TRIMmed only once, right before starting the benchmarks:

Is there any reason NOT to run trim?

While trimming disks before executing benchmarks makes sense (as one example), there may be reasons you don't want to do it on productions environments.

Depending on how long ago disks were trimmed, disk performance may severely degrade during trim execution and the trim may take a long time to finish.

For Elastic Cloud Enterprise (ECE) installations, the control plane may become unstable during lengthy trims with system wide repercussions — as a result we recommend avoiding trimming ECE disks manually.

In other production cases, it’s strongly recommended to schedule execution only during a quiet traffic period or, if it's constantly running hot, disable periodic trimming entirely.

Caveats

On CentOS-7 the fstrim.service and associated systemd timer is disabled by default, so you will need to enable the service and probably override the default timer schedule as well.

If you are using LVM and frequently perform resize operations you will need to set issue_discards to 1 in /etc/lvm/lvm.conf .

The dm-crypt Linux encryption layer, if used, also requires special handling. Instructions for configuring TRIM under lvm / dm-crypt can be found at the ArchLinux wiki page.

The TRIM issue affects Cloud instances as well, if they provide direct access to SSD disks. This is the case with Instance Store-Backed Linux AMIs on some Amazon Instances.

Conclusion

Elasticsearch is an IO-intensive application. If you are running it on SSD storage you should check if TRIM is enabled and, depending on your load, adjust the frequency of the TRIM job. You will be surprised with the performance gains!