ZFS is an incredible filesystem and solves many of my local and shared data storage needs.

While, I do like the idea of clustered ZFS wherever possible, sometimes it's not practical, or I need some geographical separation of storage nodes.

One of the use cases I have is for high-performance replicated storage on Linux application servers. For example, I support a legacy software product that benefits from low-latency NVMe SSD drives for its data. The application has an application-level mirroring option that can replicate to a secondary server, but it's often inaccurate and is a 10-minute RPO.

I've solved this problem by having a secondary server (also running ZFS on similar or dissimilar hardware) that can be local, remote or both. By combining the three utilities detailed below, I've crafted a replication solution that gives me continuous replication, deep snapshot retention and flexible failover options.

zfs-auto-snapshot - https://github.com/zfsonlinux/zfs-auto-snapshot

Just a handy tool to enable periodic ZFS filesystem level snapshots. I typically run with the following schedule on production volumes:

# /etc/cron.d/zfs-auto-snapshot PATH="/usr/bin:/bin:/usr/sbin:/sbin" */5 * * * * root /sbin/zfs-auto-snapshot -q -g --label=frequent --keep=24 // 00 * * * * root /sbin/zfs-auto-snapshot -q -g --label=hourly --keep=24 // 59 23 * * * root /sbin/zfs-auto-snapshot -q -g --label=daily --keep=14 // 59 23 * * 0 root /sbin/zfs-auto-snapshot -q -g --label=weekly --keep=4 // 00 00 1 * * root /sbin/zfs-auto-snapshot -q -g --label=monthly --keep=4 //

Syncoid (Sanoid) - https://github.com/jimsalterjrs/sanoid

This program can run ad-hoc snap/replication of a ZFS filesystem to a secondary target. I only use the syncoid portion of the product.

Assuming server1 and server2, simple command run from server2 to pull data from server1:

#!/bin/bash /usr/local/bin/syncoid root@server1:vol1/data vol2/data exit $?

Monit - https://mmonit.com/monit/

Monit is an extremely flexible job scheduler and execution manager. By default, it works on a 30-second interval, but I modify the config to use a 15-second base time cycle.

An example config that runs the above replication script every 15 seconds (1 cycle)

check program storagesync with path /usr/local/bin/run_storagesync.sh every 1 cycles if status != 0 then alert

This is simple to automate and add via configuration management. By wrapping the execution of the snapshot/replication in Monit, you get centralized status, job control and alerting (email, SNMP, custom script).

The result is that I have servers that have multiple months of monthly snapshots and many points of rollback and retention within: https://pastebin.com/zuNzgi0G - Plus, a continuous rolling 15-second atomic replica:

# monit status