I'll start with my tweet:

One of the frustrating things about operating ZFS on Linux is that the ARC size is critical but ZFS's auto-tuning of it is opaque and apparently prone to malfunctions, where your ARC will mysteriously shrink drastically and then stick there.

Linux's regular filesystem disk cache is very predictable; if you do disk IO, the cache will relentlessly grow to use all of your free memory. This sometimes disconcerts people when free reports that there's very little memory actually free, but at least you're getting value from your RAM. This is so reliable and regular that we generally don't think about 'is my system going to use all of my RAM as a disk cache', because the answer is always 'yes'.

(The general filesystem cache is also called the page cache.)

This is unfortunately not the case with the ZFS ARC in ZFS on Linux (and it wasn't necessarily the case even on Solaris). ZFS has both a current size and a 'target size' for the ARC (called ' c ' in ZFS statistics). When your system boots this target size starts out as the maximum allowed size for the ARC, but various events afterward can cause it to be reduced (which obviously limits the size of your ARC, since that's its purpose). In practice, this reduction in the target size is both pretty sticky and rather mysterious (as ZFS on Linux doesn't currently expose enough statistics to tell why your ARC target size shrunk in any particular case).

The net effect is that the ZFS ARC is not infrequently quite shy and hesitant about using memory, in stark contrast to Linux's normal filesystem cache. The default maximum ARC size starts out as only half of your RAM (unlike the regular filesystem cache, which will use all of it), and then it shrinks from there, sometimes very significantly, and once shrunk it only recovers slowly (if at all).

This sounds theoretical, so let me make it practical. We have six production ZFS on Linux NFS fileservers, all with 196 GB of RAM and a manually set ARC maximum size of 155 GB. At the moment their ARC sizes range from 117 GB to 145 GB; specifically, 117 GB, 127 GB, three at 132 GB, and 145 GB. On top of this, the fileserver at 117 GB of ARC is a very active one with some very popular and big filesystems (such as our mail spool, which is perennially the most active filesystem we have). Even if we're still getting a good ARC hit rate during active periods, I'm pretty sure that we could get some use out of it caching more ZFS data in RAM than it actually is.

(We don't currently have ongoing ARC stats for our fileservers, so I don't know what the ARC hit rate is or why ARC misses happen (cf).)

Part of the problem here is not just that the ARC target size shrinks, it's that you can't tell why and there aren't really any straightforward and reliable ways to tell ZFS to reset it. And since you can't tell why the ARC target size shrunk, you can't tell if ZFS actually did have a good reason for shrinking the ARC. The auto-sizing is great when it works but very opaque when it doesn't, and you can't tell the difference.

PS: Several years ago, I saw memory competition between the ARC and the page cache on my workstation, but then the issue went away. I don't think our fileserver ARC issues are due to page cache contention, partly because the entire ext4 root filesystem on them is only around 20 GB. Even if all of it is completely cached in RAM, there's a bunch of ARC shrinkage that's left unaccounted for. Similarly, the sum of smem's PSS for all user processes is only a gigabyte or two. There just isn't very much happening on these machines.

PPS: This is with an older version of ZFS on Linux, but my office workstation with a bleeding edge ZoL doesn't do any better (in fact it does worse, with periodic drastic ARC collapses).