Today, I happened to notice that one of my email log scanning scripts wasn't reporting on a log entry that I knew was there (because another, related script was reporting it). My log scanning script starts out with a grep to filter out some things I don't want to include:

grep -hv 'a specific pattern' "$@" | exigrep '...' | [...]

I had all sorts of paranoid thoughts about whether I had misunderstood exactly what the -v option did, or if exigrep was doing something peculiar, and so on. But eventually I ran the grep itself alone on the file, piped to less , and jumped to the end in less because I happened to know that the missing entry was relatively late in the file. What I was expecting to happen is that the grep output would just stop at some point. What I actually found was simple:

2020-04-13 16:07:06 H=(111iu.com) [223.165.241.9] [...] 2020-04-13 16:07:07 unexpected disconnection [...] Binary file /var/log/exim4/mainlog matches

Ah. Yes. How helpful. While reading along in what it had up until then thought was a text file, GNU Grep encountered some funny characters (in a DKIM signature information line, as it happened) and decided that the file was actually binary and so it wouldn't report anything more for the rest of the file than that final line.

(This is a different and much more straightforward cause than the time GNU Grep thought some text files were binary because of a filesystem bug combined with its clever tricks.)

I generally like the GNU versions of standard Unix utilities and the things that they've added, but this is not one of them, especially when GNU Grep's output is not going to a terminal. Especially if it starts out initially printing out text lines, it should continue to do so rather than surprise people this way.

The valuable learning experience here is that any time I'm processing a text file with GNU Grep (which is pretty much all of the time in my scripts), I should explicitly force it to always treat things as text. This is unfortunately going to make some scripts more awkward, because sometimes I have pipelines with several greps involved as text is filtered and manipulated. Either I spray ' -a ' over all of the greps or I try to figure out what minimal LC_<something> environment variable will turn this off, or I reach for the gigantic hammer of ' LC_ALL=C ' (as suggested by the GNU Grep manpage).