I used to work on a Java application that ran 24/7 and logged to a file on the system. The log file was rotated every week and it usually stood around 4GB.

When the shit hit the fan, I checked the log and tried to reverse-engineer how things got so bad. This is similar to what investigators do with a black box after a plane crash. How do you inspect a 4GB file with a text editor? You might be surprised to know how far Vim can take you in that direction. I have opened gigabyte-sized files before, and it worked … for some value of “worked”.

Luckily, the log4j format the application used contained the timestamp in ISO 8601 format. It looked something like YYYY-MM-DD hh:mm:ss. Thankfully, this is trivial to parse and guarantees that alphanumeric sorting (read: plain old sorting) will keep the dates in chronological order.

grep and sed

I’ll cover a simpler example and come back to dates later.

(seq might be called gseq on your system)

seq 10000 > 10000.txt

This created a 10000-line file with one number, from 1 to 10000, per line. I used this contrived example instead of ISO 8601 formatted dates because it was simple to generate and the relationship between the line number and the line content is obvious.

The next piece of the puzzle is grep. Grep has the -n/--line-number flag to “prefix each line of output with the line number”.

We’re going to extract from the line containing 444 to the line containing 2000. Of course, we know what those line numbers are because of how we generated this file. This is usually not the case.

Right, we want the first match… Part of the solution is to use a tighter regular expression. Also, and the reason I did this, is to realize that the file will keep being parsed after the first match is found. On huge files, waiting for grep to finish is both time-consuming and unnecessary.

The -m NUM/--max-count=NUM flag will “stop reading a file after NUM matching lines.”

Combining the line numbers, we can slice the log with sed:

sed -n ‘444,2000p’ 10000.txt

Discussion

Why not skip grep and just RTF sed manual?

sed -n ‘/^444$/,/^2000$/p’ 10000.txt

My reason: I want to visually confirm that my regular expressions matched the right lines. The time I would have saved bypassing grep would be wasted the first time I would open a file which didn’t contain what I really wanted.

Why not just grep for timestamp and use that?

That’s a subtle point. The log files contained YYYY-MM-DD hh:mm:ss at the beginning of almost every line.

Initially, I tried:

grep ‘^2009-06-28 04:’ log.file

To get the log lines between 4am and 5am on a specific date.

This was simple to understand and explain, and it worked beautifully until we realized that it was almost every line … it was missing the stack traces. It was also missing, although rare in that application, other multi-line log messages.

So, I used:

grep -m 1 -n ‘^2009-06-28 04:’ log.file

grep -m 1 -n ‘^2009-06-28 05:’ log.file



and used sed to extract the lines in-between.