Log files to me have typically been text files written to /var/log/ by a small system utility called syslogd or rsyslogd, or syslog-ng. Log files are also what applications create somewhere on the file system into which they write status messages or full stack traces. Log files contain plain text, so I can unleash the full power of the UNIX toolbox onto them to find what I’m looking for.

Information “mining” on dozens of huge log files over hundreds of machines with a combination of gzip -d , grep , and less doesn’t really work though, and that’s putting it mildly. Thankfully I follow the right people on Twitter and have heard of Logstash and all the related buzzwords.

I spent a few hours doing some research and took some notes to try and clarify how the different bits and pieces fit together to make a big picture. (The fact that many of the bits are interchangeable compounds the confusion.) The drawing I made during my tests, helped me understand how the bits interact, and I hope it will help you as well.

Some notes on what I’ve learned so far about some of these components.

ElasticSearch

ElasticSearch is a distributed RESTful search server based on Apache Lucene. It provides a scalable search solution which can be used to search all kinds of documents with a schema-free index using JSON over HTTP.

Graylog2

If you look at my drawing above, you’ll see some parallels between Graylog2 and Logstash: both accept messages somehow or other, and both store them in ElasticSearch. All comparison ends there.

The Graylog2 server is written in Java and accepts log messages via a UDP or TCP syslog listener onport 514 (configurable) or AMQP. It also requires an installed MongoDB in which it stores user accounts, some of its configuration, and statistics.

In addition to syslog, Graylog2 accepts messages via GELF, either directly on the Graylog2 server or from remote locations, using one of the supported language bindings. Whatever Graylog2 receives, it stores in ElasticSearch.

Logstash

The monolithic Logstash program is an indexer and/or a Web server, depending on the arguments you launch it with. It also comes bundled with a copy of ElasticSearch (but I’ll ignore that here: I prefer to use external instances of ElasticSearch).

The Logstash indexer reads logs from a variety of sources, called “inputs”, for example from files (cleverly following growing files – think tail -f and log-rotation), from sockets, with a “syslog” listener, from message-queues (AMQP and ZeroMQ), pipes, or Redis. Each source is labelled with a “type” which I can use later to have Logstash apply particular filters to to specific “types” (i.e. sources).

Logstash’ filters let me pick up messages from “inputs” and massage them. I can “grep” for specific lines, join lines with “multiline”, “split” lines, and (crazy but true) use “zeromq” to process the message off-site, waiting for a response before continuing. Most importantly, the “grok” filter allows me to use regular expressions to chop a log line into fields which I can “mutate” (another filter) or simply have Logstash store in ElasticSearch. Breaking a log line into fields means I can search by particular fields. (I’ll show you an example shortly.)

Outputs, finally, instruct Logstash how to handle messages which were “input” and possibly “filtered”. Recall we applied a “type” to each message source, and I can use these types in “outputs” to segregate logs to different places. (Typically most inputs will go to “elasticsearch”, but I could, say, additionally store messages in a file output.) There are a large variety of outputs: “stdout” and “elasticsearch” ought to be self-explanatory. Others include “e-mail”, “file”, “http”, “amqp”, “redis”, “nagios”, “pipe”, “tcp”, “zeromq”, etc.

This is what a Logstash configuration can look like:

input { file { type => "bindaxfr" path => [ "/var/log/named/axfr.*" ] format => [ "plain" ] add_field => [ "fofo", "babar" ] } } filter { grok { type => "bindaxfr" patterns_dir => [ "/usr/local/share/grok/patterns" ] pattern => "%{DNSAXFRCOMPLETE}" add_tag => "axfr" add_field => [ "named_raw_message", "%{@message}" ] } } output { elasticsearch { host => '127.0.0.1' } }

I define a single input, a single filter, and an output to ElasticSearch.

Visualizing the lot, a.k.a. The GUI

Logstash provides an optional simple Web interface through which I can search for whatever it has added to ElasticSearch.

A separate project, Kibana, provides a PHP-based interface which connects directly to ElasticSearch. Apropos ElasticSearch: here’s a UI you must get for looking into it.

Graylog2-Web provides a separate Ruby on Rails project which offers a very sexy-looking interface into what is stored in ElasticSearch. Filters allow me to drill down and search for specific fields, “streams” enable me to use pre-defined filters for doing the same, and I can monitor streams and have them issue alerts or forward messages to other Graylog2 endpoints.

Shipping logs

So how do I get logs from machines into Graylog2 or Logstash? Well, there are a lot of ways to accomplish that:

To Graylog2

Configure machines to forward syslog entries to the remote Graylog2 syslog-listener, or have rsyslogd store in ElasticSearch itself.

Ship log entries via GELF to the Graylog2 server.

Use Gelfino, a tiny GELF server (written in Closure), as forwarder endpoint to Graylog2.

logix is a Python daemon which queues events over AMQP for transmission to Graylog2’s AMQP listener.

To Logstash

Copy logfiles via the “traditional” methods (e.g. rsync ) to the Logstash server and use a “file” input.

) to the Logstash server and use a “file” input. I can run Logstash’ “syslog” input on individual servers or centrally, and have, say, syslogd forward messages to that.

I can run Logstash on shipping servers, feed logfiles into it, do the filtering thing there, and send results off to ElasticSearch (or Graylog2) from there. (Read Centralized Setup with Event Parsing, which uses Redis as a broker or load balancing Logstash with Redis.)

Lumberjack is a promising new lightweight utility (by Jordan Sissel) which will collect logs locally to ship them elsewhere.

An alternative seems to be Beaver – a Python daemon that chews on logs and sends their content to a remote Logstash server via Redis or 0MQ. What I particularly like about the Redis way is that it can buffer to disk if the indexer (Logstash) goes down for maintenance.

Grok

Grok is a program (and API)

that allows you to easily parse logs and other files. With grok, you can turn unstructured log and event data into structured data. […] You can match any number of complex patterns on any number of inputs (processes and files) and have custom reactions.

I found the best way to start learning grok was by using it standalone, i.e. without Logstash, starting with the tutorial example and going on from there; a glance at the Logstash Cookbook helped me too. I set myself the task of parsing a BIND zone transfer log file into individual fields, so I started off by building an appropriate regular expression using parts already defined in the Grok package to parse a line like this

02-Aug-2012 08:52:52.600 transfer of 'example.com/IN' from 192.168.37.53#53: Transfer completed: 8 messages, 052 records, 20965 bytes, 1.925 secs (372200 bytes/sec)

with this:

DNSAXFRCOMPLETE %{MONTHDAY}-%{MONTH}-%{YEAR} %{HOUR}:%{MINUTE}:%{SECOND}\.%{INT} transfer of '%{HOSTNAME:zone}/IN' from %{IP:clientip}#%{POSINT}: Transfer completed:.*messages, %{POSINT:records} records, %{POSINT:octets} bytes,%{SPACE}%{NUMBER:seconds}

Note how I grab a field (e.g. HOSTNAME ) and rename it ( zone ). What happens here is: as soon as grok finds a regular expression matching the definition of “HOSTNAME” it assigns that to a field called zone which Logstash will store in ElasticSearch. Here is an example of what the individual fields look like via the Graylog2 Web interface:

(I can click on the pink bits to have Graylog2 use a field as a filter.)

This is all very powerful, and I have a lot to learn; I feel it’s going to be worth it.

Further reading