I have worn many hats over the past few years: System Administrator, PostgreSQL and MySQL DBA, Perl Programmer, PHP Programmer, Network Administrator, and Security Engineer/Officer. The common thread is having the data I need available, searchable, and visible.

So what data am I talking about? Honestly, everything. System logs, application logs, events, system performance data, and network traffic data are key requirements to making any tough infrastructure decision, if not key to the trivial infrastructure and implementation decisions we have to make everyday.

I'm in the midst of implementing a comprehensive solution, and this post is a brain dump and road map for how I went about it, and why.

Step 1: syslog

I usually start with a sane solution for transporting the events occurring on UNIX and Windows servers to a central log host. You may need to aggregate events at a data center level depending on throughput. There are a number of options available to you for centrally logging with syslog, the favorites seem to be:

Both are excellent choices, but if you have a tight budget and are reading between the lines of popular regulatory policy (SOX,PCI-DSS,FISMA,FERPA,etc), you may want to give some thought to 2 features in particular: guaranteed delivery, and encrypted transfer. These are not hard and fast rules that auditors check for right now, but they will in the near future.

With rsyslog, both features are available in the open source solution, where as this is not the case with syslog-ng. However, rsyslog does not run on Windows, so if you have a large number of Windows Servers, you probably need to spend money on a central logging solution anyways. I am not in this position, so I choose rsyslog.

The main drawback to rsyslog is the configuration file syntax. Syslog-ng decided to do away with legacy syslog config file syntax in favor of a readable, sensical format. Rsyslog, decided to maintain the legacy syslog configuration syntax and extend it for new features. This is maddening, but if you don't have the budget for syslog-ng and need encryption and/or guaranteed delivery, you can make it work.

Central Log Server

First, we need to setup a place for our logs to land. Configuring the rsyslog central server means configuring where we want the logs to live, and how we'd like to receive them. I'm calling this host 'logstorage-01.'

I'll explain the configuration step by step. The first part sets the default templates, work directory, and loads the modules we need:

# Rsyslog Defaults $ActionFileDefaultTemplate RSYSLOG_TraditionalFileFormat $WorkDirectory /var/run/rsyslog # Modules $ModLoad immark $ModLoad imudp $ModLoad imtcp $ModLoad imklog $ModLoad imuxsock

Now, I establish that I want to listen on tcp and udp 514:

## Enable Listeners $InputTCPServerRun 514 $UDPServerRun 514

Rsyslog uses templates for both filenames and output. This is an example of both. The RemoteHost template will be used to determine the filename foreach message that comes in. The ArcSightFormat is going to be used to reformat the message in a way that an ArcSight Agent can handle.

# Templates $template RemoteHost,"/var/log/remote/%HOSTNAME%/%$YEAR%/%$MONTH%-%$DAY%.log" $template ArcSightFormat,"<%PRI%>%TIMESTAMP% %fromhost-ip% %syslogtag%%msg:::sp-if-no-1st-sp%%msg:::drop-last-lf%

"

Now, we're ready to start doing things with messages. The first action I choose is to discard all connection related messages from snmpd as these consume a lot of disk space. You can disable this type of logging in snmpd, but it also serves a good example of log filtering and the discard action '~'.

# Discard SNMPD Connection Messages if $programname == 'snmpd' and ( $msg contains 'Connection from UDP' or $msg contains 'Received SNMP packet(s) from UDP' ) then ~

At this point, due to the '~' any message matching snmpd and the strings I've specified will have been discarded. I now want to log everything to disk using my RemoteHost template. It is important to note that local syslog messages will also be caught by this next rule:

# Archival Storage # All Messages, locally and remote stored to these rules *.* ?RemoteHost

The *.* tells rsyslog to log everything, the ?RemoteHost, is the template used for the file name. This next rule demonstrates how to send selected messages to a UDP listener using a message format:

# ArcSight if $programname == 'named' then @arcsight.example.com;ArcSightFormat

So, in this example, anything from named is forwarded to acrsight.example.com over udp (@) port 514 (default) using the format (;) ArcSightFormat for the message.

It's at this point that our log archival and any additional remote forwarding we need is complete. The next thing we do is discard any messages not sourced from 'logstorage-01':

# If not sourced locally, stop processing message. :source , !isequal , "logstorage-01" ~

So now only local events are left and we implement local logging. This format should be familiar to anyone who's worked with syslogd before:

# Local Logging *.info;mail.none;authpriv.none;cron.none /var/log/messages authpriv.* /var/log/secure mail.* -/var/log/maillog kern.* /var/log/kern.log cron.* /var/log/cron *.emerg * uucp,news.crit /var/log/spooler local7.* /var/log/boot.log

Setting up the clients

Next, we'd like to receive logs on the central server, so we need to setup our clients to send messages. At this point, I'm not configuring encryption of messages. I would like guaranteed delivery of the messages to the central log server. rsyslog has a few ways to do this, including it's own protocol for delivery. I don't need insane amounts of guarantee; using TCP and an on-disk queue will get me most of the way there and is simple to implement.

So, here's my rsyslog.conf, one step at a time:

# Rsyslog Defaults $ActionFileDefaultTemplate RSYSLOG_TraditionalFileFormat $WorkDirectory /var/run/rsyslog # Default Location for Work Files # Modules $ModLoad immark $ModLoad imklog $ModLoad imuxsock

Nothing crazy there, load the modules we need, set the standard template for messages.

# Local Logging *.info;mail.none;authpriv.none;cron.none /var/log/messages authpriv.* /var/log/secure mail.* -/var/log/maillog kern.* /var/log/kern.log cron.* /var/log/cron *.emerg * uucp,news.crit /var/log/spooler local7.* /var/log/boot.log

Again, nothing earth shattering. Basic syslogd style capture of messages to disk. Now we're ready to send messages to our central log storage server. So this would be a good time to remove anything we don't want to send from the stream. Again, I've used an snmpd connection message filter as a demonstration. Anything matching it will be discarded by the '~'.

# Discard SNMPD Spam if $programname == 'snmpd' and ( $msg contains 'Connection from UDP' or $msg contains 'Received SNMP packet(s) from UDP' ) then ~

So, now we're ready to actually send log messages to the central server. The first thing we need to configure is the on-disk queue. We do this as follows:

# Remote Logging with On Disk Queuring Enabled $ActionQueueType LinkedList # Asynchronous Forwarding Mechanism $ActionQueueFileName centralwork # Enable disk mode queue $ActionResumeRetryCount -1 # Infinite Retries $ActionQueueSaveOnShutdown on # Save Queue on Exit for reprocessing

And the last thing we need is a destination for the logs for this queue:

*.* @@logstorage-01.example.com:514

One thing to point out is the use of '@@', this specifies we want to use TCP.

Reflections

What we have at this point is a fairly reliable transport of our syslog messages from our UNIX hosts to our central log server. Remember, we configured the central log server with both TCP and UDP listeners. This means that for systems which rsyslog doesn't support and may not be able to use TCP delivery, we can send legacy UDP messages to logstorage-01 and it will work.

To get Windows servers participating, you may want to investigate: syslog-win32, S.N.A.R.E., or eventlog-to-syslog. I don't have much experience with them, but they will communicate to rsyslog in this setup.

Now that we have rsyslog configured, we could use syslog as the backend for our application logging. Don't get too angry, you can always use something like scribe or Flume as well. That's the subject of another write-up though.

Step 2: Doing something with these messages

Before going any further, I'd like to address the 1,000 lb Gorilla in the room, Splunk. I have never worked with Splunk. Even for my ~150 servers in my previous job, I exceeded the 500mb of logs allowed per day. (I am a fan of syslog, why invent another logging protocol if there's already one available?) That being said, people I trust who have experience with it say it's amazing.

As a matter of fact, I've never encountered someone who's used Splunk and had anything remotely negative to say about it's performance, scalability, or user experience. The only complaint I've ever heard is that it is expensive; not "organic beef" expensive, but Aston Martin expensive. If you have that kind of budget to spend on logging, go for it. I've been working for far too many poor companies for too long and could not fathom spending hundreds of thousands of dollars on logging software.

I am insanely curious about how close to Splunk's interface and utility I can get with open source software. What follow is my attempt to do just that.

Graylog2

When I first started down this road, someone suggested I take a look at Graylog2. It's user interface is fantastic and it leverages cool sounding technologies like "MongoDB" and uses a cartoon gorilla from The Oatmeal. When you log in, as the interface loads it says: "Mounting party hats!" How cutting edge is that? Awesome.

I love software that has a sense of humor. It adds to the user experience. Under the hood, Graylog2 has a number of awesome features including it's own log format for passing messages around in a way that allows for easy serialization and deserialization of data in the log stream. This format is GELF.

I didn't have too many problems installing Graylog2, and honestly I was very happy with the interface and configurability. I sent only a small stream of log traffic to it and was able to get data out very quickly. I also noticed that the release I downloaded was the first to support ElasticSearch as a storage backend in addition to MongoDB.

For those of you unfamiliar with ElasticSearch, it's a clustered full-text search platform based on Lucene. If you've never had the privilege of working with ElasticSearch, I can tell you it is magic. I support several ElasticSearch clusters in a production environment and can tell you first hand it's a wonderful product from my stand point. The only complaint I have is it's too much magic. It makes me feel insignificant as a system administrator because I have to do very little to support it. It's also incredibly fast and incredibly scalable.

Well, it's scalable if you design your indexes in a certain way. And this is where the show stops for Graylog2. You see, ElasticSearch uses sharding to distribute data across the cluster. You can specify how many shards you want an index to have when you create it. You can also specify how many copies of each shard you'd like to keep across the cluster for redundancy. You can even do some neat things like saying "keep a copy at each datacenter and never have all copies of one shard in the same rack in the same datacenter." This means you can scale performance at the time of index creation. If I have 5 shards, I can scale up to 5 cluster nodes and gain performance, after that, I'm simply gaining redundancy as only 1 shard will be the master at any one time.

So why is this a problem with Graylog2? Well, Graylog2 uses a single index for it's entire database. Perhaps this is a side-effect of their relatively late adoption of ElasticSearch as a backend for log storage. But it means that you need to build the index at the time of initial installation to cope with the load of the logs for the future of your logging solution. Sound easy? Well, it's not. If you have a large volume of logs and you intend on keeping them around for compliance reasons for a long period of time, Graylog2's use of ElasticSearch will cause significant performance problems for you, even if you were to know you need 20 nodes in your cluster.

So, for a large installation with high volume, I cannot recommend Graylog2. It's beautiful, it's fun, but the ElasticSearch indexing scheme is currently broken.

Logstash

So, what else is there? Well, there's Logstash. Logstash is more of a log routing or translation protocol than anything else. Take a look at the list of inputs, filter, and outputs it supports:

inputs amqp

exec

file

gelf

redis

stdin

stomp

syslog

tcp

twitter

xmpp

zeromq filters date dns gelfify grep grok grokdiscovery json multiline mutate split outputs amqp elasticsearch elasticsearch_river file ganglia gelf graphite internal loggly mongodb nagios null redis statsd stdout stomp tcp websocket xmpp zabbix zeromq

AH HA! You'll notice that one of the outputs logstash supports is ElasticSearch. So why use logstash instead of Graylog2? It has to do with the indexes. Graylog2 implements a single index 'graylog2' in the ElasticSearch cluster. This makes the search API fairly simple, as I simply specify that index to search from and give my filter criteria. The downside, this index is ENORMOUS, so simple searches, or unbounded searches could dramatically impact the availability of the entire cluster.

Logstash's developers seem to have more experience with the ElasticSearch model and designed it with scalability in mind. The logstash elasticsearch output mechanism creates a new index every day. This means there's a little more logic needed on the search front-end to specify which indexes to look in for the data you're querying, but you can change the sharding definitions on a daily basis and grow your cluster as your needs change. This also allows for index optimization. Someone much smarter than me can explain better, but if an index is in a readonly (or infrequent write) state, like yesterday's index, it can be highly optimized with Lucene to yield better performance.

Extracting Custom Data

One of my first uses for Logstash was to provide a better UI for OSSEC-HIDS. OSSEC does an amazing job of security monitoring hosts and aggregations of hosts, but the interface is fairly behind where I feel it needs to be. However, I could say the same thing about Logstash. That's fine, because we can leverage the strengths of Logstash to provide better interfaces.

I found this awesome write-up on getting OSSEC alerts to Logstash for processing.

Where Logstash fails

Logstash isn't perfect, it's front-end leaves MUCH to be desired. However, the infrastructure and flexibility it affords, I'd prefer the developers focus on the inputs, filter, and outputs than waste valuable resources on front-ends. If you've learned anything in the open source community, someone will fix that problem. And it turns out they have:

Searchable and Visible: Kibana

As I've pointed out, the ElasticSearch storage in logstash is excellent. It uses the indexes exactly as they were designed to be used. This means the performance, reliability, and scalability of logstash's storage backend is on par with Splunk. However, it's front-end has nothing on Splunk. Enter Kibana.

Kibana is a Bootstrap-based PHP front-end which leverages the indexes Logstash creates in ElasticSearch to provide a beautiful front-end to the log searching. It also adds in the functionality to do trending and analysis of logs from Logstash! Keep in mind you can create your own fields in logstash using grok, so we can extract, trend, score, and analyze data in real-time in a fairly beautiful and powerful interface.

Kibana fills the gap with the Logstash interface so perfectly. It doesn't give me everything I'd get with Splunk, but I've just touched the functionality I can extract with Logstash.

Step 3: All your stats are belong to Graphite

Graphite is awesome. If you're not using it, DROP EVERYTHING AND GET IT RUNNING NOW. For a foray into it's awesomeness, here's a quick overview of it's features. This presentation is based off something that Jason Dixon shared with me, so please make sure you visit his blog and read his series of misnamed "Unhelpful Graphite Tips":

If you've taken the time to read the overview and those tips, you're probably thinking "OMG I CAN GRAPH EVERYTHING." If you're not thinking that, re-read everything and reconsider. Perhaps you missed timeShitft() ? Perhaps you're not a wanna-be statistics nerd like me? I strongly suggest you become one, maybe start here.

So, you can write grok patterns to extract metrics from your logs, you can then use statsd to track those metrics in Graphite.

STFU, Information Overload

Agreed, this is a lot for a single post, and I still have so much to say on all the technology here. I will try to keep brain-dumping as I fine-tune my setup. I aimed to answer the question, how close can I get to Splunk with open source software? I honestly don't know. I'm relying on you folks to find out. Start a conversation with me on twitter, @reyjrar.

UPDATE!