The ability to efficiently analyze and query the data shipped to the ELK Stack depends on the readability and quality of data. This implies that if unstructured data (e.g., plain text logs) is being ingested into the system, it must be translated into structured form enriched with valuable fields. Regardless of the data source, pulling the logs and performing some magic to format, transform, and enrich them is necessary to ensure that they are parsed correctly before being shipped to Elasticsearch.

Logstash is a data pipeline that helps us process logs and other event data from a variety of sources. With over 200 plugins, Logstash can connect to a variety of sources and stream data at scale to a central analytics system. One of the best solutions for the management and analysis of logs and events is the ELK stack (Elasticsearch, Logstash and Kibana).

Data transformation and normalization in Logstash are performed using filter plugins. This article focuses on one of the most popular and useful filter plugins, the Logstash Grok Filter, which is used to parse unstructured data into structured data and making it ready for aggregation and analysis in the ELK. This allows us to use advanced features like statistical analysis on value fields, faceted search, filters, and more. If we can’t classify and break down data into separate fields, all searches would be full text, which would not allow us to take full advantage of Elasticsearch and Kibana search. The Grok tool is perfect for syslog logs, Apache, and other web server logs, Mysql logs, and in general, any log format that is written for humans and includes plain text.

The Grok filter ships with a variety of regular expressions and patterns for common data types and expressions you can meet in logs (e.g., IP, username, email, hostname, etc.) When Logstash reads through the logs, it can use these patterns to find semantic elements of the log message we want to turn into structured fields.

Thus, the Grok filter works by combining text patterns into something that matches your logs. You can tell Grok what data to search for by defining a Grok pattern: %{SYNTAX:SEMANTIC}

The SYNTAX is the name of the pattern that will match your text. For example, the NUMBER pattern can match 4.55, 4, 8, and any other number, and IP pattern can match 54.3.824.2 or 174.49.99.1 etc.

The SEMANTIC is the identifier given to a matched text. You can think of this identifier as the key in the key-value pair created by the Grok filter and the value being the text matched by the pattern. Using the example above 4.55, 4, 8 could be a duration of some event, and a 54.3.824.2 could be the client making a request.

We can express this quite simply using the Grok pattern as %{NUMBER:duration} and %{IP:client} and then refer to them in the filter definition

filter { grok { match => { "message" => "%{IP:client} %{NUMBER:duration}" } } }

As we’ve mentioned, Logstash ships with lots of predefined patterns. Patterns consist of a label and a regex, e.g.: USERNAME [a-zA-Z0-9._-]+

Let’s take a look at some other available patterns. (You can find a full list here.)

# Basic Identifiers USERNAME [a-zA-Z0-9._-]+ USER %{USERNAME} INT (?:[+-]?(?:[0-9]+)) BASE10NUM (?<![0-9.+-])(?>[+-]?(?:(?:[0-9]+(?:\.[0-9]+)?)|(?:\.[0-9]+))) NUMBER (?:%{BASE10NUM}) BASE16NUM (?<![0-9A-Fa-f])(?:[+-]?(?:0x)?(?:[0-9A-Fa-f]+)) BASE16FLOAT \b(?<![0-9A-Fa-f.])(?:[+-]?(?:0x)?(?:(?:[0-9A-Fa-f]+(?:\.[0-9A-Fa-f]*)?)|(?:\.[0-9A-Fa-f]+)))\b

# Networking MAC (?:%{CISCOMAC}|%{WINDOWSMAC}|%{COMMONMAC}) CISCOMAC (?:(?:[A-Fa-f0-9]{4}\.){2}[A-Fa-f0-9]{4}) WINDOWSMAC (?:(?:[A-Fa-f0-9]{2}-){5}[A-Fa-f0-9]{2}) COMMONMAC (?:(?:[A-Fa-f0-9]{2}:){5}[A-Fa-f0-9]{2})

# paths PATH (?:%{UNIXPATH}|%{WINPATH}) UNIXPATH (/([\w_%!$@:.,+~-]+|\\.)*)+ TTY (?:/dev/(pts|tty([pq])?)(\w+)?/?(?:[0-9]+)) URIHOST %{IPORHOST}(?::%{POSINT:port})? # uripath comes loosely from RFC1738, but mostly from what Firefox # doesn't turn into %XX URIPATH (?:/[A-Za-z0-9$.+!*'(){},~:;=@#%&_\-]*)+

# Months: January, Feb, 3, 03, 12, December MONTHNUM (?:0?[1-9]|1[0-2]) MONTHNUM2 (?:0[1-9]|1[0-2]) MONTHDAY (?:(?:0[1-9])|(?:[12][0-9])|(?:3[01])|[1-9])

# Log formats SYSLOGBASE %{SYSLOGTIMESTAMP:timestamp} (?:%{SYSLOGFACILITY} )?%{SYSLOGHOST:logsource} %{SYSLOGPROG}:

# Log Levels LOGLEVEL ([Aa]lert|ALERT|[Tt]race|TRACE|[Dd]ebug|DEBUG|[Nn]otice|NOTICE|[Ii]nfo|INFO|[Ww]arn?(?:ing)?|WARN?(?:ING)?|[Ee]rr?(?:or)?|ERR?(?:OR)?|[Cc]rit?(?:ical)?|CRIT?(?:ICAL)?|[Ff]atal|FATAL|[Ss]evere|SEVERE|EMERG(?:ENCY)?|[Ee]merg(?:ency)?)

A great feature is that patterns can contain other patterns, e.g.: SYSLOGTIMESTAMP %{MONTH} +%{MONTHDAY} %{TIME}

By default, all semantics (e.g., DURATION or CLIENT) are saved as strings. Optionally, we can add a data type conversion to our Grok pattern. For example %{NUMBER:num:int} converts the num semantic from a string to an integer. The only conversions currently supported are int and float .

Let’s take a look at a more realistic example to illustrate how the Grok filter works. Let’s assume we have a HTTP log message like this:

55.3.244.1 GET /index.html 15824 0.043

Many such log messages are stored in /var/log/http.log , so we can use Logstash File input that tails the log files and emits events when a new log message is added. In the filter part of the configuration, we define Syntax-Semantic pairs that match each pattern available in the Grok filter to specific element of the log message sequentially.

input { file { path => "/var/log/http.log" } } filter { grok { match => { "message" => "%{IP:client} %{WORD:method} %{URIPATHPARAM:request} %{NUMBER:bytes} %{NUMBER:duration}" } } } output { elasticsearch { hosts => ["localhost:9200"] } }

In the example above, we represented the log message as:

%{IP:client} %{WORD:method} %{URIPATHPARAM:request} %{NUMBER:bytes} %{NUMBER:duration}

This will add a few extra fields (e.g., “client” or “method”) to the event and store them in the “message” variable sent to Elasticsearch Metricbeat index.

Let’s verify this by running Logstash with the above configuration. First, save the log message above in /var/log/http.log or any file you prefer, and then run Logstash with this configuration:

input { file { path => "/var/log/http.log" } } filter { grok { match => { "message" => "%{IP:client} %{WORD:method} %{URIPATHPARAM:request} %{NUMBER:bytes} %{NUMBER:duration}" } } } output { elasticsearch { hosts => ["localhost:9200"] } }

client : 55.3.244.1

: method : GET

: request : /index.html

: bytes : 15824

: duration: 0.043

Target Variables

A pattern can store the matched value in a new field. Specify the field name in the Grok filter:

filter { grok { match => [ "message", "%{USERNAME:user}" ] } }

This would find three lower case letters and create a field called ‘myField’.

Casting

Groked fields are strings by default. Numeric fields (int and float) can be declared in the pattern:

filter { grok { match => [ "message", "%{USERNAME:user:int}" ] } }

Note that this is just a hint that Logstash will pass along to Elasticsearch when it tries to insert the event. If the field already exists in the index with a different type, this won’t change the mapping in Elasticsearch until a new index is created.

Custom Patterns

Sometimes Logstash doesn’t have a pattern we need, so we need a few options for this situation.

First, we can use the Oniguruma syntax for named capture, which will let you match a piece of text and save it as a field:

(?<field_name>the pattern here)

For example, postfix logs have a queue id that is an 10 or 11-character hexadecimal value. We can capture that easily like this:

(?<queue_id>[0-9A-F]{10,11})

Alternately, you can create a custom patterns file.

Create a directory called patterns with a file in it called extra. The file name doesn’t matter, but do name it meaningfully for yourself.

In that file, write the pattern you need as the pattern name, a space, then the regexp for that pattern.

For example, doing the postfix queue id example as above:

# contents of ./patterns/postfix: POSTFIX_QUEUEID [0-9A-F]{10,11}

Then use the patterns_dir setting in this plugin to tell Logstash where our custom patterns directory is. Here’s a full example with a sample log:

Jan 1 06:25:43 mailserver14 postfix/cleanup[21403]: BEF25A72965: message-id=<20130101142543.5828399CCAF@mailserver14.example.com>

The corresponding Grok filter configuration will be:

filter { grok { patterns_dir => ["./patterns"] match => { "message" => "%{SYSLOGBASE} %{POSTFIX_QUEUEID:queue_id}: %{GREEDYDATA:syslog_message}" } } }

The above will match and result in the following fields:

timestamp : Jan 1 06:25:43

: logsource : mailserver14

: program : postfix/cleanup

: pid : 21403

: queue_id : BEF25A72965

: syslog_message: message-id=<20130101142543.5828399CCAF@mailserver14.example.com>

The timestamp, logsource, program, and pid fields come from the SYSLOGBASE pattern, which itself is defined by other patterns. If the input doesn’t match the pattern, a tag will be added for “ _grokparsefailure ”.

Common Examples

Syslog:

grok { match => { "message" => "%{SYSLOGTIMESTAMP:syslog_timestamp} %{SYSLOGHOST:syslog_hostname} %{DATA:syslog_program}(?:\[%{POSINT:syslog_pid}\])?: %{GREEDYDATA:syslog_message}" } add_field => [ "received_at", "%{@timestamp}" ] add_field => [ "received_from", "%{host}" ] }

Nginx:

grok { match => [ "message" , "%{COMBINEDAPACHELOG}+%{GREEDYDATA:extra_fields}"] overwrite => [ "message" ] }

Apache

grok { match => [ "message" , "%{COMBINEDAPACHELOG}+%{GREEDYDATA:extra_fields}", "message" , "%{COMMONAPACHELOG}+%{GREEDYDATA:extra_fields}" ] overwrite => [ "message" ] }

Mysql

grok { match => [ 'message', "(?m)^%{NUMBER:date} *%{NOTSPACE:time} %{GREEDYDATA:message}" ] overwrite => [ 'message' ] add_field => { "mysql_time" => "%{date} %{time}" } }

Elasticsearch

grok { match => ["message", "\[%{TIMESTAMP_ISO8601:timestamp}\]\[%{DATA:loglevel}%{SPACE}\]\[%{DATA:source}%{SPACE}\]%{SPACE}\[%{DATA:node}\]%{SPACE}\[%{DATA:index}\] %{NOTSPACE} \[%{DATA:updated-type}\]", "message", "\[%{TIMESTAMP_ISO8601:timestamp}\]\[%{DATA:loglevel}%{SPACE}\]\[%{DATA:source}%{SPACE}\]%{SPACE}\[%{DATA:node}\] (\[%{NOTSPACE:Index}\]\[%{NUMBER:shards}\])?%{GREEDYDATA}"] }

Custom Application Log:

Let’s consider the following application log:

2015-04-17 16:32:03.805 ERROR [grok-pattern-demo-app,BDS567TNP,2424PLI34934934KNS67,true] 54345 --- [nio-8080-exec-1] org.qbox.logstash.GrokApplicarion : this is a sample message

We have the following Grok pattern configured for the above application logs:

match => { "message" => "%{TIMESTAMP_ISO8601:timestamp} *%{LOGLEVEL:level} \[%{DATA:application},%{DATA:minQId},%{DATA:maxQId},%{DATA:debug}] %{DATA:pid} --- *\[%{DATA:thread}] %{JAVACLASS:class} *: %{GREEDYDATA:log}" }

For input data that matches this pattern, Logstash creates a JSON record as shown below.

{ "minQId" => "BDS567TNP", "debug" => "true", "level" => "ERROR", "log" => "this is a sample message", "pid" => "54345", "thread" => "nio-8080-exec-1", "tags" => [], "maxQId" => "2424PLI34934934KNS67", "@timestamp" => 2015-04-17 17:02:03.301, "application" => "grok-pattern-demo-app", "@version" => "1", "class" => "org.qbox.logstash.GrokApplicarion", "timestamp" => "2015-04-17 16:32:03.805" }

Debugging

There is an online Grok debugger available for building and testing patterns.

It offers three fields:

The first field accepts one (or more) log line(s) The second the Grok pattern The third is the result of filtering the first by the second.

Demonstration of a Custom Application Log using the Grok Debugger:

Dissect Filter

The Grok filter gets the job done — but it can suffer from performance issues, especially if the pattern doesn’t match. An alternative is to use instead the Dissect filter, which is based on separators. Unfortunately, there’s no app for that, but it’s much easier to write a separator-based filter than a regex-based one. The mapping equivalent to the above is:

%{timestamp} %{+timestamp} %{level}[%{application},%{minQId},%{maxQId},%{debug}]

%{pid} %{}[%{thread}] %{class}:%{log}

There are slight differences when moving from a regex-based filter to a separator-based one. Some strings end up padded with spaces. There are two ways to handle that:

Change the logging pattern in the application, which might make direct log reading harder

Strip additional spaces with Logstash

Using the second option, the final filter configuration config is:

filter { dissect { mapping => { "message" => ... } } mutate { strip => [ "log", "class" ] } }

Conclusion

Grok is a library of expressions that make it easy to extract data from our logs. You can select from hundreds of available Grok patterns. There are many built-in patterns that are supported out-of-the-box by Logstash for filtering items such as words, numbers, and dates (see the full list of supported patterns here). If you cannot find the pattern you need, you can write your own custom pattern.

The Grok filter is powerful and used by many to structure data. However, depending on the specific log format to parse, writing the filter expression might be quite a complex task. The dissect filter, based on separators, is an alternative that makes it much easier — at the price of some additional handling. It also is an option to consider in case of performance issues.

Give It a Whirl!

It’s easy to spin up a standard hosted Elasticsearch cluster on our Qbox data centers. Note, too, that you can provision your own AWS Credits on Qbox Private Hosted Elasticsearch.

Questions? Drop us a note, and we’ll get you a prompt response.

Not yet enjoying the benefits of a hosted ELK stack enterprise search on Qbox? We invite you to create an account today and discover how easy it is to manage and scale your Elasticsearch environment with Qbox.