James Huang is an enterprise solutions architect at Amazon Web Services.

AWS recently announced support for Elasticsearch 5.1. The announcement mentioned a few important improvements in the open source software and in Amazon Elasticsearch Service (Amazon ES), the managed service. Elasticsearch 5.1 includes three important changes: a new scripting language called Painless, significant improvements to indexing throughput, and a new ingest model, which is described in this blog post.

Many operations teams use Logstash or homegrown pipelines of data transformation scripts and tools to modify log data before it’s stored and indexed in Elasticsearch. The new ingest model makes use of pipelines and processors offered in Elasticsearch 5.1 and Amazon ES to simplify your log analytics architecture and centralize administration of your data transformation logic.

If you’re using the ELK stack, this method allows you to remove Logstash and to instead use Beats to ship logs directly to Elasticsearch for processing and indexing.

Ingestion and pipelines

In Elasticsearch 5.1, you can think of a pipeline as a factory line of processors, where each processor has a single task to perform on the data (for example, applying a grok pattern).

The following are some of the well-known processors built into Elasticsearch 5.1:

Grok

JSON

Date

Convert

Trim

Sort

For more information about processors, see the Elasticsearch Reference 5.1.

Because Amazon ES is a managed service, it does not support custom processor plugins. However, you can use the script processor with an inline script (where your code is not referenced through a file or id) to customize logic.

Scenario

Say we’re working on some access log data where a single row of data looks like this:

10.0.0.3 GET /index.jsp 1024 0.134

In this scenario, we have an IP address, HTTP method, name of the document to be retrieved, size (in bytes), and the latency of the request.

It’s common to transform this into separate searchable fields by using a grok pattern. Before Elasticsearch 5.1, that meant using Logstash or scripting tools that modified the data into separate JSON attributes before it was sent to Elasticsearch. Now, however, you can set up a dedicated pipeline for this web access log format. The pipeline can use a grok processor.

First, you’ll need a cluster. For instructions, see Getting Started with Amazon Elasticsearch Service Domains.

When you create your domain, be sure to choose 5.1 from the Elasticsearch version drop-down box, as shown here:

After the cluster is installed, copy the DNS endpoint from the dashboard.

Set up the client and access proxy

To protect a production Elasticsearch cluster, secure it as described in the Amazon ES documentation and in the How to Control Access to Your Amazon Elasticsearch Service Domain blog post.

For testing purposes, assuming there is no sensitive data being used, you might be able to use a simple access proxy like this one.

Note: This signing proxy was developed by a third party, not AWS. AWS is not responsible for the functioning or suitability of external content. This proxy can be used for development and testing, but is not suitable for production workloads.

To make things easier, we’ll use a Chrome plugin called Sense as our REST client.

To install the plugin, open a Chrome browser session, navigate to

https://chrome.google.com/webstore/detail/sense-beta/lhjgkmllcaadmopgmanpapmpjgmfcfig, and then click the button to add the plugin to your browser.

Or you can navigate to the Chrome web store, search for Sense, and then add the plugin, as shown here:

Any REST client will do, but Sense understands Elasticsearch APIs and will auto-complete parts of your request for you. Plus, it’s got fancy colors.

In the Sense client, the left side is for making requests. The right side (with a black background) displays the results of those requests.

If you’ve already installed the access proxy, you can point your Sense client to your local machine where the proxy is running. For example, instead of https://foo.bar.amazonaws.com:9200, you would point it to http://localhost:9200, as shown here:

Pipeline and processor

To create the pipeline, in Sense, type (or copy and paste) the following into the left side of the request window:

PUT /_ingest/pipeline/my51pipeline { "description" : "my sample pipeline to show grok", "processors": [ { "grok": { "field": "message", "patterns": ["%{IPV4:myclientip} %{WORD:httpmethod} %{URIPATHPARAM:uri_requested} %{NUMBER:sizebytes} %{NUMBER:duration}"] } } ] }

Here’s what we are doing:

Defining a pipeline named my51pipeline. Adding a single processor into the pipeline (in this case, the built-in processor, grok). Setting up the pattern for grok (which field values to assign to which attribute names).

For grok patterns that are useful for Elastic Load Balancing and Amazon S3 logs and can be used out of the box, visit this GitHub repo. (The link originally used in this blog post referred to a GitHub link for v5.x that is no longer available. This new link provides an equivalent resource for an updated version of Amazon ElasticSearch Service, as of May 2020.)

The format of the grok pattern is a series of %{SYNTAX:SEMANTIC} pairs.

%{IPV4:myclientip} indicates that you want grok to pull out the first column of information, which should match the regex behind the IPV4 grok pattern, and that the value should be assigned to myclientip.

If all goes well, then 10.0.0.3 will go into the attribute named myclientip upon ingest only when this pipeline and processor are used.

10.0.0.3 GET /index.jsp 1024 0.134

So let’s try it.

PUT /_ingest/pipeline/my51pipeline { "description" : "my sample pipeline to show grok", "processors": [ { "grok": { "field": "message", "patterns": ["%{IPV4:myclientip} %{WORD:httpmethod} %{URIPATHPARAM:uri_requested} %{NUMBER:sizebytes} %{NUMBER:duration}"] } } ] }

Copy and paste the pipeline into the left side of Sense, and then click the green triangle on the right of the entry.

On the right side of Sense, you should see:

{

“acknowledged”: true

}

So far, so good. Let’s see the pipeline definition.

Type this into Sense:

GET /_ingest/pipeline

You should now see the pipeline you just created:

Now let’s feed in some data and inspect it. This first time we’re not going to use the pipeline:

PUT /logindex/accesslog/notpipelined1

{

“message”: “10.0.0.3 GET /index.jsp 1024 0.134”

}

We’re creating an index in Elasticsearch called logindex and putting in some data of type accesslog . This document’s id is “ notpipelined1 “.

Click the green triangle and get the result:

Now let’s view that entry.

On the left side of Sense, type the following and then click the green triangle.

GET /logindex/accesslog/notpipelined1

The good news is that we have data in the cluster. The bad news is that the data is not split according to each field, so is not as usable for searching and filtering.

Let’s try it again, this time with the pipeline:

PUT /logindex/accesslog/ispipelined2?pipeline=my51pipeline

{

“message”: “10.0.0.3 GET /index.jsp 1024 0.134”

}

Note: If you were using a Filebeat, you could specify the pipeline name in your filebeat.yml file as follows:

output.elasticsearch:

hosts: [“myamazonelasticsearch.amazonaws.com:9200”]

pipeline: my51pipeline

In Sense, you’ll get an acknowledgement of the success.

Let’s observe the result. Type the following into Sense:

GET /logindex/accesslog/ispipelined2

You have now separated the fields, thanks to your pipeline and grok processor.

In addition to ingesting and transforming the data, you might also want to set up an index template and mappings to ensure that your data is typed properly. For more information, see Index Templates.

This new ingest model allows businesses to transform data in a centralized, native manner in Amazon Elasticsearch Service.

Log shipping and transformation alternatives

An alternative option for getting log data into Amazon ES is to send log data to Amazon CloudWatch Logs through a downloadable agent. The log data can then be streamed into Amazon ES through a Lambda function. For more information, see Streaming CloudWatch Logs Data to Amazon Elasticsearch Service.

This method is compatible with the new ingest model because you have access to the Lambda function which sends log data to Amazon ES.

Log data can also be ingested through a downloadable Java agent, which can buffer that log data to an Amazon Kinesis Firehose stream. If you want to transform or enrich log data, Amazon Kinesis Firehose lets you call a Lambda function for custom processing. For more information, see Amazon Kinesis Firehose Data Transformation. The advantage of using a Lambda function for transformation is that it can call other services, such as Amazon DynamoDB, to enrich your data. The disadvantage of this method of ETL is that the processors in Elasticsearch must be written in your transformation Lambda function.

Happy pipelining!