We can use ingest node to pre-process documents before the actual indexing takes place. This pre-processing happens by an ingest node that intercepts bulk and index requests, applies the transformations, and then passes the documents back to the index or bulk APIs.

In Elasticsearch 5.x the concept of the Ingest Node has been introduced. It is just a node in the cluster like any other but with the ability to create a pipeline of processors that can modify incoming documents. The most frequently used Logstash filters have been implemented as processors.

Before Elasticsearch version 5.x, however, there were mainly two ways to transform the source data to the document (Logstash filters or you had to do it yourself).

Indexing documents into the cluster can be done in a couple of ways:

We previously discussed the Automatic Keyword Extraction via the Elasticsearch 5.0 Ingest API . Here, we will go over what is an Ingest Node, what type of operations one can perform, and show a specific example starting from scratch to parse and display CSV data using Elasticsearch and Kibana.

We can enable ingest on any node or even have dedicated ingest nodes. Ingest is enabled by default on all nodes. To disable ingest on a node, configure the following setting in the elasticsearch.yml file:

node.ingest: false

We define a pipeline that specifies a series of processors to pre-process documents before indexing. Each processor transforms the document in some way. For example, you may have a pipeline that consists of one processor that removes a field from the document followed by another processor that renames a field.

For this post, we will be using hosted Elasticsearch on Qbox.io. You can sign up or launch your cluster here, or click “Get Started” in the header navigation. If you need help setting up, refer to “Provisioning a Qbox Elasticsearch Cluster.“

We will use a New York State open catalog of farmers markets. This CSV file was updated in May 2017, and It consists of 20 fields that detail the time and location of community farmers markets as well as the name and phone number of the market manager. The goal will be to use the Ingest feature of Elasticsearch in a Qbox provisioned Elasticsearch cluster. (Qbox provides out-of-box solutions for Elasticsearch, Kibana, and many Elasticsearch analysis and monitoring plugins.) We will set up Ingest node to parse the data into a structured json, index the data, and use Kibana to build a map of New York City that includes all these community farmers markets.

Prerequisites

The amount of CPU, RAM, and storage that your Elasticsearch server will require depends on the volume of logs that you intend to gather. For this tutorial, we will be using a Qbox provisioned Elasticsearch with the following minimum specs:

Provider : AWS

: Version : 5.1.1

: RAM : 1GB

: CPU : vCPU1

: Replicas: 0

The above specs can be changed per your desired requirements. Please select the appropriate names, versions, regions for your needs. For this example, we used Elasticsearch version 5.3.2, the most recent version. We support all versions of Elasticsearch on Qbox. (To learn more about the major differences between 2.x and 5.x, click here.)

Endpoint

REST API

https://13107969358abaa8f0cf:16550c2323@eb843037.qb0x.com:30579

Authentication

Username = 13107969358abaa8f0cf

Password = 16550c2323

TRANSPORT (NATIVE JAVA)

eb843037.qb0x.com:30579

We will use the Ingest feature from Elasticsearch instead of Logstash as a way to remove the need of extra software/architecture setup for a simple problem that can be solved just with Elasticsearch.

With all this in place, we will be able to visualize the data and answer some questions such as “Where can we find a farmers market?”, “Where are most of the farmer markets located?”, “Which is the most dense area?”, among others. Our data will come from a text file, and will turn into insights.

Setup

We will download the CSV file with this data from the Export to CSV feature included in the following website: https://data.ny.gov/Farmers-Markets-in-New-York-State.

We will use a Linux script that is composed with a simple loop to iterate through the CSV lines and send it to our Qbox provisioned Elasticsearch endpoint. In addition to this, we need to enable Kibana to use the Developer tools and also to build the Dashboard.

Tutorial: Python and Elasticsearch

In order to be able to parse the file, we need to replace all double quotes with single quotes and delete the first line of the file (header) before processing it. This can be done with your preferred tool. Each entry should look like (please note the single quotes):

Albany,Altamont Farm Stand,'Altamont Free Library, 181 Main St.',181 Main Street,Altamont,NY,12009,Kelly Best,5188618554,,'Monday 11am-6pm, Friday 11am-6pm, Saturday 10am-2pm',July 1-September 1,M,Y,N,N,Y,42.70149,-74.03254,'(42.70149, -74.03254)'

Parsing using Pipeline Processor

A pipeline is a definition of a series of processors that are to be executed in the same order as they are declared. A pipeline consists of two main fields: a description and a list of processors:

{ "description" : "...", "processors" : [ ... ] }

The description is a special field to store a helpful description of what the pipeline does.

The processors parameter defines a list of processors to be executed in order.

In order to be able to search and build dashboards, we need to parse the plain text into a structured json. For this, we will send the data to elasticsearch using the following script:

while read f1 do curl -XPOST '<a href="https://eb843037.qb0x.com:30024/">https://ec18487808b6908009d3:efcec6a1e0@eb843037.qb0x.com:32563</a>/farmer_markets_info/market' -H "Content-Type: application/json" -d "{ \"market\": \"$f1\" }" done < Farmers_Markets_in_New_York_State.csv

This script will read the file (named Farmers_Markets_in_New_York_State.csv), line by line, and send the following initial json to Elasticsearch:

{ "market": "Albany,Altamont Farm Stand,'Altamont Free Library, 181 Main St.',181 Main Street,Altamont,NY,12009,Kelly Best,5188618554,,'Monday 11am-6pm, Friday 11am-6pm, Saturday 10am-2pm',July 1-September 1,M,Y,N,N,Y,42.70149,-74.03254,'(42.70149, -74.03254)'" }

The json contains just one field “market”, with a single line. Once the json is sent to Elasticsearch, we need to take the information and break the market field into multiple fields, each containing a single value of the unstructured line. It is highly recommended to use the simulate API from Elasticsearch to play and develop the pipeline before actually creating it. Initially we should just start with a document and an empty pipeline:

curl -XPOST ES_HOST:ES_PORT/_ingest/pipeline/_simulate -d ‘{ "pipeline": {}, "docs": [ { "market": "Albany,Altamont Farm Stand,’Altamont Free Library, 181 Main St.’,181 Main Street,Altamont,NY,12009,Kelly Best,5188618554,,’Monday 11am-6pm, Friday 11am-6pm, Saturday 10am-2pm’,July 1-September 1,M,Y,N,N,Y,42.70149,-74.03254,’(42.70149, -74.03254)’" } ] }’

There are many processors available to process the lines, so you should consider all of them to choose which to use. In this case we will simply use the Grok Processor, which allows us to easily define a simple pattern for our lines. The idea of the following processor is to parse using grok and finally remove the field containing the full line:

curl -XPOST ES_HOST:ES_PORT/_ingest/pipeline/_simulate -d ‘{ "pipeline": { "description": "Parsing the NYC Farmer Markets", "processors": [ { "grok": { "field": "market", "patterns": [ “%{DATA:country},%{DATA:market_name},'%{DATA:_location}',%{DATA:addr_line_1},%{DATA:city},%{DATA:state},%{NUMBER:zip},%{DATA:contact},%{NUMBER:phone},%{DATA:market_link},'%{DATA:operation_hours}',%{DATA:operation_season},%{DATA:operating_months},%{DATA},%{DATA},%{DATA},%{DATA},%{NUMBER:location.lat},%{NUMBER:location.lon},'%{DATA}'” ] } }, { "remove": { "field": "market" } } ] }, "docs": [ { "_index": "community_farmers_info", "_type": "market", "_id": "AVvJZVQEBr2flFKzrrkr", "_score": 1, "_source": { "market": "Albany,Altamont Farm Stand,’Altamont Free Library, 181 Main St.’,181 Main Street,Altamont,NY,12009,Kelly Best,5188618554,,’Monday 11am-6pm, Friday 11am-6pm, Saturday 10am-2pm’,July 1-September 1,M,Y,N,N,Y,42.70149,-74.03254,’(42.70149, -74.03254)’" } } ] }’

This pipeline will produce the following document with structured fields and ready to be indexed into Elasticsearch:

{ "docs": [ { "doc": { "_id": "AVvJZVQEBr2flFKzrrkr", "_type": "market", "_index": "community_farmers_info", "_source": { "_location": "Altamont Free Library, 181 Main St.", "zip": "12009", "addr_line_1": "181 Main Street", "country": "Albany", "operation_hours": "Monday 11am-6pm, Friday 11am-6pm, Saturday 10am-2pm", "operation_season": "July 1-September 1", "city": "Altamont", "market_name": "Altamont Farm Stand", "phone": "5188618554", "contact": "Kelly Best", "operating_months": "M", "location": { "lon": "-74.03254", "lat": "42.70149" }, "state": "NY", "market_link": "" }, "_ingest": { "timestamp": "2017-06-19T06:55:15.781Z" } } } ] }

Indexing the Documents

We need to create an index template to match the index name before indexing the document. In order to do document filtering and geo-location styled queries with Elasticsearch/kibana Visualisations, we must set up certain field types that require a specific mapping definition.

Tutorial: Elasticsearch Java Rest Examples

This can be done by explicitly creating the index in advance, but it is better to use an index template to make it flexible. Any new index following this name will be created automatically, with all these settings and mappings. The name of the index that we will use in this case will start with farmer_markets. We would use the following template that will match such an index name:

curl -XPUT ES_HOST:ES_PORT/_template/nyc_template -d ‘{ "template": "farmer_markets*", "settings": { "number_of_shards": 1 }, "mappings": { "market": { "properties": { "country": { "type": "string" }, "market_name": { "type": "string" }, "_location": { "type": "string" }, "addr_line_1": { "type": "string" }, "city": { "type": "string" }, "state": { "type": "string" }, "zip": { "type": "string" }, "contact": { "type": "string" }, "phone": { "type": "string" }, "operation_hours": { "type": "string" }, "operation_season": { "type": "string" }, "operating_months": { "type": "string" }, "location": { "type": "geo_point" } } } } }’

After creating the template, we need to take the ingest pipeline from the simulate step and put the pipeline itself into Elasticsearch so that we can invoke it at indexing time. The command to put the ingest pipeline should look like:

curl -XPUT ES_HOST:ES_PORT/_ingest/pipeline/parse_nyc_csv -d ‘{ "description": "Parsing the NYC farmer markets", "processors": [ { "grok": { "field": "market", "patterns": [ "%{DATA:country},%{DATA:market_name},'%{DATA:_location}',%{DATA:addr_line_1},%{DATA:city},%{DATA:state},%{NUMBER:zip},%{DATA:contact},%{NUMBER:phone},%{DATA:market_link},'%{DATA:operation_hours}',%{DATA:operation_season},%{DATA:operating_months},%{DATA},%{DATA},%{DATA},%{DATA},%{NUMBER:location.lat},%{NUMBER:location.lon},'%{DATA}'" ] } }, { "remove": { "field": "market" } } ] }’

Now, we just need to read the file and sent the data to Elasticsearch using the pipeline we created. The pipeline definition is configured in the URL:

while read f1 do curl -XPOST https://13107969358abaa8f0cf:16550c2323@eb843037.qb0x.com:30579/farmer_markets_info/market?pipeline=parse_nyc_csv' -H "Content-Type: application/json" -d "{ \"market\": \"$f1\" }" done < Farmers_Markets_in_New_York_State.csv

Let it run for a while, and it will create each document one at a time. Note that here we are using the index API and not the bulk API. In order to make it faster and more robust for production use cases, we recommend using the bulk API to index these documents. At the end of the ingest process, we will end up with 747 markets in the farmer_markets_info index.

Kibana Dashboards/ Visualizations

In order to build the dashboard, first we need to add the index pattern to Kibana. For that, just go to Management and add the index pattern farmer_markets_info. You should unclick the “ Index contains time-based events ” option because it is not time series data (our data doesn’t contain a date-time field).

Adding an index pattern to Kibana is pretty easy, and once the index pattern is selected, Kibana provides many features to sort, filter, and visualize data.

We can filter through the indexed documents using Kibana. By adding some additional visualizations, such as a Saved Search and type of entrance, we can easily build a tool to search for specific farmer markets in New York City.

After this we can go and first create our Tile Map visualization showing all the farmer markets that we have in this dataset for New York City. For that, we need to go to Visualizations and choose the Tile Map type. By choosing the geolocation field and the cardinality of the market name, we get an easy and quick view of the existing markets.

As we can see here, downtown Manhattan is the area with the most farmer markets. With Kibana we can select a specific rectangle and also filter on a value such as “state”. For example from the 747 markets, around 77 are located in Manhattan, and 36 markets are operating in Spring (April or May) .

Give It a Whirl!

It’s easy to spin up a standard hosted Elasticsearch cluster on any of our 47 Rackspace, Softlayer, Amazon, or Microsoft Azure data centers. And you can now provision your own AWS Credits on Qbox Private Hosted Elasticsearch.

Questions? Drop us a note, and we’ll get you a prompt response.

Not yet enjoying the benefits of a hosted ELK stack enterprise search on Qbox? We invite you to create an account today and discover how easy it is to manage and scale your Elasticsearch environment in our cloud hosting service.