Like many other companies, we are using Impala as our primary application to allow our analysts to retrieve data from the DWH. As in any other IT project, we needed two important things: Logging and Monitoring.

Without these two components, it is extremely hard to understand how your infrastructure is behaving, and which queries/users/processes are hurting performance.

As you may know, Impala offers a nice UI with some metrics and a small query log (up to 1000 queries). That’s alright if you have a single Impala node, but with a cluster of Impala nodes it makes it hard to get all the knowledge from all the nodes.

Besides the raw data, it’s useful to have alerts if something is misbehaving, or to help locate some poorly performing query.

Let’s start talking about the metrics, what we built, and why it is so useful to us.

Measure Everything!

Why do we need metrics? When does it make sense? Let’s start imaging a couple of scenarios.

Imagine we’re going to tweak Impala’s configuration to improve the performance. In this scenario, we need to get insights before we start working on it. Without metrics, how are we going to be sure the changes are improving anything?

Another scenario: we have a limit of 40 concurrent queries on Impala and need some system to alert us if the value goes higher than 40.

Those situations have a straightforward solution with statsd, Grafana, and Impala Monitor.

Solution

We built a tool to get all the metrics/queries from the Impala UI to Grafana and Kibana. In this way, we now can have a better understanding to know if any change/query is affecting the platform.

Impala Monitor is a Python 3.6 application built to retrieve the metrics/query log from Impala UI automatically. At this point, we are going to focus on the metrics.

The monitoring service can be configured by passing some parameters:

nodes : Specify your Impala nodes. For instance: node-01:25000x

: Specify your Impala nodes. For instance: node-01:25000x seconds : How often the data has to be retrieved. A value between 3 and 10 should give you useful insights.

: How often the data has to be retrieved. A value between 3 and 10 should give you useful insights. graphite-node : To which graphite node the information should be sent.

: To which graphite node the information should be sent. graphite-port : Which port it should connect to.

: Which port it should connect to. graphite-prefix : Which prefix should be used to generate the key names.

: Which prefix should be used to generate the key names. env : Which environment is being monitored. For instance: staging or production.

With those parameters set, the monitoring app will start retrieving metrics. Let’s dive into how it manages that.

Using the asyncio library, we can take advantage of multithreading to speed up the process of retrieving data from the Impala UI. Let’s take the example of collecting metrics for each node:

As you can see here, we created a new ThreadPoolExecutor with a max limit of workers. Each execution generates a Future that at some point will return the value produced by the load_url method.

This piece of code allow us to parallelize the task of getting the metrics for each node, meaning we can have many nodes and can speed up the execution simply by increasing the number of workers. The optimum configuration here depends on your hardware/instance’s limits.

But Impala JSON Metrics return quite a big chunk of data, which we should filter down to only the information we care about. To achieve that we implemented an intermediate step to send only these filtered metrics to statsd.

With all those metrics we can build some nice graphs in Grafana like:

Memory usage per Impala node

We can also take advantage of the Alert system on Grafana and have alerts if the maximum amount of queries is close to being reached:

Alerting module if max amount of queries is reached

We automated this service with Ansible and execute it with Supervisor to be sure it is always up and running.

Logging your queries

A fundamental companion to metrics is logging. It is nice to know how the platform is performing, but when something goes wrong, we need to know why.

Let’s take an example. At 2am we saw a memory peak in our Metric Dashboard. We knew that a large query must have been executed, but we had no way of finding out exactly what, because the Impala logs only give us up to 1000 queries.

With logging on ElasticSearch and using Kibana, we now have the ability to search for specific times and attributes. For instance, we can find out not only which query was executed on node X at time Y, but also the amount of memory it consumed, vCores it used, the number of rows it fetched, and other data besides.

View on Impala UI of the last 25 completed queries

In this case, we need to do some parsing to transform that HTML into meaningful data. We can extract some useful information like:

Query

Query ID (necessary to avoid duplicates)

Username

Fetched rows

Memory usage (that is stored in the Profile page)

To accomplish this, we are using some useful Python packages like: BeautifulSoup4, lru-dict and elasticsearch.

BeautifulSoup4 is a fantastic tool to parse HTML content. Here you can see an example of retrieving data from a table like the one in the image above:

If you ever need to parse much more complex HTML structures, I’d recommend lxml , which allows you to use xPath to find the elements you need.

When working with these kinds of logs, you must always consider data sanitization and standardization. For instance, ensuring the same time or size units, to enable more powerful searches in the Kibana UI.

After all the payloads have been cleaned, we can start sending information to ElasticSearch. The simplest way to get Elasticsearch + Kibana up and running is with Docker.

We are going to use elk-docker for this purpose:

$ sudo docker run -p 5601:5601 -p 9200:9200 -p 5044:5044 -it --name elk sebp/elk

And send a payload that looks like:

With this setup we can find insights by querying Kibana:

Now for each query we have a full overview:

Furthermore, with Kibana we can create Dashboards based on queries. This example displays the average number of fetched rows and number of failed queries:

Conclusion

With the ability to understand your platform’s weaknesses and strengths you gain the freedom to play around and try new ideas safely.

It makes sense to have these tools in place before dealing with refactoring (coding or infrastructure). You will sleep better, I promise!

Looking for a new job opportunity? Become a part of our team! We are constantly on the lookout for great talent. Help us in our mission of delivering fresh ingredients to your door and make home cooking accessible to everyone.