Distributed tracing is a method used to profile and monitor applications, especially those built using a microservices architecture. Distributed tracing helps pinpoint where failures occur and what causes poor performance. In addition it can be used to visualize and understand communication between microservices without additional documentation (that rather often get stale quickly).

There were two major options for a long time to choose between, they are Zipkin and Jaeger. But there is a relatively new beast in Elastic stack called APM. By using Elastic APM it’s possible to store distributed tracing information inside Elasticsearch and visualize it in Kibana. This is extremely useful if you have already been using Elasticsearch and Kibana for logs as well because it opens up out-of-the-box solution for linking together logs and your tracing information. You needn’t any more jumping between systems to correlate your logs and traces.

High-Level Architecture

Image was taken from APM docs

APM Agents

APM agents are actually part of your services that collect and post tracing data to APM server. Elastic team has already created a bunch of Agents to the most popular programming languages (e.g. Go, Node.JS, .NET, Java and etc). Full list of supported agents and corresponding docs can be found in the APM documentation.

APM Server

APM Server is responsible for processing requests from agents. It performs validation, security scans, creates valid documents and stores them into Elasticsearch indices.

Elasticsearch

It’s distributed search and analytics engine. In simple words you can treat it as a kind of database.

Kibana

Rich visualization platform that works with Elasticsearch.

Add APM Agent to Node.JS service

It’s incredibly easy to add an APM agent to your existing service and it requires minimum amount of changes.

First of all it should be added to an app.

npm install elastic-apm-node --save

Next step is to add and start the agent.

const apm = require('elastic-apm-node').start({

serviceName: 'my-awesome-service',

serverUrl: 'http://localhost:8200',

})

The most important thing here is that it should be the first module you require in your app (details can be found below). Node.JS APM agent instruments libraries during start up. It wraps Node.JS core services and installed external libraries to be able to automatically starts, configure and ends transactions and spans. The list of instrumented modules can be found in the elastic-apm-node repository and in the official docs. As a developer you can actually disable instrumentation for any specific module by passing disableInstrumentation configuration to your agent or even disable instrumentation at all via instrument property.

APM agent is up and running. It starts and finishes transaction for incoming HTTP requests, creates spans for outbound requests, external storage calls, passing corresponding HTTP headers to the next service (context propagation) and etc without any additional explicit line of code from service developer.

I’d rather mention that start function may include tons of settings. So it’s possible to make a fine grained configuration of your agent. Folks from Elastic made a great job with configuration by providing rather flexible approach. So it’s possible to use global settings, environment variables and configuration via code. Settings are applied in the following priority:

options passed via environment variables — the highest priority

options passed in to agent.start()

options from global config file

default options — the lowest priority

Going back to staring APM agent. There was a phrase that:

APM Agents Usage Recommendations

The recommendations are based on my personal experience with Node.JS Elastic APM Agent.

Inspect transactions and spans details

If instrumentation is turned on then be ready to look through data the agent collects. E.g. APM agent stores request headers and body for inbound HTTP requests. It also adds current user information based on information in your request object and etc. There are following issues:

storing senseless information that increase memory consumption by Elasticsearch storing sensitive information. Rather often tokens and other sensitive information is passed via headers. (Luckily, 90% of cases APM server strips such kind of information) the agent adds user id, name and email to transaction if it’s able to find such information in request object. E.g your app uses passport and passport stores user information in request.user, as a result if user is not specified explicitly the APM Agent will take values from request. Storing such information can be a GDPR issue

Use Labels and Customs

There is a possibility to add additional information to transactions and spans. They are labels and customs. The main difference between them is that labels are indexed (searchable) and customs are not.

Examples for labels:

System has already had a correlation id (how to add correlation id) before tracing. So correlation id is a good candidate to be added as a label to all transactions.

Blue-Green, Canary releases. Information can also be stored in labels to separate between different types of services. It’s even possible to use global labels settings in the APM Agent to set up such kind of labels.

Examples for customs:

stores context related information that does not make any sense during searching but helps during analyze by providing some app specific context.

Handle Multi-Cluster Environment

If there are multiple clusters (e.g. separate clusters for specific customer or specific location) then, for sure, information about cluster should be tracked in transactions to be able to filter by this data. There are actually two options to track that data.

The first option is to set cluster name via label. With this approach there will be only one record per service on a services page in Kibana.

The second option is to set cluster name via environment agent parameter. In this case you can easily filter by environment in Kibana. There is a dedicated quick filter on all pages.

Inspect Resources Consumption

When adopting tracing make sure that it was tested without sampling transactions. Memory consumption can be increased. Personally I had several memory leaks that luckily was fixed shortly.

Support Quick Switch Between Sampling Rates

Actually it does not make sense to have tracing enabled on full power all the time. Usually it is enabled on full power during direct performance research. All the remaining time it should work with 10–20% power to decrease senseless load on Elasticsearch. 10–20% power gives you a possibility to create reports and have high level vision on performance and issues of your system.

Unfortunately there is no “deep sampling”/”sampling propagation”. It means that if there are two services ServiceA and ServiceB and one calls another and both have sample rate configured that sampling is applying on each service separately. Side effect of this is that it’s possible to have ServiceA transactions but does not have any ServiceB transactions and vice-versa. The best scenario, IMHO, is if root transactions is persisted then it should force all sub-transactions to be persisted and if root transaction is NOT persisted then all sub-transactions should NOT be persisted as well. I call it sampling propagation.

Enrich Log Records with Tracing Information

To be able to correlate log records and tracing data it’s required to enrich log records with tracing information that includes one or more of the following fields:

transaction.id

trace.id

span.id

The agent provides API to get these values for currentation transaction apm.currentTraceIds. The only thing is to inject them into particular loggers.

In addition it makes sense to add trace.id and transaction.id to response headers.

Be aware of custom thenables

The most part of instrumentation works via async-hooks. So keep in mind that instrumentation won’t work for libraries which use custom thenables (e.g. mongoose uses own promises out of the box) due to V8 issue. More details regarding this issue can be found here:

How to fix mongoose?

First, need register built-in promise as a default promise for mongoose

mongoose.Promise = global.Promise;

use exec() to run database operations

async function getDataNotTrackedByAPM() {

const result = await Model.find();

result result;

} async function getData() {

const result = await Model.find().exec();

result result;

}

How to make sure that custom thenable is not in use in your code? I have a rather dirty approach but it helps to migrate quite big codebase (but it requires good test coverage).

if (process.env.NODE_ENV !== `production`) {

mongoose.Query.prototype.then = function() {

throw new Error(

`use exec() because requests cannot be tracked by APM Agent`,

);

};

}

Configure Local Development Environment

Configure local development environment is easy-peasy with docker. It requires only two files and docker:

apm-server.yml

apm-server:

host:

"0.0.0.0:8200"



output.elasticsearch:

hosts:

["elasticsearch:9200"]

docker-compose.yml

version: "3"

services:

elasticsearch:

image: docker.elastic.co/elasticsearch/elasticsearch:7.4.0

ports:

- 9200:9200

- 9300:9300

environment:

- discovery.type=single-node



apm-server:

image: docker.elastic.co/apm/apm-server:7.4.0

depends_on:

- elasticsearch

environment:

- output.elasticsearch.hosts=["elasticsearch:9200"]

volumes:

- "./apm-server.yml:/usr/share/apm-server/apm-server.yml:ro"

ports:

- 8200:8200



kibana:

image: docker.elastic.co/kibana/kibana:7.4.0

depends_on:

- elasticsearch

ports:

- 5601:5601

environment:

- ELASTICSEARCH_HOSTS=http://elasticsearch:9200

To be able to use distributed tracing docker-compose should be run. Then open Kibana by http://localhost:5601 url, find APM tab and follow the setup instructions.

There are some simple examples below how it looks like:

Conclusion

Elastic APM solution is quite suitable and can be easily adopted due to already implemented APM Agents with all sort of “magic” inside that minimize required service developer sefforts. Also this solution removes unnecessary steps to add additional resources for visualization if Kibana is already in place.

Useful links: