Péter Márton Co-Founder of RisingStack

Microservices is a powerful architecture pattern with many advantages, but it also brings new challenges regarding debugging - as it’s a distributed architecture that moves the complexity to the network.

Distributed tracing (and OpenTracing) provides a solution by giving enough visibility and information about cross-process communication.

This article explains the basics of distributed tracing as well as shows an open-source solution to debug Node.js based microservices applications.

UPDATE: This article mentions Trace, RisingStack’s Node.js Monitoring platform several times. On 2017 October, Trace has been merged with Keymetrics’s APM solution. Click here to give it a try!

Microservices debugging

Microservices is a powerful architecture pattern which helps your company to move fast and ship features frequently: it maximizes the impact of autonomous teams with allowing them to design, build and deploy their services independently as they have full ownership over the lifecycle of their applications.

However, we shouldn't forget that a microservices architecture produces a distributed system which moves the complexity to the network layer.

Developers who have experience building and operating microservices know that debugging and observing a distributed system is challenging, as the communication between components doesn't happen with in-memory function calls. It also means that we don't have stack-traces anymore.

This is the case when distributed tracing comes to the rescue and provides visibility for microservices.

Distributed Tracing

Traditional monitoring tools such as metrics and logging solutions still have their place, but they often fail to provide visibility across services. This is where distributed tracing thrives.

Distributed tracing provides enough visibility to debug microservices architectures via propagating transactions from distributed services and gaining information from cross-process communications.

The idea of distributed tracing is not new, Google successfully has been using it internally to understand system behavior and reasoning about performance issues for more than a decade. Google also published a whitepaper about their internal solution called Dapper in 2010.

Distributed tracing gives visibility about microservices communication

Distributed Tracing Concepts

The Google Dapper whitepaper introduces the two basic elements of distributed tracing: Span and Trace .

Span

A Span represents a logical unit of work in the system that has an operation name, start time and duration. Spans may be nested and ordered to model causal relationships. An RPC call like an HTTP request or a database query is an example of a span, but you can also represent internal operations with spans.

Spans are controlled by events in a system. They can be started, finished and extended with operational data that makes debugging easier.

For example when we create an HTTP call to the other service we want to start and span, and we want to finish it when our response received while we can decorate it with the status code and other metadata.

Trace

A Trace is represented by one or more spans. It's an execution path through the system. You can think about it of as a DAG (Directed Acyclic Graph) of spans.

Trace: graph of spans on a timeline, source: Jaeger

Context propagation

For being able to connect spans and define connections, we need to share some tracing context both within and between processes. For example, we need to define parent-child relation between spans.

Cross-process communication can happen via different channels and protocols like HTTP requests, RPC frameworks, messaging workers or something else. To share the tracing context, we can use meta headers. For example, in an HTTP request, we can use request headers like X-Trace or Trace-Parent-ID .

To manage a span lifecycle and handle the context propagation we need to instrument our code. In our next section, we will discuss instrumentation.

Instrumentation

In the Tracing Concepts section, we discussed that we need to instrument our code to start and finish spans, to decorate them with metadata and to connect them between different processes.

This kind of instrumentation need some time and will produce extra code as we need to touch every part of our application to propagate the tracing context both within and between processes.

We can write this kind of instrumentation on our own, or we can use an out of the box solution like Trace, our Node.js Monitoring & Debugging Platform.

If you decide that you want to do the instrumentation on your own, you should always be very careful while doing so. Instrumentation can introduce bugs and cause performance issues in your application or it can simply make your code very hard to read.

OpenTracing

Okay, in case you decided that you want to do the instrumentation on your own, wouldn't be great if you could do it in a vendor neutral way?

I mean, who wants to spend weeks or months to instrument their code if they have to repeat this process when they want to try out a different distributed tracing solution?

Nobody, right?!

This is exactly the challenge that OpenTracing addresses with providing a standard, vendor neutral interface for instrumentation.

The future of OpenTracing standard also means that maintainers of open source libraries and service providers can provide their solutions with built-in vendor neutral instrumentations for distributed tracing.

How cool would it be if the request and express npm packages would come with built-in OpenTracing instrumentation?

Today we are not there yet. We need to instrument our own code as well as the libraries that we use in our application.

OpenTracing Example

Let's see the following simple code snippet that makes a request to a remote site:

const request = require('request') // Request options const uri = 'https://risingstack.com' const method = 'GET' const headers = {} request({ uri, method, headers }, (err, res) => { if (err) { return } })

Now let's see the very same code snippet when it's instrumented with OpenTracing:

const request = require('request') const { Tags, FORMAT_HTTP_HEADERS } = require('opentracing') const tracer = require('./my-tracer') // jaeger etc. // Request options const uri = 'https://risingstack.com' const method = 'GET' const headers = {} // Start a span const span = tracer.startSpan('http_request') span.setTag(Tags.HTTP_URL, uri) span.setTag(Tags.HTTP_METHOD, method) // Send span context via request headers (parent id etc.) tracer.inject(span, FORMAT_HTTP_HEADERS, headers) request({ uri, method, headers }, (err, res) => { // Error handling if (err) { span.setTag(Tags.ERROR, true) span.setTag(Tags.HTTP_STATUS_CODE, err.statusCode) span.log({ event: 'error', message: err.message, err }) span.finish() return } // Finish span span.setTag(Tags.HTTP_STATUS_CODE, res.statusCode) span.finish() })

I think it's easy to say that the instrumented code is much more complicated and requires more effort from our side.

Cross-process propagation in Node.js

Earlier in this article, we discussed that distributed tracing requires cross-process Context Propagation to share information between processes and connect spans.

This kind of coordination between different parts of the application needs a standard solution, like a specific request header that each application must send and understand.

OpenTracing has an elegant solution to give enough freedom to the tracer provider to define these headers, while it gives a well-defined instrumentation interface for setting and reading them.

Let's see a Node.js example on how you can share context in an HTTP request:

// Client side of HTTP request const span= tracer.startSpan('http_request') const headers = {} tracer.inject(span, FORMAT_HTTP_HEADERS, headers) request({ uri, method, headers }, (err, res) => { ... })

This is how you can read the context and define the relation between spans on the server side of the very same request:

// Server side of HTTP request app.use((req, res) => { const parentSpanContext = tracer.extract(FORMAT_HTTP_HEADERS, req.headers) const span = tracer.startSpan('http_server', { childOf: parentSpanContext }) })

You can see that the extract(..) and inject(..) interfaces provide a vendor neutral instrumentation interface to share context between processes.

The previous code snippet will add different request headers per different tracing vendors. For example, with the Jaeger vendor (see later) it will add the uber-trace-id headers to your HTTP request.

Sampling

Distributed tracing has other challenges besides instrumentation. For example, in most of the cases, we cannot collect tracing information from all of our communication as it would be too much data to report, store and process. In this case, we need to sample our traces and spans to keep the data small but representative.

In our sampling algorithm, we can weigh our traces based on different aspects like priority, error type or occurrence.

In Trace, our Node.js Monitoring & Debugging tool we collect and group traces by similarity. We don't just make them easy to overview, but you can also see the errors occurrence number and make decisions based on that.



Traces by similarity and occurrence

Open-source Tracers

We call the application that collects, stores, process and visualize distributed tracing data a Tracer. The most popular open-source tracers today are Zipkin and Jaeger:

Zipkin 's design is based on the Google Dapper paper and was open-sourced by Twitter in 2012.

's design is based on the Google Dapper paper and was open-sourced by Twitter in 2012. Jaeger is a new distributed solution built around OpenTracing and released in April 2017.

In the next section, we will dig deeper to Jaeger as it is OpenTracing compatible.

Jaeger

Jaeger is an OpenTracing compatible tracer that is built and open-sourced by Uber in 2017. You can read more about the history and evolution of tracing at Uber in their article.

Jaeger's backend is implemented in Go and uses Cassandra as data storage, while the UI is built with React.

The agent and collector can also accept Zipkin Spans, and it transforms them to Jaegers' data model before storage.



Architecture of Jaeger

You can try out Jaeger with Docker, using the pre-built image that contains all of the necessary components:

docker run -d -p5775:5775/udp -p6831:6831/udp -p6832:6832/udp -p5778:5778 -p16686:16686 -p14268:14268 jaegertracing/all-in-one:latest

Jaegers' UI gives us insight about trace durations and provides a search interface, as well as a timeline visualization platform to look and inspect traces.



List of traces on Jaeger UI

Jaeger and Node.js

Jaegers' npm package is called jaeger-client. It provides an OpenTracing interface with a built-in agent, so you can instrument your code as we did it above in the OpenTracing section.

You might ask: Is there a way I can skip instrumentation?

The answer is yes! :)

RisingStack is pleased to announce the @risingstack/jaeger-node npm package that provides automatic instrumentation for Node.js core modules, the most popular database drives (MongoDB, PostgreSQL, Redis, etc.) and web frameworks like express .



Automatic instrumentation for Node.js and npm libraries with Jaeger

The jaeger-node library is built around the Node.js feature called async_hooks that makes possible the efficient and accurate tracking of asynchronous operations inside the application.

However async_hooks is the future of debugging and monitoring Node.js asynchronous resources it's an experimental feature yet.

Which means: Please do not use in production yet.

Looking to implement distributed tracing in your organization using cloud-native technologies? Learn more.

Conclusion

There are new standards and tools like OpenTracing and Jaeger that can bring us the future of tracing, but we need to work together with open source maintainers to make it widely adopted.

In the final episode of our Node.js at Scale series, we're discussing how you can build an API Gateway using Node.js.