I have been following distributed tracing technologies – Zipkin, OpenTracing, Jaeger, and others – for several years, without deeply trialing with any of them. Just prior to the holidays, we were having a number of those “why is this slow?” questions about an express application, written in typescript, providing an API endpoint. The API fronts multiple different data sources – MongoDB, InfluxDB, and Redis, and we run it all in containers, deployed and orchestrated with Kubernetes.

I decided that with some quieter time with the holiday, this was the perfect time to experiment with and take a stab at seriously implementing some form of tracing and external presentation to shed some light on what was happening within the API endpoint, so we could have some good conversations about the scope and structure of what’s being returned in the various implemented API endpoints.

I wrote about Jaeger (and OpenTracing) in Kubernetes for Developers, which provided me with some passing familiarity. In addition to prior research, I choose to go that route again to see what we could learn. A new complication to the scenario was that we wrote our API server using TypeScript. While the concepts are well documented – there aren’t any details tutorials or examples showing how to enabling tracing in your typescript/express application. My goal with this article is to provide one of those detailed example that may help to provide a roadmap (or highlight potholes) for others who are interested in doing the same.

The resources that I used to source and start the learning included the book I wrote (Kubernetes for Developers), and also:

https://www.jaegertracing.io – more marketing than detail, but provided a starting point for finding other resources and hinting at where to look

https://github.com/jaegertracing/jaeger-client-node – the source repository for the NodeJS jaeger client implementation, primarily the README (outside of when I needed to look at how something was implemented) for developer level notes about the implementation details.

https://github.com/yurishkuro/opentracing-tutorial – the real source for a lot of the starting points, with concrete implementation examples, even if they didn’t always follow the same patterns I needed. YuriSkuro did some lovely work with providing these examples. I didn’t hassle him in learning or using any of this, so any misinterpretation of what was expected is mine.

Getting a toe hold and seeing a trace

The first step in my journey was to get tracing enabled and see any single result. There are a lot more examples for Go than Express/NodejS. Adding in TypeScript and getting the libraries to work seamlessly within the type-constrained interfaces wasn’t entirely obvious.

The client itself installs cleanly with npm or yarn, as does opentracing, so getting the libraries was as simple as:

npm install opentracing jaeger-client --save

The jaeger client for node was written using Flow, which is a different pattern of adding in type constraints on top of Javascript, and doesn’t translate directly to TypeScript. Fortunately, it implements OpenTracing, and the OpenTracing offers a typescript typed definition library, so it was possible to load that and use OpenTracing’s definitions with jaeger clients implementation.

In the end, I get this operational in typescript with:

import * as opentracing from 'opentracing' ;

From there, you can set up the overall tracing client to work with your code. The tutorials inspired some of this, but the gist of the setup is to define an initialization function that passes in the tracer configuration, including sampling and reporting of when traces are captured. For getting a toe-hold and just starting,

const initJaegerTracer = require('jaeger-client').initTracer;

function initTracer(serviceName: string) {

const config = {

serviceName: serviceName,

sampler: {

type: 'const',

param: 1,

},

reporter: {

logSpans: false, // this logs whenever we send a span

},

};

const options = {

logger: {

info: function logInfo(msg: string) {

console.log('INFO ', msg);

},

error: function logError(msg: string) {

console.log('ERROR ', msg);

},

},

};

return initJaegerTracer(config, options);

}



export const tracer = initTracer('my-service-api') as opentracing.Tracer;



I chose to use the ‘const’ sampler (implying “trace everything”), and while I had logSpans set to true to see how it was working, I switched it to false as soon as I saw it was operational, as it was generating a huge amount of additional console logging output that was cluttering up my output.

logger in this function looks like it was built so that you could pass in any of a variety of logging functions (an example of which might be Winston). For getting started, I just stuck with some simple console.log statements.

Once I had the tracer set up, then I could use that to create spans, annotate them, finish them, and let the jaeger-client library take care of dealing with getting it outside of my code and to it’s external destination. This also implies there’s somewhere to capture and display this information, and the JaegerClient getting started documentation came in useful here.

JeagerClient sets up and offers an all-in-one memory store and UI for collecting and visualizing traces that’s perfect for running in a docker container right on your development machine.

I used the example directly, exposing all the ports (even though I only use a few) and it fired right up:

docker run -d --name jaeger \

-e COLLECTOR_ZIPKIN_HTTP_PORT=9411 \

-p 5775:5775/udp \

-p 6831:6831/udp \

-p 6832:6832/udp \

-p 5778:5778 \

-p 16686:16686 \

-p 14268:14268 \

-p 9411:9411 \

jaegertracing/all-in-one:1.8

Something to note (maybe obvious, but I’ll mention it): to kill off this background container, you use the command:

docker kill jaeger

and before you try and run it again, you may want to “clean it up” and remove it. You can do that with the command:

docker rm /jaeger

Once it’s up and running, you can access port 16686 on localhost (open http://localhost:16686) to see the jaeger UI (it is boring until you get traces into it)

With an endpoint ready to receive, you can start to create spans and send them in. YuriSkuro’s tutorial code (lesson 2) really cemented this piece for me, and it was relatively light work to translate it to using with TypeScript.

I started off by picking one express route that we’d already implemented and enabling it with a trace. Setting up a trace was pretty straight forward, and the pattern quickly emerged for how I’d repeat this, so it moved into a function to simplify the code in the route.

Creating a span is primarily calling tracer.startSpan. When you’re doing that, you can create the current span as stand-alone, or you can build it up with references to other spans. This can either be another span reference directly, or you can extract the span context information from somewhere else and use it. For express routes, there are some handy helper utilities to look for and pull a tracing context out of the HTTP request headers, so the logic I resolved upon here was to try and pull that context, and if that does not exist, then go ahead and create a stand-alone span.

I’m still experimenting with what and how much to annotate onto spans, and so far I’ve chosen to use the function names from the routes as the span names, and to annotate the controller for a bit of grouping.

export function createControllerSpan(controller: string, operation: string, headers: any) {

let traceSpan: opentracing.Span;

// NOTE: OpenTracing type definitions at

// <https://github.com/opentracing/opentracing-javascript/blob/master/src/tracer.ts>

const parentSpanContext = tracer.extract(opentracing.FORMAT_HTTP_HEADERS, headers);

if (parentSpanContext) {

traceSpan = tracer.startSpan(operation, {

childOf: parentSpanContext,

tags: {

[opentracing.Tags.SPAN_KIND]: opentracing.Tags.SPAN_KIND_RPC_SERVER,

[opentracing.Tags.COMPONENT]: controller

}

});

} else {

traceSpan = tracer.startSpan(operation, {

tags: {

[opentracing.Tags.SPAN_KIND]: opentracing.Tags.SPAN_KIND_RPC_SERVER,

[opentracing.Tags.COMPONENT]: controller

}

});

}

return traceSpan;

}

With the span created, the other critical piece is to report back when the span should be marked as completed. The function finish() on a span can be invoked directly, but it quickly became apparent that it was handy to also annotate the return result code and if it should be considered an error. This ends up being more handy in the UI than anything else, letting you filter down to “interesting” traces.

I ended up also encapsulating that into a function:

export function finishSpanWithResult(span: opentracing.Span, status: Number, errorTag?: boolean) {

span.setTag(opentracing.Tags.HTTP_STATUS_CODE, status);

if (errorTag) {

span.setTag(opentracing.Tags.ERROR, true);

}

span.finish();

}

I’m not particularly thrilled with the errorTag?: boolean optional on the tail end of this function, I’m sure there’s more aesthetic ways to handle it. I’m still prone to think of function signatures with named variables that is common from other languages (python), so take my specific implementation with a grain of salt.

Between these two functions, I have the pieces to create and finish spans, which results in code within an express route that looks something akin to:

import { createControllerSpan, finishSpanWithResult } from '../helpers/tracing';

import * as opentracing from 'opentracing';



# ...



# within the express route function:

const traceSpan = createControllerSpan('myController', 'doSomethingCool', args.headers);

# if something not found

finishSpanWithResult(traceSpan, 404);

return res.status(404).send();

try {

# do work here

# resulting in a resultList object that we'll return:

finishSpanWithResult(traceSpan, 200);

return res.status(200).json(resultList);

} catch (error) {

finishSpanWithResult(traceSpan, 500, true);

console.log('error while listing things ', error);

return res.status(500).send();

}

With this in place, and the code’s previously existing tests (which leverage supertest to work the API endpoint), just running the unit tests was generating traces, which turned out to be immensely useful.

I’m fairly confident that the logic that I put into my routes could have been encapsulated within an express middleware, and we may take that path in the future. We are also using Swagger 2.0 as a definition layer, and the swagger-express client tooling sets up each of the operationId as middleware functions, so it was not as convenient to try that with our particular setup.

Adding this code in to every route was frankly pretty tedious, but I also took advantage of the time it took to normalize the pattern of each of routes to a more consistent structure, and while it was a bit time consuming, it wasn’t hard and the end result was structurally more consistent routes between all the controllers.

This level of implementation gets you a view of “how long” an API call took, but not the detail within it of why it took so long. A necessary first step, it was valuable in itself but the overhead of Jaeger and OpenTracing to get just this information was akin to swatting a fly with a flamethrower. The real value comes from extending the tracing so you can see what within the API call took time, and how long it took.

Datagram size problems

Just this level of tracing was enough to highlight a quirk that’s worth calling out. While implementing and running the tests to make sure I didn’t break anything, I started seeing a repeated error message in the unit test console output, starting with EMSGERROR. After some digging, it turns out that this error message was being caused by MacOS’s default datagram size being MUCH smaller than the jaegertracing libraries expected.

It boils down to how JaegerClient chose to transmit data and keep the process of collecting and transmitting this additional as light as possible. The client collects the traces and then manages the transmission of them in the background using the UDP network protocol. This is basically a “fire and forget” data packet, the packet of which is called a Datagram, and the size of which is set by your operating system. The default on Linux (and most Unix OS) is 65,536, but the built-in default that MacOS uses is 9216 bytes. The error that was appearing was JaegerClient reporting that it was unable to compose and send the traces because it always expected to use the larger datagram sizes.

The solution was fairly straightforward – tell the local OS to allow those datagrams to be quite a bit larger (match what Linux was doing). On MacOS you can view the size it’s currently set to with the command:

sysctl net.inet.udp.maxdgram

and you can set it with:

sudo sysctl net.inet.udp.maxdgram =65536

Once I did that, the error messages disappeared, and I saw significantly more traces hitting Jaeger that I could dig through.

But what is it doing?

At this point, we had nice visualization of the API call timing, but I wanted to leverage the wins of tracing with adding sub-spans to break up the work that was happening. Our API code works with multiple external data resources, so that was the natural place I wanted to wrap with tracing.

We are using Mongoose and it’s schema, as well as the Influx node client libraries to access these remote resources. I’m sure with more work I could have made the wrapping less intrusive, but with a few additional helper methods I was able to quickly get 90% of the value and view into “what’s taking how much time”.

First I made a method to initiate a span within a mongoose Schema method (we use the virtual methods within Mongoose to consolidate our data models in the code):

export function createSchemaSpan(schemaName: string, operation: string, parentSpan?: opentracing.Span) {

if (parentSpan) {

return tracer.startSpan(operation, {

childOf: parentSpan,

tags: {

[opentracing.Tags.SPAN_KIND]: opentracing.Tags.SPAN_KIND_RPC_SERVER,

[opentracing.Tags.COMPONENT]: schemaName

}

});

} else {

return tracer.startSpan(operation, {

tags: {

[opentracing.Tags.SPAN_KIND]: opentracing.Tags.SPAN_KIND_RPC_SERVER,

[opentracing.Tags.COMPONENT]: schemaName

}

});

}

}





The logic follows the same pattern as the express route span creation, but instead of taking headers apart the looking for a span context, it’s set up to accept a parent span, which is much more readibly available at the level of the mongoose virtual method function call. I then appended an optional parentTraceSpan parameter into each of those function calls (which are typically used form the express routes), and used that with this function to set up the span for any work that happened, and continue the cascade down.

The other place that was getting some MongoDB time was doing document queries. This got a little awkward with TypeScript, mostly because I did not feel super confident in how generics work and parameterizing functions like this, so I took a more ugly shortcut and did some down-casting of the types, and re-cast them on result, so that I could write a wrapper that consistently traced the call.

The end result is this function:

export async function traceMongoQuery(parentSpan: opentracing.Span, traceName: string, documentQuery: DocumentQuery<any, any>) {

const traceSpan = tracer.startSpan(traceName, {

childOf: parentSpan,

tags: {

[opentracing.Tags.SPAN_KIND]: opentracing.Tags.SPAN_KIND_RPC_SERVER,

[opentracing.Tags.COMPONENT]: 'mongodb'

}

});

const documentResult = await documentQuery;

traceSpan.finish();

return documentResult;

}





which ends up getting used in a pattern like:

const possibleSite = await traceMongoQuery(traceSpan, 'Site.findOne', Site.findOne({ name: siteId })) as ISiteModel;

It means that when I trace the query, I need to know what should be coming back out so I can cast it back into the right object model, which is a little stupid, but kept everything working. Certainly not my proudest moment, but it’s functional.

The other data source we use is Influx, and a separate method when into wrapping those queries, which had a less heavily typed setup than what we had with MongoDB and mongooose in TypeScript.

The function I used to wrap the influx queries:

// return type for an influx query is Promise<influx.IResults<{}>>

export async function tracedInfluxQuery(

query: string,

options?: influx.IQueryOptions,

parentSpan?: opentracing.Span,

spanName: string = 'influx-query'

): Promise<influx.IResults<{}>> {

let traceSpan: opentracing.Span;

if (parentSpan) {

traceSpan = tracer.startSpan(spanName, {

childOf: parentSpan,

tags: {

[opentracing.Tags.SPAN_KIND]: opentracing.Tags.SPAN_KIND_RPC_SERVER,

'db.type': 'influxdb',

'db.statement': query

}

});

} else {

traceSpan = tracer.startSpan(spanName, {

tags: {

[opentracing.Tags.SPAN_KIND]: opentracing.Tags.SPAN_KIND_RPC_SERVER,

'db.type': 'influxdb',

'db.statement': query

}

});

}

const client = influxClient();

const result = await client.query(query, options);

traceSpan.finish();

return result;

}





You can see this allowed me even embed the specific query into the span. It’s a lot of meta-data baggage to add to a span, and I think it’ll be useful, but other than cool to see it hasn’t paid off yet. As we promote this experiment into active systems with real usage, the tradeoff of consuming more space and bandwidth against the value will hopefully become a bit more clear. I might be able to get just as much value from a shorter tag, or a more meaningful grouping of the span names, instead of the explicit query itself.

Much of the time consuming lifting we’re doing is with time series data stored in InfluxDB, so while I wanted to see the mongoDB included to provide the context, I expected more detail to be useful when sub-spans were doing something with influx – and in the end, that was a pretty good assumption.

There were a few places in the code where I manually created a subspan and wrapped some logic in the code with it. We use async/await through-out our code to keep it easier to understand and reason about, so I was tempted to try and wrap the promise and make that a nice boundary for a span, but in the end I found just a few strategic additions around heavier code in the routes or elsewhere did what I wanted, and was a bit simpler.

Bringing it all together and seeing results

In the end, I have nearly all functions that are called from the API endpoints accepting a span as an optional parameter, and all the functions were updated to create spans regardless of a parent, using the logic above. With the tests so actively working the code at various levels, just running our local tests generated a bunch of traces, and I found quite a bit of value in seeing the progress I was making by running the test suite and then viewing all the traces reported into Jaeger on my desktop.

Some of the heavier lifting in the code is post-processing data we get back from influx – some of which can be quite extensive. When we wrapped some of the “what appears to be simple functional code”, it quickly highlighted how expensive getting that data (with our chosen way or algorithm) was – and it highlighted a few O(n^2) methods that were a lot more obvious in hindsight than in reading the code.

I also found that it was relatively easy to forget to close off a span, and when you do that, what happens is that you simply don’t see that span reported in Jaeger. So I quickly found myself focusing on code to make sure the sub-spans I expected to be there showed up as a means of knowing if I’d wrapped the logic correctly.

In the end, getting to this point ended up consuming probably 16 to 20 hours of time, and was mostly repetitious work that you could easily see progress on while you were doing it. And even at this point, with just the local test suite driving effort, there was a clear and valuable pay-off that starts to highlight “why some things are taking a while”.

Deploying to staging

The next step was to see it operating with some closer-than-test-suite real data. The infrastructure for astorage system for traces was more than I wanted to tackle to start getting data back. Jaeger does a good job of setting it up, but there’s a lot moving parts for a “downtime holiday project”.

We run all our code in a kubernetes cluster, and since I had a ephemeral memory-based container already available, I made a service of it with Kubernetes. JaegerTracing has some Kubernetes templates for those inclined to explore, and they helped me understand what will be needed in the future for a more solid, use-it-constantly implementation of a service.

I ended up forking what they had available and using it to make my own “ephemeral jaeger” using just in-memory storage.

#

# Copyright 2017-2018 The Jaeger Authors

#

# Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except

# in compliance with the License. You may obtain a copy of the License at

#

# http://www.apache.org/licenses/LICENSE-2.0

#

# Unless required by applicable law or agreed to in writing, software distributed under the License

# is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express

# or implied. See the License for the specific language governing permissions and limitations under

# the License.

#

apiVersion: v1

kind: List

items:

- apiVersion: extensions/v1beta1

kind: Deployment

metadata:

name: jaeger-deployment

labels:

app: jaeger

jaeger-infra: jaeger-deployment

spec:

replicas: 1

strategy:

type: Recreate

template:

metadata:

labels:

app: jaeger

jaeger-infra: jaeger-pod

annotations:

prometheus.io/scrape: "true"

prometheus.io/port: "16686"

spec:

containers:

- env:

- name: COLLECTOR_ZIPKIN_HTTP_PORT

value: "9411"

image: jaegertracing/all-in-one

# all-in-one Dockerfile: <https://github.com/jaegertracing/jaeger/blob/master/cmd/all-in-one/Dockerfile>

# CMD ["--sampling.strategies-file=/etc/jaeger/sampling_strategies.json"]

args: ["--sampling.strategies-file=/etc/jaeger/sampling_strategies.json", "--memory.max-traces=20000"]

# all-in-one image is an in-memory image, with no default limit on how much memory will be consumed capturing

# traces, so this puts a little bit of a limit on it.

name: jaeger

ports:

- containerPort: 5775

protocol: UDP

- containerPort: 6831

protocol: UDP

- containerPort: 6832

protocol: UDP

- containerPort: 5778

protocol: TCP

- containerPort: 16686

protocol: TCP

- containerPort: 9411

protocol: TCP

readinessProbe:

httpGet:

path: "/"

port: 14269

initialDelaySeconds: 5

- apiVersion: v1

kind: Service

metadata:

name: jaeger-query

labels:

app: jaeger

jaeger-infra: jaeger-service

spec:

ports:

- name: query-http

port: 80

protocol: TCP

targetPort: 16686

selector:

jaeger-infra: jaeger-pod

type: NodePort

- apiVersion: v1

kind: Service

metadata:

name: jaeger-collector

labels:

app: jaeger

jaeger-infra: collector-service

spec:

ports:

- name: jaeger-collector-tchannel

port: 14267

protocol: TCP

targetPort: 14267

- name: jaeger-collector-http

port: 14268

protocol: TCP

targetPort: 14268

- name: jaeger-collector-zipkin

port: 9411

protocol: TCP

targetPort: 9411

selector:

jaeger-infra: jaeger-pod

type: ClusterIP

- apiVersion: v1

kind: Service

metadata:

name: jaeger-agent

labels:

app: jaeger

jaeger-infra: agent-service

spec:

ports:

- name: agent-zipkin-thrift

port: 5775

protocol: UDP

targetPort: 5775

- name: agent-compact

port: 6831

protocol: UDP

targetPort: 6831

- name: agent-binary

port: 6832

protocol: UDP

targetPort: 6832

- name: agent-configs

port: 5778

protocol: TCP

targetPort: 5778

clusterIP: None

selector:

jaeger-infra: jaeger-pod

- apiVersion: v1

kind: Service

metadata:

name: zipkin

labels:

app: jaeger

jaeger-infra: zipkin-service

spec:

ports:

- name: jaeger-collector-zipkin

port: 9411

protocol: TCP

targetPort: 9411

clusterIP: None

selector:

jaeger-infra: jaeger-pod

Most of this detail is stock from their examples, and offers a bunch of functionality that I’m not even touching (alternate transports and offering drop-in replacement functionality to accept Zipkin traces).

Buried in this detail is an option that I found on the all-in-one container to limit the number of traces that would be stored in memory, and since I wasn’t putting many other limits on this system, I wanted to put in at least some constraining factor. You can find it in args in the pod spec:

args: ["--sampling.strategies-file=/etc/jaeger/sampling_strategies.json", "--memory.max-traces=20000"]

20,000 traces may end up being way to small, but for an ephemeral service implementation, I thought it made a reasonable starting point.

The other trick needed to make this all work in Kubernetes was adding something to listen for the UDP data was being sent that can on the same Pod as the API code, and forward that into this ephemeral service I just set up. This part I had in notes from my work on Kubernetes for Developer, so I picked it back up and used it again almost verbatim. The end result is specifying two containers (this pattern is called a “side car”) in the template that we used for the API code, so the additions end up looking like:

{{- if .Values.tracing.enabled }}

- name: jaeger-agent

image: "{{ .Values.tracing.image }}:{{ .Values.tracing.tag }}"

ports:

- containerPort: 5775

protocol: UDP

- containerPort: 5778

- containerPort: 6831

protocol: UDP

- containerPort: 6832

protocol: UDP

command:

- "/go/bin/agent-linux"

- "--collector.host-port=jaeger-collector:14267"

{{- end }}



We use Helm for the deployment into a cluster, and this is a snippet from the deployment template in our helm chart. The default values we’re using for the agent are:

image: “jaegertracing/jaeger-agent”

tag: “1.8”

With this in place, we deployed the tracing basics into our staging environment where we have more realistic integrations and data flowing, the end result of which is a tremendously more valuable solution. We access the Jaeger console with a port-forward right now, and this provided us with a way to quickly open a window into what was taking time behind the API calls and start to get to some really good conversations with this data as a base.

Where next?

This whole holiday experiment has been a great success so far, but there’s more work to do to use it seriously. Rolling the ephemeral tracing infrastructure into our production environment and switching from constant/trace everything to a probabilistic tracer is likely one the first tasks.

The API in question is the back-end not directly for our customers, but for an Angular web application in a browser. How we’ve structured those calls from Angular, when we’re doing them, and how much we try to do in each call is where I expect to get some immediate value, as well as being able to answer the questions behind “why is it taking so long for this to render?”

I have this crazy idea about viewing portions of the single page application (what gets displayed) as a sort of costed effort. Now that we have “time taken” and the visibility o where it goes, we can look at what we load, when we load it, and “how much it costs” – from a time and user responsiveness point of view. So maybe we could annotate a wireframe with “how expensive” each data visualization piece is, even if it’s manually done with on a whiteboard, using Jaeger data about the “time-expense” of the API needs needed as the source for that information.

It’s also quite possible that I’m a little too “budget concept” happy, and the majority of value will be in developers, designers, and product managers in our company just being able to have a concrete way to see and address the desire to make really outstanding web-based user interfaces.

When we have it loaded into our production environment, then getting familiar with Jaeger filtering and that UI may also support us identifying problems quite a bit more effectively, and I’m actively looking forward to seeing this bolster the data we’re already capturing with our production monitoring with Prometheus.

And finally, I’m also interested to see where Grafana 6.0 goes in the future, which provides such lovely overlays for metrics capture, building now into logging, and perhaps linking that to traceing in the future.

I’d love to be able to add what’s happening in the Angular application and cascade the tracing from there. I don’t know of the way (outside of a lot more service infrastructure, and getting a LOT more familiar with jaeger-client and making a variant to work from a browser) to capture and forward traces. I’m honestly not sure I’d always want it in place, where I rather do want it always there and on at the API layer, at least while it’s not overly negatively impacting the service calls. There might be something interesting we can do for a single browser with a special setup that would allow us to profile from the user interface all the way back to the services, so there’s another interesting thought experiment for another down-time.