In this post, we are going to drill down on observability, which is a core component of every engineering and ops teams moving into production. We’ll discuss the importance of each pillar: metrics, logging, and distributed tracing, and provide some best practices and real-world examples in serverless.

Metrics in observability

The most straightforward and traditional form of monitoring is to simply look at metrics. Metrics, and especially trends in those metrics, can reveal fundamental statistics about our production, for example:

CPU or memory consumption spikes.

Traffic and requests trends.

Latencies across services that we use.

Metrics from Grafana.

Google’s SRE book states the 4 golden signals for monitoring distributed systems are latency, traffic, errors, and saturation. Although it seems simple, it requires some expertise and time to monitor correctly. This process includes:

Collecting all metrics from environments, applications, resources, and services. This can include, for example, our K8s cluster, cloud resources, Java and Node.js applications, and our Redis cluster. Each entity requires a different treatment in order to record these metrics.

Shipping the metrics to a unified platform that needs to handle the right scale, aggregate all metrics, and present the right data (for example, percentiles instead of averages).

Ultimately, we need to build a dashboard for every application or environment that we have, and most importantly define the right alerts.

With serverless, CPU and memory becomes less relevant; don’t settle with just basic charts of calls and errors. Look forward and observe every operation that your function is having. Any API that you’re calling to should be monitored as well.

Logging in observability

Metrics can only tell us ‘good or bad.’ They don’t provide any information or a way to explore why an application isn't working properly.

To troubleshoot any kind of problem, we need to understand the flow of our code or the service we used. To accomplish that we print out logs (to a file, socket, or service) that contain everything starting from free text to detailed exceptions.

Logs in Elastic.

Every engineer is familiar with the scenario of troubleshooting a problem or a bug and the distress that raises to hunt down the right log. This happens due to some pitfalls and inherent issues with logging:

They are generated manually. If you haven’t logged something, it won’t be there (and then you’ll add it, deploy your code, and wait for the problem to happen again 🤦‍♂).

Usually, they have no context . That means that once you’ve found the log you’re looking for, you need to search for all related logs that happened in your code or service at the event.

. That means that once you’ve found the log you’re looking for, you need to search for all related logs that happened in your code or service at the event. There are just too many logs from so many services that it’s really hard to navigate (or sift) through them efficiently.

In order to get the most out of logging:

Add metadata for your log lines; for example, service/function name, stage, request ID, etc.

Automate the process of logging events in your code using instrumentation. We will discuss that in the tracing section below.

Make sure to index logs in the right way at your service, so you’ll be able to leverage the analytics of such tools. Having analytics on top of logs (using metadata and dimensions) will help you understand much more complex trends in your application.

Record custom metrics. This fits with the previous pillar, but it can help you to discover business KPIs. For example, the number of users signed up in the last week.

Distributed tracing in observability

Tracing is the emerging pillar of observability that takes a major role in both metrics and logging for microservices and serverless. The purpose of tracing is to collect data about an operation, in such a way that we can follow it between different services.

A trace’s timeline taken from Jaeger.

In a modern application running microservices or serverless, we have to put much more focus on the distributed part of tracing. The most popular standard for such tracing is OpenTracing (or the new OpenTelemetry). Distributed tracing describes a framework to collect data about events (for example, for a DB query, we will collect the hostname, table name, duration, operation, etc.), which is called spans and includes context. It also describes a way to inject and extract “trace IDs” between your services.

The preferable way to efficiently capture traces across your code is to instrument it, so it won’t be done manually for every call. Instrumentation alters some calls; for example, every time you make an HTTP call, it will be routed to a middleware that will record the information to the trace.

Since traces are captured in a structured way, it allows us to ask much more meaningful questions than with logs; for example, you can find all events that are “insert” operations, that took more than 300ms, and that tagged with a specific customer ID.

Trace captures data in a structured form.

There are a few key things to keep in mind:

Instrumenting and tracing your applications is a very long process, which requires maintenance over time. There’s no quick win if you choose to implement it on your own.

We discussed only the part of trace collection above. Next comes the step of shipping them to some service. Jaeger is probably the most popular service to show and search traces.

In order to get the most out of tracing:

Enrich traces with tags. Tags will allow you to pinpoint events in your complex system, and analyze as a dimension, for example, how many times did userId=X had a specific event and how long it takes. Good tags can be a user ID, item ID, event type, or anything that is specific for your system.

Traces can play as a core component for troubleshooting since they bring context to logs. To do that, consider recording payloads in the traces. For example, for every call to a DB, add the query; and for every HTTP call, add the headers and the body of the request/response.

Looking at tons of logs or charts without any context can be tough. By using traces you can visualize complex service maps and transactions across your system.

Visualizing traces and payloads is a powerful troubleshooting tool (Epsagon).

Summary

Observability plays an important part in every modern application. It requires a lot of planning, heavy lifting, and maintenance to apply best practices. Separating each pillar into a different tool can have a powerful effect on the productivity of engineering teams, empowering them to collaborate. This plays a big role when it comes to choosing a tool to consolidate everything.

In addition, it is crucial to automate your processes so they won’t impact the day-to-day workflows of engineering. Choosing a managed solution can have a lot of benefits, just as you choose a database, message queue, or web server as a service from your cloud vendor.

At Epsagon, we are building a state-of-the-art tracing and monitoring solution, tailored for serverless and microservices. If you’re interested in learning more just drop us a message.