According to the influential business thinker, Peter Drucker, “What gets measured gets improved.”, collecting high-quality data is essential when it comes to finding and fixing problems when they arise. As the software industry has migrated from building monolithic applications to microservices and other forms of distributed computing, it has become increasingly difficult to use traditional monitoring methods. Instead of one single application that receives requests, manages data, and presents that data to users, these tasks are now handled by the individual, discrete services. This means that when a problem occurs, we need to find all the actors involved and understand what they do before we can start piecing together what went wrong.

Luckily, over time, a number of practices have emerged that not only make it possible to monitor what an application does but the resources it consumes as well. Here we’ll look at five different approaches to monitoring distributed applications, discuss their key advantages as well as disadvantages, and talk about how they can help you ensure that your software stays up and running.

Traditional Logging Methods

Let’s first look at two methods currently used for monitoring and debugging applications.

Method 1: Monitoring Individual Services and Applications

On the surface, it should be possible to continue using the same methods for troubleshooting a microservice as we would a monolithic application. After all, a microservice is simply a smaller version of a monolithic app, designed to handle a specific task or business process.

If a service or application is written in a compiled language, such as Java, developers will first consult the stack trace generated when encountering a runtime error. For interpreted languages, such as Python, these are displayed immediately in the console. The stack trace indicates where and when the problem occurred. If there is some form of application log, developers will check this to see the requests that the application received, when they occurred, and when the application responded to them. The following is a stack trace from a Python application.

In order to get a complete picture regarding any given issue, it is important to also look at the system logs of the host device. These logs cannot directly identify a specific issue, but they can provide important background information. For example, using a higher than expected level of system resources can indicate that a service has a memory leak, or a resource trying to use restricted filesystem locations could indicate a malware infection.

Relying on a service’s application and system log data has a number of advantages. First, most software probably has some form of built-in logging. If it doesn’t, adding a logging mechanism is relatively simple. Once the data has been captured, many tools are available to read, process, and easily understand it. However, this approach only shows us what is happening within a single service. In modern distributed systems, a single request is handled by multiple services, so focusing on one individual service only fixes those problems related to an individual process and ignores its impact on the wider system.

Method 2: Log Aggregation

If looking at the logs of a single service is no longer enough to locate and diagnose problems, the answer should be to combine the logs of as many services as possible. Log aggregation achieves this by collecting logging data from multiple sources and aggregating the results. Agents installed onto the relevant hosts collect the logs and then stream this data to a server for processing. Log aggregation is performed by a number of existing open-source and third-party tools and services, many of which are based on the Elasticsearch, Logstash, and Kibana (ELK) stack. These tools provide powerful search mechanisms that enable you to match patterns in the recorded data.

The disadvantage of this approach is that it only captures data for individual services, meaning the logged data lacks the relevant contextual data to show the wider impact of the problem. Due to storage restrictions, the text files themselves may not be preserved for the time prescribed by an organization’s policies. This means that the logging data required to spot long-term trends will not be available. Many cloud-based services offer unlimited log storage, enabling you to catch such long-term trends, but using a cloud-based solution can be expensive over the long term.

Understanding and Implementing Distributed Tracing

The two methods described in the previous section may seem different, but in practice, there is little difference in collecting data from a single service or multiple services. This is because what interests us is not the individual or collective problems of a service or services, but how they impact a single request as it moves through your system.

Distributed tracing is a method used to profile and monitor microservice-based architectures. This approach follows the progress of a single request from its origin to its destination, across multiple systems/domains, and takes into account all participants and processes. In a serverless environment, such as AWS Lambda, distributed tracing can be used to capture and forward correlation IDs through different event sources. As a result, it is better suited to locating application failures and improving performance in a distributed, cloud-based system than traditional methods. One of the biggest advantages of distributed tracing is that it lets you see what’s happening in your serverless applications.

Let’s now look at three different ways to take advantage of distributed tracing.

Method 1: Do It Yourself

In this approach, you can try to build your own internal tracing system from scratch. The system tracks all payloads, uniquely identifies messages and requests, and lets you track requests over your system. By taking this route, you can repurpose your existing tools by writing code that integrates your tracing solution. The main advantage here is that you can use your existing infrastructure, knowledge, and skillsets. Over time, this system can be further customized to fit your current and future needs.

Although the DIY approach may seem like a good idea, like building your own cryptographic software, it just isn’t worth the time and effort. Most organizations simply don’t have enough time, money, or experience to pull it off.

Method 2: Use Open Frameworks

Our second solution is similar to the first, with one major difference. In this scenario, you still write a significant amount of code, but you use an existing open, distributed tracing framework to do the heavy lifting. OpenTracing and OpenCensus are two examples of popular open frameworks (now known as OpenTelemetry)

Although better than the previous one, this approach still suffers from many of the same issues. First, it relies on you to do the necessary coding and integration work. In addition, you have to take on many of the issues associated with open-source projects. Your chosen framework may have an active community and decent documentation, but when you run into an issue, many times you’ll have to solve it on your own. The following sample Python code shows how you could integrate an open framework with your application code.

from opencensus.trace.tracer import Tracer from opencensus.trace import time_event as time_event_module from opencensus.ext.zipkin.trace_exporter import ZipkinExporter from opencensus.trace.samplers import always_on ze = ZipkinExporter(service_name="dr-test", host_name='localhost', port=9411, endpoint='/api/v2/spans') tracer = Tracer(exporter=ze, sampler=always_on.AlwaysOnSampler()) def main(): connection=pika.BlockingConnection(pika.ConnectionParameters (host='localhost')) channel = connection.channel() rabbit = RabbitMQHandler(host='localhost', port=15672) channel.queue_declare(queue='task_queue', durable=True) logger = logging.getLogger('send_message') with tracer.span(name="main") as span: message = ''.join(sys.argv[1:]) channel.basic_publish(exchange='', routing_key='task_queue',body=message,properties=pika.BasicProperties(delivery_mode=2)) logging.info("Sent " + message) connection.close()

Method 3: Automated Distributed Tracing Solution

Another alternative is to use a cloud-based service. Not only does this approach provide an overall better solution than building your own, but it also has a number of additional benefits. First off, it can automatically monitor any requests generated by your software and track them across multiple systems. This means that at different stages of the request’s path, it can send notifications to alert you to problems or indicate the request’s progress.

The results that this system produces are both far more reliable and consistent than anything you could hope to build. And like the services you would be building, these solutions are cross-platform and support multiple development stacks as well as high-level languages. Any data recorded by the system can also be viewed, analyzed, and presented in a number of visual formats and charts.

Conclusion: Finding the Best Solution

Monitoring software has always been challenging, but trying to monitor applications in distributed environments makes an already hard problem that much harder.

At the start of this article, we looked at the different factors that make traditional logging and monitoring solutions unsuitable for modern microservices running in distributed environments. To understand this issue in-depth, we looked at the problems associated with using individual application and system logs and both the benefits and disadvantages offered by log aggregation. Next, we looked at the advantages of distributed logging and investigated three possible approaches to implementing and deploying distributed tracing.

Clearly, using an automated distributed tracing solution is the preferred choice over the other two routes: DIY and an open framework. This is where Epsagon can help. Epsagon provides everything you need to perform automated distributed tracing through major cloud providers without writing a single line of code. It can handle synchronous events, asynchronous events, and message queues, such as RabbitMQ. Plus, it can support both server and serverless environments, including AWS Lambda and Fargate Managed Kubernetes.

Not only does this approach have all the advantages of the second, but it also adds numerous other benefits such as cross-platform support, ease-of-use, and analytics. Following the path of automated distributed tracing with a tool such as Epsagon to monitor your distributed apps will deliver better results and cost less.

Interested in automated distributed tracing? Request a 1:1 demo of Epsagon.

More About Observability and Distributed Systems: