Despite a gap of more than fifty years and numerous changes in the computing landscape, the computer scientists that worked on NASA’s Apollo program and today’s developers both faced two common challenges. First, how can you pass messages between entities operating in a harsh and unpredictable environment? NASA scientists and modern developers both solved this problem by making remote entities as autonomous as possible and by passing messages via the use of asynchronous protocols. Second, how can you monitor mission-critical software operating in these environments? Here, modern developers have been able to take advantage of newer techniques, such as distributed tracing.

One key issue developers face when trying to monitor applications in an asynchronous environment is how to ensure that a message or event triggers a response in a target process or service. A common solution to this issue is to create direct connections between processes. Unfortunately, this approach effectively turns an asynchronous process into a synchronous one.

In this article, we will explore the use of distributed tracing in an asynchronous environment using three common scenarios. Highlighting the problems involved in each scenario, we will then explain how you can use distributed tracing to overcome them.

Building Asynchronous Applications Without Distributed Tracing

Let’s start by looking at three scenarios that try to ensure that a message or event triggers a response without distributed tracing.

Scenario 1: Serverless Computing

Serverless computing platforms, such as AWS Lambda, provide Functions as a Service (FaaS). FaaS lets you replace servers by writing functions that respond to specific events. Let’s examine how you could use an FaaS approach to build an asynchronous application.

Suppose you want to build a web service that waits for a specific request, processes the request, and then sends the results to another service. Let’s say that the data reads stream data from a source that creates an event for another process to use. To build this type of scenario, you need to build two serverless functions, one to send the original request and another to receive the processed data. AWS lets you build the sending and receiving functions using its Lambda service.

In order to collect data streams between the two functions, you can use AWS Kinesis. This enables you to collect large streams of data and process records in real-time, allowing you to process data, generate alerts, and send messages. Using the AWS Lambda console, you can create this type of application simply from an existing blueprint designed for this type of scenario.

Once you’ve selected your blueprint, all you need to do is modify your code to handle your given scenario, as shown below:

exports.handler = async (event, context) => { const output = event.records.map((record) => { const match = payload.match(parser); if (match) { return { recordId: record.recordId, result: 'Ok', data: (Buffer.from(JSON.stringify(result))).toString('base64')}; } else { return { recordId: record.recordId, result: 'ProcessingFailed', data: record.data, }; } });

But now comes the tricky part: How do you add distributed tracing into the mix?

One solution to this problem is to use AWS Step Functions. Step Functions lets you manage and orchestrate services using a state-based workflow. This approach lets you write a JSON specification for processing by a Lambda function, and you can use it to model the same workflow that we described above.

First, you add error handling code to the function, which triggers a step function that captures a unique identifier. You can then use this identifier for tracing errors.

exports.handler = (event, context, callback) => { function CustomError(message) { this.name = 'TraceError'; this.message = message; } CustomError.prototype = new Error(); const error = new TraceError('This is a tracing error!'); callback(error); };

Next, you create a step function that captures the error in the function. After the error is captured, the task tries to resume script execution, and if that fails, terminates the task. The results of the task are displayed in the Step Functions console.

{ "Type": "Task", "States": { "CreateAccount": { "Type": "Task", "Resource": "arn:aws:lambda:us-east-1:123456789012:function:FailFunction", "Catch": [ { "ErrorEquals": ["TraceError"] "Next": "RecoveryState" }, { "ErrorEquals": ["States.ALL"], "Next": "TerminateMachine" } ], "End": true }, } }

On the surface, Step Functions seems to offer a good solution to our problem. But this service is not without its issues. First, it is subject to numerous limits and conditions that make it difficult to use for an extended period of time. Second, using it to monitor big and complex scenarios is difficult and time-consuming. Lastly, it’s not the best approach for handling and monitoring asynchronous events, making Step Functions less than ideal for distributed tracing scenarios.

Scenario 2: Simple Notification Service (SNS)

Amazon provides a number of tools that can help you implement this type of scenario, such as its Simple Notification Service (SNS). SNS provides microservices that send messages or can provide something in particular that enables microservices to send messages. Since it is part of the AWS ecosystem and Amazon lets you integrate SNS with your own code, you can add SNS code directly to your Lambda function.

For example, the following code sample from Amazon’s SNS tutorial allows you to add an SNS topic to existing Java code:

final ListTagsForResourceRequest listTagsForResourceRequest = new ListTagsForResourceRequest(); listTagsForResourceRequest.setResourceArn(topicArn); final ListTagsForResourceResult listTagsForResourceResult = snsClient.listTagsForResource(listTagsForResourceRequest); System.out.println(String.format("ListTagsForResource: \tTags for topic %s are %s.

", topicArn, listTagsForResourceResult.getTags()));

Using SNS, you can send a message a maximum of three times before it is added to a Dead Letter Queue and ignored. SNS is also a highly efficient protocol that can handle one hundred concurrent messages in parallel. You can then use a Step function to capture a unique identifier that you can use for tracing errors. In this case, the step function can send a message that tells you when an error is detected.

"Amazon SNS: Publish a message": { "Type": "Task", "Resource": "arn:aws:states:::sns:publish", "Parameters": { "Message": "Error dectected!", "PhoneNumber": "+999999999" }, "Next": "NEXT_STATE" }

On paper, this all sounds great. But to take advantage of this feature, you need to ensure that your services are optimized to handle such high throughput. Otherwise, your messages will get stuck and slow down all related operations. This is less than ideal for distributed tracing, which creates large amounts of additional data and could slow the system down even further.

Scenario 3: Write Your Own Service

Another approach is to write your own code and deploy it to your own servers or host it on AWS. In this case, you can replace the AWS Lambda function with your own Java code. You can also replace Kinesis and SNS with RabbitMQ, an open-source messaging platform. The following code shows how easy it is to build a simple messaging client to handle this scenario:

import com.rabbitmq.client.Channel; import com.rabbitmq.client.Connection; import com.rabbitmq.client.ConnectionFactory; public class Send { private final static String QUEUE_NAME = "hello"; public static void main(String[] argv) throws Exception { ConnectionFactory factory = new ConnectionFactory(); factory.setHost("localhost"); try (Connection connection = factory.newConnection(); Channel channel = connection.createChannel()) { channel.queueDeclare(QUEUE_NAME, false, false, false, null); String message = "Send Message"; channel.basicPublish("", QUEUE_NAME, null, message.getBytes("UTF-8")); System.out.println(" [x] Sent '" + message + "'"); } } }

The easiest way to collect tracing data with RabbitMQ is to create a connection factory and capture any unhandled exceptions. So, for our Java example you can add the following:

connectionFactory factory = new ConnectionFactory(); cf.setExceptionHandler(customHandler);

In addition, RabbitMQ provides its own metric collections that you can use for collecting error and performance data. And you can even integrate your custom metrics with a number of backend services. For example, to collect micrometer data, you can add the following to your Java code.

ConnectionFactory connectionFactory = new ConnectionFactory(); MicrometerMetricsCollector metrics = new MicrometerMetricsCollector(); connectionFactory.setMetricsCollector(metrics); metrics.getPublishedMessages()

Once you’ve written all this extra code, you should be well on your way. But getting code to run is only half the battle. You also need to consider how to monitor code once it is finally deployed, and this is where the problems start–especially if you want to use distributed tracing.

In theory, you will have to create direct connections/triggers between your services and your monitoring platforms. But these direct connections will turn an asynchronous process into a synchronous one, thus losing the benefits of distributed tracing. The bigger problem here is that it’s hard to inject trace IDs into services. Moreover, by adding trigger delays into a service, you will cause time delays.

Distributed Tracing Scenario

Instead of trying to build your own solution, you can build something that uses distributed tracing to monitor your application. Distributed tracing frameworks such as OpenTracing and OpenCensus allow you to quickly do just this. Both frameworks give you all the tools you need to use distributed tracing to monitor your applications, and you’ll also benefit from active support and development communities. This Python sample below illustrates how you could integrate an open framework with your own application code:

from opencensus.trace.tracer import Tracer from opencensus.trace import time_event as time_event_module from opencensus.ext.zipkin.trace_exporter import ZipkinExporter from opencensus.trace.samplers import always_on ze = ZipkinExporter(service_name="dr-test", host_name='localhost', port=9411, endpoint='/api/v2/spans') tracer = Tracer(exporter=ze, sampler=always_on.AlwaysOnSampler()) def main(): connection=pika.BlockingConnection(pika.ConnectionParameters (host='localhost')) channel = connection.channel() rabbit = RabbitMQHandler(host='localhost', port=15672) channel.queue_declare(queue='task_queue', durable=True) logger = logging.getLogger('send_message') with tracer.span(name="main") as span: message = ''.join(sys.argv[1:]) channel.basic_publish(exchange='', routing_key='task_queue',body=message,properties=pika.BasicProperties(delivery_mode=2)) logging.info("Sent " + message) connection.close() if __name__ == "__main__":main()

Another benefit of this approach is that you can mix and match the different parts. For example, you can still take advantage of the AWS ecosystem and utilize Kinesis, SNS, or Step Functions to supplement your existing code instead of trying to reinvent the wheel.

Conclusion: Not Everything Has to Be Difficult

At the start of this article, we compared the challenges faced by NASA’s computer scientists working on the Apollo program and modern developers today. In certain situations, such as ensuring that different components and processes can exchange data, we saw that there was much common ground between both groups.

But when it came to monitoring, the approaches became far different, with modern developers being able to take advantage of things that were not previously available, such as distributed tracing. We then looked at three different approaches to monitoring asynchronous applications and the advantages and disadvantages of each approach. Next, we looked at how using distributed tracing gives a solution that is both simpler and better.

Distributed tracing has definite advantages over trying to build your own solution or trying to adopt other approaches. However, to take your distributed tracing solution to the next level, you should consider using Epsagon. Epsagon offers a complete solution for any cloud-based, distributed, asynchronous application and provides the observability you need.

Related articles in our blog:

Introduction to Distributed Tracing

Distributed Tracing: Manual vs. Automatic