Part 1 of this series discussed some of the main advantages of microservices, and touched on some areas to consider when working with microservices.

Part 2 considered how containers fit into the microservices story. In today's post we will look at some basic patterns and best practices for implementing microservices.

Introduction

Design and implementation patterns that can be applied to a microservices-based application vary widely based on the specific scenario of the application. A single blog post describing all the possible scenarios and the appropriate design pattern would not do justice to the richness of this topic. That said, recent engagements with customers have shown that there are some common issues most of them run into when getting started with microservices. The goal of this blog post is to provide an overview of what these basic patterns and best practices are to address those common issues, to provide a good starting point for what to keep in mind when getting started with microservices.

General Best Practices

Before you start thinking about architectural and implementation best practices, it is important to ensure that we can apply some general principles for the successful delivery of microservices-based applications. In order for services teams to be set up for maximum success, they must be enabled to be truly independent, and release on their own cadence. They must also be able to use the languages, frameworks and products that best meet their needs. Many organizations are not set up for this, which can be an impediment to success. These are critical factors for successful microservices projects. Some companies release services and updates to services up to one hundred times per day, which makes automation a key requirement. Typically, everything from continuous integration all the way to continuous delivery and deployment is automated. The next blog post will cover DevOps and automation in more detail.

With the increased freedom every team has, it is crucially important to invest in standards (e.g. common log formats, naming conventions, API documentation, etc.) to avoid chaos. Think about what would happen if each team had its own log format. How would one be able to ever correlate events or transactions that span multiple services? Next we will cover some of the best practices that are more related to implementation.

Design for Stateless Compute

Let’s define stateless first, as there are very few applications that are truly stateless. In almost any case an application or service has some state such as session state, application and configuration data etc. In the past, especially in non-dynamic environments, state has been stored with the service instance, for example in memory. As mentioned before, a microservices world is very dynamic, so you may not know where a new instance of the service gets spun up when scaling out, or in the case of a failure. If the state of that service is stored with the instance, you’ll lose the state and the new instance will not able to use it. The recommended best practice for this case is to push that data into highly available managed services. As you’ll see in the last part of this blog series we are building some really cool services to enable those scenarios.

Design for Failure

In a distributed system you should always assume that service calls will fail due to faults. Faults can be due to many factors, and not just because of faulty code. For example, failures can also occur due to issues with the network or the infrastructure. There are two types of faults: transient faults and non-transient faults. Transient faults can happen anytime, and most of the time the operation will succeed after a few retries. Non-transient faults are more permanent; for example, when you are trying to access a directory that has been deleted. Designing for failure means writing your code in a way that handles those types of faults to ensure that the application is always responsive and returns something to the user. The table below shows some of the patterns that can be applied.

Patterns Example Scenario Retry In the microservices world, a microservice may fail the first time it makes a request of another service, whether it is another microservice or a managed service in a cloud environment. Those failures might be very short-lived but without any retrial policy in place, the requesting service is forced to go into failure handling mode. With a retrial policy in place, the underlying infrastructure can retry without the knowledge of the requesting microservice, thus providing improved failure handling. There are several retry strategies that can be applied: - Fixed interval: retrying at a fixed rate. One should be cautious with choosing very short intervals and a high number of retries, as this can be interpreted as a DOS attack on the service.

- Exponential backoff: using progressively longer waits between retries.

- Random: defining random retry intervals. Circuit Breaker In contrast to the Retry pattern which enables services to retry an operation, the circuit breaker pattern prevents the service from performing an operation that is likely to fail. For example, a client service can use a circuit breaker to prevent further remote calls over the network when a downstream service is not functioning properly. This can also prevent the network from becoming congested by a sudden spike in failed retries by one service to another, and it can also prevent cascading failures. Self-healing circuit breakers check the downstream service at regular intervals and reset the circuit breaker when the downstream service starts functioning properly. Bulkhead Bulkhead prevents taking an entire system down from a failure of a single component by compartmentalizing a system. Assume we have a request-based, multi-threaded application that uses three different components, A, B, and C. If requests to component C start to hang, eventually all request handling threads will hang, waiting for an answer from C. This would make the application entirely non-responsive. Bulkhead prevents this by using only a certain number of threads from the system solely reserved for use by C, so that not all the threads are consumed waiting for C. Fallback Fallback provides an alternative solution to a requesting service when the dependent service fails. For example, a movie client fails to load the most popular movies list which is saved in a remote service. If there is a failure to that remote service, there could be an alternative service which might show some generic movies list, or it could show the last cached list.





Design for Backward and Forward Compatibility

As services will be deployed completely independently and autonomously, you need to make sure that the update to your service does not break existing services that communicate with yours. That means that your code should be backward and forward compatible. So what does that mean?

In Figure 1 the Processing Service works against a v1.1 of the Order Service. Now we update the Order Service to v1.2 to implement some new functionality. Backward compatibility means that this change will not break the functionality of the Processing Service. New fields added to the API should be optional or should have sensible defaults. Never rename existing fields. It is recommended that you test the new version of the API by passing old messages so as to identify any problems before releasing the new version. Figure 1 shows this scenario.

Figure 1: Backward compatibility

Forward compatibility is a less common requirement for microservices, but if you need to ensure rollback functionality, it must be implemented. Forward compatibility means that a service works the same way, without needing to update it, against a new version of itself. In general, follow Postel’s law, “Be conservative in what you do, be liberal in what you accept from others”. Ignore any additional fields that are passed along and don’t throw errors.

Besides the implementation of API backward and forward compatibility, it is also important to properly document the APIs and their version history. A good practice is to use Semantic Versioning (major.minor.patch) for your microservices. This will not only help consumers get started consuming the API quickly, but can also provide best practices for consuming and working with the API. A tool that has been quite successful is Swagger, which is being used by the Open API Initiative (OAI) and is focused on creating, evolving and promoting a vendor neutral API Description Format. You can use it for interactive documentation of your API, as well as client SDK generation and discoverability.

Design for Efficient Services Communication

At a high level one can differentiate between external service communication and internal service communication. The second part of this blog series covered an example of external service communication by discussing the API Gateway, which allows one, amongst other things (such as AuthN, AuthZ, throttling or protocol translation), to route traffic from clients to microservices. Most of the time HTTP is used as the protocol for the communication between clients and the API gateway or other external services. Now you may ask, why bother differentiating external and internal services communication? Let’s think about it a bit more.

Large microservices applications do have hundreds and thousands of services, and the more services you have, the more communication and data exchange needs to happen. As a result, the protocol selected becomes an important factor that can impact performance. While HTTP (soon HTTP/2) is a natural choice for communication from a client to your service over the internet, you can choose TCP/UDP protocols for communication between internal services to improve performance.

The second aspect that is often overlooked is how data serialization and deserialization can impact the overall performance, and in the worst case become a bottleneck. One way to avoid paying a penalty on data serialization and deserialization, besides choosing a good JSON serializer, is to consider whether you need to re-serialize the object if a downstream service works with the same object. Instead, you can just augment the de-serialized object and pass it on to another service in a form. Another option to improve performance is to use a binary format like Protocol Buffers.

Design for Proper Asynchronous Messaging

In a microservices application each service instance is typically a process. As a result, services must interact using an inter-process communication (IPC) mechanism. NGINX put together a great blog post describing several ways to do this in IPC in a microservices architecture.

One very common approach for IPC is to use messaging. Pretty much everyone knows how to read and write messages from a queue, however there are some design practices and considerations to keep in mind when using asynchronous messaging as an IPC mechanism in your application. Just like with using asynchronous calls in general, one of the biggest advantages of using asynchronous messaging is that the service does not block while waiting for a response from another service. The flip side is that asynchronous messaging also introduces some challenges to be aware of. Depending on your messaging solution, the order of messages may be random. Many times this is not a problem, but in case your service requires or expects a certain order to its messages, you need to apply some logic to your service to implement ordering. Another challenge may arise when needing to deal with repeated messages; a message could be sent more than once, for example as part of retry logic. In this case it is critical to make sure that your service is able to detect duplicate messages and handle them appropriately.

A good practice is to design for idempotency, which is covered later in this blog post. Another problem, which happens quite frequently, is how to handle messages that cannot be processed. For example, if a message is malformed or contains corrupt data it may cause a receiver to throw an exception. A good practice is to discard them by adding them to a specific queue as part of the exception handling, so receivers do not continue trying to process the “poisoned” message. The poisoned message can then be taken out of the specific queue for diagnostics purposes. These are just some of the challenges one may encounter when working with asynchronous messaging. It should be quite evident by now that messaging is a rich topic which could easily fill an entire blog post, but the good news is that many messaging system like Apache Kafka or RabbitMQ offer functionalities to help address most of the concerns mentioned above.

Design for Idempotency

Whether dealing with messages or data, one should always try to design for idempotency. Messages for example can be received and processed more than once based on failed receivers, retry policies, etc. Ideally the receiver should handle the message in an idempotent way, so that the repeated call produces the same result.

The example below illustrates what this actually means. Let’s assume a service needs to add some money to an account. The message could look similar to the one below.

{ "credit" : { "forAccount" : "12345" "amount" : "100", } }

Let’s assume further the first operation fails due to some network issues and the receiver cannot pick up the message. As you have seen before it is good practice to implement retries. As a result, the sender submits the message again. Now you end up with two equal messages. If the receiver now picks up the messages and process them both, the account has been credited not $100 but $200. To avoid this, you need to ensure idempotency. A common way of ensuring that an operation is idempotent is by adding a unique identifier to the message and making sure that the service only processes the message if the identifiers do not match. Below is an example of the same message, but with an identifier added.

{ “credit” : { “creditID” : “124e456-e89b-12d3-a456-426655440000” “forAccount” : “12345” “amount” : “100”, } }

Now the receiver can check if the message has already been processed before processing it. This is also commonly referred to as de-duping. The same principle applies to data updates. Bottom line is that one should design operations to be idempotent, so that each step can be repeated without impacting the system. For more information on idempotency patterns see Jonathan Oliver’s blog.

Design for Eventual Consistency

Based on the CAP theorem, where you need to tolerate network partitions, one must choose between consistency and availability. Most microservices applications are typically designed to be highly available, which means giving up on strong consistency. The problem we encounter is that business transactions typically span multiple services touching different data stores. Patterns such as event sourcing along with a compensating transaction pattern have been proven to be very efficient, but ultimately it comes down to the specific application scenario to determine the best way to implement eventual consistency.

Design for Operations

One of the most important aspects of successful microservices implementations is being able to monitor the system and services, while also diagnosing issues. Besides monitoring the host machines and containers, a good strategy is required for monitoring and diagnosing your services. A key element of monitoring and diagnostics is data collection, which is normally accomplished through logging. As mentioned earlier, all services should adhere to a common log format which is used across all services to have a solid foundation for a good diagnostics story. Opentracing, for example, offers a vendor-neutral open standard for distributed tracing that can be used in a polyglot microservices application. Besides a common log format you need to consider what to log. There is always the classic question of what is too much logging and what is not enough. As with so many things, this depends on your scenario, but the list below should give you a good idea of a good starting point based on many customer scenarios:

Healthy state data: It is important to log how your system is behaving at a healthy state. This data can later be used as a baseline for comparison with anomalies. Some of the data to be collected should include service start events, heartbeat, etc.

Log what is not working: This seems to be straightforward, but in addition to error details and stack traces, one should also log semantic information such as user requests, etc.

Log performance data on critical path: Aggregating the percentile on performance metrics allows one to identify system-wide long tail performance issues.

Finally, one of the most important best practices is to use a correlation or activity ID. This ID is generated when a request is made to the application and is passed on to all downstream services. This allows you to trace a request from beginning to end, even though it spans multiple independent services. Figure 2 shows a waterfall chart generated based on an activity ID.

Figure 2: Waterfall chart based on an Activity ID

Distributed tracing solutions such as Zipkin have proven to be very efficient when it comes to monitoring and diagnosing microservices-based applications.

Summary

As mentioned in the introduction, which patterns to use depends very much on your specific scenario. This blog post explored some of the basic design principles to keep in mind when you start to design and develop microservices-based applications. Please let us know if you find this information useful and if you would like more details on certain aspects.

The next blog post will cover DevOps for containerized microservices. Stay tuned.