In recent years there’s a growing trend to move away from large all-in-one applications. These “monoliths”, developed with one codebase and delivered as one large system, are hard to maintain. In their place, the industry now favors splitting-off the component systems into individual services. As separate “microservices”, they perform the smallest functions possible grouped into logical units. They are independent deliverables, deployable, replaceable and upgradeable on their own.

Going further into the Practicality Beats Purity series, this article will cover the implications of transitioning to a microservices architecture.

What are the benefits of microservices?

This architecture provides a level of modularity and portability that you’re unable get out of a monolith. It’s easy to repurpose or reuse individual systems for other parts of the business. A great example is a user authorization service that implements company-wide security standards. With good design, we can reuse it across various applications and organizations in a large business.

You also gain the ability to scale. Implementations that use easy-to-proxy protocols like HTTPS, can spin up multiple instances of the same service across multiple physical or virtual machines at any point in time. It’s great during peak usage, and helps lower overall infrastructure costs during the scaled down service hours. If you’re in the cloud they can even be completely shut down, directly saving money.

Along with this scalability comes availability. Once a service is independent enough to scale, it’s just as easy to keep extra instances running in separate infrastructure resources. When one fails, the other can take over the load while the original restarts. Developers can debug from a stack trace that’s either emailed to them or in a centralized logging system. In fact, you could chose to ignore the failure unless the same one occurs more than X number of times over Y minutes, saving your debug time for the wider-reaching problems.

The same mechanism enables continuous zero-downtime deployments by individually upgrading service instances. After one upgrade completes and traffic starts to flow, it’s easy to restart at the previous version if there’s a problem. This helps maintain uptime while fixes are ongoing. It’s a great way of doing A-B testing with user interfaces, just run multiple instances of the same interface, some of which have slight changes to a GUI. Simply monitor how users respond to those changes and make decisions on whether to move forward with them or not.

Implementing microservices does imply quite a bit of complication for the system as a whole. In order to help mitigate that, it’s necessary to provide well defined interfaces between all the services. Going through the thought process of those definitions will help surface architectural problems with the application as a whole that you may have otherwise missed. It also helps define tests that guarantee the required promises or contracts that the interfaces must follow. While the overall integration of the services becomes harder to test, the individual functions or services become easier.

Separating into these smaller systems allows an increase in speed of feature delivery for each subsystem. In other words, you don’t need an entire new version of your application because you made a change in user management, you can simply update the user management service and deliver that feature separately (as long as it doesn’t impact the promises of its interfaces) without having to wait for the rest of the application to catch up.

Complexity of Infrastructure

Implementing microservices does have a cost and sticking with a monolith has its own set of advantages. The largest cost, which can sneak up on you, is this shift from complexity of the codebase to complexity of the infrastructure. If you don’t have the right architecture, tools, compute or human resources in place to manage it, you can find yourself in deep water very quickly.

Running individual services means that you have to track which ones are up and in which compute resources they are running. You have to provide a way to dynamically configure them so they can run on any available resources, otherwise it defeats the purpose. Plus you need simple ways of restarting, scaling, monitoring and moving them to different compute nodes.

Time Spent Coordinating

Splitting a monolith into individual pieces also relinquishes control of the codebase. While this gains a lot of the benefits described previously, it also requires more coordination between the owners of the individual services. Defining interface contracts becomes more important and takes more time, often turning into a point of contention.

Maintaining an internal “service compatibility matrix” is also required, one that tracks which version of the user service works with which version of billing, and with which version of the core business logic. Things work better if you define the interfaces such that the services can discover this, but you still need manual intervention when performing upgrades that can no longer function with the rest of the ecosystem.

Adding interdependent features also becomes more complicated due to the extra coordination needed to release them. This is yet another point of contention, especially if development is split into different teams with different priorities.

Varied Coding Practices

Monoliths provide an easier way of enforcing coding standards and practices across the entirety of the application. Changing this architecture trade readability and consistency for flexibility. However, each piece can use the best practices that better suit their ultimate goals of interacting with the rest of the ecosystem.

In other words, you can write the compute-intensive statistical analysis code in C, the highly concurrent message queue in Erlang, and the business logic in Python. Depending on your organization, this may be worth it, but it’s important to keep in mind the main drawback: it’s highly likely that the person who wrote the business logic, will have no clue how to fix a bug in the message queue.

Latency

A microservice architecture can increase end-user latency when compared to the monolith. Working with anything that imposes response time metrics becomes a problem because typical implementations use a layered approach. And traversing each layer now incurs a latency cost as a result of crossing network boundaries. It’s no longer just about working between threads or processes. For example, if you split a user-management service out of the main business logic, a request to list all possible users becomes something like:

Customer requests a user list from the application. The application load balancer receives this request and uses algorithm to determine which core application instance to send it to. One of the core instances receives the request and routes it internally to a user list endpoint. The endpoint function no longer knows how to get the list of everything, so it sends the request to the user management service. The load balancer for the user management service receives request and determines which instance to send it to. User management code receives the request, processes it, returns a response. Core instance can now process the response from user management, adds anything it may need to and returns its own response. Finally, the customer receives the response.

In a regular monolith, you could probably perform items 3-7 in microseconds + query time. In a microservices system, you’re talking milliseconds.

Optimizing the scale of each service is also a consideration. If you want to handle 1K requests per second in your application, and all of your endpoints require user information from the user management server, then this service must also handle 1K req/sec. It seems obvious now, but when the codebase and the teams designing it grow large enough, it’s easy to miss trivial things like this and run into situations like the perfect machine learning code choking while waiting on user validation.

Debugging

Debugging also gets exponentially more complicated:

You’ll need some sort of central logging system. This is now yet another service to manage and maintain that could be a single point of failure.

Take special care not to divulge sensitive information in your logs because now they’re going over the network, not to stdout or some local file.

Timestamps are now blurry. You can’t trust them across services, even if they’re on the same physical hardware but different virtual machines (this is a CPU architecture issue).

If you’re doing something very time sensitive, you’ll need to manage it through the logging system or some other centralized mechanism.

It’s also hard to keep track of the request that initiated a set of actions inside the larger system unless you pass that information back and forth (and list it in your logs).

Summarizing

Problems, bugs and mistakes are a reality of life. Choosing between these architectures is about exchanging one set of problems for another. In this case, you give up core software issues for infrastructure and operational issues.

Maintaining multiple code bases is complicated, managing many services and how many of them are online also gets complicated quickly. Availability through scale is easier than built-in resiliency, but finding and debugging problems across multiple instances of services is considerably more complicated.

Patching a problem on multiple instances of the same service while you wait on a real fix to finish testing is tricky and has little room for error. You need the right teams to manage that in order for it to be worth the effort.

If you don’t need the scale, then who cares? If you need the availability, consider a monolith with a scale of 2. Don’t pick microservices because it’s “cool”, pick it because it’s the right solution for the problem your trying to solve and for your organizational structure.