Amir Souchami Amir Souchami is Director of Architecture and Security at ironSource. With great passion for technology, he's constantly learning the latest to stay sharp and create highly scalable and modular solutions with a positive business ROI. Amir loves working with teams and individuals to envision and achieve their goals. Follow up with him to chat about: Empathy, Yoga, Hiking, Startups, AdTech, Machine Learning, Stream Processing, Continuous Delivery & MicroServices.

How breaking apps into fine-grained microservices introduces complications that might end up in a mega-disaster, and what you can do to avoid it.

At ironSource, we work with Service Oriented Architecture (SOA), which has been around for decades, and have embraced its latest iteration — microservices. Using a microservices approach to application development enables us to improve resilience and expedite our time to market. It’s easier to develop, test, deploy, and most importantly, change and maintain our entire application stack when it’s broken up into little pieces.

That said, breaking an app into smaller units does not mean that everything works perfectly right away. In the last year, we observed several service availability issues, and after a lot of investigation, we realized that to avoid these issues, it’s important to implement microservices in a particular way. During our research, we found and mitigated a few bad practices in order to prevent potential doomsday scenarios. In this article I will delve into one of the techniques that has helped our R&D teams enjoy the full benefit of microservices, allowing them to sleep well at night.

Are My Services Coupled to Each Other?

Traditional “request-driven” architectures (e.g. REST) are the simplest and most common pattern of service communication (e.g. Service A asks for some information from Service B and waits. Service B then responds and sends the information to Service A).

Working with HTTP APIs is one of these basic things developers learn and commonly utilize. It’s clear when a request was received and acknowledged by the corresponding service, and there are plenty of tools to debug HTTP APIs. As such, it was our default go-to method to communicate between services across the system. The simplicity of the “request-driven” communication served us very well in moving fast, delivering new features, and growing our systems to accommodate all needs.

Unfortunately, this pattern entails a tight coupling of services. While in small systems it works perfectly fine, for an application built out of dozens of services, this sort of coupling hinders development agility and blocks rapid scaling.

The main risk to pay attention to when using this pattern is that each “core” service (e.g. service C in the above illustration) becomes a single point of failure. Meaning, it can potentially create a performance bottleneck — or worse, a downtime of the dependent services. So, the whole dependency chain gets disrupted (i.e. a mega-disaster). It might sound like it’s easy to work around, but every service being added to the chain requires a service discovery mechanism (or even a service mesh in big systems), failover/retries, circuit breakers, timeouts, and caching mechanisms in place — making it a great challenge to work flawlessly end-to-end.

In practice, using synchronous communications, like REST, across the entire system makes it behave like a monolith or more precisely a distributed monolith that prevents the full benefits of microservices from being enjoyed.

In order to untangle that mess, at ironSource we’re moving many core services to communicate using an asynchronous event-driven architecture. Our approach is to let our core service publish the information it provides whenever that information is updated, rather than waiting for another service to request that information. By using “push” instead of “pull”, our system handles data in realtime. Now we can dispose of many of the complex cache management and purging mechanisms, service discovery, and retries techniques that we used in order to maintain the reliability and performance of the system while using synchronous services communication.

Moreover, services can now publish events to a resilient message broker asynchronously (Kafka in our case). They trust the broker to route the message to the right service, and the receiving parties subscribe to key events that interest them. It’s easy to add subscribers and that way they don’t put a further load on the publisher service.

Why Kafka Then?

We’ve chosen Kafka because it is very resilient and reliable. It has a great community and documentation resources. Unlike other message brokers, Kafka enables us to replicate the data in a highly available manner and control the data retention policy, so events can be kept even after they are consumed by multiple consumers. Events are expired automatically in a stream according to a retention policy. Therefore, if you’re using event sourcing and want to reproduce a state out of the events log, it can act as a persistent storage source to reconstruct the current and past states from. Moreover, Kafka brings native support for advanced scenarios like streaming and real-time aggregations (with KSQL) and connects natively to many components in our infrastructure (e.g. Spark streaming, Cassandra, Elasticsearch, S3, etc) so it makes our lives slightly easier.

In asynchronous communications, a service may still rely on another service, meaning the API and dependency between the parties still exists. But if one service fails or is overloaded and slow to respond, it will not affect the other services since they’re now loosely coupled from each other and contain everything they need to respond.

So the benefit of a common event bus is that it eliminates the single point of failure and performance bottlenecks we had when our core services communicated synchronously — e.g. the queue can still keep the messages sent to Service B until it’s back up and able to consume them.

This leads us to the “one-hop” rule:

“By default, a service should not call other services to respond to a request, except in exceptional circumstances.”

A service should be self-contained and manage its own data. Allowing a service to call other services adds overhead to the request and can result in very slow or unresponsive service. If you see that you need multiple calls, back and forth between several services, I encourage you to explore whether using the async event-driven pattern or even merging these services into one (micro-monolith) can provide you with a healthier service.

Of course, every rule has exceptions. In some cases, you need to make a conscious decision to communicate synchronously between services in order to respond to a request.

For example, a classic case for synchronous communication is to have a centralized authentication service that gets a synchronous server call from multiple user-facing APIs in order to validate and authenticate a user token. Separation of the authentication service unlocks agility in terms of development and deployment, creating high cohesion for any service that needs authentication and lets teams work separately and productively. Moreover, the separation enables different auto-scaling patterns and resource allocations for the authentication service and the other services that use it. And most importantly, it enables you to synchronously block any user request that fails authentication.

Wait, What if the Event Bus Becomes a Single Point of Failure?

While taking care of eliminating points of failure within the system by using an event bus, you might be concerned whether your event bus is a single point of failure by itself. Well, keep doing what you always do for robustness and scaling: distribute it, deploy multiple instances of the event bus to achieve high availability and figure out if you need to take care of retries. With Kafka, you will get many options and configurations to tune it for high availability. If you have multiple different use cases and big loads, my advice is to consider spinning multiple clusters and even consider deploying a cluster on multiple data centers (using active-active or active-passive topology). With that in mind, I recommend checking what happens if you lose the event bus completely. Although the outcome seems obvious, you are actually testing for more than the failure, you are also testing the recovery. Understanding how your system acts upon failure is essential for ensuring resiliency.

Putting It All Together

Our services are now emitting events, which results in a log of facts that is:

Reproducible: a state of the system, at a given point in time, can be reproduced by replaying the log of events. Redundant: the log is partitioned and replicated for high availability using Kafka. Decoupled from any specific datastore: events are usually serialized using formats such as JSON, Avro, etc. Immutable: once an event is emitted it cannot be changed.

The log enables us to reconstruct the current and past states by processing events (i.e. event sourcing). The single source of truth becomes the data repository where the events are stored. Each service that consumes data from the log becomes self-contained and no longer couples with any set of microservices whether they’re up, down or just slow. Unchaining services from synchronous dependencies improved our fault isolation, and our system can remain almost unaffected by the failure of a single module. The overall reliability and performance increased and that’s a huge win for our business.

What’s Next?

We’ve broken our app into fine-grained microservices, we’ve placed a resilient event bus in the middle of everything to coordinate the communication, and enable each service to be self-contained. We even organized our teams into “squads” or small enough teams that they can be fed by two pizzas. In our next posts in this series we’ll deal with further questions that might raise red flags:

Does a change to one microservice require changes to other microservices? Will a microservice deployment require other microservices to be deployed at the same time? Is there a team of developers who work across a large number of microservices? Do microservices share a lot of the same code or models?

Feature image via Pixabay.