As you can see, from the charts, as the RPS increases on Service A, the time taken for the client to obtain response from Service B increases too. But Service B prometheus stats do not show much of a difference, and the response time is pretty consistent. What is the reason for this behavior?

THE TCP GAMEPLAY

A lot happens under the hood for a TCP request, when two services interact.

Lot of things come into play when two services interact over TCP, the number of sockets involved(Socket layer), the network IO speed(Protocol/Interface layer), number of cores of the machine (apart from application code). Bottlenecks could happen at any of these layers. The sockets may run out because of a limit on the system, not enough threads available on a server to serve (hence shooting up the load average leading to delay in response). Hence Service B metrics only the record response times on the application layer, whereas Service A metrics (for Service B) takes into account the whole tcp cycle, which includes the connection time, tcp wait time etc.

How do we solve this problem in general? A few obvious approaches would be to either add more servers or upscale the hardware. But is it always required? Is the data being served from the other service needed in real time that every single time a request has to be made? These are few questions every application developer must consider. It is not necessary that the configurations or values that we fetch from the other service may update frequently (or every second). That is where things could be optimized.

KEEP EVERYTHING IN MEMORY

High frequency trading systems had always intrigued me as they are limited by the same problems. But they seem to overcome it and provide responses in real time. Here are a few points to note when building a HFT platform:

Optimize code for performance Keep everything in memory Keep hardware unutilized Keep reads sequential, utilize cache locality Do async work as much as possible

Keeping everything in memory is a good viable option for us since in most of our cases, the RAM remained underutilized compared to the cores. Remember, Service A requests data every single time from Service B, but Service B’s data doesn’t change frequently. When updates/writes are not very frequent, one could cache the response from one service locally for a few minutes to avoid requesting the other service for the same data again and again. Similar to how L2/L3 cache works on a processor to avoid requesting from main memory/disk all the time, the same idea can be applied at a service level to increase the traffic for better response times.

ZIZOU, an alternate answer

Keeping this in mind, I made a simple library called Zizou which serves as an in memory (sharded) cache, which can store keys locally in memory, with eviction policy enabled. This just acts as another layer of cache locally on the system, similar to what one would store on redis to avoid lookup’s from different services for the same stale data. And since the data is served from memory, it helps your service respond back in pretty quickly and frees up your resources for other important tasks.

Using it is pretty simple. You have simple Get/Set functions which keep a map of the key value pairs in memory with the expiry specified.

A background goroutine runs periodically which evicts deleted keys from memory to free up the heap.

While this may seem a trivial caching mechanism which many would have implemented, this simple hack has helped me save a lot of money in infrastructure because this avoids unnecessary scaling of services with increase in traffic, thereby reducing traffic on other Microservices in our cluster, and therefore not scaling the services unnecessary which would otherwise scale based on the custom resource RPS metrics in Kubernetes.

Here’s the code.

Live, laugh, and always Cache.

Resources: