At my current client, we have been dealing with an ongoing problem while scaling their cloud systems to the ever increasing customer-base.

As it is with any software that needs to scale, we’ve been seeing and solving scalability problems along the way. Anyone who has ever dealt with scaling a system to accommodate a lot of concurrent users, knows that issues will be showing up places you aren’t expecting.

While scaling the system, we’ve made it more resilient and fault tolerant, and also learned to mitigate many of the problems as they are arising. We’ve improved logging and metrics, so we know exactly what is happening, and can see stability problems before they affect the end-users.

Redis loses connection, and will not reconnect

In our system, we use StackExchange.Redis(v1.2.6) to communicate with our Redis server. The Redis instance handles all cached data and all communication between services. It is a vital part of our architecture, and currently the architecture is heavily dependent on this connection. The odd thing is that we keep losing connection to it, but Redis is not even breaking a sweat.

StackExchange.Redis is supposed to be able to recover from outages, and in most cases it does. But every once in a blue moon, it doesn’t. It think it does, and it thinks it is connected, but in reality all subsequent calls fail, and the pub/sub connection is completely dead, until the application is restarted.

We have multiple services, and it will always just be one of them that goes down, every other application will continue to run just fine.

At first we thought it was an internal bug, but as time has gone on, and we’ve fixed every issue around the library we could find or even think about it.

Examples:

Avoiding IO thread starvation

Minimizing load on Redis Removing unnecessary/reduntandt calls

Reducing bytes transferred (Protocol buffers)

Hard-Reconnect Reconnects managed by our software.



After seeing that not even a hard-reconnect would work, and the Multiplexer would not get a new connection, without restarting the entire application, it has become clear that StackExchange.Redis does have a problem reconnecting itself in some rare cases.

Since the issues have started, until now, similar reports have been showing up on their github page.

Examples of these issues:

The solution

In Marc Gravell’s(the creator of the library) issue #871, he addresses this issue, by saying that they have not been able to find the root cause of these issues and therefor are basically changing the entire network stack internally in the library. This will cause breaking changes for some users, but should fix all the issues that many of us have been having.

As he explains in the issue, the following has been addressed in 2.0:

The deliverables of this work are, in order: network connection stability

especially on TLS (cloud), non-Windows systems and .NET Core

better protection from thread-pool starvation

lower allocations generally

better handling of “backlog” network buffers

cleaner code We plan is to release this as the next “major”, 2.0. Quote from StackExchange.Redis issue #871

The biggest change is the implementation of Microsofts System.IO.Pipelines, which helps them handle the complexity of the network layer, and integrate better with the .NET Core stack.

System.IO.Pipelines was built to run the ASP.NET Core Kestrel web server, so it is well tested in production.

Must network code is boiler plate that everyone has to write, and therefor also need to consider all the many corner cases this code has to cover. That often leads to very complex code, that can be error prone and even hard to maintain.

We know first hand, because one of our internal services, manages TCP Sockets, and that is one of the big areas we’ve spent alot of time and care on, to be able to scale it.

So we upgraded to 2.0

We were expecting to spend a day or two to change our code to fit against the breaking changed they were warning us about. But luckily, there was none; for us anyway. The entire library was plug’n’play and we could ship to our testing servers a few minutes after upgrading.

Our use of Redis is simple though, so that might be why we haven’t seen any breaking changes. We mainly use:

GET/SET

Sorted Sets

Pub/sub

Transactions

So far, we’ve been running 30-days in production with this updated library and had no incidents where the connection was lost and not able to be recovered.

Conclusion

As developers, we are quick to judge ourselves and even our predecessors, when odd bugs appear in production. We certainly did, and have torn our application to pieces many times to figure why we would see these problems in production (on multiple different platforms; Azure, AWS, etc..).

In our case, it took us a long time to realize that it might be the tool(StackExchange.Redis), although we did update the library multiple times to its latest version, hoping that this would solve it.

When 2.0 finally arrived, we were unsure if we had time to upgrade because of the breaking changes. Which turned out not to affect us at all.

Sometimes it actually is the tool that is broken. Thankfully, the team at StackExchange, and Marc Gravell especially, are competent and honest developers, so they addressed the issue and rewrote alot of the code to fix this issue, that they could not reproduce themselves.

A big thanks to Marc and the rest of the contributors for making a great tool, for us all to use for free!