Lets first recall the requirements we defined at the beginning of this post, in order to build an acceptable solution. I define a set of 6 top level requirements. Starting at the top, the new solution must stop duplicate requests. Using a distributed lock across the entire cluster guarantees that no two threads can take a lock on the same key. Hashing of the API request parameters into a unique key (for the lock) by associativity guarantees that no two identical requests can be executed by any two threads simultaneously. The lock is taken for the duration of the controller action, so race conditions on database or partner API's are eliminated. This satisfies the first requirement.

Moving on, the second requirement states that the proposed solution shall not require any new infrastructure. The Apache Ignite cluster is embedded into the application host so there is no additional architecture to manage. There is a slight overhead of monitoring the Ignite cluster health, metrics and logs to ensure it's working properly. However, this overhead is much lower than if I had to manage a standalone Redis cluster or something similar.

Next, is complexity. Reiterating my earlier point, the solution to the duplicate request problem is needed for an existing, complex enterprise application. The solution must not be complex and must not takes a long time to develop. Apache Ignite is complex, but does a great job of hiding the complexity behind a well documented and highly optimized user interface. I like to talk about software complexity, in terms of the first law of thermodynamics in physics. The law states that energy is conserved, it can never be created or destroyed only transformed into another state, which (in my opinion) is exactly paralleled in software complexity. You can hide it, you can move it around, you can transform it (into someone else's problem), but you can never eliminate it. In this case, we transform the complexity of the solution into Apache Ignite's domain. We let Ignite's cluster handle all the necessary distributed computing, leaving us with an elegant and simple implementation.

Looking at the 4th requirement, performance, is bit more involved than the previous requirements. There are the theoretical and measured aspects. Theoretically, Apache Ignite uses efficient low level socket messaging platform to facilitate communication between cluster nodes. According to an article on GridGain, Ignite is quite fast. In the article called: GridGain Confirms: Apache Ignite Performance 2x Faster than Hazelcast the Atomic PUT benchmarks show latency of 0.56 milliseconds. Even though we aren't actually using the PUT operation in our solution, the PUT operation uses locking to achieve atomicity thus the .56 millisecond is actually a ceiling estimate of the latency we should expect. According to the same article we can expect a throughput of 115,000 atomic PUT operations per second which is orders of magnitude larger than the traffic my application receives. Additionally, Apache Ignite is built on top of the Yardstick framework; it allows for built-in benchmarks to be executed on deployed clusters. I would be surprised if the GridGain article didn't use this framework to generate the reported numbers. I have not had the time or the need to benchmark my deployed cluster yet, though I envision that the next blog in this series will be centered around performance and benchmarking; stay tuned for that.

The extendability requirement is met in several ways. First, if at a future time, the need arises for a standalone Ignite cluster, I can spin one up (with the dreaded new infrastructure) and offload the computation and memory footprint from my application instance nodes. All while being completely transparent to the application. I would still need to run the JVM and Ignite instance on each application node, however, these Ignite instances will be light, they will only serve as communication ports into the cluster and will do no computation or host memory. Second, I can use the Ignite cluster for more than just distributed locks in if the need arises. I can use it to cache database queries, API responses, or even schedule compute jobs outside API actions; the possibilities are endless.

Not getting into the specifics of the code, but I did write a set of interfaces for .NET applications which hide the implementation details for working with in-memory-data-grid. No such standard interfaces exist as far as I could determine today but it's detrimental to preventing the dreaded vendor lock-in. Ignite.NET becomes an in-memory-data-grid provider, while the application can swap a new provider at any time. This is just good code practice, not anything revolutionary, but it solidifies the extendibility requirement. According to the benchmarks above Ignite is so far the most performant IMDG but as new implementations come to market or Hazelcast performance improves I'll be able to switch with minimal effort.

For all intents and purposes this solution works and in my opinion works well. Since implementing this solution I have seen about zero duplicate requests, have not had to run any cleanup scripts and was actually able to identify misbehaving clients. By monitoring the logs I was able to see when duplicate requests were being rejected by the new locking mechanism and then requested the clients which created such requests fix their systems. However, I have also seen some interesting consequences of such a robust duplicate check mechanism. Keep reading this ridiculously long post to find out !

Lessons learned

Many lessons were learned in the process of implementing the Apache Ignite based solution. Mainly, I had to learn all about IMDGs, brush up on my distributed computing concepts, and dive deep into Apache Ignite, JNI and more. That's the boring stuff though, lets talk about how I took down my production application cluster for a few minutes (╯°□°）╯︵ ┻━┻).

IMDGs are very powerful; as with anything powerful they can be dangerous if not carefully implemented. Same can be said about any distributed computing platforms, but in my specific experimentation slight oversight resulted in an application outage. In my attempt to be clever and over-engineer this solution I added a bit of code that would attempt to self-heal the Ignite cluster in case of unexpected network partitions. What I was trying to do is shut down the Ignite node and exit the cluster if a network partition was detected. Ignite has built-in mechanisms to detect such network events by providing event listeners. Taking advantage of this framework, I added logic to shutdown the node to prevent a split brain cluster. Shutting down the Ignite node would (in the worst case) result in falling back to the original duplicate check logic; not desired though an acceptable degraded state. A split brain cluster would be much worse because the application would continue working, but would corrupt the cluster's ability to fully prevent duplicate requests. Nodes in the split clusters would function as if they are successfully taking locks on unique keys while the same key could be locked in the unreachable cluster. However, a bug in my partition handling logic was actually sending out a shutdown signal to all nodes, instead of shutting down the specific partitioned node. This resulted in a cascading effect of each Ignite attempting to shutdown at the same time. Coupled with some built-in restart logic, which was meant to restart the node due to an unexpected shutdown, this resulted in a terrible effect of Ignite nodes getting stuck in a strange state locking down the entire cluster. Cascading further into chaos, CPU spiked on the host machines and caused request queues to build up in my IIS application pools. Triggered retry logic in my API clients caused even more requests to queue bringing the whole application cluster to a halt. Was this a problem with Apache Ignite? No, this was a problem with my implementation, but it highlighted a few problems with my design. For one, it highlighted the problem with co-locating the Ignite nodes with application nodes. Had Ignite been running in a standalone cluster the application would remain largely unaffected. Additionally, it highlighted the fact that when introducing a relatively new technology into a critical system, especially one that relies on distributed computing care must be taken to ensure proper failure scenarios are addressed. I had since introduced circuit breakers and a sophisticated logic around handling network partitioning and failure scenario fallbacks that prevent such scenarios from playing out.

There was another interesting consequence of distributed lock based duplicate prevention mechanism. Since the lock is taken on the duration of the request, if the request takes a long time to complete, interesting things start to occur. Requests can take a long time for any number of reasons i.e. timing out at the database, building up request queues due to spike in load or outages in any of downstream dependencies. I'm sure I'm not the only one to agree with the fact that without proper timeout configurations and circuit breakers API requests can get "stuck". Generally speaking, these types of problems do occur periodically in most systems. However, when requests are locked and if you have clients that retry on failure you start to reject the retries because they are.. well duplicates by the nature of the design. Depending on the underlying cause of the timeouts this effect can act as a type of circuit breaker and can potentially help alleviate the problem..or make the problem worse. Usually whenever a dependency outage causes timeouts, the retrying clients (which do not empoy circuit breakers) make the problem worse by sending large numbers of requests (doomed to failure) as a sort of negative feedback cycle. The duplicate check, which rejects the retries, begins to act as an implicit circuit breaker. No new retried requests are added to the request queue, no extra load is added to the underlying system which is choking and causing the outage in the first place. I'm not going to argue whether this observed effect is desired or not, but I will state that I haven't seen it be a problem yet. I keep mentioning the circuit breakers and timeouts and request queues... Those are all interesting topics in their own respect and deserve a full blog post so I'll just leave it be for now.

Problems with Iginte.NET

Apache Ignite is a great in-memory-data-grid framework/platform. Fully featured, mature, and has a large, thriving community. I was working with Apache Ignite.NET , a JNI based .NET wrapper for Apache Ignite. In several months of testing and production deployment I encountered a small number of problems worth discussing.

I encountered an issue where the a new ignite node would get stuck on startup when attempting to connect to a cluster by establishing a connection with to a host machine where an Ignite node is not running. Basically, when a node tries to connect to a cluster it tries to establish a connection to at least one node in an existing cluster. To figure out which machines to try to connect to, service discovery registry is used to get a list of IPs of application cluster host machines. If the application cluster is up, but no Ignite node has been started on any of the host machines, a new Ignite.NET host gets stuck attempting to connect to each of the IPs in the list. On the surface, this appears to be a bug with Ignite.NET because I tried a similar scenario with pure Ignite and didn't see such problems. A workaround involved coordinating Ignite cluster startup to ensure that a cluster node has at least one Ignite node to connect to. The very first node then must know it's first and does not try to connect to any other node and establishes a new cluster instead.

As far as issues in the production cluster, there was only one, seemingly caused by a network partition. If you remember, about 100 pages back, I had mentioned how I introduced a fatal bug while writing network partition failure handling. The main motivation there was to try to prevent split brain cluster from forming. This turned out to be a real use-case, with exception of the whole split brain thing. After running Apache Ignite cluster in production for some time, I realized that network partitions, though rare, do happen and 99.9% of the time Ignite handles these network issues well. The partitioned node leaves the cluster and re-joins at a later time when the network is fixed. The split-brain cluster has never formed, however, there was one instance when a partitioned node got into a strange state where every attempt to take a distributed lock was treated as if the key has already been locked. Strangely enough Ignite's network partition event did not fire on the affected node. The rest of cluster did detect that the affected node had left the cluster and appropriate measures were automatically taken to re-balance the affinity function. However, the partitioned node itself continued running as if part of the larger cluster. This strangely resulted in all API requests on the partitioned node being rejected. The strange part is that the split brain cluster did not form, the partitioned node used the same affinity function and attempted to send lock requests to the mapped cluster nodes but was unable to communicate with the required nodes. I tried to make sense of what went wrong, looked into all the JVM logs from Ignite, but nothing really stood out. This was also not something I could reproduce nor did I ever see it in my production cluster since.

I do have to make one disclaimer; my original solution was based on Ignite.NET 2.1.0 and there had since been two dot releases. As of this writing the latest available version is 2.3.0. According to the release notes, most of the updates are bug fixes, performance improvements and added support for .NET Core. Unfortunately I have not tested the newest version to see if the issues I have encountered have been fixed.

Final thoughts

I'll keep this short since this article is already long enough. The Apache Ignite In-Memory-Data-Grid is a powerful, robust, highly optimized distributed computing platform. It can be used to build data processing pipelines, caching solutions, run in-memory data analytics on large data sets, and event to implement atomic REST API actions. This was definitely a fun project, I'll be looking to expand my usage of Apache Ignite and will start taking full advantage of this incredible platform in the future.