Photo by Connor Danylenko from Pexels

What is Chaos Engineering?

The appearance and growth of Amazon and Netflix has resulted in the development of microservices. While this new way of doing things presented many advantages, the microservices architecture presented its own dilemma: when there is failure, how does the rest of the service respond?

In the early 2000s, Amazon began the GameDay program to test the Amazon system by purposefully injecting critical failures into their system. While these experiments revealed unexpected flaws in the system, the program was stopped for fear of the failures adversely affecting their customers.

Then along came Netflix. In August 2008, Netflix began migrating from their own datacenter to the AWS Cloud after a devastating database corruption halted shipment of DVDs for several days. This movement was accompanied by a transition to a microservices architecture, which made more room for failures in the Netflix system. Building on the idea of Amazon’s GameDay program, Netflix began deploying Chaos Monkeys into their system. The Chaos Monkey purposefully terminated functionality into live production code, and automatically replaced the broken functionality to limit the impact on customers. In their book Principles of Chaos, the Chaos Engineering pioneers at Netflix state, “Chaos Engineering is the discipline of experimenting on a system in order to build confidence in the system’s capability to withstand turbulent conditions in production.” This Chaos Engineering has allowed Netflix to test the redundancy of their system before there is a large scale failure.

Chaos Engineering and Chaos Tools should be used to make sure that when a system fails, the impact of the failure is mitigated as much as possible. While Netflix and other companies have built Chaos Tools to test various microservices and databases, one area where Chaos Engineering has not yet reached is GraphQL.

Why Chaos Engineering is Helpful in GraphQL?

This article is not going to cover the many advantages of GraphQL; there are already a plethora of articles only a quick Google search away. And while GraphQL offers many advantages, it has issues that can sneakily wreak havoc in a code base if these weaknesses are not pre-emptively sought out and remedied.

No matter what, GraphQL will send a status code of 200 back to the user, even if the query was not successful. Well what happens if there is latency in the query? Or if data is missing? The response does not have to readily indicate these errors. If the rest of the code base depends on a successful GraphQL query and a 200 code is received despite an error, well then, how will the rest of the system compensate if there is latency or missing data?

This is where Chaos Engineering tools are effective. Purposefully injecting latency or returning missing data on random queries can show developers how their system will react when these issues occur with GraphQL.

Enter ChaosQoaLa

Our team has developed an open-source Chaos Engineering tool, ChaosQoala, to release Chaos experiments on GraphQL queries. With our CLI tool, you can include latency in your queries and return queries with missing data to see how it impacts your provided Steady State. Once you have run the ChaosQoala, our tool returns a results file you can upload at our site and receive a data visualization to see how the ChaosQoalas have impacted your codebase. This is useful to preempt any major issues in your codebase and to maximize user experience.

For more information on how to use ChaosQoala, check out the project on GitHub.