Gremlin and Failure as a Service

1:49 - Failure as a Service is about causing the kinds of problems you see in production systems on purpose.

2:15 - The Gremlin Failure as a Service system comprises of three main pieces: a complied Linux binary agent that causes the bad behaviours, a control plane that has some safety features like dead man switching with an API that allows integration with standard build and deploy pipelines, and a web interface.

3:18 - Version 1 of Gremlin is focused on infrastructure failures- noisy neighbours, JVM leaks, network issues and the like.

4:25 - The goal is to end up running the service in production, as that is where you get the most value, but you would typically start in dev or test and work your way up to production.

4:56 - It is harder than you might think to cause failures, but there are a good number of open source libraries out there that can do it. However, they tend to not make it easy to revert the impact.

5:40 - Failure injection is somewhat analogous to a vaccine. We want to inject these bad behaviours so our developers can build immunities to them.

6:45 - Gremlin supports auditing; it also doesn’t need to run as root so the clients and agents all the need to authenticate to run attacks, and those prirvlideldegs can be revoked. If they are revoked they will automatically clean-up.

7:43 - Gremlin is currently in closed beta, with an open beta planned at the end of the year.

LDFI at Netflix

8:55 - Lineage-driven Fault Injection (LDFI) is a technique that Peter Alvaro and others at UC Berkeley came up with, where you look at something that went right and ask what could have prevented that.

9:32 - The approach uses a satisfiability solver to find the combination of failures that will break the system, and then we test them with real failures. If the test results in a customer facing error, we've found a bug. But if it doesn’t, we can eliminate it from future experiments.

10:01 - The team at Netflix ran this on the start-up sequence for Netflix devices which touch around 100 services. Using this approach, the team was able to fully explore the space with just 200 experiments, versus with the 1E+30 (a 1 with 30 zeros) it would require using brute force.

11:37 - One thing the team learnt at Netflix was that it was easier to approach academics than they thought! An invitation to lunch resulted in Peter Alvaro and Kolton Andrus’ collaboration in taking Peter’s theoretical model and building it into an automated failure testing system at Netflix.

Why failure testing matters

14:06 - As you scale up, you encounter failure more often.

14:25 - In addition, in a DevOps world where you build it and run it, failure testing gives you an opportunity to train. What both Amazon and Netflix found was that this meant you don’t get paged in the middle night.

15:51 - If you are doing failure testing in production, you can cause customer impact, but you are watching it and can roll it back if things start to go wrong. So you might have an outage for a minute or so, but you avoid a much longer one in the real world.

16:38 - Netflix’s failure testing system differs from Gremlin version 1 in that it is built at the application layer. Gremlin version 2 will add this.

17:00 - What it allows you to do, is have request level fault injection for a specific user or a specific device. So test it for one user, then in the lab, then in production for 0.1% of traffic, and gradually increase the scope.

17:50 - The goal is to run a failure at 100% in production, because then you really know you can handle it.

Lessons learnt at Amazon and Netflix

18:11 - Amazon makes extensive use of metrics. Part of doing failure testing well is about having dashboards, alerts and paging.

18:40 - One of the value propositions for Gremlin is that you see when things fail, so you can test that your alerts work and that you get paged as you should.

19:03 - The most important metric is the top performance metric for the business. So for Amazon retail, this would be, “can people order”. At Netflix, it would be, “can people stream”.

19:36 - Then you drill down to the service metrics- are your API calls working correctly? Has your throughput dropped? What does your load and memory profile look like?

20:05 - Being able to trace infrastructure with something like Dapper or Zipkin offers tremendous value. At Netflix the failure injection system is integrated with the tracing system, which meant that when they caused a failure they could see all the points in the system that it touched.

Why do failure testing

21:12 - It depends on the cost of being wrong. An hour of downtime at Facebook is estimated to cost $1.7 million in lost ad revenue. At Amazon, if you are down on black Friday that is also going to cost you a great deal.

22:05 - An outage can also impact less tangible such as the cost reputation of your brand and customer trust.

Getting started with failure testing

22:41 - Start by taking the chaos monkey approach and just break things in test.

22:54 - If you want to run in production though, you do need the ability to roll-back. Running in production should be your end goal because that is what matters.

Gremlin plans

24:03 - Longer term (version 3) Gremlin hopes to have automated LDFI.

24:20 - Gremlin is also working on helping teams train for operational readiness. It's important to test engagement to make sure alerts go off, the right person gets paged, and so on.

25:11 - The firm also wants to be able to train new team members and is using a red/blue team approach to this.

26:24 - The firm also wants to think about running large scale failure tests. For example, abort conditions, a Slack bot to let people know that a failure test is being run, and post mortem questions.

Resources

QCon London Presentation

Automating Failure Testing Research at Internet Scale Paper

Original Lineage-driven Fault Injection Paper (Paper, The Morning Paper Summary)

Tools

Dapper

Zipkin

Companies mentioned

Amazon

Gremlin.inc

Slack

Netflix