In serverless architectures, control over many — or even most — components is given up. This is generally true of using SaaS products, but with a fully serverless system, the number of points where the developer has full control is further reduced. On AWS, user code is limited to Lambda functions, API Gateway mappings, and IoT rules, which gives no ability to, for example, induce a premature shutdown of the underlying EC2 instance handling an API Gateway connection, or cause SNS to fail when invoked by an event on S3. While the compute components of serverless systems are generally stateless (a good practice), this doesn’t mean that, in a degraded system, they will meet performance requirements (e.g., latency, data loss, management of distributed transactions, etc.).

While unit testing of Lambda code is fairly straightforward (see my recent article), this does not suffice for verifying that a full system is production-ready; integration testing is required. However, integration testing for serverless architectures presents a problem. For the purpose of this article, I will assume the system uses solely AWS services. How can we test the situation where DynamoDB has less-than-perfect reliability? Does our system degrade gracefully? Does our logging and monitoring system adequately inform us of the problems?

In traditional architectures, a system like Netflix’s Chaos Monkey (and related pieces of the simian army) serves this purpose, by randomly shutting down VMs and interfering with network traffic. If a system has no SaaS components, nearly every error condition can be tested this way.

Using SaaS components, we have no way to induce those components to behave abnormally. In a fully serverless system, the only control we have is over the code we put in. Given that constraint, how can we do integration testing similar to Chaos Monkey? What would Monkeyless Chaos look like?

With the starting assumption that we are using only AWS services, and the further assumption that we are using Python (just to pick a particular SDK; the requirements work for all languages), we could establish some requirements for such a system:

Requirements for Monkeyless Chaos

A system for injecting errors into boto3 SDK calls

> This exists, and botocore’s Stubber class provides a template for implementing a more focused error-injection class A system for intercepting the creation of boto3 Sessions, Clients, and Resources

> The same injection system in boto3 will work for this

> This is so we can inject the error injector when the chaos library is loaded A system for specifying the errors to inject and how often (and where) they should appear

> Service errors can be referred to by name, as the service definitions in botocore suffice to translate that into an actual exception

> We also need network errors, such as latency or timeouts, as well as, perhaps, corrupted data

> Allow placebo’s pill format for direct specification of return

> This system should allow for varying degrees of specificity. For example, from “this particular Lambda can’t reach Kinesis 60% of the time” to “all requests to Kinesis from all Lambdas fail”

> The error specifications should be able to be changed at run time, without requiring redeployment, to allow simulating outage scenarios (e.g., how long does recovery take once an outage is over?). It should be possible to deploy the system without any of this code included at all, so that it would be impossible to use it to cause system degradation by accidental or malicious means.

This system would likely use a DynamoDB table, shared by all components of the system, to satisfy requirement #3. This table could have a well-known name, but this would limit the system to a single error specification per account, which would not work for people who deploy multiple systems into the same account. A well-known table with a mapping from Lambda name (+version) to error spec table name could be used to have multiple error specifications: on startup, each Lambda would index into the well-known table using its own name to find the error spec table it should use.

Support for environment variables in Lambda would reduce the pain involved in the above paragraph. The error spec table name would be provided to Lambdas through an environment variable at deployment time. Ryan Scott Brown suggested to me that error specs could be provided inline to Lambdas in environment variables, but this wouldn’t satisfy the requirement to change them at run time.

To extend this beyond the use of AWS services, the first logical step is HTTP calls. The system should allow similar specifications for HTTP errors, and a way to inject these errors into common HTTP libraries like requests.

This is one approach to integration testing for serverless systems. It is working well for us, but there may be other approaches that work better (feedback is very welcome here or on Twitter!). I hope to see support for these use cases in SDKs (e.g., built-in error stubbing). I’m torn about support in services themselves. An API that can cause a service to cease functioning normally would make me (and OpSec) very nervous. Hopefully, as the serverless space matures, support for integration testing will be a new bullet point on the serverless manifesto.