The S3 SLA is broken (and how to fix it)

At approximately 2009-01-14 05:26, Amazon's Simple Storage Service suffered some form of internal failure, resulting in a sharp increase in the rate of request failures. According to Amazon, there were "increased error rates"; according to my logs, 100% of the PUT requests the tarsnap server made to S3 failed. For somewhat more than half an hour (I don't know the exact duration) it was impossible for the tarsnap server to store any data to S3, effectively putting it out of service as far as storing backups was concerned; and presumably other S3 users met a similar fate.

At approximately 2009-01-16 15:20, the S3 PUT error rate jumped from its usual level of less than 0.1% up to roughly 1%; and as I write this, the error rate remains at that elevated level. However, the tarsnap server, like all well-designed S3-using applications, retries failed PUTs, so aside from a very slight increase in effective request latency, this prolonged period of elevated error rates has had no effect on tarsnap whatsoever; nor, presumably, has it had any significant impact on any other well-designed S3-using applications.

According to the S3 Service Level Agreement, these two outages -- one which rendered applications inoperative for half an hour, and the other which had little or no impact -- are equal in severity.

This peculiar situation is caused by the overly simplistic form which the SLA takes: It provides a guarantee on the average Error Rate, completely neglecting to consider the fact that -- given that applications can retry failures -- the impact of errors is a very non-linear function of the error rate. I observed no outages in S3 during December 2008, yet even without using tricks which can be used to arficially raise the computed error rate, the occasional failures which result from S3's nature as a distributed system -- failures which occur by design -- were enough that the error rate I experienced (as computed in accordance with the SLA) was 0.098% -- just barely short of the 0.1% which would have triggered a refund. At the same time, 0.1% of a month is 40-44 minutes (depending on the number of days in the month), so if S3 failed completely for 30 minutes but every request made outside of that interval succeeded, nobody would get a refund under the SLA.

Put simply, the design of the SLA results in refunds being given in response to harmless failures, yet not being given in response to harmful failures: The wrong people get refunds.

If I were in charge at Amazon, I would adjust the S3 SLA as follows:

Definitions "Failed Request" means: A request for which S3 returned either "InternalError" or "ServiceUnavailable" error status.

"Non-GET Request" means: Requests other than GET requests, e.g., PUT, COPY, POST, LIST, or DELETE requests.

"Severely Errored Interval" for an S3 account means: A five-minute period, starting at a multiple of 5 minutes past an hour, during which either At least 5 GET requests associated with the account are Failed Requests, and the number of GET requests associated with the account which are Failed Requests is more than 0.5% of the total number of GET requests associated with the account; or At least 5 Non-GET Requests associated with the account are Failed Requests, and the number of Non-GET Requests associated with the account which are Failed Requests is more than 5% of the total number of Non-GET Requests associated with the account.

"Monthly Uptime Percentage" means: 100% minus the number of Severely Errored Intervals divided by the total number of five-minute periods in the billing cycle (i.e., 288 times the number of days).

Three notes are in order here:

The use of Severely Errored Intervals as a metric in place of simply computing the average Error Rate would distinguish the low baseline rate of errors which result from S3's design (and are mostly harmless) from the exceptional periods where S3's error rate spikes upwards (often, but not always, to 100%). In so doing, this change would make it possible to increase the guaranteed Monthly Uptime Percentage without increasing the number of SLA credits given. I distinguish between GET failures and non-GET failures for two simple reasons: First, GET failures are far less common, so it wouldn't hurt Amazon to offer a strengthened guarantee for GETs; and second, because in many situations a GET failure is more problematic than a PUT failure -- not least because web browsers downloading public files from S3 don't automatically retry failed requests. The dual requirement that at least 5% (or 0.5% for GETs) of requests fail AND that there be at least 5 failed requests makes it extremely unlikely that Error Rate increasing tricks could be used to artificially raise an interval across the threshold required to qualify as Severely Errored.

Now, I don't expect Amazon to adopt this suggestion overnight, and I suspect that even if they are inspired to fix the SLA they'll do it in such a way that the result is at most barely recognizable as being related to what I've posted here; but I hope this will at least spark some discussions about making the set of people who receive SLA credits better reflect the set of people affected by outages.

And Amazonians -- I know you're going to be reading this, since I logged hundreds of you reading my last post about the S3 SLA -- if this does open your eyes a bit, could you let me know? It's always a bit unsettling to see a deluge of traffic coming from an organization but not to hear anything directly. :-)

Disqus