In the first article of this Rate Limiting series I introduced the motivations for rate limiting, and discussed several implementation options (depending whether you own both sides of the communication, or not) and the associated tradeoffs. This article dives a little deeper into the need for rate limiting with API gateways

Why Rate Limit with an API Gateway?

In the first article I discussed options for where to implement rate limiting: the source, the sink, or middleware (literally a service in the middle of the source and sink).

When exposing your application via a public API you typically have to implement rate limiting within the sink or middleware that you own. Even if you control the source (client) application you will typically want to guard against bugs that cause excess API request, and also against bad actors who may attempt to subvert your client applications.

The Stripe blog has an excellent article on “Scaling your API with rate limiters”, which I’ll be referencing throughout this post, and the opening section talks about how rate limiting can help make your API more reliably in the following scenarios:

One of your users is responsible for a spike in traffic that is overwhelming your application, and you need to stay up for everyone else.

One of your users has a misbehaving script which is accidentally sending you a lot of requests (trust me, this happens more often than you think — I’ve personally created load test scripts that accidentally triggered a self-inflicted denial of service!). Or, even worse, one of your users is intentionally trying to overwhelm your servers.

A user is sending you a lot of lower-priority requests, and you want to make sure that it doesn’t affect your high-priority traffic. For example, users sending a high volume of requests for analytics data could affect critical transactions for other users.

Something in your system has gone wrong internally, and as a result you can’t serve all of your regular traffic and need to drop low-priority requests.

At Datawire we have seen these patterns firsthand, particularly with organisations exposing “freemium” style public API, where it is a clear business requirement to be able to prioritise traffic for paying customer, and also protect against bad actors (intentional or otherwise).

The Basics of Rate Limiting and Load Shedding

Fundamentally rate limiting is simple. For each request property you want to limit against, you simply keep a count of the number of times each unique instance of the property seen, and reject the associated request if this is over the specified count per time unit. For example, if you wanted to limit the amount of requests each client made you would use the “client identified” property (perhaps set via the request string key “clientID”, or included in the request header), and keep a count for each identifier.

You would also specify a maximum number of requests per time unit, and potentially define an algorithm for how the count is decremented, rather than simply resetting the counter at the start of each unit (more on this later). When a request arrives at the API gateway it will increment the appropriate request count and check to see if this increase would mean that the maximum allowables request per time unit has been exceeded. If so, then this request would be rejected, most commonly returning a “Too Many Requests” HTTP 429 status code to the calling client.

Closely related to rate limiting is “load shedding”. The primary difference here is that the decision to reject traffic is not based on a property of an individual request (e.g. the clientId), but on the overall state of the application (e.g. database under heavy load). Implementing the ability to shed load at the point of ingress can save a major customer incident if the system is still partially up and running but needs time to recover (or fix).

Challenges with API Gateways

The majority of open source and commercial API gateways offer rate limiting, but one of the challenges with many of these implementations is scalability. Running your API gateway on a single compute instance is relatively simple, and this means you can keep the rate limiting counters in memory. For example, if you were rate limiting on clientId, you would simply check and set (increment) the clientId in an in-memory map with an associated integer counter. However, this approach does not scale past the single instance to a cluster of gateway instances.

I’ve seen some developers attempt to get around this limitation by either using sticky sessions or by dividing the total maximum number of allowable requests by the number of rate limiting instances. However, the problem with this is that neither of these approaches work reliably when deploying and operating applications in a highly dynamic “cloud native” environment, where instances are being destroyed and recreated on-demand, and also scaled dynamically.

The best solution to overcome this limitation is to use some form of high-performance centralised data store to manage the request count. For example, at Lyft, the team use Redis (presumably run as a highly-available Redis Sentinel cluster) to track this rate limiting data via their Envoy proxy that is deployed as a sidecar to all of their services and datastores. There are some potential issues to be aware of with this approach, particularly around the atomicity of the check-and-set operations with Redis. It is recommended for performance reasons to avoid the use of locking, and both Stripe and Figma have talked around using the Lua scripting functionality (with guaranteed atomicity) within the Redis engine.

Other challenges often encountered relates to the ability to extract request (meta)data for use in determining the rate limit, and also specifying (or implementing) the associated algorithm used to determine whether a specific request should be rejected. Ideally you would like to be able specify rate limiting in relation to various client properties (e.g. request HTTP method, location, device etc) and the decomposition of your backend (e.g. service endpoint, semantic information such as an user-initiated request vs an app-initiated request, payload expectations).

Rate Limiting via an External Service

An interesting solution to overcome many of the challenges discussed in the previous section was presented by the Lyft Engineering team last year, when they talked about how the Envoy proxy they use as (what we are now calling) a service mesh implements rate limiting by calling out to an external RateLimit service for each request. The Ratelimit service conforms to the Ratelimit protobuf defined here, which is effectively a rate limit API. The Datawire team has built the open source Ambassador API gateway on top of the Envoy Proxy, and recently Alex Gervais has implemented the same rate limiting support for Ambassador.

As you now have access to a protobuf rate limit service API you can implement a rate limit service in any language you like (or at least any language with protobuf support, which is most modern languages). You also now have complete freedom to implement any rate limiting algorithm you like within the service, and also base the rate limiting decision on any metadata you want to pass to the service. The examples within the Lyft RateLimit service provide some interesting inspiration! It’s also worth mentioning, that as the Ambassador API gateway runs within Kubernetes, any rate limiting service you create can take advantage of Kubernetes to handle scaling and fault-tolerance.

Wrapping Up with a Look to the Next Article

In this second article of our rate limiting series you have learned about the motivations for rating limiting and load shedding with an API gateway, and you also explored about some of the challenges with doing this. In the final section of the article I presented some ideas around integrating rate limiting witin an API gateway deployed within a modern cloud native platform (like Kubernetes, ECS etc), and discussed how using an external service to do this could allow a lot of flexibility with implementing your exact requirements for a rate limiting algorithm.

Join me for the final part of the series next week, where we take a look at implementing a Java rate limiting service for the Ambassador API gateway (here’s a sneak peak of some of the code!).

In the meantime, please do feel free to email any questions, or jump on the Ambassador Gitter channel.

Continue reading the other articles in this four part series:

Part 1: Rate Limiting: A Useful Tool with Distributed Systems

Part 3: Implementing a Java Rate Limiting Service for the Ambassador API Gateway

Part 4: Designing a Rate Limiting Service for Ambassador