Check out the follow-up post on how to calculate the rate limit status which is often returned in response headers.

Rate limiting is a simple enough idea, but implementing it efficiently is less obvious. When you go searching for algorithms the token/leaky bucket algorithms inevitably come up. Implementations typically track the number of tokens and the timestamp of the last request. When a new request comes in they look at the timestamp and calculate how much time has passed and add/subtract the appropriate number of tokens from the bucket. The generic cell rate algorithm is a more efficient way to implement leaky bucket that only requires us to track a timestamp.

The GCRA page on Wikipedia is a bit heavy on the details with words like “asynchronous transfer mode networks”, “virtual channels”, and a bunch of other jargon that doesn’t matter to those of us just interested in rate limiting. It took me a while to figure out how to map a rate limit policy like “5 requests per minute” to the description, so here’s a cheat sheet and the intuition I came up with to reason about the algorithm.

The algorithm

Let’s say you want to enforce “$X$ requests per $P$ seconds”. To do this let:

$$T = P/X \\

\tau = P - T$$

When a request comes in we set the arrival time $t_a$ to the current time. If this is the first request, we initialize $TAT = t_a$. If $t_a < TAT - \tau$ then reject the request. Otherwise we allow the request and update $TAT$:

$$TAT = \max(t_a, TAT) + T$$

The intuition

The best way I found to understand this is to map it back to the leaky bucket metaphor. We have a bucket that holds tokens and has a fixed capacity. There’s a hole in the bucket that leaks tokens at a fixed rate of one token per second. Each request that arrives costs some number of tokens and the request is allowed if they can be added to the bucket without overflowing it and rejected otherwise.

Let’s pretend we want to enforce that requests arrive at a fixed rate. That is, we want requests to come every $T = P/X$ seconds. Then all we need to do is set our bucket capacity to $T$ and add $T$ tokens per request. Requests are rejected if the bucket is not empty and the bucket empties every $T$ seconds. Since tokens are removed at a rate of one per second, we can think of $T$ as the time it will take for the tokens added by a request to leak from the bucket.

Requests usually don’t arrive at a fixed rate, so we allow clients to send a burst of requests before we limit them. For instance, if we have a policy that allows for 5 requests per minute, the client may send 5 requests all at once, but then will need to wait at least 12 seconds before it can make another. To allow for this we increase the bucket capacity to $P$. 5 requests will fill up the bucket with 60 tokens and another will be allowed after 12 seconds.

You can think of $TAT$ as the next time the bucket will empty. This makes sense since we update $TAT$ by adding $T$ to $TAT$ or $t_a$ if $t_a > TAT$ (i.e. the bucket is already empty). However, we don’t need the bucket to be empty to allow the request. The bucket just needs to have room for $T$ more tokens. Put another way, a request is rejected if the bucket has more than $\tau = P - T$ tokens in it. Combined with the fact that tokens leak at a rate of one per second and $TAT$ is when the bucket will empty, this translate to rejecting requests when $t_a < TAT - \tau$.