Implementing GCRA in Python

Rate limiting with the Generic Cell Rate Algorithm

We enforce rate limits on our systems to protect against attacks and enforce usage tiers. A rate limit should enforce that the action can only be done at a specified rate and any requests over that rate are instead limited and another action, typically a response with a 429 status code, undertaken instead.

Rate limits are usually enforced per user-agent, although they can be per route (regardless of the user-agent), or even per application. The key to a rate-limiting system is that each limit is applied per rate-key. A typical example is the remote IP or the member ID for an authenticated user-agent.

Rate limits are defined by a limit, L , i.e. the number of requests, and a period, P , such that a rate of only L/P is possible without the requests being limited (blocked).

Time Bucketed

The naive approach for rate limiting, and the one we initially adopted at Smarkets, is to use a time-bucketed algorithm. In this algorithm, a count and expiry time are stored for each rate-key. On the first request the count is set to the limit, L , and the expiry to the current time plus the period. On each subsequent request the count is decremented and if it drops below zero the request is limited unless the expiry is in the past, in which case the bucket is reset.

The downside to this approach is it doesn’t actually enforce a rate but rather enforces a limit to the number of requests in discrete periods, with each period defined by the requests present. This means that the entire quota can be exhausted at the start of the bucket, resetting only after the bucket period has elapsed.

Typically the next step is to implement the leaky-bucket approach, however we skipped over this for the more elegant Generic Cell Rate Algorithm (GCRA) approach.

Generic Cell Rate Algorithm

To understand this approach let’s start by aiming to limit requests to R = L/P and then for simplicity set L = 1 (we’ll revisit this later) so that the target rate is R = 1/P . With this target rate, we can expect that the requests are separated by at least P , any that aren’t should be limited. Put simply, if the arrival time of the current request t is within P of the arrival time of the previous request s , or t — s < P , then the request must be limited.

By defining a Theoretical Arrival Time, TAT , of the next request to be equal to the arrival time of the current request s plus P — i.e. TAT = s + P — we find that if t < TAT the request should be limited. Hence we can just calculate and store TAT on every request.

If L = 1 is the only case you need to support you can stop here, however in practice L is often greater than 1. This is because it is often desirable to bunch requests in a short period of time, while remaining within the rate limit over a larger period. For example if R = 10 / 60s (10 requests per 60 seconds) a user could decide to make all 10 requests in the first six seconds. Whereas if R = 1 / 6s (one request per six seconds) the user would have to wait six seconds between requests, for the same rate over the minute.

To cope with L > 1 , we need to reconsider what the Theoretical Arrival Time of the next request should be. In the L = 1 case it was always the current request time s plus P . However when L > 1 it would be possible to have L requests within P , therefore each request is separated by P/L . In addition, it is now possible for requests to bunch, i.e. for the arrival time s to be less than the expected TAT . When this happens the next request’s Theoretical Arrival Time is TAT' = TAT + P/L . However when requests don’t bunch and s is greater than TAT , TAT' = s + P/L hence TAT' = max(TAT, s) + P/L .

Now that we can calculate and store the TAT for the next request, we just need to decide when to limit. From the above it is clear that for each request that arrives before its TAT , the stored TAT increases by the interval P/L . Clearly if the new TAT' exceeds the current time plus the period the request should be limited, i.e. if TAT' — t > P or TAT — t > P — P/L . When L = 1 this reduces to TAT — t > 0 as previously stated. Note that the TAT should not be updated for limited requests, as they don’t have a theoretical arrival time.

An example allowed request. The theoretical time of arrival, TAT, is close enough to the actual time of arrival, t, therefore the request is allowed. P is the period of the rate limit and L the number of requests allowed in that period.

An example limited request. The theoretical time of arrival, TAT, is too far in the future for the actual time of arrival, therefore the request is limited. P is the period of the rate limit and L the number of requests allowed in that period.

In Python this can be implemented with this snippet:

Conclusion

Most rate-limit implementations in existence are either time-bucket or leaky-bucket based approaches, yet both approaches are inferior to GCRA. The time-bucket approach requires the storage of two values and does not enforce a rate, while the leaky-bucket approach requires a separate process to continually refill the bucket. In comparison GCRA requires only a single value — the TAT to be stored — and can be implemented in 10 lines of Python.

I think GCRA is rarely used because of its perceived complexity and relative obscurity. Hopefully this article goes some way to addressing these issues and convincing others to adopt GCRA.