Aussie AI

Chapter 9. Rate Limiter

Book Excerpt from "C++ Ultra-Low Latency: Multithreading and Low-Level Optimizations"

by David Spuler, Ph.D.

Chapter 9. Rate Limiter

What is a Rate Limiter?

A rate limiter or “throttling” component aims to avoid too many requests hitting a server in a time period. For example, a server might have a rate limit policy of “100 requests per minute” that all clients must adhere to.

Servers have two basic methods for dealing with an exceeded rate limit:

Rejection — disallow the client’s transaction with an error message.
Smoothing — instigate a delay or other load reduction method without rejecting.

Servers don’t really like having to force rate limits on their clients. After all, they want happy customers. Hence, servers will attempt to improve their capacity in other ways:

Load balancing technologies
Bigger servers with GPUs
More C++ low-latency coders

But at some point, if demand for your service is unlimited, you have to say no.

Rate limiters are a general technology component and may apply to numerous types of servers and services that allow multiple clients:

Trading exchanges limiting HFT order submissions.
AI servers limiting the number of Norse poems people can request through their API.
Web sites limiting the number of browser page views or online transactions.
Email servers limiting the pass-through of emails (spam prevention).

I’m sure you can think of some more.

Client vs Server Rate Limiters

Servers and clients have different issues, but both can implement a rate limiter C++ component. The basic idea is:

Servers — block a client from submitting too many orders (keeping it minimal).
Clients — try to figure out when you can submit an order (so as to maximize it).

There’s a significant architectural difference for the two contexts:

Server rate limiter — per-client rate limits for many clients.
Client rate limiter — tracks rates viz one server and one client (me!).

The scalability requirements for the server are much greater. Hence, a server will often implement its multi-client rate limiter using an in-memory database such as Redis or Memcached. The client-side rate limiter component is much less complex, and it’s usually a simple C++ class.

As you can see, the objectives for servers versus clients are somewhat opposite, but the coding issues are similar. The server is effectively maintaining multiple rate limiters with one for each client. The client is maintaining one rate limiter component for its own connection to the server. These are two sides of the same coin:

Client rate limiters are abstract models of the server rate limiter.

The server has the real rate limiter that will actually block orders. The client’s rate limiter is a theoretical model that attempts to emulate the server-side logic to thereby predict whether we can send an order or not. Hence, to implement a client-side rate limiter you need to know as much as possible about the server’s rate limiter:

Rate limit thresholds — e.g., how many transactions in what time period?
Rate limit algorithm — e.g., time-based or total transactions?

The rate limit thresholds are usually part of the documentation. Note that rate limits may not be time-based, such as where each client is allowed (or can buy) some “credits” and then consumes one or more credits with each submitted request.

The algorithm used by the server-side rate limiter may be trickier to discern. There’s a whole bunch of theory about the best way to do this on a server. Here’s a selection of algorithms:

Fixed counts (credits)
Token bucket
Leaky bucket
Fixed window
Sliding window (log variant)
Sliding window (counter variant)

There are also other lower-level features of the server-side rate limiter algorithm to consider:

Rate limit violations — does the server reject, smooth, or delay the client transaction?
Retry permissions — does the server’s rejection include a data field with the recommended time for a retry?

And then you have to code all that into your client rate limiter.

Rate Limiter Optimizations

Client-side rate limiters are part of the “hotpath” and are performance-critical. After all, a rate limit component is queried immediately before submitting a transaction, so any latency in the rate limiter checking will directly worsen trade submission latency.

How to run fast?

Well, it depends on the server’s algorithm. For example, if the server allows one request per minute, then only record the timestamp of your last submission. And if the server’s method is one based on a fixed number of credits (e.g., a free trial with an upper bound on credits, or a way to purchase a number of credits), then the client can just maintain an incremental value of its own credit stache.

More interesting optimizations arise in the rate limit tracking of fixed window and sliding window algorithms. The general idea for the rate limit is:

N requests in M seconds.

In a fixed window algorithm, the allowed limit is reset every M seconds. It’s like the server has an interrupt timer running every M seconds, which resets the allowed client transactions. Indeed, this is one way to implement it, although it’s not the fastest for the server, because it would have to touch every client’s counter every M seconds.

Nevertheless, having a timer running every M seconds is more efficient for the client-side implementation. But you have to make sure that your client-side interrupt is synchronized with the server’s timestamps, or else chaos ensues.

A sliding window algorithm is a more accurate way to limit client requests. Whereas a fixed window algorithm can be manipulated by the client in a way that allows 2N transactions to be submitted, a sliding window will more correctly limit to only N client transactions.

However, it’s also more complicated to code a sliding window algorithm, and requires tracking the timestamps of many requests. The methods to implement a sliding window rate limiter include:

Naive request queue with removals.
Fixed array of N timestamps.
Ring buffer of N timestamps.

Compact data representation. But before we look at the code, there’s a space optimization, which also helps with speed due to cache locality. The first optimization is that we can throw away most of the transaction. We only need the timestamp, so we can compact the data significantly. <

It might be desirable to store other aspects of the request, such as an order ID, especially in testing mode. But the algorithms discussed below work only on the timestamp, and don’t ever need to go back to the original order or request. In production mode, a client-side rate limiter component will tell you whether or not you have permission to submit a trade, but it can’t give you a list of the transactions you did previously.

Naive request queue algorithm. The naive algorithm is to realize this is an “order-of-insertion” algorithm, so we need a queue, where the orders are stored as they are received, with their timestamp being the only important field. The basic idea goes like this:

Remove all old transactions on the queue that are outside the M seconds time window.
Check if there are less than N transactions still in the queue.
If so, success, and add our request to the queue.
Otherwise, fail with a rejected transaction (and don’t add it to the queue).

But this idea is not great coding, and I’m understating it here for politeness reasons. This method is super-inefficient because we are doing:

Insertions of new requests (even if only timestamps).
Removals of out-of-date requests.
Linear scanning of the request list.

Fixed array of N items. A key insight is that to manage a rate limit of N items for a fixed time period, we only need to track the last N requests. Hence, we only need to store the last N items, and we can maintain a fixed array of exactly N items, or rather, exactly N timestamp values. Throw that dynamic queue data structure to the curb!

The simplest idea is an order-of-insertion array, but we shouldn’t use an array that starts from index zero. Instead, we should use a ring or circular buffer data structure.

Fixed-size ring buffer. The simple idea is to maintain a fixed-size ring buffer containing the last N timestamps. This is effectively implementing a fixed-size queue of N items in an array or vector container.

It’s an implementation choice whether to use a compile-time size with std::array or a run-time fixed size with std::vector for the ring buffer. If the rate limit is rarely changing, then N is a compile-time constant and we can use std::array. However, we can use std::vector by doing a single heap allocation with a reserve() call in the startup phase of the trading application, away from the hotpath. The vector method is more flexible because we can load N from a configuration file, rather than needing a re-compile.

Prefilled ring buffer. One minor optimization is to note that the client-side rate limiter in a ring buffer is doing a needless branch. There are two distinct cases:

1. Startup — the first N transactions.

2. Ongoing — the rest of the transaction requests.

If the total number of submitted transactions is less than N, then our queue is only partially filled, and the transaction is definitely allowed: we’ve never submitted enough transactions, regardless of the time period!

But branches are not great, as discussed in the chapter on branch prediction. In the spirit of branchless coding, let’s not even check for the condition. Instead, we can pre-fill the initial ring buffer with N zero timestamp values at startup, and pretend like it already has N elements stored in it. Thus, we can remove an “if” statement (goodbye, branch, we won’t miss you!).

Note that doing this also allows a secondary optimization: we no longer need both “head” and “tail” indices. The ring buffer is always full, and the most recent item is always right next to the oldest timestamp. So, we only need a single offset.

Unfortunately, we can’t remove everything! If it weren’t for those pesky orders, we could do it all at compile-time.

Advanced Client Rate Limiting Issues

Computing retry time. An extra feature of our rate limit algorithm is in the rejection logic: compute the wait time required until a re-submission of this trade would be accepted. This can be returned to the caller as useful information.

However, it’s not a simple algorithm in our fixed-size ring buffer queue. Whether we want to always compute this for the caller, or provide an API for the caller to ask for this information, is a judgement call. Note that if our order is going to be rejected anyway, we’re no longer in the hotpath, so adding extra computations has a low penalty.

Server timestamps synchronization. There can be a difference in the timestamps as perceived by the server receiving your message, versus the timestamp you used to decide about rate limits. This is a rare issue, and it can cut both ways.

False positives — you submit a trade and it gets refused.
False negatives — you withhold a trade that would have been allowed.

The difference in timestamps can occur at both ends of the queue. Depending on which end, this rare issue could trigger a false positive or false negative.

Extensions

Extend the client-side rate limiter algorithm to use time delays and retries. When should a rate limiter suggest a delay versus rejecting the transaction?
Extend the client-side rate limiting algorithm to accept requests from multiple threads.
How would you handle false positives? The client rate limiter says the trade is allowed, so the trade execution component submits the trade, but the exchange server rejects its submission. Add a feature allowing the trade execution component to report “bad trades” to the rate limiter. Should it report “good trades” to the rate limiter?
Examine the use of lower-precision timestamp values in the ring buffer. Instead of a 64-bit unsigned long, can you use 32 bits? Or less?
Analyze the use of differences in timestamps to compact the data type. Instead of the full timestamp, can you use the number of clock ticks since the program startup timestamp?
Consider how to handle incoming transactions that are out-of-order according to their timestamps. For example, your code is accepting candidate trade transactions from multiple servers.
What statistics should be tracked and recorded to allow monitoring and management of a rate limiter software component in production?