Aussie AI

Chapter 3. Cache Warming

Book Excerpt from "Advanced C++ Memory Techniques: Efficiency and Safety"

by David Spuler, Ph.D.

Chapter 3. Cache Warming

What is Cache Warming?

Cache warming is a specific type of prefetching optimization aimed at keeping the various memory caches fresh. It typically involves scanning through all the memory data required for the “hot path,” even though there’s no real intention to use the data (until later). The hot path maintains a warm cache, so that when the hot path is executed for real (e.g., a trade execution in HFT), then memory accesses are very fast.

There are multiple ways to trigger the prefetching of data needed to keep the cache warm:

Low-level C++ prefetching primitives.
Copy to volatile temporary variables.
Explicit dry-run parameters in the code.

Unlike other types of CPU prefetching, cache warming is something your C++ code does directly, rather than a hardware-enabled feature. It’s up to you to determine what data is needed the most in hot path computations, and then pre-load that data on every pass-through. You effectively do a “dry run” of the hot path, but access the memory to ensure it’s maintained in the cache.

Note that cache warming is not always a guaranteed win. Using the “dry run” approach can end up with a lot of extra conditional tests:

    if (!dry_run) {
        // Do something
    }

This can negatively impact performance in two ways:

Runtime cost of testing the flag, and
Extra branches of code that slow down CPU branch prediction.

As with everything in multithreading, you really need to time it to see if these costs are less than the gain from faster memory cache accesses.

Memory Prefetch Primitives

Although you can “manually” prefetch data in basic C++ code, there are also some builtins that are convenient for larger amounts of data. Some of the C++ primitives to use for cache warming include:

__builtin_prefetch (GCC)
_mm_prefetch (GCC)

Prefetching is more effective on some data structures than others, with a general preference for contiguous data blocks. Cache locality issues with the “cache lines” of size 64-256 bytes are another reason. As a practical example, contiguous arrays are better than dispersed data structures liked links lists and trees. This means that std::vector contiguous memory layouts can be more effectively prefetched than the spread-out memory used by std::list objects.

Volatile Temporary Variables

Another approach for manual prefetching is the use of volatile specifier on temporary variables. By assigning data to a volatile temporary variable, the optimizer cannot remove an apparently unused assignment. For example, consider if we do this:

    int temp = my_order_book[0];

The C++ compiler may notice that “temp” is not used anywhere else, so it can throw away that entire assignment statement. The solution is to use the volatile specifier:

    volatile int temp = my_order_book[0];

The compiler is forced to load the data into memory even when it seems to be unused by the remainder of the code, because assigning data to a volatile variable is itself a side-effect.

Note that we only want to declare temporary variables as volatile, but not the shared global data arrays we’re trying to prefetch. We don’t want the main data structures to have this status. If our main global variables or arrays were declared as volatile, this would actually interfere with having them loaded from the memory caches. They would be uncached!

Dry-Run Executions

A simple approach to cache warming is to still execute all the steps, even if you’re not going to do anything. For example, in HFT, you could call the “execute trade” function even if the decision is to not trade any stocks.

The method is simply to pass a Boolean flag indicating a “dry run” or “test run” or “warm-up run” or whatever term you like. A simple conceptual example:

    if (!dry_run) {
        orderobj.setup(ticker, price);
        execute_trade(orderobj);
    }

A better way to get more cache warming is to populate all the objects as if you were going to actually do a trade. At the very last step, the dry-run flag is tested, and no trade gets submitted.

    orderobj.setup(ticker, price);
    if (!dry_run) {
        execute_trade(orderobj);
    }

But we really want to warm up the entire path, even the trade execution logic. Hence, we go deeper by passing the flag inside:

    orderobj.setup(ticker, price);
    execute_trade(orderobj, dry_run);

And our trade execution code looks like:

    void execute_trade(Order &order, bool dry_run)
    {
        if (!dry_run) {
            g_order_count++;  // Count total
            // Other accounting stuff..
            // Submit the order...
        }
    }

That isn’t really much better, is it? We didn’t warm anything extra, but just pushed the test inside the function.

Double Data Trouble

We really need to actually prefetch some data! One way is to double up all our data. The basic data for order count tracking is like this:

    int g_order_count = 0;

One common trick is to use an array of two values with two meanings:

Live data
Dry-run data (unused)

Hence, our order count becomes:

    int g_order_count[2] = { 0, 0 };

Then we can try this:

    if (!dry_run) {
        g_order_count[0]++;  // Live run
    }
    else {
        g_order_count[1]++;  // Dummy
    }

The point of the dummy is that we access the [1] array element in order to warm up the [0] element (without changing it). This works because of “false sharing” with “cache lines,” which is often a slowdown problem, but here they offer an advantage. We can warm the cache by touching adjacent array elements, without disturbing the main data. (Note that here we don’t use the alignas trick to avoid false sharing, because we actually want it to occur!)

In the spirit of branchless programming, we can make this code tighter by mapping the Boolean flag to 0 and 1 integer values:

    g_order_count[(int)dry_run]++;

Note that we have actually added extra computation to our hot path! Instead of a global variable increment, it’s now an array index lookup plus the increment. We need to measure our optimizations to ensure that the gain from memory cache warming is greater than the extra cost of these array indexing operations. (We’ve also added a large amount of extra computation to our cold path, including whole extra function invocations, but we care less about that.)

Our conceptual trade execution routine starts to look like:

    void execute_trade(Order &order, bool dry_run)
    {
        g_order_count[(int)dry_run]++;  // Count total
        // Other accounting stuff.. same tricks
        if (!dry_run) {
            // Submit the order...
        }
    }

The idea is that our “dry run” mode has run over as much of the code as possible, only stopping short of actually submitting the order. By maintaining two copies of all data, with dry-run and live values, we can prefetch all of those arrays into memory caches.

Problems with Cache Warming

The above cache warming double-array trick has used false sharing of cache lines for good, not evil. And yet it has a problem: false sharing.

Our use of false sharing was harmless (and helpful) because we assumed only a single thread was in use. There’s no cache invalidation slowdown when it’s only one thread. The cache warming idea for the L1 and L2 caches requires a single thread, although the L3 cache can be warmed for multiple threads. Hence, this cache warming idea has limitations:

Single thread required for all order submissions (if you want L1/L2 cache warming).
Thread pools and other multi-thread design patterns are therefore problematic.

We cannot really have a thread pool model where each consumer thread could potentially submit a trade. The above cache warming logic only works for one thread. If we try to use multiple threads, our cache warming logic is actually a cache freezing de-optimization, because we’ve got the “false sharing” problem for real.

Even worse, consider what happens if we try to use a thread pool model with the following modifications:

(a) multiple consumers, where each thread tries to decide whether to trade,

(b) single trade submission thread.

In other words, multiple decider threads, where each decider then hands off to the single trading thread (which is kept warmed).

But then we’ve made another conceptual error. The hot path should really include the decision logic, as the overall latency is from receiving incoming data to submitting a trade. However, we haven’t kept the cache warm for these multiple “decider” threads, particularly so for all the data they use in deciding whether to trade, so the decision modules won’t run fast.

Possible solutions include:

Single thread for all decision and order submission (with L1/L2 warming), or
Keep multiple threads warm (tricky!), or
Modify the cache warming code tricks to use reads only, not writes (avoiding the cache invalidation problem), or
Only warm up the L3 cache (for multiple threads).

But these solutions have additional problems:

Single order thread idea lacks a failover or backup plan.
Single order thread cannot issue two trades without blocking.
Warming multiple threads means each thread needs its own copy of the data.

None of these solutions are great, so that’s why they pay you the big bucks.

Further Optimizing Cache Warming

Another further iteration of advanced cache warming would be to actually submit a dummy order, such as if the exchange connectivity allowed the sending of test-only transactions. Doing this would allow us to keep warm any of the data structures that are actually inside the client API of the exchange connection.

The advantage of the use of dry-run cache warming is that all the various data structures used to prepare a trade are kept warm in the memory caches (L1/L2/L3). The downside is extra processing that occurs whenever you’re not trading. In other words, there are extra computations done on the “cold path” every time, just to keep the “hot path” all snuggly and warm.

The code to traverse all the memory data structures can be a significant cost in itself, although it only occurs during the cold path. There are several advanced tweaks to optimize your cache warming code:

Exploit cache line sizes for quicker loading of contiguous data.
Limit cache warming to the total L1/L2/L3 cache size.

A further optimization of cache warming is to use “cache lines” to your advantage. The L1/L2 caches don’t work on individual bytes, but on blocks of memory called “cache lines”, which are usually sized between 64 bytes and 256 bytes (e.g., Intel is usually 64 bytes, Apple M2 is 128 bytes, some other CPUs are 256 bytes). Hence, to load a “cache line” of 64 bytes on an Intel CPU, you only need to load one of the bytes from the 64-byte block. Your C++ code doesn’t need to explicitly touch every element of a vector to have the entire vector hot as a fresh-baked oven loaf in the cache. Admittedly, this doesn’t speed up the hot path itself, but only the preliminary cache warming code.

An important limitation of cache warming is the maximum sizes of the L1, L2, and L3 caches. If you’re trying to warm up the CPU cache for your 7B AI model, that’s 7 billion floating-point numbers, and trying to keep them all in the CPU cache isn’t going to work. On the other hand, you can probably preload an entire 7B model into the CPU RAM (i.e., global memory, not the caches), or into the GPU’s VRAM, but that’s preloading not cache warming, and it’s a slightly different story.

If you know your CPU’s cache size, you can optimize your cache warming strategy by only trying to prefetch that much data. If you load more data than the cache size, the newly warmed data is just evicting other data from the cache that you prefetched earlier in the warming code. Hence, prefetching exactly the amount of data equal to your CPU cache size is the optimal cache warming strategy.