Aussie AI

Chapter 19. Cache Locality

Book Excerpt from "C++ AVX Optimization: CPU SIMD Vectorization"

by David Spuler, Ph.D.

Chapter 19. Cache Locality

What is Cache Locality?

Cache locality is the idea of staying “local” in our accesses to memory locations to maximize the benefits of some hardware caches in the CPU. There are two general categories of cache locality:

Instruction cache locality — machine code instruction execution.
Memory cache locality — data access from memory locations.

There’s a lot going on in the CPU in terms of caching accesses and also prefetching possible future accesses. Cache locality is the idea of ensuring that our C++ code maximizes the value of those hardware cache optimizations.

Caching occurs primarily at a lower-level than multithreading, which means that each thread’s execution can benefit from these optimizations. Most of the methods to improve cache locality are related to the general code structure, rather than specific ways to do thread synchronization or other multi-threading requirements. The general ideas include:

Tight code blocks and loops — instruction cache locality.
Localized and predictable memory access sequences — data cache locality.

You can do both together if you like, since they have orthogonal speedups. Easier said than done!

There are various tools you can use to examine the rates of cache hits and cache misses in the instruction or data caches. Some of the main ones include:

perf (Linux)
cachegrind (valgrind)
Intel VTune
gperftools
uprof (AMD)
likwid-perfctr

Depending on how you look at it, these speedups make cache locality either more or less important in multithreaded applications versus sequential code. It’s more important in multithreading because we have lots of threads in different places doing different things, all of which need to have good cache locality. Or maybe it’s less important, because the CPU has to throw away all of those per-thread hardware caches at every context switch, so why bother with cache locality? I’ll leave it to you to judge that.

Instruction Cache Locality

The instruction cache stores recently executed machine code instructions in a CPU hardware cache. There’s also a separate mechanism of “instruction prefetching” to try to load the next instruction that will be executed. As part of this prefetching method, there’s also “branch prediction” in the CPU, which attempts to predict which of two branch directions will get chosen.

To get the best out of these instruction speedups, our C++ code should generally use:

Short and tight loops
Fewer branches

Keeping loops short will mean that the CPU stays within the same block of code, maximizing the chances that it already has an instruction in its cache. Interestingly, this means that some common code optimizations can be bad for instruction cache locality:

Inlining of functions
Loop unrolling

Both of these can cut both ways, since they both reduce branches, but also lengthen code blocks. Whenever you’re tempted to maximize your use of such optimizations, think about the plight of the poor instruction cache as it tries to keep up.

Branches are another separate issue from short code blocks. In fact, long code sequences of compute instructions are fine for branch prediction. To maximize the CPU’s branch prediction capability, we should either have few branches, or at least have very predictable branches. At the limit, we could use branchless programming, which is a set of tricks to get rid of branches. See Chapter 4 for more on branch prediction and branchless coding methods.

Data Cache Locality

There are numerous improvements that you can make to improve cache locality of the memory access caches. And there are rather a lot of different caches for CPU memory accesses:

L1 and L2 caches (per-thread)
L3 cache (shared)
TLB cache (virtual address accesses)
NUMA multi-core caching

There are some general recommendations for the entire application, that aim to reduce memory cache misses:

Use less memory!
Fewer memory allocations
Smaller data sizes

But particular algorithms can also be modified to keep nearby memory in the caches. Data structures can affect the level of cache locality, with improvements such as:

Separate cold data from hot data — improve cache locality for use of hot data.
Structure of Arrays (SoA) vs Array of Structures (AoS) — which one is best depends on the context.
Contiguous data structures — arrays and vectors, not linked lists or binary trees.
Compact data structures — smaller memory sizes are easier to maintain in the cache.

The code execution of various algorithms can alter the sequence of memory accesses, and thereby maximize cache locality. Some well-known improvements include:

Loop segmenting — process short sub-sequences of a longer array.
Tiling algorithms — process 2D “tiles” in a matrix or multidimensional data structure (also called “blocking”).

The goal of these algorithm modifications is to iterate over a small sub-section of the data, keeping cache locality during that “hot” computation, and then move on to the next part. This works particularly well with matrix multiplication, because it involves multiple computations with every element of the matrix.

There are also some dynamic approaches whereby you can manually ensure that data is already in the cache when you need it:

Memory prefetching
Cache warming

See Chapter 3 for more about prefetching and cache warming.

Memory Hierarchy

To fully understand the caches, we need to know of all the different types of memory used in a C++ program. Handling memory properly is one of the most important parts of C++ optimization, because memory access is much slower than the CPU. Memory is the bottleneck, and you need to know where the compiler puts everything.

Learn to love the linker-loader!

When your program starts running, the “loader” puts all sorts of things in different places. The basic moving parts that happen before execution starts are:

Instructions — the code’s machine instructions.
Global read-write memory — initialized or zero-initialized global variables.
Read-only data — string literal data.

To get deeper into the memory segments used by the linker-loader, these are the main ones:

text — stores the machine code instructions (read-only, executable)
bss — all zero’d global data such as global arrays without non-zero initializers (read-write)
data — Initialized non-zero global variable data (read-write)
rodata — read-only data such as string literals or constant globals (read-only)

Yes, the “text” segment has a confusing name, and it’s sometimes called the “code” segment. According to Wikipedia, BSS stands for “Block Started by Symbol,” but you didn’t need to know that.

All of the above segments are statically resolved, for the most part, by the linker. However, once the program gets going, there are more dynamic allocations of memory within its virtual address space. The main types of dynamic memory are:

Stack memory — the function call stack with parameters and local variables (also alloca).
Heap memory — dynamically allocated by the new operator or malloc function.
Thread-local storage — via the “thread_local” keyword (C++11).

See Chapter 8 for more about reducing stack and heap memory, and now let’s discuss thread-local storage.

Thread-Local Storage

Thread-Local Storage (TLS) is memory that is exclusive to a particular thread. The other threads do not have access to it. In C++, this is defined via the “thread_local” keyword, available since C++11. The usage is simple:

    thread_local int tls_variable;

There are also some earlier and non-standard versions:

_Thread_local — older version of specifier.
__thread — GCC non-standard modifier with similar semantics.
__declspec(thread) — on Microsoft C++.

The key features of thread_local variables are:

Accessible in one thread only.
Persistent memory storage.
Variables, objects or arrays only (cannot have a thread_local function).

Per-thread access. If you declare a variable as “thread_local” then the C++ compiler has to ensure the semantics. Accesses to that variable in C++ must go to the version of that variable for the current thread. Typically, this means that the variable has multiple copies, with different addresses for each thread.

How is it implemented? It’s not necessarily using any particular hardware support behind the scenes, and it’s not necessarily using any magic per-thread caching. The C++ compiler can allocate different addresses per thread to the same data, and then ensure that accesses within each thread get the correct version. After all, the C++ compiler knows that a particular variable is “thread_local” because it’s a type specification.

Persistent memory semantics. The thread_local specifier is very similar to the static keyword in terms of its memory persistence. Its effect is similar to:

Global variables (with external scope linkage)
static file-scope variables
static local variables (in a function)
static data members (in a C++ class)

A thread_local variable is created when a thread starts and destroyed when the thread finishes. This has some implications:

At most one copy is created at program startup.
Dynamically created (along with the thread itself).
Does not persist across thread shutdown and restarts.

Note that persistence and scope are different things. Persistence is whether the data is maintained across multiple accesses, whereas scope is simply whether its name can be referenced within code statements.

For example, if you use a thread_local variable as a local variable in a function, its value will persist across invocations to that function, and always have the same address. However, it’s scope is limited to within the function, where its name is accessible. This is the same as a static local variable, but with the extra semantics that only one thread can see this version. If multiple threads call the function, they’ll get different versions of the thread_local variable inside the function.

Thread-local variables occupy a special niche in the programmer’s bag of tricks. You don’t need to wrap accesses with any locking or other synchronizations, which is nice. They are like atomics, in that they cannot be messed up by another thread, but unlike atomics because they are not shared across threads. The main usage is to have some shared code, but also have a special non-shared variable, especially where you want the variable to persist, such as having per-thread counters, flags, intermediate calculations, and so on.