Aussie AI

Instruction Cache Locality

  • Book Excerpt from "C++ Ultra-Low Latency: Multithreading and Low-Level Optimizations"
  • by David Spuler, Ph.D.

Instruction Cache Locality

The instruction cache stores recently executed machine code instructions in a CPU hardware cache. There’s also a separate mechanism of “instruction prefetching” to try to load the next instruction that will be executed. As part of this prefetching method, there’s also “branch prediction” in the CPU, which attempts to predict which of two branch directions will get chosen.

To get the best out of these instruction speedups, our C++ code should generally use:

  • Short and tight loops
  • Fewer branches

Keeping loops short will mean that the CPU stays within the same block of code, maximizing the chances that it already has an instruction in its cache. Interestingly, this means that some common code optimizations can be bad for instruction cache locality:

  • Inlining of functions
  • Loop unrolling

Both of these can cut both ways, since they both reduce branches, but also lengthen code blocks. Whenever you’re tempted to maximize your use of such optimizations, think about the plight of the poor instruction cache as it tries to keep up.

Branches are another separate issue from short code blocks. In fact, long code sequences of compute instructions are fine for branch prediction. To maximize the CPU’s branch prediction capability, we should either have few branches, or at least have very predictable branches. At the limit, we could use branchless programming, which is a set of tricks to get rid of branches. See Chapter 4 for more on branch prediction and branchless coding methods.

 

Ultra-Low Latency C++ Book:

Online: Table of Contents

PDF: Free PDF book download

Buy: C++ Ultra-Low Latency

C++ Ultra-Low Latency C++ Ultra-Low Latency: Multithreading and Low-Level Optimizations:
  • Low-level C++ efficiency techniques
  • C++ multithreading optimizations
  • AI LLM inference backend speedups
  • Low latency data structures
  • Multithreading optimizations
  • General C++ optimizations

Get your copy from Amazon: C++ Ultra-Low Latency