Aussie AI
Chapter 4. Branch Prediction
-
Book Excerpt from "Advanced C++ Memory Techniques: Efficiency and Safety"
-
by David Spuler, Ph.D.
Chapter 4. Branch Prediction
What is Branch Prediction?
Branch prediction is an optimization in the CPU whereby efficiency is improved by considering upcoming branches. The CPU in its execution tries to predict which of the two paths of a branch is more likely to be taken. Some CPUs also do “speculative execution” of the future instructions, to get ahead, which must be discarded if the “wrong” branch is actually executed by the code.
For the programmer, these branch prediction capabilities give the opportunity to further optimize your code to capitalize on the CPU’s abilities. Optimization techniques for the C++ programmer include:
- Eliminating branches in the hotpath so that the code runs straight and narrow (i.e., fast!).
- Hinting to the compiler about the most likely branches (e.g.,
[[likely]]and[[unlikely]]specifiers). - Keep unavoidable branches in the same neighborhood (e.g., short loop bodies).
Branch prediction has a problem in HFT: the hot path is rarely executed (i.e., actually submitting a trade). All of the branch prediction logic would try to run the cold path, as it would always be predicted. But what we want is for the branch prediction logic to always choose the hot path, even though it would mostly fail to be correct. Thus, all of HFT is at odds with a whole swathe of computing theory about branch prediction. HFT needs a “set opposite world mode” flag, but I’m yet to find one in the GCC documentation.
Types of Branches
First things: analyze your hotpath code for branching. The main types of branches in C++ code include:
ifstatements andif-elsestatements.- Loop conditions and loop bodies.
- Loop control statements:
break,continue. - Function calls and
returnstatements. switchstatements (multi-way branching).
Some of the less obvious types of branches are:
- Ternary operator (
?:) - Short-circuiting in the
&&and||operators
There are also hidden branches in C++ code features such as:
- Virtual function calls
- Function pointers (and function names)
Branch Compiler Hints
There are several ways for the programmer to give “hints” to the compiler and its optimizer about which pathways are more likely. As always, the compiler is free to ignore hints, so you have to check in the assembly output what effect your changes have. Some of the ways to give hints include:
[[likely]]and[[unlikely]]path attributes (C++20).likely()condition marker (C++20)noexceptattribute (C++11)[[noreturn]]attribute (C++11)[[assume(expression)]]attribute (C++23)
GCC also has various extensions available to give hints:
__builtin_expect(expression, value)(GCC extension)hot(GCC function attribute)
Branch Profiling
Branch profiling is the recording of pathway stats to analyze the most likely branches. This can also be re-used in the compiler’s optimization mode, so that the optimizer can perform branch-aware optimizations. Hence, there is a two-step process whereby better branch prediction can be incorporated into your C++ executable code.
GCC has capabilities to store and use branch prediction statistics in its optimization phase. The arguments to use are:
-fprofile-arcs(GCC command-line argument)-fprofile-generate(GCC command-line argument)-fprofile-use(GCC command-line argument)
Following this process will allow GCC to generate more optimal code under assumptions based on branch frequency in its seen executions. Obviously, this is an automatic method, but needs multiple steps in the build:
- Compile without branch hints
- Run the tests
- Output the branch prediction data
- Re-compile the code with branch optimizations enabled
Note that for HFT, the fully hot path (i.e., trade execution) is actually a rare branch, so this historical branch data won’t be that useful. One solution is to run GCC in a test mode in which the hotpath is always dummy-executed! Other early parts of the hotpath in HFT can still benefit in both situations, such as the trading decision logic, which is always executed on incoming market data. Obviously, non-HFT applications can always benefit, as the most likely paths are also the most heavily-executed.
Branch Heuristics
In the absence of other branch prediction data, the CPU and compiler tools fall back on some heuristics. Some of the common ones include:
- The
ifcode block is more likely to be executed than theelsecode block. - Loops tend to be executed multiple times.
- Backwards branches are assumed to be loop iterations (and are preferred due to the prior assumption).
Hence, we can make some heuristic recommendations for how to organize your code:
- Put common case code in the
ifblock. - Have error handling in the
elseblock. - Don’t use once-only loop executions.
Branch Elimination
The simplest way to avoid branch prediction issues is to have fewer branches. There are various ways to achieve this, ranging from minor code tricks to re-writing your entire algorithm to have fewer conditional tests.
Which branches to eliminate? The worst kinds of branches that need elimination include:
- Long if-else-if sequences
- Nested if-else statements
What data is being tested by a branch condition is also critical, and some of the problematic branches are based on unpredictable conditions:
- Branches depending on user inputs
- Branches depending on random numbers
- Branches depending on system clocks
The best types of conditional tests include:
- Compile-time known tests
- Predictable conditions
The techniques available to eliminate your least favorite branches include:
- Reorganize the overall algorithm to have fewer branches.
- Defer or combine error checking for multiple errors so that there’s only one error handling branch.
- Function call optimizations such as inlining and call hierarchy flattening.
- Loop conditional test reductions such as loop unrolling and iteration bounds known at compile-time.
- Branchless programming techniques and tricks to change conditional paths to arithmetic computations.
Branchless Programming Tricks
Branchless programming is a variety of coding tricks
to get rid of control flow branches.
The main approach is to remove conditional tests,
such as if statements,
by using a variety of arithmetic computations instead.
Code that has no branches in a long block
can run very fast on a CPU
because of instruction prefetching.
Advantages of branchless programming:
- Avoids branch prediction issues (CPU speedup).
- Avoids warp divergence in CUDA C++ (GPU speedup).
- Job security
Possible general software engineering disadvantages of these branchless arithmetic bit tricks:
- Code complexity — isn’t it a good thing?
- Unreadable code — as if we care.
- Maintainability — is someone else’s problem.
Even worse, the speed benefit might be a mirage. The issues include:
- De-optimizations from too many arithmetic operators — benchmark your tricks!
- Don’t underestimate the optimizer’s capability on simple code.
- Tricks can confuse the optimizer (undermining any benefit).
The types of methods for branchless coding include:
- Bit masks
- Bit arithmetic (bitshifts, bitwise AND/OR/XOR)
- Mapping Boolean flags to 0 or 1
- Mapping logical operator results to 0 or 1
- Lookup tables
- Conditional move (CMOV) assembly statements
- Ternary operator (
?:)
Some of the more traditional C++ optimizations techniques can also reduce branching:
- Loop code hoisting of conditional tests.
- Compile-time settings and configurations.
Ternary Operator and CMOV
Using the C++ ternary operator is one way to help the compiler write branchless code.
Consider the basic if statement:
if (x > y) {
max = x;
}
else {
max = y;
}
This can be more concisely written with a ternary operator:
max = (x > y) ? x : y;
The ternary operator can be implemented in the compiler backend using a CMOV (conditional move) register assignment statement. This is a branchless instruction that implements the conditional assignment very efficiently.
In theory, both pieces of code are equivalent, and the compiler
really should generate identical code.
In practice, the use of the ternary operator
makes it easier on those poor compiler engineers,
because it’s 100% guaranteed that an assignment is required,
whereas the if statement requires a significant amount of extra compile-time static
analysis
to deduce that both assignments are setting the same variable.
The C++ compiler is more likely to emit a branchless CMOV assembly statement
with a ternary operator.
Boolean Flags are 0 and 1
Another way to reduce branches is to use Boolean flags in arithmetic, using them as having the values of integer 0 and 1. Here’s a simple example:
bool inc_flag;
int x = 0;
if (inc_flag) {
x++;
}
This can be implemented in a branchless manner:
x += (int)inc_flag
Note that the type cast to int is not really needed, but helps with readability,
and ensures you don’t get compiler or static analyzer warnings.
Whether that is faster is something that needs testing because it forces an addition operator into one of the pathways that previously had none, but at least its branchless so it helps with branch prediction.
That was a simple example, but many other ideas are possible. Instead of this:
if (clear_flag) x = 0;
You can try this branchless version:
x *= (int)!clear_flag;
I’m betting that it’s actually slower, since multiplication is an expensive operation, but who’s to know without running a benchmark.
Logical Operators are 0 and 1
In the same vein, the Boolean values of the && and || operators can
be treated as 0 and 1 in integer arithmetic expressions.
Here’s an example of the maximum computation:
max = (x > y) * x + (y >= x) * y;
Again, the ternary operator’s CMOV instruction is probably faster than this de-optimization.
Bitwise XOR Tricks
There’s the well-known XOR trick to swap two integer variables without using a temporary:
x = x ^ y;
y = y ^ x;
x = x ^ y;
Don’t worry; nobody understands how this works. But it uses three assignments, no temporary variable, and no branches.
Sign Bit Extension Masks
If you’re doing any arithmetic with negative values, you can use bitwise tricks by creating two masks depending on the sign bit. The idea is that the bitmask is:
- All 0’s if the number is positive (or zero).
- All 1’s if the number is negative.
In other words, the bitmask is 32 bits all set to the same bit value as the sign bit. The bitmask value is either 0 or 0xFFFFFFFF (which is also that artist previously known as -1). We can generate this using the right bitshift operator:
unsigned int mask = x >> 31;
Yes, I really should portably compute the bitshift count
using CHAR_BIT and sizeof(int) as nicely done in [Farrier, 2025].
Example: RELU Activation Function
Let’s have a go at making the RELU function branchless. RELU is an “activation function” in LLM backends, and it’s quite simple:
if (x < 0) {
RELU = 0;
}
else {
RELU = x;
}
In other words, change negatives to zero, but leave positives unchanged. Here’s the ternary version (faster):
RELU = (x < 0) ? 0 : x;
The basic idea for a branchless, bitwise RELU is:
unsigned int umask = (x >> 31); // All 0’s or 1's
RELU = (x | umask);
Actually, that’s buggy, with the bit masking the wrong way around. Here’s the correction:
unsigned int umask = ((-x) >> 31); // All 0’s or 1's
RELU = (x | umask);
Beware this might be a de-optimization, because the ternary version might be a single CMOV instructions, whereas this version has three operators: negative, right bitshift, and bitwise-AND.
Sign Bitshift Portability
There’s a major portability problem with this code, because right bitshift on a negative signed integer is actually undefined behavior in C++. The compiler is free to shift in zero bits or to sign bit extend on the leftmost bit position, in its sole discretion. Hence, you need to check your platform to see what the >> operator does, and whether this rightshift bitmask idea will work.
Note that we cannot fix this by doing the right bitshift on an unsigned type,
which is guaranteed to shift in a zero bit
(well-defined in standard C++, but not what we want).
Note also that this is only undefined for right bitshift, not for left bitshift, which is well-defined and
always shifts zero bits in on the right side (again, not what we want).
Of course, you can create the sign-based bitmask more portably by avoiding the right bitshift operator, but this loses the branchless benefits:
unsigned int mask = (x >= 0) ? 0 : 0xFFFFFFFF;
That’s safe and slow, and what’s the point of that?
Lookup Tables
Precomputation of lookup tables is a fast way to get a double benefit of fast computation and branchless code. A good example in the standard C++ library are the functions for character types. Here’s a slow branching version:
#define islower(c) (((c) >= 'a') && ((c) <= 'z') )
This has lots of computation and there are also branches
in the short-circuiting of the && operator.
A faster version uses a precomputed lookup table with 256 bytes.
#define islower(c) _islower_table[(unsigned char)(c)]
This is faster and branchless, at the cost of 256 bytes of global memory, and has already been done for you in the standard libraries by those uber-brainy compiler engineers.
Instruction Reordering Optimizations
Instruction reordering is an optimization performed inside the CPU where it actually runs the machine code instructions out-of-order. The way this works in simple terms is:
- Delay any opcodes that don’t have the data they need (e.g., from memory).
- Run any instructions that are ready as soon as possible.
There’s a whole smash of fun to be had researching how this all works in the CPU. There are schedulers and “stations” and various queues and caches. Kudos to all those hardware engineers.
Another special type of fun is for compiler engineers. GCC does a lot of fancy optimizations in the code generation backend in terms of taking advantage of instruction orders.
But what about C++? Is there anything you can do in C++ to optimize your code? Or with inline assembly instructions?
Safety first. Most of the discussion of out-of-order execution and C++ occurs in relation to safety. Problems can arise across multiple threads if the reads and writes from our C++ statements are running out-of-order. I mean, how can it be good to just run my C++ code in any random order that the CPU chooses?
The issue of preventing out-of-order errors involves “memory order.” These are especially useful for correctly implementing lock-free algorithms with atomics, but they also act as memory barriers that can prevent any undesirables types of out-of-order execution.
Speed second. But the goal is to go faster! Rather than stopping the CPU from reordering instructions by using memory barriers, let’s maximize it! There are at least two major ideas:
- Minimize memory-waiting delays
- Exploit out-of-order instructions
The first point is to minimize the slowdowns whereby instructions get delayed. The main one is memory accesses, which has well-known solutions such as: cache hit maximization, cache lines, tiled memory accessing, contiguous memory blocks, reducing data sizes, etc.
Other than cache locality, there’s not a lot of discussion anywhere in books or on the internet about exploiting out-of-order instruction execution to make code run faster. But there’s some discussion of this in Agner Fog’s astounding CPU resources; see (Fog, 2024). The key point is:
Free extra parallelism!
The average CPU has hidden parallelism in terms of its various computation pathways. For example, the CPU can run these two computations in parallel:
- Integer arithmetic — Arithmetic-Logic Unit (ALU)
- Floating-point arithmetic — Floating-Point Unit (FPU)
That’s not the full list. Some CPUs can run different types of integer arithmetic, such as addition and multiplication, on separate pathways. Similarly, some of the SIMD operations run separately from the non-SIMD instructions.
So, you can see the opportunity here, right? Not only can the CPU run the same operations in parallel via SIMD instructions, but it can run two (or more!) different types of computations in parallel.
Unfortunately, the opportunities for huge improvements to your C++ are somewhat limited. For example, if you have a computation with both integer and floating-point computations, can you parallelize them? Yes, but only in limited circumstances, where:
- The two computations don’t depend on the results of the other.
- Not requiring memory accesses for the computations.
- Computation operands are values already in CPU registers.
If there’s a dependency, they can’t run in parallel. And if they both require memory requests, that’s the bottleneck regardless of whether the instructions can run in parallel. The data needs to be already loaded from memory into CPU registers to run fast.
That’s quite a list of limitations. Hence, I haven’t quite solved the problem of a faster vector dot product using instruction out-of-order execution.
References
- Paul Bilokon, Burak Gunduz, 8 Sep 2023, C++ Design Patterns for Low-latency Applications Including High-frequency Trading, https://arxiv.org/abs/2309.04259, Code: https://github.com/0burak/imperial_hft
- Sarah Butcher & Alex McMurray, 2 January 2025, The C++ techniques you need for $600k hedge fund jobs, https://www.efinancialcareers.com/news/low-latency-c
- Paul Alexander Bilokon, Maximilian Lucuta, Erez Shermer, 27 Aug 2023, Semi-static Conditions in Low-latency C++ for High Frequency Trading: Better than Branch Prediction Hints, https://arxiv.org/abs/2308.14185, Code: https://github.com/maxlucuta/semi-static-conditions (Advanced branch prediction analysis, a way to do branches by self-modifying code at assembly level.)
- John Farrier, March 2025, Branch Prediction: The Definitive Guide for High-Performance C++, https://johnfarrier.com/branch-prediction-the-definitive-guide-for-high-performance-c/
- Srdjan Delić, Apr 10, 2023, Branchless programming — Why your CPU will thank you, https://sdremthix.medium.com/branchless-programming-why-your-cpu-will-thank-you-5f405d97b0c8
- Jared Gorski, 11 August, 2020, Branchless programming, https://jaredgorski.org/notes/branchless-programming/
- Algorithmica, March 2025 (accessed), Branchless Programming, https://en.algorithmica.org/hpc/pipelining/branchless/
- Michael Kerrisk, Oct 5, 2012, How much do __builtin_expect(), likely(), and unlikely() improve performance? http://blog.man7.org/2012/10/how-much-do-builtinexpect-likely-and.html
- Agner Fog, 28 May, 2024 (last update), The microarchitecture of Intel, AMD, and VIA CPUs: An optimization guide for assembly programmers and compiler makers, https://www.agner.org/optimize/microarchitecture.pdf
- GCC, March 2025 (accessed), Common Function Attributes, https://gcc.gnu.org/onlinedocs/gcc/Common-Function-Attributes.html
- GNU, May 2025 (accessed), Adjusting the Instruction Scheduler, https://gcc.gnu.org/onlinedocs/gccint/Scheduling.html
|
• Online: Table of Contents • PDF: Free PDF book download • Buy: Advanced C++ Memory Techniques: Efficiency and Safety |
|
Advanced C++ Memory Techniques: Efficiency & Safety:
Get your copy from Amazon: Advanced C++ Memory Techniques |