Aussie AI
Chapter 16. Branch Prediction
-
Book Excerpt from "C++ AVX Optimization: CPU SIMD Vectorization"
-
by David Spuler, Ph.D.
Chapter 16. Branch Prediction
What is Branch Prediction?
Branch prediction is an optimization in the CPU whereby efficiency is improved by considering upcoming branches. The CPU in its execution tries to predict which of the two paths of a branch is more likely to be taken. Some CPUs also do “speculative execution” of the future instructions, to get ahead, which must be discarded if the “wrong” branch is actually executed by the code.
For the programmer, these branch prediction capabilities give the opportunity to further optimize your code to capitalize on the CPU’s abilities. Optimization techniques for the C++ programmer include:
- Eliminating branches in the hotpath so that the code runs straight and narrow (i.e., fast!).
- Hinting to the compiler about the most likely branches (e.g.,
[[likely]]and[[unlikely]]specifiers). - Keep unavoidable branches in the same neighborhood (e.g., short loop bodies).
Branch prediction has a problem in HFT: the hot path is rarely executed (i.e., actually submitting a trade). All of the branch prediction logic would try to run the cold path, as it would always be predicted. But what we want is for the branch prediction logic to always choose the hot path, even though it would mostly fail to be correct. Thus, all of HFT is at odds with a whole swathe of computing theory about branch prediction. HFT needs a “set opposite world mode” flag, but I’m yet to find one in the GCC documentation.
Types of Branches
First things: analyze your hotpath code for branching. The main types of branches in C++ code include:
ifstatements andif-elsestatements.- Loop conditions and loop bodies.
- Loop control statements:
break,continue. - Function calls and
returnstatements. switchstatements (multi-way branching).
Some of the less obvious types of branches are:
- Ternary operator (
?:) - Short-circuiting in the
&&and||operators
There are also hidden branches in C++ code features such as:
- Virtual function calls
- Function pointers (and function names)
Branch Compiler Hints
There are several ways for the programmer to give “hints” to the compiler and its optimizer about which pathways are more likely. As always, the compiler is free to ignore hints, so you have to check in the assembly output what effect your changes have. Some of the ways to give hints include:
[[likely]]and[[unlikely]]path attributes (C++20).likely()andunlikely()condition markers (C++20)noexceptattribute (C++11)[[noreturn]]attribute (C++11)[[assume(expression)]]attribute (C++23)
GCC also has various extensions available to give hints:
__builtin_expect(expression, value)(GCC extension)hot(GCC function attribute)
It’s common in pre-C++20 Linux code to define your own macro versions for use with the GCC compiler:
#define likely(expr) __builtin_expect((expr), 1)
#define unlikely(expr) __builtin_expect((expr), 0)
Branch Profiling
Branch profiling is the recording of pathway stats to analyze the most likely branches. This can also be re-used in the compiler’s optimization mode, so that the optimizer can perform branch-aware optimizations. Hence, there is a two-step process whereby better branch prediction can be incorporated into your C++ executable code.
GCC has capabilities to store and use branch prediction statistics in its optimization phase. The arguments to use are:
-fprofile-arcs(GCC command-line argument)-fprofile-generate(GCC command-line argument)-fprofile-use(GCC command-line argument)
Following this process will allow GCC to generate more optimal code under assumptions based on branch frequency in its seen executions. Obviously, this is an automatic method, but needs multiple steps in the build:
- Compile without branch hints
- Run the tests
- Output the branch prediction data
- Re-compile the code with branch optimizations enabled
Note that for HFT, the fully hot path (i.e., trade execution) is actually a rare branch, so this historical branch data won’t be that useful. One solution is to run GCC in a test mode in which the hotpath is always dummy-executed! Other early parts of the hotpath in HFT can still benefit in both situations, such as the trading decision logic, which is always executed on incoming market data. Obviously, non-HFT applications can always benefit, as the most likely paths are also the most heavily-executed.
Branch Heuristics
In the absence of other branch prediction data, the CPU and compiler tools fall back on some heuristics. Some of the common ones include:
- The
ifcode block is more likely to be executed than theelsecode block. - Loops tend to be executed multiple times.
- Backwards branches are assumed to be loop iterations (and are preferred due to the prior assumption).
Hence, we can make some heuristic recommendations for how to organize your code:
- Put common case code in the
ifblock. - Have error handling in the
elseblock. - Don’t use once-only loop executions.
Branch Elimination
The simplest way to avoid branch prediction issues is to have fewer branches. There are various ways to achieve this, ranging from minor code tricks to re-writing your entire algorithm to have fewer conditional tests.
Which branches to eliminate? The worst kinds of branches that need elimination include:
- Long if-else-if sequences
- Nested if-else statements
What data is being tested by a branch condition is also critical, and some of the problematic branches are based on unpredictable conditions:
- Branches depending on user inputs
- Branches depending on random numbers
- Branches depending on system clocks
The best types of conditional tests include:
- Compile-time known tests
- Predictable conditions
The techniques available to eliminate your least favorite branches include:
- Reorganize the overall algorithm to have fewer branches.
- Defer or combine error checking for multiple errors so that there’s only one error handling branch.
- Function call optimizations such as inlining and call hierarchy flattening.
- Loop conditional test reductions such as loop unrolling and iteration bounds known at compile-time.
- Branchless programming techniques and tricks to change conditional paths to arithmetic computations.
Branchless Programming Tricks
Branchless programming is a variety of coding tricks
to get rid of control flow branches.
The main approach is to remove conditional tests,
such as if statements,
by using a variety of arithmetic computations instead.
Code that has no branches in a long block
can run very fast on a CPU
because of instruction prefetching.
Advantages of branchless programming:
- Avoids branch prediction issues (CPU speedup).
- Avoids warp divergence in CUDA C++ (GPU speedup).
- Job security
Possible general software engineering disadvantages of these branchless arithmetic bit tricks:
- Code complexity — isn’t it a good thing?
- Unreadable code — as if we care.
- Maintainability — is someone else’s problem.
Even worse, the speed benefit might be a mirage. The issues include:
- De-optimizations from too many arithmetic operators — benchmark your tricks!
- Don’t underestimate the optimizer’s capability on simple code (even if it’s “branchy”).
- Code tricks can confuse the optimizer (undermining any benefit).
- Memory access costs may dominate over branchless code.
One of the risks with branchless code is that it runs too fast, and gets blocked by memory access delays. Hence, you may need to combine branchless code sequences with software-based memory prefetch primitives, such as with GCC builtins:
__builtin_prefetch()_mm_prefetch()
Branchless Coding Techniques
Now, let’s look at some of the fun tricks in branchless C++ sequences. The types of methods for branchless coding of basic sequential CPU code include:
- Bit masks
- Bit arithmetic (bitshifts, bitwise AND/OR/XOR)
- Mapping Boolean flags to 0 or 1
- Mapping logical operator results to 0 or 1
- Multiplications by 0 or 1 using Booleans
- Lookup tables (maybe)
- Conditional move (CMOV) assembly statements
- Ternary operator (
?:)
How can we do that in AVX? Note that all SIMD operations are branchless and efficient. All of the basic arithmetic should be vectorized into SIMD operations where possible. Branches can also be removed in parallel using some types of AVX instructions. The AVX equivalents of these branchless coding ideas include:
- Boolean flags — map to 0 or 0xFFFFFF (-1) in AVX.
- Bitmask tricks — permute/shuffle AVX intrinsics.
- CMOV/ternary operator — combined “cmp” and “blend” AVX primitives.
Some of the more traditional C++ optimizations techniques can also reduce branching as an extra benefit:
- Loop code hoisting of conditional tests.
- Compile-time settings and configurations.
Ternary Operator and CMOV
Using the C++ ternary operator is one way to help the compiler write branchless code.
Consider the basic if statement:
if (x > y) {
max = x;
}
else {
max = y;
}
This can be more concisely written with a ternary operator:
max = (x > y) ? x : y;
The ternary operator can be implemented in the compiler backend using a CMOV (conditional move) register assignment statement. This is a branchless instruction that implements the conditional assignment very efficiently.
In theory, both pieces of code are equivalent, and the compiler
really should generate identical code.
In practice, the use of the ternary operator
makes it easier on those poor compiler engineers,
because it’s 100% guaranteed that an assignment is required,
whereas the if statement requires a significant amount of extra compile-time static
analysis
to deduce that both assignments are setting the same variable.
The C++ compiler is more likely to emit a branchless CMOV assembly statement
with a ternary operator.
Boolean Flags are 0 and 1
Another way to reduce branches is to use Boolean flags in arithmetic, using them as having the values of integer 0 and 1. Here’s a simple example:
bool inc_flag;
int x = 0;
if (inc_flag) {
x++;
}
This can be implemented in a branchless manner:
x += (int)inc_flag
Note that the type cast to int is not really needed, but helps with readability,
and ensures you don’t get compiler or static analyzer warnings.
Whether that is faster is something that needs testing because it forces an addition operator into one of the pathways that previously had none, but at least its branchless so it helps with branch prediction.
That was a simple example, but many other ideas are possible. Instead of this:
if (clear_flag) x = 0;
You can try this branchless version:
x *= (int)!clear_flag;
It’s not clear that this is faster, since multiplication is an expensive operation, but a good compiler can actually notice that it’s a fake multiplication over two possible values (0 and 1), and the optimizer can then use a CMOV instruction. Who’s to know without checking the assembly code or running a benchmark.
Logical Operators are 0 and 1
In the same vein, the Boolean values of the && and || operators can
be treated as 0 and 1 in integer arithmetic expressions.
Here’s an example of the maximum computation:
max = (x > y) * x + (y >= x) * y;
Note that the optimizer can notice that a multiplication over a Boolean operand can be replaced with a CMOV, and there are two here. Again, the ternary operator’s single CMOV instruction is probably faster than this possible de-optimization, because this version has either two multiplications or two CMOV instructions.
Bitwise XOR Tricks
There’s the well-known XOR trick to swap two integer variables without using a temporary:
x = x ^ y;
y = y ^ x;
x = x ^ y;
Don’t worry; nobody understands how this works. But it uses three assignments, no temporary variable, and no branches.
Self XOR to Zero
There’s also a well-known assembly language trick
of zeroing a register using XOR with itself.
The idea is that instead of an “x=0” statement, do this:
x ^= x; // Self XOR
The result is zero, and we don’t even need to initialize the variable! However, we don’t usually do this in C++, but the equivalent is common in assembly listings and compiler backend implementations.
Sign Bit Extension Masks
If you’re doing any arithmetic with negative values, you can use bitwise tricks by creating two masks depending on the sign bit. The idea is that the bitmask is:
- All 0’s if the number is positive (or zero).
- All 1’s if the number is negative.
In other words, the bitmask is 32 bits all set to the same bit value as the sign bit.
The bitmask value is either 0 or 0xFFFFFFFF, which is also that artist previously known as -1.
One way is a ternary operator:
unsigned int mask = (x >= 0) ? 0 : 0xFFFFFFFFu;
We can also generate this bitmask using the right bitshift operator and sign extension:
unsigned int mask = x >> 31;
Yes, I really should portably compute the bitshift count
using CHAR_BIT and sizeof(int) as nicely done in [Farrier, 2025].
Subtraction Bit Mask
Another way to get the same result is by noting the joke about -1 being the same value.
Hence, this trick with subtraction on 2’s complement signed integers works:
unsigned int mask = (unsigned) ( (int)(x < 0) - 1 );
The comparison generates an integer 0 or 1, and then we subtract 1 to get
either 0xFFFFFFFF or 0.
Hence, we needed to reverse the comparison test to “<” instead.
All of the type casts are “free” without runtime costs, and are probably not necessary
because implicit conversions would work, anyway.
Example: RELU Activation Function
Let’s have a go at making the RELU function branchless. RELU is an “activation function” in LLM backends, and it’s quite simple:
if (x < 0) {
RELU = 0;
}
else {
RELU = x;
}
In other words, change negatives to zero, but leave positives unchanged. Here’s the ternary version (faster):
RELU = (x < 0) ? 0 : x;
The mask-by-subtraction version combines with bitwise-and to get:
unsigned int mask = (x < 0) - 1;
RELU &= mask;
Another idea for a branchless version of a bitwise RELU is:
unsigned int umask = (x >> 31); // All 0’s or 1's
RELU = (x | umask);
Actually, that’s buggy, with the bit masking the wrong way around. Here’s the correction:
unsigned int umask = ((-x) >> 31); // All 0’s or 1's
RELU = (x | umask);
Beware this might be a de-optimization, because the ternary version might be a single CMOV instructions, whereas this version has three operators: negative, right bitshift, and bitwise-AND.
Sign Bitshift Portability
There’s a major portability problem with this code, because right bitshift on a negative signed integer is actually undefined behavior in C++. The compiler is free to shift in zero bits or to sign bit extend on the leftmost bit position, in its sole discretion. Hence, you need to check your platform to see what the >> operator does, and whether this rightshift bitmask idea will work.
Note that we cannot fix this by doing the right bitshift on an unsigned type,
which is guaranteed to shift in a zero bit
(well-defined in standard C++, but not what we want).
Note also that this is only undefined for right bitshift, not for left bitshift, which is well-defined and
always shifts zero bits in on the right side (again, not what we want).
Of course, you can create the sign-based bitmask more portably by avoiding the right bitshift operator, but this loses the branchless benefits:
unsigned int mask = (x >= 0) ? 0 : 0xFFFFFFFF;
That’s safe and slow, and what’s the point of that?
Lookup Tables
Precomputation of lookup tables is a fast way to get a double benefit of fast computation and branchless code. A good example in the standard C++ library are the functions for character types. Here’s a slow branching version:
#define islower(c) (((c) >= 'a') && ((c) <= 'z') )
This has lots of computation and there are also branches
in the short-circuiting of the && operator.
A faster version uses a precomputed lookup table with 256 bytes.
#define islower(c) _islower_table[(unsigned char)(c)]
This is faster and branchless, at the cost of 256 bytes of global memory, and has already been done for you in the standard libraries by those uber-brainy compiler engineers.
References
- Paul Bilokon, Burak Gunduz, 8 Sep 2023, C++ Design Patterns for Low-latency Applications Including High-frequency Trading, https://arxiv.org/abs/2309.04259, Code: https://github.com/0burak/imperial_hft
- Sarah Butcher & Alex McMurray, 2 January 2025, The C++ techniques you need for $600k hedge fund jobs, https://www.efinancialcareers.com/news/low-latency-c
- Paul Alexander Bilokon, Maximilian Lucuta, Erez Shermer, 27 Aug 2023, Semi-static Conditions in Low-latency C++ for High Frequency Trading: Better than Branch Prediction Hints, https://arxiv.org/abs/2308.14185, Code: https://github.com/maxlucuta/semi-static-conditions (Advanced branch prediction analysis, a way to do branches by self-modifying code at assembly level.)
- John Farrier, March 2025, Branch Prediction: The Definitive Guide for High-Performance C++, https://johnfarrier.com/branch-prediction-the-definitive-guide-for-high-performance-c/
- Srdjan Delić, Apr 10, 2023, Branchless programming — Why your CPU will thank you, https://sdremthix.medium.com/branchless-programming-why-your-cpu-will-thank-you-5f405d97b0c8
- Jared Gorski, 11 August, 2020, Branchless programming, https://jaredgorski.org/notes/branchless-programming/
- Algorithmica, March 2025 (accessed), Branchless Programming, https://en.algorithmica.org/hpc/pipelining/branchless/
- Michael Kerrisk, Oct 5, 2012, How much do __builtin_expect(), likely(), and unlikely() improve performance? http://blog.man7.org/2012/10/how-much-do-builtinexpect-likely-and.html
- Agner Fog, 28 May, 2024 (last update), The microarchitecture of Intel, AMD, and VIA CPUs: An optimization guide for assembly programmers and compiler makers, https://www.agner.org/optimize/microarchitecture.pdf
- GCC, March 2025 (accessed), Common Function Attributes, https://gcc.gnu.org/onlinedocs/gcc/Common-Function-Attributes.html
- Algorithmica, July 2025 (accessed), Binary Search, https://en.algorithmica.org/hpc/data-structures/binary-search/ (Shows a branchless binary search algorithm with prefetching.)
- Paul-Virak Khuong, Pat Morin, 15 Mar 2017 (v2), Array Layouts for Comparison-Based Searching, https://arxiv.org/abs/1509.05053 (Branchless and cached versions of binary search on sorted arrays.)
- Agner Fog, 22 June 2024 (last updated), Optimizing subroutines in assembly language: An optimization guide for x86 platforms, https://www.agner.org/optimize/optimizing_assembly.pdf
|
• Online: Table of Contents • PDF: Free PDF book download • Buy: C++ AVX Optimization |
|
C++ AVX Optimization: CPU SIMD Vectorization:
Get your copy from Amazon: C++ AVX Optimization: CPU SIMD Vectorization |