Aussie AI

Chapter 16. Branch Prediction

Book Excerpt from "C++ AVX Optimization: CPU SIMD Vectorization"

by David Spuler, Ph.D.

Chapter 16. Branch Prediction

What is Branch Prediction?

Branch prediction is an optimization in the CPU whereby efficiency is improved by considering upcoming branches. The CPU in its execution tries to predict which of the two paths of a branch is more likely to be taken. Some CPUs also do “speculative execution” of the future instructions, to get ahead, which must be discarded if the “wrong” branch is actually executed by the code.

For the programmer, these branch prediction capabilities give the opportunity to further optimize your code to capitalize on the CPU’s abilities. Optimization techniques for the C++ programmer include:

Eliminating branches in the hotpath so that the code runs straight and narrow (i.e., fast!).
Hinting to the compiler about the most likely branches (e.g., [[likely]] and [[unlikely]] specifiers).
Keep unavoidable branches in the same neighborhood (e.g., short loop bodies).

Branch prediction has a problem in HFT: the hot path is rarely executed (i.e., actually submitting a trade). All of the branch prediction logic would try to run the cold path, as it would always be predicted. But what we want is for the branch prediction logic to always choose the hot path, even though it would mostly fail to be correct. Thus, all of HFT is at odds with a whole swathe of computing theory about branch prediction. HFT needs a “set opposite world mode” flag, but I’m yet to find one in the GCC documentation.

Types of Branches

First things: analyze your hotpath code for branching. The main types of branches in C++ code include:

if statements and if-else statements.
Loop conditions and loop bodies.
Loop control statements: break, continue.
Function calls and return statements.
switch statements (multi-way branching).

Some of the less obvious types of branches are:

Ternary operator (?:)
Short-circuiting in the && and || operators

There are also hidden branches in C++ code features such as:

Virtual function calls
Function pointers (and function names)

Branch Compiler Hints

There are several ways for the programmer to give “hints” to the compiler and its optimizer about which pathways are more likely. As always, the compiler is free to ignore hints, so you have to check in the assembly output what effect your changes have. Some of the ways to give hints include:

[[likely]] and [[unlikely]] path attributes (C++20).
likely() and unlikely() condition markers (C++20)
noexcept attribute (C++11)
[[noreturn]] attribute (C++11)
[[assume(expression)]] attribute (C++23)

GCC also has various extensions available to give hints:

__builtin_expect(expression, value) (GCC extension)
hot (GCC function attribute)

It’s common in pre-C++20 Linux code to define your own macro versions for use with the GCC compiler:

    #define likely(expr)   __builtin_expect((expr), 1)
    #define unlikely(expr) __builtin_expect((expr), 0)

Branch Profiling

Branch profiling is the recording of pathway stats to analyze the most likely branches. This can also be re-used in the compiler’s optimization mode, so that the optimizer can perform branch-aware optimizations. Hence, there is a two-step process whereby better branch prediction can be incorporated into your C++ executable code.

GCC has capabilities to store and use branch prediction statistics in its optimization phase. The arguments to use are:

-fprofile-arcs (GCC command-line argument)
-fprofile-generate (GCC command-line argument)
-fprofile-use (GCC command-line argument)

Following this process will allow GCC to generate more optimal code under assumptions based on branch frequency in its seen executions. Obviously, this is an automatic method, but needs multiple steps in the build:

Compile without branch hints
Run the tests
Output the branch prediction data
Re-compile the code with branch optimizations enabled

Note that for HFT, the fully hot path (i.e., trade execution) is actually a rare branch, so this historical branch data won’t be that useful. One solution is to run GCC in a test mode in which the hotpath is always dummy-executed! Other early parts of the hotpath in HFT can still benefit in both situations, such as the trading decision logic, which is always executed on incoming market data. Obviously, non-HFT applications can always benefit, as the most likely paths are also the most heavily-executed.

Branch Heuristics

In the absence of other branch prediction data, the CPU and compiler tools fall back on some heuristics. Some of the common ones include:

The if code block is more likely to be executed than the else code block.
Loops tend to be executed multiple times.
Backwards branches are assumed to be loop iterations (and are preferred due to the prior assumption).

Hence, we can make some heuristic recommendations for how to organize your code:

Put common case code in the if block.
Have error handling in the else block.
Don’t use once-only loop executions.

Branch Elimination

The simplest way to avoid branch prediction issues is to have fewer branches. There are various ways to achieve this, ranging from minor code tricks to re-writing your entire algorithm to have fewer conditional tests.

Which branches to eliminate? The worst kinds of branches that need elimination include:

Long if-else-if sequences
Nested if-else statements

What data is being tested by a branch condition is also critical, and some of the problematic branches are based on unpredictable conditions:

Branches depending on user inputs
Branches depending on random numbers
Branches depending on system clocks

The best types of conditional tests include:

Compile-time known tests
Predictable conditions

The techniques available to eliminate your least favorite branches include:

Reorganize the overall algorithm to have fewer branches.
Defer or combine error checking for multiple errors so that there’s only one error handling branch.
Function call optimizations such as inlining and call hierarchy flattening.
Loop conditional test reductions such as loop unrolling and iteration bounds known at compile-time.
Branchless programming techniques and tricks to change conditional paths to arithmetic computations.

Branchless Programming Tricks

Branchless programming is a variety of coding tricks to get rid of control flow branches. The main approach is to remove conditional tests, such as if statements, by using a variety of arithmetic computations instead. Code that has no branches in a long block can run very fast on a CPU because of instruction prefetching.

Advantages of branchless programming:

Avoids branch prediction issues (CPU speedup).
Avoids warp divergence in CUDA C++ (GPU speedup).
Job security

Possible general software engineering disadvantages of these branchless arithmetic bit tricks:

Code complexity — isn’t it a good thing?
Unreadable code — as if we care.
Maintainability — is someone else’s problem.

Even worse, the speed benefit might be a mirage. The issues include:

De-optimizations from too many arithmetic operators — benchmark your tricks!
Don’t underestimate the optimizer’s capability on simple code (even if it’s “branchy”).
Code tricks can confuse the optimizer (undermining any benefit).
Memory access costs may dominate over branchless code.

One of the risks with branchless code is that it runs too fast, and gets blocked by memory access delays. Hence, you may need to combine branchless code sequences with software-based memory prefetch primitives, such as with GCC builtins:

__builtin_prefetch()
_mm_prefetch()

Branchless Coding Techniques

Now, let’s look at some of the fun tricks in branchless C++ sequences. The types of methods for branchless coding of basic sequential CPU code include:

Bit masks
Bit arithmetic (bitshifts, bitwise AND/OR/XOR)
Mapping Boolean flags to 0 or 1
Mapping logical operator results to 0 or 1
Multiplications by 0 or 1 using Booleans
Lookup tables (maybe)
Conditional move (CMOV) assembly statements
Ternary operator (?:)

How can we do that in AVX? Note that all SIMD operations are branchless and efficient. All of the basic arithmetic should be vectorized into SIMD operations where possible. Branches can also be removed in parallel using some types of AVX instructions. The AVX equivalents of these branchless coding ideas include:

Boolean flags — map to 0 or 0xFFFFFF (-1) in AVX.
Bitmask tricks — permute/shuffle AVX intrinsics.
CMOV/ternary operator — combined “cmp” and “blend” AVX primitives.

Some of the more traditional C++ optimizations techniques can also reduce branching as an extra benefit:

Loop code hoisting of conditional tests.
Compile-time settings and configurations.

Ternary Operator and CMOV

Using the C++ ternary operator is one way to help the compiler write branchless code. Consider the basic if statement:

    if (x > y) {
        max = x;
    }
    else {
        max = y;
    }

This can be more concisely written with a ternary operator:

    max = (x > y) ? x : y;

The ternary operator can be implemented in the compiler backend using a CMOV (conditional move) register assignment statement. This is a branchless instruction that implements the conditional assignment very efficiently.

In theory, both pieces of code are equivalent, and the compiler really should generate identical code. In practice, the use of the ternary operator makes it easier on those poor compiler engineers, because it’s 100% guaranteed that an assignment is required, whereas the if statement requires a significant amount of extra compile-time static analysis to deduce that both assignments are setting the same variable. The C++ compiler is more likely to emit a branchless CMOV assembly statement with a ternary operator.

Boolean Flags are 0 and 1

Another way to reduce branches is to use Boolean flags in arithmetic, using them as having the values of integer 0 and 1. Here’s a simple example:

    bool inc_flag;
    int x = 0; 

    if (inc_flag) {
        x++;
    }

This can be implemented in a branchless manner:

    x += (int)inc_flag

Note that the type cast to int is not really needed, but helps with readability, and ensures you don’t get compiler or static analyzer warnings.

Whether that is faster is something that needs testing because it forces an addition operator into one of the pathways that previously had none, but at least its branchless so it helps with branch prediction.

That was a simple example, but many other ideas are possible. Instead of this:

    if (clear_flag) x = 0;

You can try this branchless version:

    x *= (int)!clear_flag;

It’s not clear that this is faster, since multiplication is an expensive operation, but a good compiler can actually notice that it’s a fake multiplication over two possible values (0 and 1), and the optimizer can then use a CMOV instruction. Who’s to know without checking the assembly code or running a benchmark.

Logical Operators are 0 and 1

In the same vein, the Boolean values of the && and || operators can be treated as 0 and 1 in integer arithmetic expressions. Here’s an example of the maximum computation:

    max = (x > y) * x + (y >= x) * y;

Note that the optimizer can notice that a multiplication over a Boolean operand can be replaced with a CMOV, and there are two here. Again, the ternary operator’s single CMOV instruction is probably faster than this possible de-optimization, because this version has either two multiplications or two CMOV instructions.

Bitwise XOR Tricks

There’s the well-known XOR trick to swap two integer variables without using a temporary:

    x = x ^ y;
    y = y ^ x;
    x = x ^ y;

Don’t worry; nobody understands how this works. But it uses three assignments, no temporary variable, and no branches.

Self XOR to Zero

There’s also a well-known assembly language trick of zeroing a register using XOR with itself. The idea is that instead of an “x=0” statement, do this:

    x ^= x;  // Self XOR

The result is zero, and we don’t even need to initialize the variable! However, we don’t usually do this in C++, but the equivalent is common in assembly listings and compiler backend implementations.

Sign Bit Extension Masks

If you’re doing any arithmetic with negative values, you can use bitwise tricks by creating two masks depending on the sign bit. The idea is that the bitmask is:

All 0’s if the number is positive (or zero).
All 1’s if the number is negative.

In other words, the bitmask is 32 bits all set to the same bit value as the sign bit. The bitmask value is either 0 or 0xFFFFFFFF, which is also that artist previously known as -1. One way is a ternary operator:

    unsigned int mask = (x >= 0) ? 0 : 0xFFFFFFFFu;

We can also generate this bitmask using the right bitshift operator and sign extension:

    unsigned int mask = x >> 31;

Yes, I really should portably compute the bitshift count using CHAR_BIT and sizeof(int) as nicely done in [Farrier, 2025].

Subtraction Bit Mask

Another way to get the same result is by noting the joke about -1 being the same value. Hence, this trick with subtraction on 2’s complement signed integers works:

    unsigned int mask = (unsigned) ( (int)(x < 0) - 1 );

The comparison generates an integer 0 or 1, and then we subtract 1 to get either 0xFFFFFFFF or 0. Hence, we needed to reverse the comparison test to “<” instead. All of the type casts are “free” without runtime costs, and are probably not necessary because implicit conversions would work, anyway.

Example: RELU Activation Function

Let’s have a go at making the RELU function branchless. RELU is an “activation function” in LLM backends, and it’s quite simple:

    if (x < 0) {
        RELU = 0;
    }
    else {
        RELU = x;
    }

In other words, change negatives to zero, but leave positives unchanged. Here’s the ternary version (faster):

    RELU = (x < 0) ? 0 : x;

The mask-by-subtraction version combines with bitwise-and to get:

    unsigned int mask = (x < 0) - 1;
    RELU &= mask;

Another idea for a branchless version of a bitwise RELU is:

    unsigned int umask = (x >> 31); // All 0’s or 1's
    RELU = (x | umask);

Actually, that’s buggy, with the bit masking the wrong way around. Here’s the correction:

    unsigned int umask = ((-x) >> 31); // All 0’s or 1's
    RELU = (x | umask);

Beware this might be a de-optimization, because the ternary version might be a single CMOV instructions, whereas this version has three operators: negative, right bitshift, and bitwise-AND.

Sign Bitshift Portability

There’s a major portability problem with this code, because right bitshift on a negative signed integer is actually undefined behavior in C++. The compiler is free to shift in zero bits or to sign bit extend on the leftmost bit position, in its sole discretion. Hence, you need to check your platform to see what the >> operator does, and whether this rightshift bitmask idea will work.

Note that we cannot fix this by doing the right bitshift on an unsigned type, which is guaranteed to shift in a zero bit (well-defined in standard C++, but not what we want). Note also that this is only undefined for right bitshift, not for left bitshift, which is well-defined and always shifts zero bits in on the right side (again, not what we want).

Of course, you can create the sign-based bitmask more portably by avoiding the right bitshift operator, but this loses the branchless benefits:

    unsigned int mask = (x >= 0) ? 0 : 0xFFFFFFFF;

That’s safe and slow, and what’s the point of that?

Lookup Tables

Precomputation of lookup tables is a fast way to get a double benefit of fast computation and branchless code. A good example in the standard C++ library are the functions for character types. Here’s a slow branching version:

    #define islower(c)   (((c) >= 'a') && ((c) <= 'z') )

This has lots of computation and there are also branches in the short-circuiting of the && operator.

A faster version uses a precomputed lookup table with 256 bytes.

    #define islower(c)  _islower_table[(unsigned char)(c)]

This is faster and branchless, at the cost of 256 bytes of global memory, and has already been done for you in the standard libraries by those uber-brainy compiler engineers.

References

Paul Bilokon, Burak Gunduz, 8 Sep 2023, C++ Design Patterns for Low-latency Applications Including High-frequency Trading, https://arxiv.org/abs/2309.04259, Code: https://github.com/0burak/imperial_hft
Sarah Butcher & Alex McMurray, 2 January 2025, The C++ techniques you need for $600k hedge fund jobs, https://www.efinancialcareers.com/news/low-latency-c
Paul Alexander Bilokon, Maximilian Lucuta, Erez Shermer, 27 Aug 2023, Semi-static Conditions in Low-latency C++ for High Frequency Trading: Better than Branch Prediction Hints, https://arxiv.org/abs/2308.14185, Code: https://github.com/maxlucuta/semi-static-conditions (Advanced branch prediction analysis, a way to do branches by self-modifying code at assembly level.)
John Farrier, March 2025, Branch Prediction: The Definitive Guide for High-Performance C++, https://johnfarrier.com/branch-prediction-the-definitive-guide-for-high-performance-c/
Srdjan Delić, Apr 10, 2023, Branchless programming — Why your CPU will thank you, https://sdremthix.medium.com/branchless-programming-why-your-cpu-will-thank-you-5f405d97b0c8
Jared Gorski, 11 August, 2020, Branchless programming, https://jaredgorski.org/notes/branchless-programming/
Algorithmica, March 2025 (accessed), Branchless Programming, https://en.algorithmica.org/hpc/pipelining/branchless/
Michael Kerrisk, Oct 5, 2012, How much do __builtin_expect(), likely(), and unlikely() improve performance? http://blog.man7.org/2012/10/how-much-do-builtinexpect-likely-and.html
Agner Fog, 28 May, 2024 (last update), The microarchitecture of Intel, AMD, and VIA CPUs: An optimization guide for assembly programmers and compiler makers, https://www.agner.org/optimize/microarchitecture.pdf
GCC, March 2025 (accessed), Common Function Attributes, https://gcc.gnu.org/onlinedocs/gcc/Common-Function-Attributes.html
Algorithmica, July 2025 (accessed), Binary Search, https://en.algorithmica.org/hpc/data-structures/binary-search/ (Shows a branchless binary search algorithm with prefetching.)
Paul-Virak Khuong, Pat Morin, 15 Mar 2017 (v2), Array Layouts for Comparison-Based Searching, https://arxiv.org/abs/1509.05053 (Branchless and cached versions of binary search on sorted arrays.)
Agner Fog, 22 June 2024 (last updated), Optimizing subroutines in assembly language: An optimization guide for x86 platforms, https://www.agner.org/optimize/optimizing_assembly.pdf

• Online: Table of Contents

• PDF: Free PDF book download

• Buy: C++ AVX Optimization

C++ AVX Optimization: CPU SIMD Vectorization:

Introduction to AVX SIMD intrinsics
Vectorization and horizontal reductions
Low latency tricks and branchless programming
Instruction-level parallelism and out-of-order execution
Loop unrolling & double loop unrolling

Get your copy from Amazon: C++ AVX Optimization: CPU SIMD Vectorization

Aussie AI

Chapter 16. Branch Prediction