Aussie AI

Chapter 4. Branch Prediction

Book Excerpt from "Advanced C++ Memory Techniques: Efficiency and Safety"

by David Spuler, Ph.D.

Chapter 4. Branch Prediction

What is Branch Prediction?

Branch prediction is an optimization in the CPU whereby efficiency is improved by considering upcoming branches. The CPU in its execution tries to predict which of the two paths of a branch is more likely to be taken. Some CPUs also do “speculative execution” of the future instructions, to get ahead, which must be discarded if the “wrong” branch is actually executed by the code.

For the programmer, these branch prediction capabilities give the opportunity to further optimize your code to capitalize on the CPU’s abilities. Optimization techniques for the C++ programmer include:

Eliminating branches in the hotpath so that the code runs straight and narrow (i.e., fast!).
Hinting to the compiler about the most likely branches (e.g., [[likely]] and [[unlikely]] specifiers).
Keep unavoidable branches in the same neighborhood (e.g., short loop bodies).

Branch prediction has a problem in HFT: the hot path is rarely executed (i.e., actually submitting a trade). All of the branch prediction logic would try to run the cold path, as it would always be predicted. But what we want is for the branch prediction logic to always choose the hot path, even though it would mostly fail to be correct. Thus, all of HFT is at odds with a whole swathe of computing theory about branch prediction. HFT needs a “set opposite world mode” flag, but I’m yet to find one in the GCC documentation.

Types of Branches

First things: analyze your hotpath code for branching. The main types of branches in C++ code include:

if statements and if-else statements.
Loop conditions and loop bodies.
Loop control statements: break, continue.
Function calls and return statements.
switch statements (multi-way branching).

Some of the less obvious types of branches are:

Ternary operator (?:)
Short-circuiting in the && and || operators

There are also hidden branches in C++ code features such as:

Virtual function calls
Function pointers (and function names)

Branch Compiler Hints

There are several ways for the programmer to give “hints” to the compiler and its optimizer about which pathways are more likely. As always, the compiler is free to ignore hints, so you have to check in the assembly output what effect your changes have. Some of the ways to give hints include:

[[likely]] and [[unlikely]] path attributes (C++20).
likely() condition marker (C++20)
noexcept attribute (C++11)
[[noreturn]] attribute (C++11)
[[assume(expression)]] attribute (C++23)

GCC also has various extensions available to give hints:

__builtin_expect(expression, value) (GCC extension)
hot (GCC function attribute)

Branch Profiling

Branch profiling is the recording of pathway stats to analyze the most likely branches. This can also be re-used in the compiler’s optimization mode, so that the optimizer can perform branch-aware optimizations. Hence, there is a two-step process whereby better branch prediction can be incorporated into your C++ executable code.

GCC has capabilities to store and use branch prediction statistics in its optimization phase. The arguments to use are:

-fprofile-arcs (GCC command-line argument)
-fprofile-generate (GCC command-line argument)
-fprofile-use (GCC command-line argument)

Following this process will allow GCC to generate more optimal code under assumptions based on branch frequency in its seen executions. Obviously, this is an automatic method, but needs multiple steps in the build:

Compile without branch hints
Run the tests
Output the branch prediction data
Re-compile the code with branch optimizations enabled

Note that for HFT, the fully hot path (i.e., trade execution) is actually a rare branch, so this historical branch data won’t be that useful. One solution is to run GCC in a test mode in which the hotpath is always dummy-executed! Other early parts of the hotpath in HFT can still benefit in both situations, such as the trading decision logic, which is always executed on incoming market data. Obviously, non-HFT applications can always benefit, as the most likely paths are also the most heavily-executed.

Branch Heuristics

In the absence of other branch prediction data, the CPU and compiler tools fall back on some heuristics. Some of the common ones include:

The if code block is more likely to be executed than the else code block.
Loops tend to be executed multiple times.
Backwards branches are assumed to be loop iterations (and are preferred due to the prior assumption).

Hence, we can make some heuristic recommendations for how to organize your code:

Put common case code in the if block.
Have error handling in the else block.
Don’t use once-only loop executions.

Branch Elimination

The simplest way to avoid branch prediction issues is to have fewer branches. There are various ways to achieve this, ranging from minor code tricks to re-writing your entire algorithm to have fewer conditional tests.

Which branches to eliminate? The worst kinds of branches that need elimination include:

Long if-else-if sequences
Nested if-else statements

What data is being tested by a branch condition is also critical, and some of the problematic branches are based on unpredictable conditions:

Branches depending on user inputs
Branches depending on random numbers
Branches depending on system clocks

The best types of conditional tests include:

Compile-time known tests
Predictable conditions

The techniques available to eliminate your least favorite branches include:

Reorganize the overall algorithm to have fewer branches.
Defer or combine error checking for multiple errors so that there’s only one error handling branch.
Function call optimizations such as inlining and call hierarchy flattening.
Loop conditional test reductions such as loop unrolling and iteration bounds known at compile-time.
Branchless programming techniques and tricks to change conditional paths to arithmetic computations.

Branchless Programming Tricks

Branchless programming is a variety of coding tricks to get rid of control flow branches. The main approach is to remove conditional tests, such as if statements, by using a variety of arithmetic computations instead. Code that has no branches in a long block can run very fast on a CPU because of instruction prefetching.

Advantages of branchless programming:

Avoids branch prediction issues (CPU speedup).
Avoids warp divergence in CUDA C++ (GPU speedup).
Job security

Possible general software engineering disadvantages of these branchless arithmetic bit tricks:

Code complexity — isn’t it a good thing?
Unreadable code — as if we care.
Maintainability — is someone else’s problem.

Even worse, the speed benefit might be a mirage. The issues include:

De-optimizations from too many arithmetic operators — benchmark your tricks!
Don’t underestimate the optimizer’s capability on simple code.
Tricks can confuse the optimizer (undermining any benefit).

The types of methods for branchless coding include:

Bit masks
Bit arithmetic (bitshifts, bitwise AND/OR/XOR)
Mapping Boolean flags to 0 or 1
Mapping logical operator results to 0 or 1
Lookup tables
Conditional move (CMOV) assembly statements
Ternary operator (?:)

Some of the more traditional C++ optimizations techniques can also reduce branching:

Loop code hoisting of conditional tests.
Compile-time settings and configurations.

Ternary Operator and CMOV

Using the C++ ternary operator is one way to help the compiler write branchless code. Consider the basic if statement:

    if (x > y) {
        max = x;
    }
    else {
        max = y;
    }

This can be more concisely written with a ternary operator:

    max = (x > y) ? x : y;

The ternary operator can be implemented in the compiler backend using a CMOV (conditional move) register assignment statement. This is a branchless instruction that implements the conditional assignment very efficiently.

In theory, both pieces of code are equivalent, and the compiler really should generate identical code. In practice, the use of the ternary operator makes it easier on those poor compiler engineers, because it’s 100% guaranteed that an assignment is required, whereas the if statement requires a significant amount of extra compile-time static analysis to deduce that both assignments are setting the same variable. The C++ compiler is more likely to emit a branchless CMOV assembly statement with a ternary operator.

Boolean Flags are 0 and 1

Another way to reduce branches is to use Boolean flags in arithmetic, using them as having the values of integer 0 and 1. Here’s a simple example:

    bool inc_flag;
    int x = 0; 

    if (inc_flag) {
        x++;
    }

This can be implemented in a branchless manner:

    x += (int)inc_flag

Note that the type cast to int is not really needed, but helps with readability, and ensures you don’t get compiler or static analyzer warnings.

Whether that is faster is something that needs testing because it forces an addition operator into one of the pathways that previously had none, but at least its branchless so it helps with branch prediction.

That was a simple example, but many other ideas are possible. Instead of this:

    if (clear_flag) x = 0;

You can try this branchless version:

    x *= (int)!clear_flag;

I’m betting that it’s actually slower, since multiplication is an expensive operation, but who’s to know without running a benchmark.

Logical Operators are 0 and 1

In the same vein, the Boolean values of the && and || operators can be treated as 0 and 1 in integer arithmetic expressions. Here’s an example of the maximum computation:

    max = (x > y) * x + (y >= x) * y;

Again, the ternary operator’s CMOV instruction is probably faster than this de-optimization.

Bitwise XOR Tricks

There’s the well-known XOR trick to swap two integer variables without using a temporary:

    x = x ^ y;
    y = y ^ x;
    x = x ^ y;

Don’t worry; nobody understands how this works. But it uses three assignments, no temporary variable, and no branches.

Sign Bit Extension Masks

If you’re doing any arithmetic with negative values, you can use bitwise tricks by creating two masks depending on the sign bit. The idea is that the bitmask is:

All 0’s if the number is positive (or zero).
All 1’s if the number is negative.

In other words, the bitmask is 32 bits all set to the same bit value as the sign bit. The bitmask value is either 0 or 0xFFFFFFFF (which is also that artist previously known as -1). We can generate this using the right bitshift operator:

    unsigned int mask = x >> 31;

Yes, I really should portably compute the bitshift count using CHAR_BIT and sizeof(int) as nicely done in [Farrier, 2025].

Example: RELU Activation Function

Let’s have a go at making the RELU function branchless. RELU is an “activation function” in LLM backends, and it’s quite simple:

    if (x < 0) {
        RELU = 0;
    }
    else {
        RELU = x;
    }

In other words, change negatives to zero, but leave positives unchanged. Here’s the ternary version (faster):

    RELU = (x < 0) ? 0 : x;

The basic idea for a branchless, bitwise RELU is:

    unsigned int umask = (x >> 31); // All 0’s or 1's
    RELU = (x | umask);

Actually, that’s buggy, with the bit masking the wrong way around. Here’s the correction:

    unsigned int umask = ((-x) >> 31); // All 0’s or 1's
    RELU = (x | umask);

Beware this might be a de-optimization, because the ternary version might be a single CMOV instructions, whereas this version has three operators: negative, right bitshift, and bitwise-AND.

Sign Bitshift Portability

There’s a major portability problem with this code, because right bitshift on a negative signed integer is actually undefined behavior in C++. The compiler is free to shift in zero bits or to sign bit extend on the leftmost bit position, in its sole discretion. Hence, you need to check your platform to see what the >> operator does, and whether this rightshift bitmask idea will work.

Note that we cannot fix this by doing the right bitshift on an unsigned type, which is guaranteed to shift in a zero bit (well-defined in standard C++, but not what we want). Note also that this is only undefined for right bitshift, not for left bitshift, which is well-defined and always shifts zero bits in on the right side (again, not what we want).

Of course, you can create the sign-based bitmask more portably by avoiding the right bitshift operator, but this loses the branchless benefits:

    unsigned int mask = (x >= 0) ? 0 : 0xFFFFFFFF;

That’s safe and slow, and what’s the point of that?

Lookup Tables

Precomputation of lookup tables is a fast way to get a double benefit of fast computation and branchless code. A good example in the standard C++ library are the functions for character types. Here’s a slow branching version:

    #define islower(c)   (((c) >= 'a') && ((c) <= 'z') )

This has lots of computation and there are also branches in the short-circuiting of the && operator.

A faster version uses a precomputed lookup table with 256 bytes.

    #define islower(c)  _islower_table[(unsigned char)(c)]

This is faster and branchless, at the cost of 256 bytes of global memory, and has already been done for you in the standard libraries by those uber-brainy compiler engineers.

Instruction Reordering Optimizations

Instruction reordering is an optimization performed inside the CPU where it actually runs the machine code instructions out-of-order. The way this works in simple terms is:

Delay any opcodes that don’t have the data they need (e.g., from memory).
Run any instructions that are ready as soon as possible.

There’s a whole smash of fun to be had researching how this all works in the CPU. There are schedulers and “stations” and various queues and caches. Kudos to all those hardware engineers.

Another special type of fun is for compiler engineers. GCC does a lot of fancy optimizations in the code generation backend in terms of taking advantage of instruction orders.

But what about C++? Is there anything you can do in C++ to optimize your code? Or with inline assembly instructions?

Safety first. Most of the discussion of out-of-order execution and C++ occurs in relation to safety. Problems can arise across multiple threads if the reads and writes from our C++ statements are running out-of-order. I mean, how can it be good to just run my C++ code in any random order that the CPU chooses?

The issue of preventing out-of-order errors involves “memory order.” These are especially useful for correctly implementing lock-free algorithms with atomics, but they also act as memory barriers that can prevent any undesirables types of out-of-order execution.

Speed second. But the goal is to go faster! Rather than stopping the CPU from reordering instructions by using memory barriers, let’s maximize it! There are at least two major ideas:

Minimize memory-waiting delays
Exploit out-of-order instructions

The first point is to minimize the slowdowns whereby instructions get delayed. The main one is memory accesses, which has well-known solutions such as: cache hit maximization, cache lines, tiled memory accessing, contiguous memory blocks, reducing data sizes, etc.

Other than cache locality, there’s not a lot of discussion anywhere in books or on the internet about exploiting out-of-order instruction execution to make code run faster. But there’s some discussion of this in Agner Fog’s astounding CPU resources; see (Fog, 2024). The key point is:

Free extra parallelism!

The average CPU has hidden parallelism in terms of its various computation pathways. For example, the CPU can run these two computations in parallel:

Integer arithmetic — Arithmetic-Logic Unit (ALU)
Floating-point arithmetic — Floating-Point Unit (FPU)

That’s not the full list. Some CPUs can run different types of integer arithmetic, such as addition and multiplication, on separate pathways. Similarly, some of the SIMD operations run separately from the non-SIMD instructions.

So, you can see the opportunity here, right? Not only can the CPU run the same operations in parallel via SIMD instructions, but it can run two (or more!) different types of computations in parallel.

Unfortunately, the opportunities for huge improvements to your C++ are somewhat limited. For example, if you have a computation with both integer and floating-point computations, can you parallelize them? Yes, but only in limited circumstances, where:

The two computations don’t depend on the results of the other.
Not requiring memory accesses for the computations.
Computation operands are values already in CPU registers.

If there’s a dependency, they can’t run in parallel. And if they both require memory requests, that’s the bottleneck regardless of whether the instructions can run in parallel. The data needs to be already loaded from memory into CPU registers to run fast.

That’s quite a list of limitations. Hence, I haven’t quite solved the problem of a faster vector dot product using instruction out-of-order execution.

References

Paul Bilokon, Burak Gunduz, 8 Sep 2023, C++ Design Patterns for Low-latency Applications Including High-frequency Trading, https://arxiv.org/abs/2309.04259, Code: https://github.com/0burak/imperial_hft
Sarah Butcher & Alex McMurray, 2 January 2025, The C++ techniques you need for $600k hedge fund jobs, https://www.efinancialcareers.com/news/low-latency-c
Paul Alexander Bilokon, Maximilian Lucuta, Erez Shermer, 27 Aug 2023, Semi-static Conditions in Low-latency C++ for High Frequency Trading: Better than Branch Prediction Hints, https://arxiv.org/abs/2308.14185, Code: https://github.com/maxlucuta/semi-static-conditions (Advanced branch prediction analysis, a way to do branches by self-modifying code at assembly level.)
John Farrier, March 2025, Branch Prediction: The Definitive Guide for High-Performance C++, https://johnfarrier.com/branch-prediction-the-definitive-guide-for-high-performance-c/
Srdjan Delić, Apr 10, 2023, Branchless programming — Why your CPU will thank you, https://sdremthix.medium.com/branchless-programming-why-your-cpu-will-thank-you-5f405d97b0c8
Jared Gorski, 11 August, 2020, Branchless programming, https://jaredgorski.org/notes/branchless-programming/
Algorithmica, March 2025 (accessed), Branchless Programming, https://en.algorithmica.org/hpc/pipelining/branchless/
Michael Kerrisk, Oct 5, 2012, How much do __builtin_expect(), likely(), and unlikely() improve performance? http://blog.man7.org/2012/10/how-much-do-builtinexpect-likely-and.html
Agner Fog, 28 May, 2024 (last update), The microarchitecture of Intel, AMD, and VIA CPUs: An optimization guide for assembly programmers and compiler makers, https://www.agner.org/optimize/microarchitecture.pdf
GCC, March 2025 (accessed), Common Function Attributes, https://gcc.gnu.org/onlinedocs/gcc/Common-Function-Attributes.html
GNU, May 2025 (accessed), Adjusting the Instruction Scheduler, https://gcc.gnu.org/onlinedocs/gccint/Scheduling.html