Aussie AI

Branchless C++ Programming Tricks

Book Excerpt from "C++ Ultra-Low Latency: Multithreading and Low-Level Optimizations"

by David Spuler, Ph.D.

Branchless C++ Programming Tricks

Branchless programming is a variety of coding tricks to get rid of control flow branches. The main approach is to remove conditional tests, such as if statements, by using a variety of arithmetic computations instead. Code that has no branches in a long block can run very fast on a CPU because of instruction prefetching.

Advantages of branchless programming:

Avoids branch prediction issues (CPU speedup).
Avoids warp divergence in CUDA C++ (GPU speedup).
Job security

Possible general software engineering disadvantages of these branchless arithmetic bit tricks:

Code complexity — isn’t it a good thing?
Unreadable code — as if we care.
Maintainability — is someone else’s problem.

Even worse, the speed benefit might be a mirage. The issues include:

De-optimizations from too many arithmetic operators — benchmark your tricks!
Don’t underestimate the optimizer’s capability on simple code (even if it’s “branchy”).
Code tricks can confuse the optimizer (undermining any benefit).
Memory access costs may dominate over branchless code.

One of the risks with branchless code is that it runs too fast, and gets blocked by memory access delays. Hence, you may need to combine branchless code sequences with software-based memory prefetch primitives, such as with GCC builtins:

__builtin_prefetch()
_mm_prefetch()

Branchless Coding Techniques

Now, let’s look at some of the fun tricks in branchless C++ sequences. The types of methods for branchless coding include:

Bit masks
Bit arithmetic (bitshifts, bitwise AND/OR/XOR)
Mapping Boolean flags to 0 or 1
Mapping logical operator results to 0 or 1
Multiplications by 0 or 1 using Booleans
Lookup tables
Conditional move (CMOV) assembly statements
Ternary operator (?:)

Some of the more traditional C++ optimizations techniques can also reduce branching as an extra benefit:

Loop code hoisting of conditional tests.
Compile-time settings and configurations.

Ternary Operator and CMOV

Using the C++ ternary operator is one way to help the compiler write branchless code. Consider the basic if statement:

    if (x > y) {
        max = x;
    }
    else {
        max = y;
    }

This can be more concisely written with a ternary operator:

    max = (x > y) ? x : y;

The ternary operator can be implemented in the compiler backend using a CMOV (conditional move) register assignment statement. This is a branchless instruction that implements the conditional assignment very efficiently.

In theory, both pieces of code are equivalent, and the compiler really should generate identical code. In practice, the use of the ternary operator makes it easier on those poor compiler engineers, because it’s 100% guaranteed that an assignment is required, whereas the if statement requires a significant amount of extra compile-time static analysis to deduce that both assignments are setting the same variable. The C++ compiler is more likely to emit a branchless CMOV assembly statement with a ternary operator.

Boolean Flags are 0 and 1

Another way to reduce branches is to use Boolean flags in arithmetic, using them as having the values of integer 0 and 1. Here’s a simple example:

    bool inc_flag;
    int x = 0; 

    if (inc_flag) {
        x++;
    }

This can be implemented in a branchless manner:

    x += (int)inc_flag

Note that the type cast to int is not really needed, but helps with readability, and ensures you don’t get compiler or static analyzer warnings.

Whether that is faster is something that needs testing because it forces an addition operator into one of the pathways that previously had none, but at least its branchless so it helps with branch prediction.

That was a simple example, but many other ideas are possible. Instead of this:

    if (clear_flag) x = 0;

You can try this branchless version:

    x *= (int)!clear_flag;

It’s not clear that this is faster, since multiplication is an expensive operation, but a good compiler can actually notice that it’s a fake multiplication over two possible values (0 and 1), and the optimizer can then use a CMOV instruction. Who’s to know without checking the assembly code or running a benchmark.

Logical Operators are 0 and 1

In the same vein, the Boolean values of the && and || operators can be treated as 0 and 1 in integer arithmetic expressions. Here’s an example of the maximum computation:

    max = (x > y) * x + (y >= x) * y;

Note that the optimizer can notice that a multiplication over a Boolean operand can be replaced with a CMOV, and there are two here. Again, the ternary operator’s single CMOV instruction is probably faster than this possible de-optimization, because this version has either two multiplications or two CMOV instructions.

Bitwise XOR Tricks

There’s the well-known XOR trick to swap two integer variables without using a temporary:

    x = x ^ y;
    y = y ^ x;
    x = x ^ y;

Don’t worry; nobody understands how this works. But it uses three assignments, no temporary variable, and no branches.

Self XOR to Zero

There’s also a well-known assembly language trick of zeroing a register using XOR with itself. The idea is that instead of an “x=0” statement, do this:

    x ^= x;  // Self XOR

The result is zero, and we don’t even need to initialize the variable! However, we don’t usually do this in C++, but the equivalent is common in assembly listings and compiler backend implementations.

Sign Bit Extension Masks

If you’re doing any arithmetic with negative values, you can use bitwise tricks by creating two masks depending on the sign bit. The idea is that the bitmask is:

All 0’s if the number is positive (or zero).
All 1’s if the number is negative.

In other words, the bitmask is 32 bits all set to the same bit value as the sign bit. The bitmask value is either 0 or 0xFFFFFFFF, which is also that artist previously known as -1. One way is a ternary operator:

    unsigned int mask = (x >= 0) ? 0 : 0xFFFFFFFFu;

We can also generate this bitmask using the right bitshift operator and sign extension:

    unsigned int mask = x >> 31;

Yes, I really should portably compute the bitshift count using CHAR_BIT and sizeof(int) as nicely done in [Farrier, 2025].

Subtraction Bit Mask

Another way to get the same result is by noting the joke about -1 being the same value. Hence, this trick with subtraction on 2’s complement signed integers works:

    unsigned int mask = (unsigned) ( (int)(x < 0) - 1 );

The comparison generates an integer 0 or 1, and then we subtract 1 to get either 0xFFFFFFFF or 0. Hence, we needed to reverse the comparison test to “<” instead. All of the type casts are “free” without runtime costs, and are probably not necessary because implicit conversions would work, anyway.

Example: RELU Activation Function

Let’s have a go at making the RELU function branchless. RELU is an “activation function” in LLM backends, and it’s quite simple:

    if (x < 0) {
        RELU = 0;
    }
    else {
        RELU = x;
    }

In other words, change negatives to zero, but leave positives unchanged. Here’s the ternary version (faster):

    RELU = (x < 0) ? 0 : x;

The mask-by-subtraction version combines with bitwise-and to get:

    unsigned int mask = (x < 0) - 1;
    RELU &= mask;

Another idea for a branchless version of a bitwise RELU is:

    unsigned int umask = (x >> 31); // All 0’s or 1's
    RELU = (x | umask);

Actually, that’s buggy, with the bit masking the wrong way around. Here’s the correction:

    unsigned int umask = ((-x) >> 31); // All 0’s or 1's
    RELU = (x | umask);

Beware this might be a de-optimization, because the ternary version might be a single CMOV instructions, whereas this version has three operators: negative, right bitshift, and bitwise-AND.

Sign Bitshift Portability

There’s a major portability problem with this code, because right bitshift on a negative signed integer is actually undefined behavior in C++. The compiler is free to shift in zero bits or to sign bit extend on the leftmost bit position, in its sole discretion. Hence, you need to check your platform to see what the >> operator does, and whether this rightshift bitmask idea will work.

Note that we cannot fix this by doing the right bitshift on an unsigned type, which is guaranteed to shift in a zero bit (well-defined in standard C++, but not what we want). Note also that this is only undefined for right bitshift, not for left bitshift, which is well-defined and always shifts zero bits in on the right side (again, not what we want).

Of course, you can create the sign-based bitmask more portably by avoiding the right bitshift operator, but this loses the branchless benefits:

    unsigned int mask = (x >= 0) ? 0 : 0xFFFFFFFF;

That’s safe and slow, and what’s the point of that?

Lookup Tables

Precomputation of lookup tables is a fast way to get a double benefit of fast computation and branchless code. A good example in the standard C++ library are the functions for character types. Here’s a slow branching version:

    #define islower(c)   (((c) >= 'a') && ((c) <= 'z') )

This has lots of computation and there are also branches in the short-circuiting of the && operator.

A faster version uses a precomputed lookup table with 256 bytes.

    #define islower(c)  _islower_table[(unsigned char)(c)]

This is faster and branchless, at the cost of 256 bytes of global memory, and has already been done for you in the standard libraries by those uber-brainy compiler engineers.

References

Paul Bilokon, Burak Gunduz, 8 Sep 2023, C++ Design Patterns for Low-latency Applications Including High-frequency Trading, https://arxiv.org/abs/2309.04259, Code: https://github.com/0burak/imperial_hft
Sarah Butcher & Alex McMurray, 2 January 2025, The C++ techniques you need for $600k hedge fund jobs, https://www.efinancialcareers.com/news/low-latency-c
Paul Alexander Bilokon, Maximilian Lucuta, Erez Shermer, 27 Aug 2023, Semi-static Conditions in Low-latency C++ for High Frequency Trading: Better than Branch Prediction Hints, https://arxiv.org/abs/2308.14185, Code: https://github.com/maxlucuta/semi-static-conditions (Advanced branch prediction analysis, a way to do branches by self-modifying code at assembly level.)
John Farrier, March 2025, Branch Prediction: The Definitive Guide for High-Performance C++, https://johnfarrier.com/branch-prediction-the-definitive-guide-for-high-performance-c/
Srdjan Delić, Apr 10, 2023, Branchless programming — Why your CPU will thank you, https://sdremthix.medium.com/branchless-programming-why-your-cpu-will-thank-you-5f405d97b0c8
Jared Gorski, 11 August, 2020, Branchless programming, https://jaredgorski.org/notes/branchless-programming/
Algorithmica, March 2025 (accessed), Branchless Programming, https://en.algorithmica.org/hpc/pipelining/branchless/
Michael Kerrisk, Oct 5, 2012, How much do __builtin_expect(), likely(), and unlikely() improve performance? http://blog.man7.org/2012/10/how-much-do-builtinexpect-likely-and.html
Agner Fog, 28 May, 2024 (last update), The microarchitecture of Intel, AMD, and VIA CPUs: An optimization guide for assembly programmers and compiler makers, https://www.agner.org/optimize/microarchitecture.pdf
GCC, March 2025 (accessed), Common Function Attributes, https://gcc.gnu.org/onlinedocs/gcc/Common-Function-Attributes.html
Algorithmica, July 2025 (accessed), Binary Search, https://en.algorithmica.org/hpc/data-structures/binary-search/ (Shows a branchless binary search algorithm with prefetching.)
Paul-Virak Khuong, Pat Morin, 15 Mar 2017 (v2), Array Layouts for Comparison-Based Searching, https://arxiv.org/abs/1509.05053 (Branchless and cached versions of binary search on sorted arrays.)
Agner Fog, 22 June 2024 (last updated), Optimizing subroutines in assembly language: An optimization guide for x86 platforms, https://www.agner.org/optimize/optimizing_assembly.pdf

• Online: Table of Contents

• PDF: Free PDF book download

• Buy: C++ Ultra-Low Latency

C++ Ultra-Low Latency: Multithreading and Low-Level Optimizations:

Low-level C++ efficiency techniques
C++ multithreading optimizations
AI LLM inference backend speedups
Low latency data structures
Multithreading optimizations
General C++ optimizations

Get your copy from Amazon: C++ Ultra-Low Latency

Aussie AI

Branchless C++ Programming Tricks