Aussie AI

Chapter 37. Atomics & Memory Orders

Book Excerpt from "C++ Ultra-Low Latency: Multithreading and Low-Level Optimizations"

by David Spuler, Ph.D.

Chapter 37. Atomics & Memory Orders

What are Atomics?

Atomic variables are a C++11 features whereby an operation on a variable can be done “atomically” and does not require cross-thread synchronization. The std::atomic library in the <atomic> header file exists to provide these capabilities across platforms in standard C++. Note that there’s also a C version called _Atomic.

Atomics are mainly used to implement the “lock-free” versions of thread-safe data structures like concurrent stacks and queues. But that’s the advanced stuff!

The first point is to note that atomics can implement thread-safe algorithms for much simpler requirements, such as:

Counters
Sums
Maximum or minimum
Boolean flags

Don’t wrap a mutex or a lock check around a simple counter — use an atomic instead.

Standard Atomic Class

The atomic library is a templated class with pre-defined instantiations for lots of different types. Hence, you can use atomics with various types of variables:

    std::atomic<int> g_my_atomic_counter;

You can instantiate the atomic template with your own class types, but only if it satisfies various properties (e.g., trivially copyable). The main use of atomics is with scalar types such as integral types or pointers, which are almost always efficient.

Implicit atomics. Note that the performance of the atomic library can be very fast for simple scalar variables. On many platforms, this will just be a single machine code increment instruction on the underlying int variable, but on some obscure platforms it might be more complex. For example, on a lot of CPU platforms, the reading and writing of an int variable is implicitly atomic, because it runs in only a single CPU instruction. Hence, the members for std::atomic<int> might simply be a nothingburger that just accesses the integer variable underneath.

Emulated atomics. On the other hand, some platforms cannot really implement atomics properly for more complicated types, but has to use its own locking algorithms. Most C++ code using an atomic should still work either way, but this gives insight into its performance characteristics on different platforms.

To check on the status, there is the is_lock_free() and the C++17 is_always_lock_free() member function in std::atomic to test whether a particular instantiation is truly atomic, or whether the library has to emulate atomicity using hidden locks and mutexes. The first tests whether a particular variable is lock free, and the second is whether that type of atomic is always lock-free, which is a hair-splitting difference, but occasionally matters.

Atomic type aliases. If you get tired of typing the angle brackets for the template instantiation, there are some handy type aliases available since C++11, such as:

atomic_int
atomic_short
atomic_bool
atomic_size_t

There’s a lot more, but I’m sure you get the idea.

Basic Atomic Operators

Integer types are particularly well-supported by the atomic library. In simple cases, you can use the atomic variable in a way that mimics its use for the underlying type. You can access the integer value of the above atomic just by using its name, and use various operator overloads that the atomic library provides for each type, such as assignment and increment.

For example, if you wanted to track a counter of things happening across multiple threads, you could just do this in every thread using a global-scope atomic variable:

    g_my_atomic_counter++; // incremented atomically

The unary operators defined for atomics on integer types include:

Prefix and postfix ++ (increment)
Prefix and postfix -- (decrement)

There are also various binary operators:

Assignment (operator=)
Extended assignment (e.g., operator+=)

Note that although there are not explicitly defined overloads for common binary operators (e.g., + or -), you can simply use the name of the atomic variable in such expressions, and it should get treated as an integer, via the overloaded type cast operator to the underlying type.

Don’t move or copy atomics. Although you can do various operations on the variable wrapped by an atomic, you technically cannot copy or move the entire atomic object itself. It has deleted both copy and move versions of constructor and assignment operator.

Advanced Atomic Operations

An atomic variable is guaranteed by the C++ library to be performed as a single indivisible operation. However, there are cases where you want more control over the operation on the atomic, and also additional features that control reads and writes to the variable. Some of the more complex methods available for atomic variables include:

load() — get the value (atomically).
store() — write a value to the variable.
exchange() — store a new value, and return old value.

These methods also have the ability to define a “memory order” for synchronization with other reads and writes to the variable. This is a complicated issue in synchronizing atomics across multiple threads for lock-free programming.

There are more complicated arithmetic operations with similar features. Some of the useful operations that you can perform include:

fetch_add() — addition
fetch_sub() — subtraction
fetch_max() — maximum (C++26)
fetch_min() — minimum (C++26)

There are also binary bitwise operations (since C++11) for atomics of integral types:

fetch_and() — bitwise-and operation
fetch_or() — bitwise-or
fetch_xor() — bitwise-xor

Atomic flags. The C++11 library also included a class of std::atomic_flag, which is useful for concurrency. This is a simple interface that mimics synchronization capabilities of mutexes and condition variables. It’s simpler than defining your own versions using the basic std::atomic class with a scalar type.

C++20 Atomics

C++20 adds some extra member functions to std::atomic that give it new functionality that sounds a lot like a condition variable or a spinlock. The goal of adding these newer C++20 features was improved efficiency over similar synchronization methods. The members are:

wait() — blocking call to wait until an atomic changes.
notify_one() — notify one waiting thread.
notify_all() — notify all the threads that are waiting.

These primitives allow a thread to wait for an atomic to change, which is a blocking call until its value changes (there are no spurious returns where the value has not changed). The notification methods allow for one or all threads to be signalled about a change to an atomic.

There’s also some useful type aliases that can help you pick the most efficient type of atomic on a platform. These types are declared in C++20:

atomic_signed_lock_free
atomic_unsigned_lock_free

Memory Orders

Memory orders are a feature of advanced atomics that is also defined in <atomic>. The goal is to help interleave atomic operations with other atomic or non-atomic arithmetic in a way that does not cause race conditions or other synchronization failures. The enumeration std::memory_order defines constants for a number of “memory orders” that can be used in atomic operations.

Simple atomics don’t require any fancy memory orders. You don’t really need to worry about memory orders for the very simple uses of atomics such as counters, which default to the safest and most restrictive memory order. But memory orders are critical for implementing advanced lock-free data structures with atomics.

The idea of memory orders is to block the compiler from doing some reordering optimizations that will break your code. If you don’t set any particular memory order, then the default memory order is used, which is “sequential consistency” and has these properties:

The most restrictive memory model — blocking the optimizer.
The safest — least likely to cause concurrency bugs.
The slowest — compiler reordering optimizations are blocked.

The definition is std::memory_order_seq_cst from <atomic>. It’s not very readable, but I guess no-one on the standards committee wanted to type “sequential consistency” in their code.

There are a number of memory order constants that you can use. Here’s a list to help confuse the matter:

std::memory_order_relaxed — “relaxed” (the least restrictive, fastest, and riskiest).
std::memory_order_acquire — “acquire” (restricts memory reads).
std::memory_order_release — “release” (restricts memory writes).
std::memory_order_consume — “consume” (affects dependent operations).
std::memory_order_acq_rel — “acquire-release” (combination of reads/writes).
std::memory_order_seq_cst — “sequential consistency” (default, most restrictive, safest).

What do they do? Umm, nobody really knows, so just use whatever AI suggests. Let’s move on to the next chapter.

Using Memory Orders

If you’re still here, here’s the first point: you don’t define an atomic variable with a specific memory order. Rather, the memory orders are passed as optional parameters for the major atomic operations:

load() — get the value of an atomic variable.
store() — set an atomic variable.

Every operation on an atomic can choose a memory order. Here’s the sliding scale of options available to you:

Relaxed — bugs.
Sequential consistency — slugs.

Or you can choose something in the middle if you really know what you’re doing. Pay your money and take your chances.

Relaxed Memory Order

The “relaxed” mode doesn’t do much. It’s pretty chill about whatever the compiler wants to do, and there are no constraints applied to the optimizer. Hence, it’s the fastest and most unsafe, where the compiler is “relaxed” but “stressed” is the programmer’s mode.

Using the relaxed mode is a significant optimization, so it pays to consider when you can get away with it. Some of the simpler uses of atomic variables for counters or flags don’t need any memory synchronization at all. Let’s declare some atomics:

    std::atomic<int> g_atomic_counter;
    std::atomic<bool> g_atomic_shutdown_flag;

The question is whether there are any other dependent variable reads or writes happening around your operation on the atomic variable. Examples where this is the case include:

Basic atomic counter
Global flag for all threads

If you’re using an atomic<int> variable as a counter of something, it’s quite possible that nothing depends on it. You want every thread to be able to increment the counter (without losing one), but this is guaranteed by atomic semantics. The default is “sequential consistency” for this:

    g_atomic_counter++;

But it might actually be faster to do this in “relaxed” mode:

    g_atomic_counter.fetch_add(1, std::memory_order_relaxed);

Another example is our global “shutdown” flag that tells all the threads to close up shop. As an atomic, we can directly assign it, which uses the “sequential consistency” memory order:

    g_atomic_shutdown_flag = true;

There aren’t really any dependent operations on this flag, other than the threads occasionally check it. Note that an atomic flag like this doesn’t do any signalling by default, so we’re assuming that other threads are watching, or getting signalled another way. In any case, we can probably use “relaxed” mode to set our atomic flag:

    g_atomic_shutdown_flag.store(true, std::memory_order_relaxed);

We might also want to test std::atomic_flag, to see if it’s any faster, since it’s a pre-defined class with similar semantics.

Load and Store Memory Orders

The atomic load() and store() operations allow a memory order to be specified. Both of them default to “sequential consistency” (slow and safe), if no memory order argument is specified.

The alternative memory orders are quite limited for these primitives, because some memory orders cause undefined behavior. In addition to the default “sequential consistency” memory order, the options for a more efficient memory order are:

load() — “consume” or “acquire” or “relaxed”
store() — “release” or “relaxed”

Undefined Behavior

There are some memory orders that are simply incorrect, and lead to “undefined behavior” according to the C++ standard. Some examples include:

load() — memory orders memory_order_release and memory_order_acq_rel are undefined.
store() — memory orders memory_order_consume, memory_order_acquire and memory_order_acq_rel are undefined.

Note that “acquire-release” memory order cannot be used at all with these methods.

Data Hazards are not Memory Orders

You may have heard of an ordering issue called “data hazards” that includes problems such as:

Read-After-Write (RAW)
Write-After-Write (WAW)
Write-After-Read (WAR)
Read-After-Read (RAR) (harmless!)

However, data hazards are not actually related to memory ordering, nor even to multithreading. Instead, data hazards are a pipelining issue inside the CPU’s instruction scheduler related to “instruction reordering” and “out-of-order” execution. There are many similar concepts in terms of the different orders that can cause problems, but memory orders are in multithreading of multiple threads, whereas data hazards are inside the CPU related to the instruction ordering within a single thread. Hence, data hazards can be delegated to the hardware engineers, and us C++ programmers have one less thing to worry about!

Extensions

Explore the use of std::atomic<bool> versus std::atomic_flag in modern C++.
Examine the performance of std::atomic for various types, examining the costs of primitives such as locking and unlocking, along with basic class operations such as construction, destruction, copying and moves.
Research the details of all the various memory orders.

References

Emily Dawson, April 2025, Multithreading with C++: Parallel Programming Guide, https://www.amazon.com/dp/B0F494Z76L/
Sourav Ghosh, July 2023, Building Low Latency Applications with C++, Packt Publishing, https://www.amazon.com/dp/1837639353
CPP Reference, May 2025 (accessed), std::atomic, https://en.cppreference.com/w/cpp/atomic/atomic
CPP Reference, May 2025 (accessed), std::memory_order, https://en.cppreference.com/w/cpp/atomic/memory_order

• Online: Table of Contents

• PDF: Free PDF book download

• Buy: C++ Ultra-Low Latency

C++ Ultra-Low Latency: Multithreading and Low-Level Optimizations:

Low-level C++ efficiency techniques
C++ multithreading optimizations
AI LLM inference backend speedups
Low latency data structures
Multithreading optimizations
General C++ optimizations

Get your copy from Amazon: C++ Ultra-Low Latency

Aussie AI

Chapter 37. Atomics & Memory Orders