Aussie AI

Chapter 38. Lock-Free Data Structures

Book Excerpt from "C++ Ultra-Low Latency: Multithreading and Low-Level Optimizations"

by David Spuler, Ph.D.

Chapter 38. Lock-Free Data Structures

What are Lock-Free Data Structures?

Lock-free programming is a method of optimizing multithreaded code to avoid locks (i.e., mutexes) by using atomics instead. Mutexes have a significant overhead, whereas atomics are more efficient, but that’s not the only benefit. The advantages in speed and lower latency arise from reducing:

Overhead of mutexes and lock guards
Lock contention overhead
Lost performance from threads blocked awaiting a resource.
Context switches (avoided)

Generally speaking, there should be a higher throughput with none of the threads blocked to wait, which also avoids context switching. Threads can execute an atomic operation and keep going, which is better for CPU utilization, assuming there is enough work needing to be done.

Lock-free algorithms also have some safety and resilience advantages. Since the threads no longer block waiting for locks, this avoids some pitfalls of multithreading:

Lower risk of deadlock or livelock
Reduced chance of priority inversion

The main disadvantage of lock-free programming:

Your brain will explode.

The internet is littered with articles about failed attempts to write lock-free algorithms, even by some of the best programmers. There are many ways to go wrong in the quest to get rid of mutexes.

There are actually some real downsides to lock-free programming, and it’s not an automatic performance win. Some issues include:

Load balancing properties can change or worsen if no thread ever blocks.
Lock-free primitives are not always faster than mutexes or other locking methods.
Low-contention applications may perform worse under lock-free methods.
Weirdly, overall lock contention can suffer if nobody ever gets swapped out.

And worst of all, errors in coding the complex lock-free algorithms can not only cause bugs, but can also introduce insidious slugs!

Implementing Lock-Free Methods

Lock-free programming is the hardest part of multithreading. If you can do this, you can do anything. But the reverse also applies: if you’re still struggling to do other types of multithreading, don’t try to do this yet. To do lock-free programming, you really need to understand:

Overall locking strategies (mutexes, locks)
Atomics (basic usage)
Memory orders (in relation to atomics)

Note that “lock-free” programming does not mean that you just search up “mutex” in vi, and then hit the “dd” button. No, lock-free programming is not just sequential programming. Instead, the idea is to switch to a faster concurrency method than mutexes, so this is the main idea:

std::mutex — lock-based programming.
std::atomic — lock-free programming.

The overall idea is to use an “atomic” operation instead of a mutex. However, it is not adequate to use simple atomic operations, but you need to use the “compound” operations. Hence, to make this work, it’s usually a quite complex atomic operation, such as a “Compare-And-Swap” (CAS) operation or a “Fetch-and-Add” computation.

Let’s examine the “Compare-And-Swap” approach in detail. This is how a CAS operation works, with a number of steps all done atomically in one unbreakable sequence:

Access a variable (that you want to set atomically).
Compare it to the “old” or “expected” value.
If it’s equal to the old value, then successfully update to the new value (and done).
If it’s not equal to the old value, someone else has already updated it, so we fail (and then loop around and retry).

What a mouthful! Fortunately, C++ has the std::atomic class (since C++11) to take care of all that. The main routines to use for a CAS instruction are:

    std::atomic::compare_exchange_weak
    std::atomic::compare_exchange_strong

Note that you will also need to know about “memory orders” around atomic primitives, as controlled via the std::memory_order library.

Weak or Strong CAS?

Should you use the weak or strong version of the CAS primitive? The strong version is guaranteed to not fail for “spurious” reasons, but only if the atomic’s value is not what you want. By comparison, the weak version can fail for two reasons:

Wrong atomic value
Spurious error failures

Thus, the strong version seems better, but even so, the most common idiom for using CAS in lock-free programming is the use of the weak version, but in a loop. This idea simply retries if the weak CAS primitive fails, whether due to the underlying atomic variable’s value being wrong, or due to the obscure spurious failures.

Both the weak and strong CAS primitives are usually in a loop. The weak CAS can fail for two reasons, being spurious failures or another thread modified the value, and needs to retry. The strong CAS will not fail for spurious reasons, but can still fail for the second reason, and still often needs a loop. The weak version is more commonly used because it’s somewhat more efficient, under the assumption that spurious failures are rare.

Example: Lock-Free Stack Array

A lock-free stack implemented in an array is a great example to use, because it has only one moving piece: the stack pointer. This is an integer index used to identify the level of usage in the array, and also doubles as a counter of how many items are on the stack.

Here’s our basic interface for an array-based stack:

    template<typename T, int N>
    class LFStackArray {
    private:
        std::atomic<int> sp_;   // Stack pointer
        T arr_[N];              // Fixed-size array
    public:
        LFStackArray() : sp_{-1}, arr_{} { }
        ~LFStackArray() { }
        LFStackArray(const LFStackArray&) = delete;
        LFStackArray(LFStackArray&&) = delete;
        LFStackArray& operator=(const LFStackArray&) = delete;
        LFStackArray& operator=(LFStackArray&&) = delete;
    };

And here are some basic member functions:

    bool empty() const { return sp_ == -1; }
    bool full() const { return sp_ == N - 1; }
    int count() const { return sp_ + 1; }

And let’s define the main member functions implementing the stack LIFO functionality. The top() function is a const member that does not pop the stack (like the standard C++ stack container):

    T top() const {
        if (sp_ == -1) {
            throw std::exception("Stack underflow top");
        }
        else {
            return arr_[sp_];
        }
    }

Here’s the pop() function that decrements the stack pointer:

    void pop() {
        if (sp_ == -1) {
            throw std::exception("Stack underflow pop");
        }
        else {
            sp_--; 
        }
    }

And here’s the push() member function that increments the stack pointer:

    void push(const T& item) {
        if (full()) {
            throw std::exception("Stack overflow");
        }
        else {
            arr_[++sp_] = item;
        }
    }

See Any Bugs?

This code will run fine in many cases, but has several concurrency bugs if multiple threads are pushing and popping. The sp_ variable is atomic, and all of the operations on this variable will be correctly serialized. Problems arise because each of the main member functions are accessing the atomic variable twice.

Any interleaving access in another thread that modifies the stack pointer between those two accesses can break the code. Since all the member functions are short, and the two accesses are within a few instructions, these bugs would be rare situations, but are still an insidious problem.

One way to fix these problems would be to just remove the exception-handling code. All of the extra reads on the stack pointer are to detect overflow and underflow conditions. But that’s not a great design decision to make, based solely on our lack of expertise in lock-free programming.

CAS Versions

A better idea is to use the “Compare-And-Swap” (CAS) idiom, repeated in a loop, for proper lock-free versions. Here’s an updated pop() method:

    void pop() {
        int oldsp = sp_.load();
        if (oldsp == -1) {
            throw std::exception("Stack underflow pop");
        }
        while(!sp_.compare_exchange_weak(oldsp, oldsp - 1)) {
            // Nothing (try again)
        }
    }

Note that in this version using sp_.load() is not really different from just using sp_ by name (an implicit load), but the second part uses a different loop. The CAS call is used to check that the stack pointer still has the expected value (i.e., not modified by some other thread), and we loop around until it’s true. Since this is a “weak” CAS call, it call also fail for spurious reasons, but that’s not a problem because we just retry in that situation, too.

What’s missing?

There are no memory orders specified anywhere, so the atomic calls are defaulting to “sequential access” in both the load and CAS loop. That’s the safest memory model, but it’s needlessly inefficient here.

There are three places where we can specify an alternative memory model. But which to choose? The best choices are:

Initial load() call — “relaxed” memory model (fastest)
Weak CAS success — “acquire” memory model
Weak CAS failure (retry loop) — “relaxed” memory model (fastest)

We can get away with relaxed mode for the initial load() because it’s not critical. We’re testing for an error, and we also don’t care too much about ordering of accesses leading up to the CAS test.

Similarly, we also don’t much care about ordering whenever the weak CAS call fails, whether for spurious reasons or because the value has changed. Either way, we’re just looping back to re-try, and the ordering up to the next CAS call doesn’t matter.

However, we do care on a successful CAS result, which is when the stack pointer is actually being updated to a new value. Hence, we choose “acquire” rather than “relaxed” for that option.

Our new function looks like:

    void pop() {
        int oldsp = sp_.load(std::memory_order_relaxed);
        if (oldsp == -1) {
            throw std::exception("Stack underflow pop");
        }
        while(!sp_.compare_exchange_weak(oldsp, oldsp - 1, 
            std::memory_order_acquire,   // Success mode
            std::memory_order_relaxed)   // Failure mode
            ) {
            // Nothing (try again)
        }
    }

Still Buggy!

There’s an obscure problem in the above pop() function that indicates a misunderstanding of how the CAS primitives work. They don’t just update the atomic, but also the passed-in parameter.

Let’s consider a stack that currently contains one element, but two threads are both trying to pop the stack. Consider this sequence:

Thread A: starts and calls load() inside pop()
Context switch
Thread B: runs a full pop() function to pop the stack (i.e., load() and then CAS success).
Context switch
Thread A: continues, but weak CAS fails (value of sp_ was changed by Thread B).
Thread A: loops around to retry.
Thread A: weak CAS now succeeds using the sp_ value updated by Thread B (yes, really)

The end result of all this is a major bug:

The atomic is updated to the wrong stack pointer.
The stack hasn’t been popped twice (it should be, once by each thread).

The CAS primitive can change the variable passed as the expected value. Hence, the value of sp_ in the above code can change before and after the loop, and also whenever there’s a loop-around to retry. Look at the official signature of the weak CAS function, and you’ll notice that the first argument containing the “expected” or “old” value is a non-const reference. The second argument is not a reference.

Why aren’t they the same?

The compare_exchange_weak() function can modify the expected value (used to test), but not the “new” value, used to store. This means that:

(a) If it succeeds immediately, the “old” variable will have the “new” value.

(b) If it fails and loops around, the “old” variable will have whatever value another thread changed it to.

When you examine lock-free versions with CAS primitives and loops, you’ll notice a few things about the pattern:

1. The “old” value should be retested every loop iteration (after CAS failure).

2. The “old” value should also be retested after the loop (after CAS success).

3. The “expected” value parameter to weak CAS may need to be re-computed each iteration, based on the “old” value (which can change), rather than using an unchanging separate variable to contain the expected value.

This is getting complicated! Well, yes, that’s the fun of lock-free coding.

Anyway, here’s the final version with the corrected weak CAS calls. This defers the underflow test until after the loop, where it catches both cases of underflow: initial underflow or an underflow caused by some other thread popping the stack out from under us.

    void pop() {
        int oldsp = sp_.load(std::memory_order_relaxed);        
        while (oldsp != -1 && !sp_.compare_exchange_weak(oldsp, oldsp - 1,
            std::memory_order_acquire,   // Success mode
            std::memory_order_relaxed)   // Failure mode
            ) {
            // Nothing (try again)
        }
        if (oldsp == -1) {
            throw std::exception("Stack underflow pop");
        }
    }

Hopefully, this new version now has all the various concurrent execution sequences covered:

Success: valid pop on the stack without any changes by another thread.
Success: valid pop on the stack but another thread removes one (or more) stack elements, but doesn’t fully empty the stack.
Underflow: Pop on an already-empty stack.
Underflow: Pop on a non-empty stack, but another thread empties the stack before us.

Difficulties with Lock-Free Coding

What’s so hard about coding a lock-free algorithm? Well, it’s a totally different way of thinking about concurrency compared to the use of standard locking mechanisms. Some of the problems include:

Catering for all possible instruction ordering sequences.
Choosing the right memory order to guarantee correctness.
Higher-level concurrency problems with the interface.
Handling the “ABA” problem where updates are missed.

We’ve already discussed instruction ordering and memory orders in Chapter 7 on atomics, so let’s look at the other issues now.

High-Level Race Conditions

Even when we’ve correctly implemented the member functions with atomics and memory orders, this lock-free stack is still problematic to use. The stack itself will stay consistent no matter what member functions are called in what order, but there are higher-level concurrency problems with any paired usage of multiple member functions, such as sequences like:

Top and then pop
Test empty before calling pop
Test full before calling push

What we’d need to do is define some more composite member functions using lock-free methods. Ideas for new methods to add in the interface for better usage in multithreaded applications include:

Top-and-pop
Pop-if-not-empty
Push-if-not-full

ABA Problems

The ABA problem is a more general concurrency issue that can be particularly applicable to lock-free sequences. The ABA problem occurs where a shared variable undergoes this uncommon sequence with activity in one thread:

Initial value — A
Update to value — B
Second update — A

The problem occurs in a second thread, and it’s not really obvious why it’s tricky. After the ABA sequence, the second thread sees the value as A, which is unchanged from the prior value it would have seen. Hence, the second thread doesn’t know about the intervening value B. Depending on context, this is sometimes no problem, or sometimes a major concurrency error whereby the second thread wrongly assumes that nothing has changed, and the data structure hasn’t been updated.

A good example is an array implementation of a lock-free stack. The index value is incremented by one on a push, and then decremented back to its prior value by a thread doing a pop. This is an ABA sequence on an integer index, whereby another thread might assume that the stack index has not changed. If that thread accesses the “top” element, then an ABA sequence occurs in the first thread, the second thread may see that the stack index is unchanged and assume the same element is still on top of the stack, when in fact, there’s an entirely different value. Remember, it’s the atomic integer representing the stack index that’s undergone the ABA sequence, not the actual object on the stack.

This problematic sequence can occur in any synchronization style including both locking and lock-free algorithms. It’s quite an insidious bug, because the ABA sequence doesn’t occur that often. In particular, a lot of the lock-free methods of using a CAS operation in a loop are checking for the old value, and can be vulnerable to an ABA sequence.

C++20 Atomics and Lock-Free

The C++20 standard introduced some extra primitives to the atomic class, which were similar in nature to condition variables and spinlocks (or a hybrid thereof). The primitives included:

wait() — blocking call to wait until an atomic changes.
notify_one() — notify one waiting thread.
notify_all() — notify all the threads that are waiting.

An important point to note is that if you’re using these C++20 primitives to implement thread synchronization, it’s more like using locks than a lock-free algorithm. Where threads are blocked and waiting on an update to an atomic variable, that’s a great feature of C++20, but it’s no longer really a lock-free data structure. The main hallmark of lock-free programming, where every thread keeps going, is missing in that style.

Freestanding Atomic Functions

Standard C++ provides a number of “free-standing” atomic functions that mirror the member functions. For example, there is:

atomic_fetch_add() — similar to fetch_add().
atomic_fetch_sub() — matches fetch_sub().

There are also “explicit” versions of many of these functions:

atomic_fetch_add() — default memory order.
atomic_fetch_add_explicit() — extra argument for the memory order.

The differences in all these free-standing versions of the functions compared with the main C++ ones are:

Not member functions (free-standing).
Accept a pointer to an atomic, not a reference.
Memory orders can be specified in the “explicit” versions.
Consistent with the C-language versions defined in the C11 standard.

The reason for pointer arguments is that these functions are consistent with the C language versions declared in <stdatomic.h> in C11 (not C++11).

Always Faster?

Note that lock-free algorithms are not always a speed improvement. There is a significant case to be made that lock-free algorithms can increase thread contention. Hence, it is important to time your before-and-after if you’re switching from a lock implementation to a lock-free version of your thread-safe data structure. Since the concern is related to lock contention when there are multiple threads, it is important to time performance of your overall application across multiple threads under a realistic load, rather than just benchmarking the low-level lock-free queue primitives. See the references section for various Stack Overflow conversations involving quite animated discussions on when and whether a lock-free algorithm is better or worse than locking methods.

Portability issues

There are also a variety of non-standard methods to achieve lock-free programming with primitives in older code platforms, or in a platform-specific manner. Some of the primitives are:

InterlockedCompareExchange — Win32 version in <winnt.h>.
OSAtomicCompareAndSwapInt — iOS/Mac variant in <OSAtomic.h>
__atomic_compare_exchange — older GCC version.

Note that the std::atomic class is not actually guaranteed to be a lock-free atomic operation on every platform. It’s a good idea to test your platform using the “is_lock_free” primitive as part of your initialization or self-testing code:

    assert(std::atomic<int>::is_lock_free());

Extensions

Implement a lock-free version of fine-grained locking using lock striping on a vector data structure with an array of atomics instead of mutexes (see discussion of “lock striping” in Chapter 4).
Implement a lock segmenting version of lock-free fine-grained locking on a vector data structure using atomic arrays, not mutexes (see discussion of “lock segmenting” in Chapter 4).

References

Paul Bilokon, Burak Gunduz, 8 Sep 2023, C++ Design Patterns for Low-latency Applications Including High-frequency Trading, https://arxiv.org/abs/2309.04259, Code: https://github.com/0burak/imperial_hft
Jeff Preshing, Jun 12, 2012, An Introduction to Lock-Free Programming, https://preshing.com/20120612/an-introduction-to-lock-free-programming/
Deb Haldar, August 17, 2017, Top 20 C++ multithreading mistakes and how to avoid them, https://acodersjourney.com/top-20-cplusplus-multithreading-mistakes/
Apple, March 2025 (accessed), Mac OSX documentation, https://developer.apple.com/library/archive/documentation/System/Conceptual/ManPages_iPhoneOS/man3/OSAtomicAdd32.3.html
Wikipedia, March 2025 (accessed), Non-blocking algorithm, https://en.wikipedia.org/wiki/Non-blocking_algorithm
Herb Sutter, September 08, 2008, Lock-Free Code: A False Sense of Security, Dr Dobbs Magazine (archived), https://web.archive.org/web/20150901211737/http://www.drdobbs.com/article/print?articleId=210600279&siteSectionName=cpp
Microsoft, 24 May, 2022, InterlockedCompareExchange function (winnt.h), https://learn.microsoft.com/en-us/windows/win32/api/winnt/nf-winnt-interlockedcompareexchange
GNU Foundation, March 2025 (accessed), 6.26 Built-in Functions for Memory Model Aware Atomic Operations, https://gcc.gnu.org/onlinedocs/gcc/_005f_005fatomic-Builtins.html
CPP Reference, March 2025 (accessed), std::atomic<T>::compare_exchange_weak, std::atomic<T>::compare_exchange_strong, https://en.cppreference.com/w/cpp/atomic/atomic/compare_exchange
CPP Reference, March 2025 (accessed), std::memory_order, https://en.cppreference.com/w/cpp/atomic/memory_order
Sourav Ghosh, July 2023, Building Low Latency Applications with C++, Packt Publishing, https://www.amazon.com/dp/1837639353
Tim Blechmann, 2011, Chapter 17. Boost.Lockfree,, https://www.boost.org/doc/libs/1_53_0/doc/html/lockfree.html, https://www.boost.org/doc/libs/1_53_0/doc/html/boost/lockfree/queue.html
Stack Overflow, 2011, Do lock-free algorithms really perform better than their lock-full counterparts? https://stackoverflow.com/questions/5680869/do-lock-free-algorithms-really-perform-better-than-their-lock-full-counterparts
Stack Overflow, 2017, Using Boost.Lockfree queue is slower than using mutexes, https://stackoverflow.com/questions/43540943/using-boost-lockfree-queue-is-slower-than-using-mutexes

• Online: Table of Contents

• PDF: Free PDF book download

• Buy: C++ Ultra-Low Latency

C++ Ultra-Low Latency: Multithreading and Low-Level Optimizations:

Low-level C++ efficiency techniques
C++ multithreading optimizations
AI LLM inference backend speedups
Low latency data structures
Multithreading optimizations
General C++ optimizations

Get your copy from Amazon: C++ Ultra-Low Latency

Aussie AI

Chapter 38. Lock-Free Data Structures