Aussie AI

Chapter 36. Lock Contention

Book Excerpt from "C++ Ultra-Low Latency: Multithreading and Low-Level Optimizations"

by David Spuler, Ph.D.

Chapter 36. Lock Contention

What is Lock Contention?

Lock contention is a multithreading slowdown where threads are blocked waiting on locks held by other threads. If your code has a lot of busy threads, then any of the synchronization code (e.g., using mutexes or condition variables) can lead to contention over accesses to shared data.

Note that lock contention is not the same thing as lock overhead. Lock contention is the extent to which threads get blocked waiting for a lock. Lock overhead is the extra cost of library calls that do lock-related stuff, such as the cost of requesting a lock, releasing a lock, creating a mutex, destroying a mutex, etc.

All multithreaded applications have some level of lock contention, otherwise why would it need locks at all? Hence, optimizing to reduce lock contention is something that you can’t avoid. General points about lock contention include:

More threads means more opportunities for lock contention.
So does having more locks (all other things being equal).
Unpopular shared data is unlikely to cause contention.
Fine-grain locking is desirable for often-used data.

In the worst case, you get to a deadlock situation, which upgrades the lock contention problem from a slug to a bug.

Optimizing Lock Contention

General strategies for reducing lock contention include:

Short critical sections
Reduce total lock requirements
Acquire locks late
Release locks early

Here’s the best one:

No synchronization — don’t use any locks at all!

Unfortunately, the “no locks” plan has its limitations, being mostly limited to read-only data used by multiple readers. Nevertheless, your first thought should be if there’s a way to do this without needing to use a lock.

Some of the specific strategies for using fewer locks or otherwise reducing contention include:

Consider using fewer threads (so less contention for locks).
Maximize lock-friendly data handling (e.g., prefer “immutable” read-only data).
Review lock granularity (fine-grain vs coarse-grain vs a hybrid strategy).
Tolerate lockless output (e.g., out-of-order debug logging messages aren’t so bad).
Limit block scope of std::lock_guard to release the lock early.
Use std::unique_lock and other variants for more flexibility.
Copy data to temporary variables to release locks before processing data.
Use queues as the preferred method to transfer large amounts of data.
Avoid false sharing (can impact lock contention issues).
Release locks before blocking system calls, I/O waits, or network actions.

Some examples of other advanced strategies include:

Reader-friendly data structures (e.g., versioned data structures, copy-on-write).
Kernel bypass (for I/O efficiency).
Double lock check method (first check without a lock, then acquire the lock).
Exponential backoff when waiting (e.g., avoiding spinlock busy waits).
Shard or partition data across multiple threads (avoids need for locks).
Use message-passing via std::promise and std::future rather than shared memory.
Thread-specific queues and “work stealing” design pattern.
Lock-free algorithms with atomics not mutexes (very tricky to get right).

Avoid Lock Guard Delayed Unlocking

The std::lock_guard class is a wonderfully safe way to use mutexes, because it helps us avoid deadlocks and severe thread starvation if we forget to unlock our mutex (as if!). Unfortunately, it’s too easy to use, and can cause programmers to forget about unlocking.

The problem is that we can accidentally hold the lock for too long, which increases lock contention. Here’s an example of the concept:

    std::mutex g_my_mutex;

    void process_critical_data()
    {
        // Step 1. Lock
        std::lock_guard<std::mutex> mylockguard(g_my_mutex);
        // Step 2. Get the data...
        // Step 3. Process the data ...
    }

The problem is that we haven’t really thought too much about where we should unlock. The above code doesn’t release the mutex until after we’ve finished processing the data at Step 3, when the function returns, which is needlessly long.

One way to fix this would be to use some other more flexible locking wrappers that allow explicit control of the unlocking. Your basic choices are:

std::lock_guard — can only unlock in its destructor (inflexible).
std::unique_lock — allows an explicit unlock call (more flexible).

A simpler solution is to explicitly control the scoping that sets when the destructor of std::lock_guard triggers the release of the lock. Here’s a better version:

    void process_critical_data()
    {
        {
            // Step 1. Lock
            std::lock_guard<std::mutex> mylockguard(g_my_mutex);
            // Step 2. Get the data...
        }
        // Step 3. Process the data ...
    }

This has added an extra pair of { } braces around the first two steps. This triggers the scoping mechanism, so that the std::lock_guard destructor is called and the mutex is unlocked immediately after Step 2, at the inner right brace. Then Step 3 can process the data to its heart’s content without blocking any other threads.

Fine-Grain vs Coarse-Grain Locking

Locking granularity has two basic strategies: go small or go big. Here’s a summary:

Coarse-grain — lock an entire data structure while updating it.
Fine-grain — lock only in the exact critical code sequence that updates the data structure, deep in its internals.

The characteristics of these strategies can be summarized:

Coarse-grain — longer duration, fewer locks overall.
Fine-grain — shorter duration, more locks.

Fine-grain locking improves performance for data that is used often. By limiting the granularity of locking, each thread holds the lock for only a short period while performing a low-level update, so many threads can have the lock in turn.

However, fine-grain locking means frequently locks and unlocks, which involves some overhead. It also increases the overall complexity of the concurrency algorithms by needing multiple locks for small pieces of data, thereby creating greater risk of mistakes, such as an incorrect request order for multiple locks causing a deadlock.

Coarse-grain locking can reduce performance because it locks data for a longer period of time, when a broader update to a higher-level data structure is performed. The chance of lock contention for a long duration is higher than with fine-grain locking. Any thread seeking the lock is less likely to find a window to access it if the lock is frequently requested, so coarse grain locking is best for rarely-used data.

The advantage of fewer higher-level locks is simplicity. There is not only a lower risk of deadlocking errors, but also fewer chances to go wrong when ensuring concurrency is adhered to, and the access to the shared data is properly controlled. For example, when updating a large data structure with a single lock, this means that concurrency errors cannot occur at a lower level. Thus, it’s easier for the thread to maintain a coherent state of the data structure, because there won’t be any interleaved changes from other threads.

Hybrid locking strategy involves using a trade-off: using fine-grain locks for frequently-accessed critical sections, and coarse-grain locking for less popular data. This can be a pragmatic solution that balances speed with lower development complexity and risk mitigation.

Lock-Free Algorithms

Lock-free programming is a method of optimizing multithreaded code to avoid locks (i.e., mutexes). The advantages in speed arise from:

Overhead of mutexes
Lost performance from threads blocked awaiting a resource.

The main disadvantage of lock-free programming:

Your brain will explode.

The internet is littered with articles about failed attempts to write lock-free algorithms, even by some of the best programmers. There are many ways to go wrong in the quest to get rid of mutexes.

Note that “lock-free” programming does not mean that you just search up “mutex” in vi, and then hit the “dd” button. No, lock-free programming is not just sequential programming. Instead, the idea is to switch to a faster concurrency method than mutexes, so this is the main idea:

std::mutex — lock-based programming.
std::atomic — lock-free programming.

The overall idea is to use an “atomic” operation instead of a mutex. To make this work, it’s usually a quite complex atomic operation, such as a “Compare-And-Swap” (CAS) operation.

This is how a CAS operation works, with a number of steps all done atomically in one unbreakable sequence:

Access a variable (that you want to set atomically).
Compare it to the “old” or “expected” value.
If it’s equal to the old value, then successfully update to the new value (and done).
If it’s not equal to the old value, someone else has already updated it, so we fail (and then loop around and retry).

What a mouthful! Fortunately, C++ has the std::atomic class (since C++11) to take care of all that. The main routines to use for a CAS instruction are:

    std::atomic::compare_exchange_weak
    std::atomic::compare_exchange_strong

Note that you will also need to know about “memory orders” around atomic primitives, as controlled via the std::memory_order library.

There are also a variety of non-standard methods to achieve lock-free programming with primitives in older code platforms, or in a platform-specific manner. Some of the primitives are:

InterlockedCompareExchange — Win32 version in <winnt.h>.
OSAtomicCompareAndSwapInt — iOS/Mac variant in <OSAtomic.h>
__atomic_compare_exchange — older GCC version.

Note that the std::atomic class is not actually guaranteed to be a lock-free atomic operation on every platform. It’s a good idea to test your platform using the “is_lock_free” primitive as part of your initialization or self-testing code:

    assert(std::atomic<int>::is_lock_free());

Thread Pools

Thread pools are a design pattern in C++ multithreading that avoids the cost of creating and destroying threads by using long-running threads. Instead of incurring this thread overhead, a “pool” of available threads have been pre-created, which sit there until work is available to be done. The main characteristics are:

Idle threads wait for work (e.g., off a task queue).
Threads are not destroyed after completing a chunk of work.

Thread pools are mostly used in a “producer-consumer” design pattern, although thread pools can also be used in other ways. There are effectively two thread pools in this design pattern:

Producer thread pool
Consumer thread pool

Typically, one or more producer threads adds work items to a queue, such as when it receives new data from a network source (e.g., exchange connection in HFT). Another group of consumer threads is idle waiting to pull work off the queue. Consumers do the work, return the results, and then add themselves back to the group of idle consumer threads awaiting more work.