Aussie AI

Chapter 33. Fine-Grained vs Coarse Locking

Book Excerpt from "C++ Ultra-Low Latency: Multithreading and Low-Level Optimizations"

by David Spuler, Ph.D.

Chapter 33. Fine-Grained vs Coarse Locking

What is Coarse Locking?

Coarse-grained locking is a simple method of achieving synchronization with relatively few lock objects and not many calls to locking primitives. The locks are “coarse” because they control large chunks, such as a block of code in an entire member function, or access to an entire data structure. Some examples of using coarse locking include:

Long sequences of code with a lock request at the start and lock release at the end.
Single per-class mutex for your entire data structure.
One global mutex for all the critical sections of code.

The effect of coarse-grained locking is to effectively limit all accesses to the code block or data structure to be one thread at a time (i.e., full serialization). By comparison, fine-grained locking has multiple locks and more frequent locking and unlocking requests, but over shorter blocks of code or controlling access to portions of a data structure. The fine-grained locking approach is more performant, but requires a lot more effort to code correctly.

Coarse-grained locking can be added to your code relatively quickly. Hence, the advantages of coarse locking include:

Simplicity
Thread-safety (it does work)
Low lock overhead in some cases (fewer total calls to locking primitives)
Lower risk of concurrency bugs (easy to implement)

The downsides of coarse locking are mainly about performance:

Blocking other threads for longer (poor synchronization)
Locking overhead required for read-only accesses
Serialization of multiple concurrent readers (low parallelism)
Increased lock contention (for the single mutex)

Now, let’s have a look at how you’d code up coarse locking.

Adding Coarse Locking

A common requirement for locking is to create your own thread-safe containers, like stacks and queues, since the standard C++ containers are not actually thread-safe. If you have a class where you want its main data structure to be thread-safe, there’s a surprisingly simple way to add coarse locking. The steps are:

Add a mutex as a data member.
Add a mutex lock and unlock call to every member function.

Here’s what the mutex data member to control synchronization in every object looks like:

    #include <mutex>
    #include <vector>
    
    class MyVector {
      private:
        std::mutex mtx_;
        std::vector<int> vec_;
        // ...
      public: 
        int get_count() { return vec_.size(); }
        // ...
    };

Here’s a sum() class member function, but without any thread synchronization:

    int sum() {
        int isum = 0;
        for (int i = 0; i < vec_.size(); i++) {
            isum += vec_[i];
        }
        return isum;
    }

The code to add a mutex with lock calls at the top, and unlock calls at the end of a function:

    int sum() {
        mtx_.lock();  // Acquire lock
        int isum = 0;
        for (int i = 0; i < vec_.size(); i++) {
            isum += vec_[i];
        }
        mtx_.unlock();  // Release lock
        return isum;
    }

Actually, this method of directly using std::mutex is not that good, because you have to make temporary copies of internal data, even in simple getters:

    int get_count() { 
        mtx_.lock();  // Acquire lock
        int iret = vec_.size(); 
        mtx_.unlock();  // Release lock
        return iret; 
    }

An even simpler approach is to use the special wrapper class, std::lock_guard, which means you only add one lock guard declaration statement to the top of every member function.

    std::lock_guard<std::mutex> lock(mtx_);

The mutex object is automatically unlocked at the end of the function, or whenever it returns, by the destructor of the lock guard wrapper object. This fixes the above problems with simple getters so that no temporary variable is needed, because the unlocking is automatically occurring after the return expression is calculated (i.e., effectively it’s at the closing right brace of a member function, which is where the destructor runs). The downside is that you actually have two objects:

Mutex object (data member)
Lock guard object (function-local scope)

Here’s how it looks in the code:

    int get_count() { 
        std::lock_guard<std::mutex> lock(mtx_);  // Acquire lock
        return vec_.size(); 
    } // Release lock here!

Here’s how the sum() member function looks:

    int sum() {
        std::lock_guard<std::mutex> lock(mtx_);  // Acquire lock
        int isum = 0;
        for (int i = 0; i < vec_.size(); i++) {
            isum += vec_[i];
        }
        return isum;
    } // Implicit lock release here

Note that we can control the locking and release of a lock guard object by enclosing it in a narrower scope block. Here’s the use of a dummy pair of braces to control the scope for a marginal efficiency gain:

    int sum() {
        int isum = 0;
        {
            std::lock_guard<std::mutex> lock(mtx_);  // Acquire lock
            for (int i = 0; i < vec_.size(); i++) {
                isum += vec_[i];
            }
        } // Implicit lock release here
        return isum;
    }

As you can see, adding coarse locking to your whole class can be as simple was adding a single statement at the top of every member function. The main downsides include:

Forgetting one of the member functions (concurrency bug).
Performance issues from holding the lock too long.
Read-only access to your object requires locking calls.

Note that std::lock_guard is not the only type of lock wrapper class to consider. Other examples of standard classes that act as mutex wrappers:

std::unique_lock — allows explicit unlock.
std::lock — basic multi-mutex handling.
std::try_lock — handling of unavailable locks.
std::scoped_lock — handles multiple mutexes.
std::shared_lock — flexible locking method.

Notably, there’s the std::unique_lock wrapper, which has the advantage that it has an explicit unlock() method. This means you can more easily release the lock early if it’s no longer needed, which reduces lock contention and delays in other threads. The unique lock wrapper still has the destructor as a backup to release the lock if the mutex hasn’t already been unlocked before the end.

Disadvantages of Coarse Locking

This approach of using a coarse locking mechanism in literally every member function looks really inefficient, and it is! There are significant performance problems:

Basic getter member functions become needlessly inefficient.
Other const member functions need to lock just for read-only access.

Do we really need to lock the basic getters? For example, if a getter is just returning the current count of objects, does it need a lock? Probably not!

I mean, it returns an integer for the count, which is close to an atomic operation, and almost certainly atomic on many CPUs. And if another thread is modifying the data structure, changing the count, this is the caller’s synchronization problem. It’s really a “higher-level” multithreading problem than the issue here, where we’re only trying to keep the data structure itself consistent across multiple calls to its member functions.

However, not all const member functions can avoid needing a lock. For example, a sum() function above that scans over all of the elements of the vector still needs synchronized access, to avoid some other thread modifying an element in the middle of a scan.

Overall, this coarse-grained locking approach works in terms of thread-safety, but is not ideal in terms of performance. Although we can probably avoid some of the locks in the simple getter functions, several speed problems remain with this approach:

Thread overhead even if the caller is only ever reading.
Multiple concurrent readers are needlessly serialized.
Not efficient for multiple readers and a single writer.

The cost overhead from coarse locking can be quite significant.

Coarse Locking Overhead

Let’s see how much it costs to add coarse-grained locking via lock guards. I chose a basic standard queue container, with just integers. As a control, I declared a basic queue wrapper class without any synchronization.

    template<typename T>
    class QueueWrapNoSync {
    private:
        std::queue<T> m_q;
    public:
        int count() const { return m_q.size(); }
        T front() const { return m_q.front(); }
        void pop() { m_q.pop(); }
        void push(T t) { m_q.push(t); }
    };

Next, I created another class with a mutex in the objects, and lock guard statements added to every member function.

    template<typename T>
    class QueueWrapLockGuard {
    private:
        std::queue<T> m_q;
        std::mutex m_mutex;
    public:
        int count() const { 
            std::lock_guard<std::mutex> lock(m_mutex);
            return m_q.size(); 
        }
        T front() const { 
            std::lock_guard<std::mutex> lock(m_mutex);
            return m_q.front(); 
        }
        void pop() { 
            std::lock_guard<std::mutex> lock(m_mutex);
            m_q.pop();
        }
        void push(T t) {
            std::lock_guard<std::mutex> lock(m_mutex);
            m_q.push(t);
        }
    };

Here are the results:

    2 Different Queues, Sequential (No synchronization): 64839 microseconds
    1 Queue, 1 Thread (No synchronization): 35528 microseconds
    2 Different Queues, 2 Threads (No synchronization): 38123 microseconds
    1 Same Queue, 2 Threads (No synchronization, Buggy!): 38097 microseconds
    1 Same Queue, 2 Threads (Lock Guard): 56024 microseconds

This shows that two threads running with lock guard synchronization adds about 47% overhead versus without synchronization, although admittedly it removes the bugs. So, it is serializing approximately half of the second thread’s execution, which is presumably the amount of time it is blocked waiting for a lock.

So, here we have coarse-grained locking significantly increasing the time cost, because the lock guards effectively serialized most of the interface to our queue. We can partially solve this by changing the getter to not use locks, because it’s probably atomic, and tweaking the lock guard to release the lock slightly early.

The full solution to the cost overhead: fine-grained locking!

Fine-Grained Locking

Fine-grained locking is using locks over shorter blocks of code or smaller parts of a data structure. The general ideas are:

More locks
Smaller portions of data locked
Shorter duration lock holds

Some of the goals of finer granularity locking include:

Lock contention improved
Reducing thread blocking delays (less waiting for a lock)
Allowing multiple concurrent readers (shared read lock)

In the above example, it’s difficult to insert granular locks into the queue data structure, because it’s a builtin standard container, where we cannot easily modify the code. However, we can certainly apply fine-grained locking approaches to our own hand-coded containers. Some of the methods to get finer granularity of locking include:

Lock durations — acquire locks late, release locks early.
Granular locks — multiple locks on parts of data structures.
Read-write locking — shared reader versus unique writer locks.
Lock-free programming — using atomics instead of mutexes.

Some of these approaches are now discussed, and some are also covered in other chapters.

Granular Data Structure Locking

Whereas coarse-grained locking has one mutex for the entire data structure or container object, fine-grained locking uses many more mutexes or locks. The first point about implementing these strategies is to get used to the idea that mutexes and locks are just objects. Hence, we can use:

Arrays of mutexes and locks
Mutex or lock object data members

Hence, we can put mutexes or other locking objects inside our other objects, or part of containers, or whatever we want to do. Hence, we can use much more granular approaches that achieve the benefits of fine-grained locking: The idea is to use many locks such as:

Locks for each individual node in a container.
Locks for sub-parts of the data structure.

Locking each node in a data structure is as fine-grained as it is possible to go in a container. The idea is that a writer thread can modify other elements in the container, as long as it’s not changing the one that you’re using. This can be effective at avoiding lock contention, as other threads would rarely be blocked, but does increase lock overhead for every object.

A less fine-grained approach is to use fewer locks, but still maintain multiple locks per container. Some examples of using fewer locks than per-node, but still having many locks for portions of a data structure include:

Linked list sub-lists with locking from its sub-head node.
Binary tree with locking used on a subtree.
Hash table with locks for each bucket chain.

The idea with these methods is to avoid blocking other threads for every access to the entire container. For example, if your hash table has an array of mutexes, one per bucket, then readers and writers are only in contention for elements that map to the same hash bucket. This reduces lock contention, as it’s a relatively rare event.

Lock Striping

Another variant of this approach is called “lock striping,” and is a trade-off between the number of mutex objects and lock contention. The idea is to map all our data to a smaller number of mutexes. For example, in a hash table with a thousand buckets, rather than also using a thousand mutexes, we can use many fewer, and map the buckets to mutexes. The idea is like this in our container template:

    T key[NBUCKETS] hashtable_;
    std::mutex[NLOCKS] lockarr_;
    // ...
    size_t bucket = hash_function(key);
    size_t lockoffset = bucket % NLOCKS;

Here, we could have a hash table with 1,000 buckets in the hash table, but only 10 locks. This is a tenfold reduction in lock contention compared to the coarse locking approach of one lock per data structure.

Lock striping reduces the number of mutexes required, but will slightly increase lock contention compared to the granular approach of having one mutex per bucket. I’m not sure that I recommend lock striping in this example, because the advantage of using fewer mutexes for our hash table is mainly space reduction rather than speed, and don’t we have plenty of that? On the other hand, there is extra cost per lock in terms of initialization (mutex constructors) and shutdown (mutex destructors), so lock striping can reduce this cost compared to fully fine-grained locking.

Lock Segmenting

Lock segmenting is another middle-of-the-road approach, with similar ideas to lock striping, in the sense that it uses fewer locks than data points. The idea is to have one lock per “segment” of the data structure being used. This has particular applicability to linear data structures, such as vectors and arrays, in areas such as linear algebra and AI engines.

If we have a vector of data, we often want algorithms to operate on “segments” of that data, so as to maintain cache locality advantages in a CPU architecture. Note that a GPU architecture prefers a striped approach, but that’s in CUDA C++ with a totally different type of on-GPU threading model, not in C++ multithreading.

Doing data processing fast with cache-aware multithreading means that each thread operates on a segment of contiguous data, and we have a controller thread that’s scheduling different threads to work on segments of the vector. Here’s the idea of a hand-coded vector container that’s segmenting the locks according to the data:

    // Template: NARRAY = data size, NLOCKS = chosen lock granularity
    float arrdata_[NARRAY];
    const int NSEGMENTS = NARRAY / NLOCKS;
    std::mutex lockarr_[NLOCKS];
    static_assert(NARRAY % NLOCKS == 0);  // avoid extras

    size_t map_offset_to_lock(size_t offset) {
        assert(offset < NARRAY);
        size_t lock_offset = offset / NSEGMENTS;
        assert(lock_offset < NLOCKS);
        return lock_offset;
    }

Hence, any code that’s working on a segment of the array, does this to figure out which mutex to acquire:

    void process_segment(size_t offset) {
        size_t lock_offset = map_offset_to_lock(offset);
        lockarr_[lock_offset].lock();
        // ... etc
        lockarr_[lock_offset].unlock();
    }

The idea is similar to lock striping, but this uses division rather than modulus. Nearby offsets in lock segmenting will usually get the same lock, as they’re in the same segment of contiguous data, whereas adjacent elements would get different mutexes in the lock striping approach. The advantage of lock segmenting over lock striping is that it allows contiguous data processing, whether reading or writing, and therefore has cache locality efficiency.

Higher-Level Concurrency Problems

Note that higher-level concurrency issues can occur with these approaches, such as lock striping or lock segmenting. The problems arise if you have whole-of-data algorithms, which limit the value of these middle-level locking ideas. For example, consider if you have two high-level methods that work on the entire vector of data:

Sum vector — calculate the sum of the whole vector of data, or some other linear algebra metric like a dot product (reader).
Scale vector — multiply the entire array by a factor (writer).

If both of these methods acquire locks one segment at a time (or striped), then the segment-level operations are going to get interleaved. For example, one of the segments being summed might get modified by the scaling method, before getting summed, so the sum returned has only calculated results properly on half the elements.

There’s nothing wrong with the concurrency at the segment level, but the application-level logic is broken. The concurrency solutions at a higher-level are not pretty:

Vector-level lock to serialize all whole-of-vector algorithms (a read-write lock), or
Acquire all of the many locks for each segment or stripe (ugh!).

It’s not quite that bad, since we’d use read-write locking, so that multiple reader algorithms could still run concurrently on the entire array. However, writer algorithms would get totally serialized on this approach, blocking all other readers and writers, which is not optimal.

More efficient would be to have a more complex scheduling algorithm, so that the “scale vector” method runs in a pipelined fashion, processing one segment behind the “sum vector” method, but that’s tricky to do if each such segment is running in a different thread. However, if you don’t do this, they’re potentially going to interfere with each other at the higher-level, creating actual bugs in the application logic, despite being correctly synchronized at the segment level.

Read-Write Locking

An important improvement to lock contention is to allow multiple readers to access concurrently, but any “writer” must have unique access. This idea can be used for both coarse and fine-grained locking, and can be combined with moderate approaches like lock striping or lock segmenting. The goals of the read-write lock approach are:

Multiple readers at the same time (but not any writers).
Every writer needs exclusive control (no other readers or writers).

This is efficient in situations where there are lots of readers processing the data, and fewer writers. However, it can be less successful where readers and writers are accessing the data structure with approximately the same frequency, such as passing work on a queue in the producer-consumer model. Actually, in that model, both the producers and consumers are writers (not just readers), as they each push or pop the queue. You can make consumers into readers by using a delayed-pop idea, but eventually someone has to clean up the mess.

The standard C++ library has builtin support for achieving read-write locking. The way to achieve this is with a “std::shared_mutex” instead of a basic mutex. The changes to our code can be summarized:

std::shared_mutex in the data members (using <shared_mutex> header file).
Readers request a std::shared_lock over the shared mutex (multiple concurrent readers).
Writers request a std::unique_lock over the shared mutex (exclusive access).

Here’s the modified code in full:

    #include <shared_mutex>

    template<typename T>
    class QueueWrapReadWrite {
      private:
        std::queue<T> m_q;
        std::shared_mutex m_mutex;  // Read-write
      public:
        int count() const {
            std::shared_lock<std::shared_mutex> lock(m_mutex); // Reader
            return m_q.size();
        }
        T front() {
            std::shared_lock<std::shared_mutex> lock(m_mutex); // Reader
            return m_q.front();
        }
        void pop() {
            std::unique_lock<std::shared_mutex> lock(m_mutex); // Writer
            m_q.pop();
        }
        void push(T t) {
            std::unique_lock<std::shared_mutex> lock(m_mutex); // Writer
            m_q.push(t);
        }
    };

Here is the comparison of speed against the basic lock guard version with non-concurrent readers:

    1 Queue, 2 Threads (Lock Guard): 55214 microseconds
    1 Queue, 2 Threads (Read-Write): 51687 microseconds

There was about a 6.4% improvement by adding shared reader locking. This makes sense, because most of the testing involves write operations of push() and pop(), but there is a small gain in concurrency from the read-only count() and front() operations. But overall, it’s still quite a lot of overhead, when you recall the lock guard version was 47% extra CPU time compared to without synchronization. Maybe we should try a lock-free version.