Aussie AI

Chapter 30. Common Multithreading Bugs & Slugs

Book Excerpt from "C++ Ultra-Low Latency: Multithreading and Low-Level Optimizations"

by David Spuler, Ph.D.

Chapter 30. Common Multithreading Bugs & Slugs

Multithreading Bugs Overview

Modern C++ is hard enough, and multithreading adds another layer of complexity. You’re not alone, and bugs abound in multithreaded C++ code!

Some of the beginner bugs and simple misunderstandings include:

Linux linking problem with the “pthreads” library (needs “-pthread” linker option).
main() does not wait for other threads and needs to call join().
Calling join() inside the new thread causes a deadlock.
Crashing on join() because the thread is no longer “joinable” (test via the joinable() method).

Here are some simple mistakes you can make when trying to convert your application to multithreading:

Not using any synchronization for your threads (Yikes!).
Not locking in all the places.
Forgetting locking for cout and cerr output.
Not unlocking on all paths.
Double-locking a mutex.
Double-unlocking a mutex.

Once you get into running multiple threads, here are some gotchas in terms of assumptions and misunderstandings:

Assuming that the standard C++ containers are always thread-safe.
Assuming that int or pointer operations are atomic without using std::atomic.
The volatile specifier is not a synchronization method.

Let’s examine some of these simpler multithreading mistakes.

Main Thread Exits Early

Here’s a simple “Hello World” program using standard threading. It looks totally fine, right?

    #include <iostream>
    #include <thread>

    void thread_function()
    {
        std::cout << "Hello world!" << std::endl;
    }

    int main()
    {
        std::thread t1(thread_function);
        return 0;
    }

Can you see the bug? The program won’t print anything.

Why? Because there’s nothing stopping the main() function, which just keeps going and exits immediately. It doesn’t wait for the other thread to even start, let alone finish, but is indifferent to its plight.

That’s one of the things to understand, but there are actually a few fundamental points to note here:

Launching a new thread is a non-blocking operation.
Exiting the program kills all unfinished threads.
To wait for a thread, call join().

Hence, to fix the program, you need to do this in the main() function:

    std::thread t1(thread_function);
    t1.join();   // Wait!

After this change, the main thread will politely wait for the other thread to print its message and finish. The join() function has the following features:

Blocking call that waits for the other thread to finish.
Immediate return if the other thread has already finished.

Self-Join Deadlock. Note that you cannot call join() from inside the new thread itself. This causes an immediate deadlock, because the join() call in the thread is waiting for itself to finish, but it cannot finish because it’s waiting (is anyone else a fan of Catch 22?). I feel like this self-join situation is a bug that the standard threads library could check for, and maybe it does in the newer “hardened” versions of the standard C++ library.

Anyway, just don’t do that. It’s the main thread that needs to join the new thread from the outside, not the other way around.

Joinable Safety Check. In the above simple code, it’s not necessarily needed, but safer thread code would validate that the thread is allowed to join before trying to do so, because it crashes if you’re wrong! For example, a “detached” thread is non-joinable. Here’s the simplest check:

    
    if (t1.joinable()) t1.join(); // Safer

Note that in addition to join(), there’s also a method called detach(), but the former is much simpler. The main thread still needs to wait for a detached thread before exiting, but requires additional synchronization via some other method, because you can’t join() a detached thread, as we just discussed.

Linux Linking Problem

You may find that a standard C++ program using the standard thread library does not compile with GCC on Linux, or at least on older versions. The problem is that standard C++ threads are implemented as POSIX threads on Linux in the GCC implementation.

The problem is that the POSIX threads library (usually called “pthreads”) is not getting linked properly. You need to add an extra “-pthread” compiler flag to the linking step (without an “s”). The error looks like this:

    .../thread:127: undefined reference to `pthread_create'

And the fix is to add this linking flag for GCC:

    -pthread

Here’s the line in my Makefile for my testing build:

    LINKFLAGS=-L/usr/lib64/ -g $(PFLAGS) -pthread

Volatile Misunderstanding

This is a common mistake made about a longstanding feature of C++ (and also C). The “volatile” specifier in C++ is not for synchronization. In particular, the use of this specifier is not useful in multithreading because it:

Does not do anything with other threads.
Does not make a variable atomic.

Not only won’t it do anything useful for your multithreading synchronization, but it will actually slow your code down because it interferes with the optimizer.

The purpose of volatile is much more mundane than multithreaded code, and relates only to sequential programming, with these features instead:

Indicates that this variable or address has “side effects” that the compiler does not know about.
Blocks the compiler from “optimizing out” reads or writes to this variable.

The main real-world uses of the volatile specifier include:

Mapping an I/O device to a variable or memory address.
Stopping compiler optimizations when doing code benchmarking of low-level arithmetic.

The first one of these is the reason that it exists in the C++ language (and originally in C, too). The idea is to tell the compiler that a variable or address represents an input or output device. So, if the compiler sees the same variable or address read twice, it doesn’t optimize the second one out, which would be faulty if that address represents incoming data from a peripheral device or network feed. Similarly, if you write the same value to that variable, intending to send two bytes to an output device, the compiler is stopped from blocking you.

The use in benchmarking is a programmer trick that really misuses a language feature. But there’s nothing wrong with that, because the semantics of volatile are well-defined and have existed in the language since forever. It was standardized into the C language in the ANSI C standard of 1989/1990, and was formally incorporated into C++98.

The volatile specifier is a wonderful feature of C++ that I’ve used often. But, as mentioned above, don’t use volatile as a synchronization method, because nowhere in the above list of its features is anything related to multithreading or concurrency.

Advanced Multithreading Bugs

As you progress to greater multithreading knowledge, the bugs get harder:

Race conditions — a variety of orders that can have different results.
Deadlock — often from wrongly-ordered acquisition of multiple locks.
Livelock — a weird kind of near-deadlock cycling.
Memory order errors — with atomics and lock-free data structures.
High-level concurrency issues — sigh, the low-level concurrency code was working so well.
Thread starvation — a low-priority thread never gets any juice.
Priority inversion — weirdly, a low-priority thread gets all the juice.

That’s more than enough! However, there’s another important category of C++ multithreading bugs:

All the other C++ bugs you already know about.

Multithreaded code still uses basic sequential C++ code in every thread. There might be a few bugs to watch out for in that!

Multithreading Slugs

There are plenty of ways to improve the performance of a C++ multithreading application. In fact, you could write a whole book on it!

Some of the higher-level slugs to avoid include:

Using sequential code instead of multiple threads (the horrors!).
Launching too many threads (leads to thread overhead).
Too many runnable threads per core.

Some possible slowdowns in your locking strategy:

Using coarse-grained locking around an entire data structure (per-container locking).
Using a single per-class mutex as a static data member (per-class locking).
Using unique locks for read operations, instead of shared read-write locks.
Using a mutex for a simple integer counter (or a Boolean status flag), when atomics would be enough.

Some of the low-level slugs in locking synchronization include:

Overlong lock holding with std::lock_guard destructor unlocking.
Not freeing a lock when no longer needed (e.g., when doing computation).
Holding a lock while doing the last computations, instead of copying data to local variables (and then unlocking before the computations).
Holding a lock before an I/O operation or other blocking kernel system call.

Some other ideas for areas to address for performance:

Thread function arguments are pass-by-value by default (e.g., for objects).
Not using a thread pool instead of launching/destroying lots of threads.
Don’t do core pinning (thread affinity) with core zero (it’s the main Linux kernel core).
Blocking calls to select() in socket programming.
Not doing any real work in the main thread (it’s a useful worker, too!).

Fake Multithreading

One weirdly common slug is “redundant thread computations” due to a simple programming bug. This means that you have multiple threads repeating the exact same work in multiple threads, but nobody notices because it’s a slug rather than a bug.

For example, if you’re optimizing a “vector-add” operation that takes two vectors and outputs a third vector, and the vectors are very long (e.g., in AI), then you might try to have different segments of a vector processed in different threads to parallelize the operation. But if you mess up the indices, such as if your boss calls you away to an important meeting while you’re coding, there might be a problem with the loop indices.

If you actually send work to each thread that has the full index range, rather than a sub-segment, then each thread scans the entire vector and outputs the entire third vector. This is insidious because the results should be correct, but it’s re-computing the same arithmetic operations multiple times in parallel.

There’s nothing wrong with your high-level design except that the code still has n instead of i in the code that assigns jobs to threads. You can go crazy and optimize your multithreaded vector-add operation with producer-consumer thread pools and lock-free queues, and then add work stealing for load balancing, but if your indices are wrong, it’s all moot. Slugs and bugs can live together!