Aussie AI

Chapter 31. Thread Overhead

Book Excerpt from "C++ Ultra-Low Latency: Multithreading and Low-Level Optimizations"

by David Spuler, Ph.D.

Chapter 31. Thread Overhead

What is Thread Overhead?

Thread overhead is the extra cost of creating and destroying threads, at the start and end of multithreaded algorithm execution. This is effectively an extra cost that you wouldn’t have in a single-threaded C++ application, and is offset against the performance gain of parallelizing your code into multiple threads. Hence, the two main components of thread overhead are:

Launch new threads
Destroying a finished thread

Note that these costs do not involve any other thread, but are specific to a single thread. There are some other less obvious causes of extra thread overhead:

Constructors of thread_local objects (thread-local storage)
Destructors of thread-local objects on thread shutdown

These per-thread costs are analogous to the startup and shutdown costs of C++ global objects in a non-threaded program. A normal C++ program has extra code that runs before main() for global object constructors, and destructors that run after the application finishes.

Measuring Basic Thread Overhead

You’re supposed to use a “thread pool” to avoid all the basic overhead of starting and stopping threads. I was wondering how much that overhead would actually be, so I decided to time it, using a dummy example.

Here’s my simple benchmarking function that just consumes some time, but uses volatile to avoid getting optimized away:

    void thread_function(int n)
    {
        for (volatile int i = 0; i < n; i++) {
            // nothing
        }
    }

I wanted the code to be doing some real instructions, rather than just sleeping for a delay, such as with the this_thread::sleep_for() function, in case it made any difference to the status of the thread before shutdown.

Here is the instrumentation I used to try to measure thread startup and shutdown overhead using the high-resolution clock in the <chrono> library:

    {
        before = std::chrono::high_resolution_clock::now();
        before_thread = before;
        std::thread t1(thread_function, n);
        after_thread = std::chrono::high_resolution_clock::now();
        t1.join();
        after_join = std::chrono::high_resolution_clock::now();
    }
    now = std::chrono::high_resolution_clock::now();

And the computations of the different costs in microseconds are:

    diff_thread_function = std::chrono::duration_cast<std::chrono::microseconds>(now - before).count();
    startup_thread_function = std::chrono::duration_cast<std::chrono::microseconds>(after_thread - before_thread).count();
    shutdown_thread_function = std::chrono::duration_cast<std::chrono::microseconds>(now - after_join).count();
    compute_thread_function = std::chrono::duration_cast<std::chrono::microseconds>(after_join - after_thread).count();

Lambda and Functor Threads

A thread can be defined in other ways than a normal function call, such as function pointer (not much different), a lambda function, and a functor (function object). Hence, I decided to test the various different ways that a thread body of executable instructions could be defined, such as:

Standard function (i.e., with a name)
Lambda function (anonymous function)
Functor (function object)

The named function is shown above with the timing instrumentation around it. Here’s the lambda function version with the anonymous [] syntax:

    std::thread t1(
        [](int n) {
            for (volatile int i = 0; i < n; i++) {
                // nothing
            }
        }, n);

And here’s a functor for your viewing pleasure, which is a “function object” where the operator() has been defined:

    struct Functor {
        void operator()(int n) {
            for (volatile int i = 0; i < n; i++) {
                // nothing
            }
        }
    };
    Functor functor;
    std::thread t1(functor, n);

Timing Results

Timing on Linux with GCC, these are the non-threaded timings:

    Basic Function: 34 microseconds
    Basic Function (Repeat): 36 microseconds
    Basic Inline: 34 microseconds
    Basic Ptr-to-Fn: 34 microseconds
    Basic Functor: 32 microseconds
    Basic Lambda: 34 microseconds
    Basic std::function: 33 microseconds

And these are the threaded calls on Linux with GCC:

    Thread Function (First): 228 microseconds (startup: 144, compute: 84, shutdown: 0)
    Thread Function (Repeat): 43 microseconds (startup: 3, compute: 39, shutdown: 0)
    Thread Function (Repeat): 41 microseconds (startup: 2, compute: 39, shutdown: 0)
    Thread Lambda: 42 microseconds (startup: 2, compute: 39, shutdown: 0)
    Thread Functor: 40 microseconds (startup: 2, compute: 37, shutdown: 0)

Note that these are microseconds! The overhead from setting up the threads library of 200 microseconds is not even half a millisecond. And that’s only the first call, with the rest of the threads seeming to have only 2 or 3 microseconds of startup overhead on Linux!

Timing on Windows (MSVS) for the non-threaded function calls:

    Basic Function: 107 microseconds
    Basic Function (Repeat): 102 microseconds
    Basic Inline: 78 microseconds
    Basic Ptr-to-Fn: 97 microseconds
    Basic Functor: 136 microseconds
    Basic Lambda: 87 microseconds
    Basic std::function: 182 microseconds

And here are the Windows timings of the thread launches for the same functions:

    Thread Function (First): 3387 microseconds (startup: 74, compute: 3312, shutdown: 0)
    Thread Function (Repeat): 649 microseconds (startup: 41, compute: 607, shutdown: 0)
    Thread Function (Repeat): 729 microseconds (startup: 35, compute: 694, shutdown: 0)
    Thread Lambda: 621 microseconds (startup: 34, compute: 586, shutdown: 0)
    Thread Functor: 539 microseconds (startup: 30, compute: 509, shutdown: 0)

A few conclusions can be drawn:

Thread launch overhead is about 26% on Linux (43 vs 34) and 500% (649 vs 107) on Windows (admittedly, an unfair comparison of a Linux server versus a Windows laptop!).
The first thread launch has a large extra time cost, which disappears on a repeat, perhaps from initialization of the thread mechanisms (or perhaps it’s just a cold cache?).
There’s not much difference between running a thread body with a standard named function, lambda function, or functor (function object) on either platform.

Limitations. This is a dummy function and the overhead would be proportionally less if the thread did more computation. This is a single test of a single function for a single iteration count. There might be a few statisticians who want to object to that level of sampling.

Furthermore, as you can see, my timing method isn’t particularly effective at separately computing the startup and shutdown costs of a thread. A lot of the startup cost and shutdown cost seems to be hidden inside the compute time, while the main thread is waiting with the join() call. Nevertheless, the total costs are quite indicative of the extra overheads, especially on the very first thread launch.

Synchronization and Context Switch Overhead

The above discussion is about the overhead of threads starting and stopping. There are various other types of overhead that should be optimized in the middle of the thread’s execution:

Synchronization overhead — extra cost of mutexes, locks, atomics, etc.
Thread wait durations — blocked while awaiting a mutex or lock.

There are also some slowdowns that arise because your code is now split up into multiple threads, which have to be scheduled and time-sliced by the OS and the hardware. Some of the general areas of cost include:

Context switches — cost of swapping threads in and out of CPU execution.
Scheduling costs — the OS choosing which thread to run next.

There are also some slowdowns that occur in hardware caches during these switches, because the OS does not store and reinstate any of the hardware caches. The new thread starts with cold hardware caches, leading to cache misses with performance problems in several areas:

Memory cache invalidation — context switches lose low-level L1/L2/L3 CPU cache advantages.
TLB cache loss — the virtual address cache is lost.
CPU instruction pipeline — this stalls because execution location has moved.
Instruction prefetch — this is cleared because a new thread starts elsewhere.
Memory prefetch cache — the new thread is unlikely to be accessing the same memory locations.
NUMA cache issues — loss of cache coherence in multicore NUMA caches.

In other words, everything that the CPU does to make code run fast gets undermined by a context switch.

What causes a context switch? Context switches can arise at the end of a time-slice in scheduling, or can occur whenever the threads uses a primitive that can block the thread. Some examples that trigger a context switch include:

Synchronization — waiting for a lock or mutex.
System calls — those that block, such as for I/O or networking.

A context switch involves storing all the status of the current thread and then overlaying a new context for the new thread. This has its own cost, and also triggers a flush of various CPU hardware caches, so the new thread starts its time-slice with cold caches. Hence, context switches are expensive, and minimizing the occurrence of context switches is an important part of optimizing multithreading code.

• Online: Table of Contents

• PDF: Free PDF book download

• Buy: C++ Ultra-Low Latency

C++ Ultra-Low Latency: Multithreading and Low-Level Optimizations:

Low-level C++ efficiency techniques
C++ multithreading optimizations
AI LLM inference backend speedups
Low latency data structures
Multithreading optimizations
General C++ optimizations

Get your copy from Amazon: C++ Ultra-Low Latency

Aussie AI

Chapter 31. Thread Overhead