Aussie AI

Chapter 29. Multithreading Optimizations

Book Excerpt from "C++ Ultra-Low Latency: Multithreading and Low-Level Optimizations"

by David Spuler, Ph.D.

Chapter 29. Multithreading Optimizations

C++ Multithreading Optimizations

Multithreading is the art of parallelizing on a multicore CPU, often as part of low latency programming. Threads have been around since at least the 1990s (e.g., POSIX threads), even before most CPUs even had “cores,” but recent advancements have made them much easier to code. C++11 introduced a more standardized thread library called std::thread (along with std::mutex and std::atomic), and C++17 introduced a lot more advanced parallelization modes.

What is Multithreading?

In this discussion, threads run on the CPU, and you can have many threads per CPU (or per “core”). Multithreading and multicore programming are largely the same thing, or at least they’re in the same ballpark.

Other types of threads can differ quite a lot. For example, there is also a slightly different idea of “threads” on GPUs in the CUDA C++ programming language. You can run 1024 threads on an NVIDIA GPU, but you might not want to do that on your CPU lest you run out of stack space. CUDA C++ allows 1024 threads by having a quite restricted amount of GPU memory (sometimes called VRAM) allocated to the call stacks for each GPU thread in a grid. Hence, stack overflow is a thing on GPUs, too.

How Not to Multithread

If you’re looking for a short career as a multithreading programmer, here are some suggestions:

Launch as many CPU threads as you possibly can, ideally one per vector element, just like you do in a low-level GPU kernel for AI inference.
Put huge buffer objects as local variables on your call stack, and launch multiple threads of that.
Fix your huge local buffer variables by making them static, because that function won’t ever get run twice at the same time.
Use mutexes around every access to all your variables, just to be safe.
Recursion will get you fired in any coding job, except university lecturer, so it’s best to pretend you’ve never heard of it.

High-Level Multithreading Optimization

The first point above all else: multithreading is a high-level optimization in itself. Hence, you want to be judicious in your choices of where to use your threads, and at what level.

Some of the issues that control the overall concurrency that is achieved via a multithreaded architecture include:

Abstraction level choices for splitting the work across threads.
Thread pool design pattern — avoid creating and destroying threads.
Thread specializations — e.g., producer and consumer threads model.
Message-passing design pattern to avoid locking — e.g., with a paired future and promise.

Focusing on the data can also be useful to optimize:

Multithreading-friendly data structures — e.g., queues (esp. lock-free versions).
Maximize read-only “immutable” data usage — avoids blocking concurrent readers.
Advanced data structure read-write models — copy-on-write, versioned data structures.
Shard data across threads — reduces needed synchronizations (or other types of data partitioning).
Reduce disk writes — e.g., use in-memory logging with occasional disk writes.

Ways to optimize by focusing on the execution pathways include:

Slowpath removal — keep the hot path small and tight.
Defer error handing — most error code is uncommonly executed (i.e., a slowpath), so avoid, defer or combine error detection code branches.
Cache warming — keep the hotpath bubbling away.
Full hotpath optimizations — e.g., for HFT, the hotpath is not just “trade” but actually the full latency from data feed ingestion to execution, so it’s actually “receive-analyze-decide-and-trade.”

Some of the more pragmatic points include:

How many threads?
How long should each thread run?
When to exit a thread versus waiting.

There’s no wrong or right answer to these questions, as they depend on the application and the problem you’re trying to solve.

Low-Level Multithreading Optimization

There are various ways to modify how you run threads in order to optimize their concurrency speed. These are not as impactful as the higher-level thread choices, but are still important. Some methods to change the lower-level thread architectures include:

Core pinning (processor affinity) — every popular thread can have a favorite core.
Early unlocking — e.g., copy data to local variables, release lock, then do the computations.
Cache locality improvements (L1 cache and memory prefetch cache)
Branch reductions — keep the instruction pointer on the straight-and-narrow.
Lock-free algorithms — avoiding mutex overhead and blocked thread delays.

Ways to avoid slow-downs in multithreading, and therefore increase speed:

Minimizing thread launch and shutdown overheads.
Releasing locks early by avoiding unnecessary computation, I/O waits, etc.
Minimizing context switches
Memory reductions (e.g., allocated memory; reduce thread-specific call stack size).
Avoid spinlocks (busy waiting) or mitigate them with exponential backoff methods.
Avoiding “false sharing” from overlap of CPU memory prefetch cache lines (e.g., use alignas(64) to separate unrelated atomics).
Check std::lock_guard is not unnecessarily delaying the unlock (i.e., till it goes out-of-scope).

Sequential C++ Code Optimizations

An important point about the code running in any thread is that: it’s just C++ code. Each thread is running a sequential set of instructions, with its own call stack. Hence, all of the many ways to optimize normal C++ code also applies to all of the code in the thread.

Hence, all of the basic ideas for C++ code optimizations apply:

Compile-time processing — constexpr, constinit, etc.
Operator efficiency — e.g., replace multiply with bitshift or addition.
Data type optimizations — e.g., integers versus floating-point.
Memory optimizations — cache warming (prefetching), memory reductions.
Loop optimizations — e.g., loop unrolling, code hoisting, and many more.
Compiler hints — e.g., [[likely]] statements.
Function call optimizations — e.g., inlining, always_inline, etc.
C++ class-level optimizations — e.g., specializing member functions.
Algorithm improvements — various non-concurrency improvements, such as precomputation, caching, approximations, etc.

So, the bad news is that once you’ve coded your multithreaded algorithm, you still have to go and do all the other types of sequential optimizations. Oh, come on, who are we kidding? — it’s loads of bonus fun.