Aussie AI

Chapter 5. Hotpath Optimizations

Book Excerpt from "C++ Ultra-Low Latency: Multithreading and Low-Level Optimizations"

by David Spuler, Ph.D.

Chapter 5. Hotpath Optimizations

What is Hotpath Optimization?

Hotpath optimization is a multithreading C++ optimizations in HFT whereby the most important code is prioritized and super-optimized. Whereas the traditional “hotpath” in C++ code is the most heavily executed code, in HFT the hotpath is a rarely executed sequence of high importance (i.e., submitting the trade). Hence, optimizing the hotpath can mean different things:

Profiling the most heavily executed code (traditional C++ code).
Running the GPU profilers on CUDA C++ kernels (for AI applications).
Optimizing the rare but most important pathway (HFT applications).

Using the various C++ profiler tools won’t help you much in HFT hotpath optimization. Well, actually it can, but only if you have a way to modify the code in test mode so that it always runs the hotpath sequence. But take care with this idea, as maybe it shouldn’t really submit a thousand live buy orders to the exchange when it’s running under Valgrind in the nightly build.

Hotpath Optimization Techniques

The idea with hotpath examination is to put every single instruction under the microscope. Especially for HFT, every microsecond counts, and there are many ways to squeeze out more speed. There are two main categories of optimizations:

Concurrency optimizations — multithreading-related code changes.
General C++ optimizations — all of the rest!

With regard to multithreading, the hotpath should not be subjected to any of the delays that can beset a single thread. Some of the methods for speedup include:

CPU pinning — give the hot thread its own core (completely avoids context switching)
Don’t use locking on the hotpath (as much as possible) via lock-free coding, read-only data structures or lock-free algorithms.
Cache warming via prefetching of shared data needed by the hotpath.
Keep the cache warm all the way down into the NIC.
Use a lock-free queue data structure to avoid contention issues.
Use custom thread pools with only preallocated memory block pools.

Other than multithreading code changes, there’s another few hundred general types of C++ optimizations to consider. There are a number of chapters about this, but here’s a smattering of some interesting techniques:

Hoist code out of the hotpath by using precomputation.
Remove slowpaths by deferring handling of error checks.
Maximize compile-time computation (e.g., constexpr, TMP if you must).
Don’t allocate or free memory; use only preallocated memory or global memory.
Use in-memory databases for any significant amounts of incoming data.
Review data de-serialization and serialization costs.
Don’t log, or defer logging to the end, or write to an in-memory logger.
Replace every if statement with branchless coding tricks.
Examine every code statement in the entire hotpath (even at assembly level).

Odds are high that you’ll find something to improve, no matter how many times you look at the same stretch of code.

Network Optimizations

In a network-heavy application, such as HFT, there is a lot of importance in the speed of networking. Many of the main optimizations are hardware issues:

Custom NIC
Fast switches

Note that there can be multiple networks attached to one server:

Public network
Private network

The purpose of a private network is to send messages only between your servers and any administrative consoles. This private or “out-of-band” network can be used for things like:

Monitoring and administration messages
Sending data between servers (e.g., quotes data in HFT, or KV cache data in LLM inference).

Although hardware and its related network connections are critical, let’s not forget the software. Your C++ code needs to talk to the network, to receive incoming data and to emit actions (e.g., a trade in HFT) Network-related optimizations to the C++ code in the hotpath can include:

Use kernel bypass to custom NICs for fast networking.
Keep the client network connection warm (method depends on the API).
Use custom wrappers for TCP and UDP network processing.

For extra speed, you may need to wrap or re-implement the TCP and UDP code. Some of the default algorithms for networking introduce some minor safety checks and other delays, which interfere with your need for speed. Linux socket programming can be a lot of fun. I can remember coding a custom version of the select primitive, which is loads of bitmask fiddling.

Core Pinning

Core pinning is a multithreading optimization where a thread is “pinned” to one of the cores to give it higher priority. This means that important thread that runs the hotpath can have guaranteed CPU availability, rather than waiting for the default thread scheduling algorithms. Hence, it can be a solution to avoid lock contention worries for the main hotpath thread.

Core pinning is also called “thread affinity” and has multiple other names (e.g., “processor affinity” or “CPU affinity” or “CPU pinning”), but if you hear the words “pinning” or “affinity” in relation to threads, this is it.

Pinning has other meanings in related architectures. There’s a higher-level type of pinning whereby whole processes or applications are pinned to a CPU core by the operating system, rather than just a single thread, which isn’t quite the same thing. Note also that CUDA C++ has another type of “pinned memory” for GPUs, but that’s a memory upload optimization rather than a compute improvement.

The other side of core pinning is that you obviously don’t pin the less important threads. All the lower-priority threads have fewer cores available, and are downgraded.

On Windows, you can set up a process-level CPU pinning for an application via the GUI. On Linux, there is a “taskset” command that allows running a program with core pinning.

Both Windows and Linux have non-standard system calls that can set up pinning for either a process or a thread. Programmatic C++ APIs on Linux are:

Pinning processes — sched_setaffinity
Pinning threads — pthread_setaffinity_np or pthread_attr_setaffinity_np

On Windows, these are the C++ APIs:

Pinning processes — SetProcessAffinityMask
Pinning threads — SetThreadAffinityMask

The use of core pinning is a very powerful type of hotpath optimization. The main pathways are super-optimized because:

No context switches
Highest priority execution
Guaranteed core availability (no delay)

In-Memory Logging

The last thing you want is for your hotpath to block waiting for log messages to get written to disk. Hence, your options for logging include:

Don’t log!
Buy a faster SSD disk (what’s next after NVMe?)
Store log messages in memory

Not logging messages can be an option in some cases. This refers to tracing and debugging messages, that aren’t business-critical. Some of the approaches to disable logging include:

Compiling-out unimportant tracing.
Disabling logging but having it still in the code.

If you use a Boolean control flag to enable or disable logging, this can be an effective solution. On the other hand, you can have a lot of these:

    if (g_debug) {
        // Log a message
    }

These can be inefficient on a hotpath for two reasons:

Cost of testing the global flag multiple times, and
Extra branches that interfere with branch prediction.

On the other hand, this can be very flexible and the above costs can be a small price to pay in some applications. You can enable or disable the global flag based on:

Command-line options (i.e., add a “-debug” setting).
Sending a SIGUSR1 signal to the process (toggle debug mode).

Whatever the choice regarding debug or tracing-related logging, you can’t avoid business-related logging. For example, a HFT applications needs to track any actual trades sent, and update any risk management applications.

The solution for this is to use an in-memory logging C++ class. The features that you need include:

Log messages are copied to an in-memory queue (preferably lock-free).
A separate log-writing class pulls these messages off the queue.
The thread writing log messages to disk is low-priority in the background.

In this way, you can have quite extensive logging, but the critical path is all in memory, and the slower writing to disk is deferred to a background task that can run in the quiet periods.