Aussie AI

Chapter 10. Slowpath Removal

Book Excerpt from "C++ Ultra-Low Latency: Multithreading and Low-Level Optimizations"

by David Spuler, Ph.D.

Chapter 10. Slowpath Removal

What is Slowpath Removal?

Slowpath removal is a multithreading optimization whereby the cold paths are removed, merged, or deferred. The idea is to give priority to the hotpath by avoiding any branches leading to the slowpath, as much as possible.

Not all code belongs on the hotpath. Some examples of slowpath logic include:

Error handling
Logging
Self-testing code

Note that I really mean removal of these paths. There are actually two optimizations in slowpath removal:

Avoiding the cost of testing for errors.
Removing branches of code instructions.

We don’t just want to avoid testing for errors, but we actually want there to be zero branches in the hotpath code sequence. The reasons for this include:

Branch prediction optimizations (i.e., branch elimination), and
Instruction cache optimization.

Another point is that to make the hotpath short, with good latency in the instruction prefetch cache, we want to minimize any slowpath code in that path. Hence, if you cannot avoid having a slowpath sequence in the hotpath, then you should encapsulate it into a separate function, and don’t inline the slowpath function. In this way, only the test for that slowpath condition (e.g., an error flag test), and a single function call to the slowpath function, is in the instruction block along the hotpath.

If the hotpath code sequence is short and tight on the CPU, it runs a lot faster than if it has to think about alternative pathways.

Error Handling Slowpaths

Error handling is a common example of a slowpath. Most of the failures and exception states of execution are not on the hotpath, as they are uncommon events compared to success. They’re called exceptions for a reason!

The problem with errors is that you have to check for them, even though they never happen. Okay, yes, so they can happen, and good programmers always check their return codes and so on. But when you’re trying to go fast, you want to focus on success and winning.

The choices for error handling are therefore on the scale between two extremes:

Repeatedly check every error (slow)
Don’t check for any errors (unsafe)

There are some trade-offs in the middle ground:

Check for fewer errors in production, but more in offline self-testing.
Use in-memory logging data structures to defer outputting data to log files.
Defer error checking until multiple error statuses can be checked at once.

Deferring Error Checks

The idea of deferred error checking is to not immediately check every error status. Instead, we try to keep going and ignore possible error states, and then check for them as late as possible.

Traditional error checking is to immediately test for a failure return code. Here’s an example:

    bool oksetup = orderobj.setup(ticker, price);
    if (!oksetup) {
        // Fail...
    }
    bool oktrade = order.obj.submit_trade();
    if (!oktrade) {
        // Fail...
    }
    bool oklog = logger.record(ticker, price);
    if (!oklog) {
        // Fail...
    }

The basic structure is a long if-else-if sequence, with error handling interleaved into the main hotpath. Yes, you could micro-optimize the above, such as by avoiding three separate Boolean variables, but you get the idea. This is a slow control flow that mixes the hotpath and the slowpath.

Faster is to run as fast as possible with all the steps, and only check for problems at the end. If we can defer error checking until after the trade has submitted, then our error handling code is completely out of the hotpath. Here’s the basic concept of doing deferred error checking at the end:

    bool oksetup = orderobj.setup(ticker, price);
    bool oktrade = order.obj.submit_trade();
    bool oklog = logger.record(ticker, price);
    if (!oktrade || !oksetup || !oklog) {
        // Fail...
    }

We might optimize this using bit flags for error codes and pass-by-reference parameters:

    uint32_t errflags = 0;
    orderobj.setup(ticker, price, errflags);
    order.obj.submit_trade(errflags);
    logger.record(ticker, price, errflags);
    if (errflags) {
        // Fail...
    }

The tricky part here is whether the trade submitter or logger functions will crash if the first function fails. We have to design all the routines to be pass-through, or at least non-crashing, even if an earlier routine has had an error. This is easier said than done!

You have to take care to really defer the error checks, not just hide them. For example, if your second routine needs to check for an error status from the first function (so it doesn’t crash), then you haven’t really deferred the error checking until after the hotpath has finished. Instead, it’s just hidden further down the call stack inside the individual functions.

Removing Error Checks

Safe C++ programming practices always have us doing a lot of extra work to check for a myriad of coding problems:

Function parameter validation
Function error return code checking
Assertion failures
Self-testing code failures
Memory allocation failures
File loading errors (e.g., file not found, disk full)
Valgrind runtime checking

But if we want to go fast, many of these can be removed. Goodbye to slow code! Hello, speed.

Not all of the above error situations are that common, and many of them are under our own control, since they’re really just checking for our own coding errors. Some of the error avoidance strategies for the critical code in the hotpath include:

Don’t use memory allocation (avoids allocation failures).
Avoid disk-full issues with logging via good Linux admin practices and lightweight monitoring.
Compile-out parameter validation, assertions, and self-testing code for production (but include them in unit tests and offline automated test harnesses).

If compiling out all of the safety stuff gives you concerns, here’s the plan:

Don’t write buggy code!

Oh, wait! That’s not so easy. But here’s what we can do: mitigate against human frailty by shaking out all the bugs before they get to production.

One of the main ways to have very fast production code, but mitigate against unforeseen coding failures is to max out the use of automated testing in offline mode. Here’s the basic plan:

CI/CD — faster unit tests.
Nightly builds — longer automated tests, static analysis, etc.

We can and should run basic unit tests as part of CI/CD, but then we should thrash the whole thing to death in nightly builds. This means to enable lots of self-testing code and other very slow tests that would cause developer productivity issues if we ran them in CI/CD. Hence, nightly builds should run stress tests under Valgrind, even running the same tests across multiple platforms, compilers, and optimization levels. We maximize the testing offline to mitigate the risk of removing these tests in production.

Never-Failing Functions

As programmers, we’ve had it drummed into us that every function should return a success or failure status. But, why?

Some functions should never fail. If it’s a function that does not access external resources, the most common reasons for failure are internal ones (e.g., called with the wrong parameters) or very rare states (e.g., memory allocation failure). Every one of these reasons are things under our control:

Don’t call it with bad parameters.
Don’t use allocated memory.

As an example, consider a function to set up an order object to submit a trade, which is obviously on the hotpath. This is the traditional C++ style:

    bool ok = orderobj.setup(ticker, price);
    if (!ok) {
        // Handle the error...
    }
    // Keep going (submit the trade)

Here’s a faster method whereby we only check for those “under-our-control” coding issues in offline regression tests. The basic idea is to have the error checks only in test modes:

#if SELFTEST // unit test mode
    bool ok = orderobj.setup(ticker, price);
    if (!ok) {
        // Handle the error...
    }
#else  // Production mode (hotpath)
    (void) orderobj.setup(ticker, price);
#endif
    // Keep going (submit the trade)

In fact, we probably should further optimize the function to have void return type in production, and never even think about returning an error code. We could use tricky #if sequences, or have two versions of the entire function. If we make it inline, then the optimizer might get rid of some of the unused return statements, but why do we need them in the first place?

The main slowness that we can’t get rid of in the hotpath is return codes or exceptions from the third-party APIs, network connections, and system resources, which could really fail in production. However, we already talked about these above, and the strategies to defer these checks to later in the hotpath.

References

Paul Bilokon, Burak Gunduz, 8 Sep 2023, C++ Design Patterns for Low-latency Applications Including High-frequency Trading, https://arxiv.org/abs/2309.04259, Code: https://github.com/0burak/imperial_hft
Sarah Butcher & Alex McMurray, 2 January 2025, The C++ techniques you need for $600k hedge fund jobs, https://www.efinancialcareers.com/news/low-latency-c
Ivan Eduardo Guerra, October 19, 2024, C++ Design Patterns for Low Latency Applications Including High Frequency Trading, https://programmador.com/series/notes/cpp-design-patterns-for-low-latency-apps/