Aussie AI

Chapter 3. System Optimizations

Book Excerpt from "C++ Ultra-Low Latency: Multithreading and Low-Level Optimizations"

by David Spuler, Ph.D.

Chapter 3. System Optimizations

Optimizing the Whole System

There’s a lot of moving pieces in a whole low latency system. Optimizing them is an elegant dance, where each component plays a part. There’s no single answer to this, and it’s an ongoing process of continuous efficiency improvement.

Instead, you need to look at all the different components in your hardware and software stack. At each layer, you need to consider:

Better or newer components
Configurations of the component
Optimized programming

The good news is that optimizations to most of the layers are cumulative. You can optimize the hardware, the C++ software, and the network, and get a triple benefit.

Low Latency System Components

If you want to build a low latency system, here are some of the basic components in your stack. A single system may include:

Hardware — CPU, GPU, FPGA, NPU, etc.
Memory (RAM)
Disk storage — e.g., SSD (NVMe)
Network interface card (NIC)

The software stack looks like:

Operating system kernel layer — Linux or bust.
System software tools and services/daemons
Compiler tools and system libraries
Middleware software (e.g., Kafka)
API/SDK clients (e.g., HFT exchange connectivity)
Application software (your C++!)

Beyond the single system, there are various other system components:

Network switch or router devices
Network connections (e.g., wired, optical, microwave)
Load balancer devices
Backup storage devices

Combining Multithreading and SIMD CPU Instructions

You can double up! C++ multithreading software can be interleaved with CPU SIMD instructions as an optimized optimization. It’s totally allowed, and you can even put it on your resume. The idea is basically this structure:

Multithreading architecture — higher-level CPU parallelization.
SIMD instructions — lower-level CPU vectorization.

Some of the main CPU architectures with SIMD parallelization include:

AVX — x86 (e.g., Intel or AMD)
ARM Neon — iOS/Mac

Note that there are variants of each of these SIMD architectures, available on different chips. For example, AVX has AVX-1 (128 bits), AVX-2 (256 bits), AVX-512 (you can figure it out), and AVX-10 (1024 bits).

Combining Multithreading and GPU Vectorization

If you’ve sold your car to buy a PC that has both a fast CPU and a high-end NVIDIA GPU, there’s good news to think about while you ride the bus: both chips run at the same time. (Wow, in parallel, even.)

In fact, there are “threads” on both the CPU and the GPU. However, C++ CPU threads are much higher-level than the CUDA C++ threads on the GPU. The idea is:

CPU threads — big chunks of work.
GPU threads — very granular computations.

On the GPU, you might code vector addition with one GPU thread doing the addition in every element of the vector, up to the 1024 maximum. And if your vector has more than 1024 elements, you’d split it up into 1024 sub-sections and use “striding” to do it. But I digress.

CPU threads are not that granular, and you use them to do quite large chunks of work, not just one addition instruction. For example, you might have threads pulling incoming user requests off the queue, and a thread might handle the entire user request, perhaps launching some other threads on the CPU or GPU to do so.

There are some parallels (haha) between coding CPU and GPU threads:

Both types of threads have a call stack.
Both have “global” or “shared” memory to use across threads.
Overhead of thread launches and exits are a thing for both CPU and GPU threads.

Note that there’s also a new generation of “mini-GPUs” called a Neural Processing Unit (NPU), which aren’t as powerful as a fully-fledged GPU. NPUs tend to be used on “AI Phones” and other “edge” devices, which aren’t as powerful as a PC. Most of the comments about combining C++ multithreading and GPU coding also apply to the use of NPUs, except a little slower.

Going for the Triple-Double

You can even triple up your parallelism:

Multithreading/multicore (CPU)
SIMD instructions (CPU)
GPU vectorization

Is there a way to do four levels of parallelism in just one C++ program? Yes, of course:

Linux processes (parallelism at a higher level).
Networking communications (the NIC runs parallel, too).

There are some optimizations of those things, too.

Advanced Linux O/S Optimizations

It doesn’t end with the C++ code. There are other things you can optimize in the Linux O/S:

Process priorities — be nice and turn yours up to eleven!
Linux system processes — turn off the various Linux system processes that you don’t need (so they don’t compete for CPU time).
Kernel bypass — direct NIC manipulations.
Overlap communications and compute — e.g., PCIe bus GPU-to-memory upload/download.
Networking technologies — e.g., TcpDirect and Onload; RDMA.
Linux kernel optimizations — e.g., network buffer settings; disable writes that update the “file access date” when reading a file.
Linux system settings — ensure you don’t have accounting or security modes on.

There’s also some other items on the advanced menu:

Overclock your CPU (and the GPU)
Buy a bigger box
Get a faster SSD disk (e.g., NVMe)
Assembly language
Microwave communications
FPGA

There’s always more, but I’ve run out of room in your web browser.

Serving and Deployment Optimizations

If your software has to do multiple things at once, such as talk to multiple people (users), or communicate with multiple stock trading platforms, then there are many system-level practicalities that affect latency.

If your low latency application is a public-facing consumer website, there are a lot of deployment issues to scale up to a lot of users. Some of the issues to consider in the whole end-to-end latency of a request going through a system include:

DNS lookup time
Connection handshake time
SSL time
Load balancing
Round-robin DNS
Parallelization (multiple servers)
Utility servers
Caching (e.g., etags)
CDNs
Database lookup time
Database indexes
Keep-warm server architectures

Building a low-latency system is more than just coding up some C++. You have to put together a bunch of off-the-shelf components.

Network Optimization

If your algorithm has to talk between two computers, there’s a network in between. The time spent sending data across the wire and back is a key part of the latency. Faster algorithms need to optimize the network traffic. The main techniques for network optimization include:

Higher bandwidth network connections
Advanced network protocols
Compressing network data sizes
Spreading bandwidth usage over time (avoiding peaks)
Overlapping computation and communications
Direct access to peripherals (local and remote)
Direct access to memory (local and remote)
Sticky sessions (keeps session data local)
Sharing cache data between multiple servers

There’s a whole book that needs to be written about network optimizations! Should be done by Tuesday.

References

These are some good articles on optimizing an entire AI LLM backend system:

Character.AI, June 20, 2024, Optimizing AI Inference at Character.AI, https://research.character.ai/optimizing-inference/
Apple, June 2024, Introducing Apple’s On-Device and Server Foundation Models, https://machinelearning.apple.com/research/introducing-apple-foundation-models
Together AI, Nov 13, 2023, Announcing Together Inference Engine – the fastest inference available, https://www.together.ai/blog/together-inference-engine-v1
Ryan Lucchese, Niki Birkner, Yaron Hagai, Virginia Adams, August 13, 2024, A practitioner’s guide to testing and running large GPU clusters for training generative AI models, Together AI, https://www.together.ai/blog/a-practitioners-guide-to-testing-and-running-large-gpu-clusters-for-training-generative-ai-models

And these are some references about entire HFT system optimizations:

Larry Jones, 27 Feb 2025, Mastering Concurrency and Multithreading in C++: Unlock the Secrets of Expert-Level Skills, https://www.amazon.com.au/Mastering-Concurrency-Multithreading-Secrets-Expert-Level-ebook/dp/B0DYSB519C/
Sebastien Donadio, Sourav Ghosh, Romain Rossier, 17 June, 2022, Developing High-Frequency Trading Systems: Learn how to implement high-frequency trading from scratch with C++ or Java basics, https://www.amazon.com/Developing-High-Frequency-Trading-Systems-high-frequency-ebook/dp/B09ZV5L2T7/
Irene Aldridge, April 2013, Wiley, High-Frequency Trading: A Practical Guide to Algorithmic Strategies and Trading Systems, https://www.amazon.com/High-Frequency-Trading-Practical-Algorithmic-Strategies-ebook/dp/B00B0H9S5K

• Online: Table of Contents

• PDF: Free PDF book download

• Buy: C++ Ultra-Low Latency

C++ Ultra-Low Latency: Multithreading and Low-Level Optimizations:

Low-level C++ efficiency techniques
C++ multithreading optimizations
AI LLM inference backend speedups
Low latency data structures
Multithreading optimizations
General C++ optimizations

Get your copy from Amazon: C++ Ultra-Low Latency

Aussie AI

Chapter 3. System Optimizations