Aussie AI
Chapter 3. System Optimizations
-
Book Excerpt from "C++ Ultra-Low Latency: Multithreading and Low-Level Optimizations"
-
by David Spuler, Ph.D.
Chapter 3. System Optimizations
Optimizing the Whole System
There’s a lot of moving pieces in a whole low latency system. Optimizing them is an elegant dance, where each component plays a part. There’s no single answer to this, and it’s an ongoing process of continuous efficiency improvement.
Instead, you need to look at all the different components in your hardware and software stack. At each layer, you need to consider:
- Better or newer components
- Configurations of the component
- Optimized programming
The good news is that optimizations to most of the layers are cumulative. You can optimize the hardware, the C++ software, and the network, and get a triple benefit.
Low Latency System Components
If you want to build a low latency system, here are some of the basic components in your stack. A single system may include:
- Hardware — CPU, GPU, FPGA, NPU, etc.
- Memory (RAM)
- Disk storage — e.g., SSD (NVMe)
- Network interface card (NIC)
The software stack looks like:
- Operating system kernel layer — Linux or bust.
- System software tools and services/daemons
- Compiler tools and system libraries
- Middleware software (e.g., Kafka)
- API/SDK clients (e.g., HFT exchange connectivity)
- Application software (your C++!)
Beyond the single system, there are various other system components:
- Network switch or router devices
- Network connections (e.g., wired, optical, microwave)
- Load balancer devices
- Backup storage devices
Combining Multithreading and SIMD CPU Instructions
You can double up! C++ multithreading software can be interleaved with CPU SIMD instructions as an optimized optimization. It’s totally allowed, and you can even put it on your resume. The idea is basically this structure:
- Multithreading architecture — higher-level CPU parallelization.
- SIMD instructions — lower-level CPU vectorization.
Some of the main CPU architectures with SIMD parallelization include:
- AVX — x86 (e.g., Intel or AMD)
- ARM Neon — iOS/Mac
Note that there are variants of each of these SIMD architectures, available on different chips. For example, AVX has AVX-1 (128 bits), AVX-2 (256 bits), AVX-512 (you can figure it out), and AVX-10 (1024 bits).
Combining Multithreading and GPU Vectorization
If you’ve sold your car to buy a PC that has both a fast CPU and a high-end NVIDIA GPU, there’s good news to think about while you ride the bus: both chips run at the same time. (Wow, in parallel, even.)
In fact, there are “threads” on both the CPU and the GPU. However, C++ CPU threads are much higher-level than the CUDA C++ threads on the GPU. The idea is:
- CPU threads — big chunks of work.
- GPU threads — very granular computations.
CPU threads are not that granular, and you use them to do quite large chunks of work, not just one addition instruction. For example, you might have threads pulling incoming user requests off the queue, and a thread might handle the entire user request, perhaps launching some other threads on the CPU or GPU to do so.
There are some parallels (haha) between coding CPU and GPU threads:
- Both types of threads have a call stack.
- Both have “global” or “shared” memory to use across threads.
- Overhead of thread launches and exits are a thing for both CPU and GPU threads.
Note that there’s also a new generation of “mini-GPUs” called a Neural Processing Unit (NPU), which aren’t as powerful as a fully-fledged GPU. NPUs tend to be used on “AI Phones” and other “edge” devices, which aren’t as powerful as a PC. Most of the comments about combining C++ multithreading and GPU coding also apply to the use of NPUs, except a little slower.
Going for the Triple-Double
You can even triple up your parallelism:
- Multithreading/multicore (CPU)
- SIMD instructions (CPU)
- GPU vectorization
Is there a way to do four levels of parallelism in just one C++ program? Yes, of course:
- Linux processes (parallelism at a higher level).
- Networking communications (the NIC runs parallel, too).
There are some optimizations of those things, too.
Advanced Linux O/S Optimizations
It doesn’t end with the C++ code. There are other things you can optimize in the Linux O/S:
- Process priorities — be nice and turn yours up to eleven!
- Linux system processes — turn off the various Linux system processes that you don’t need (so they don’t compete for CPU time).
- Kernel bypass — direct NIC manipulations.
- Overlap communications and compute — e.g., PCIe bus GPU-to-memory upload/download.
- Networking technologies — e.g., TcpDirect and Onload; RDMA.
- Linux kernel optimizations — e.g., network buffer settings; disable writes that update the “file access date” when reading a file.
- Linux system settings — ensure you don’t have accounting or security modes on.
There’s also some other items on the advanced menu:
- Overclock your CPU (and the GPU)
- Buy a bigger box
- Get a faster SSD disk (e.g., NVMe)
- Assembly language
- Microwave communications
- FPGA
There’s always more, but I’ve run out of room in your web browser.
Serving and Deployment Optimizations
If your software has to do multiple things at once, such as talk to multiple people (users), or communicate with multiple stock trading platforms, then there are many system-level practicalities that affect latency.
If your low latency application is a public-facing consumer website, there are a lot of deployment issues to scale up to a lot of users. Some of the issues to consider in the whole end-to-end latency of a request going through a system include:
- DNS lookup time
- Connection handshake time
- SSL time
- Load balancing
- Round-robin DNS
- Parallelization (multiple servers)
- Utility servers
- Caching (e.g., etags)
- CDNs
- Database lookup time
- Database indexes
- Keep-warm server architectures
Building a low-latency system is more than just coding up some C++. You have to put together a bunch of off-the-shelf components.
Network Optimization
If your algorithm has to talk between two computers, there’s a network in between. The time spent sending data across the wire and back is a key part of the latency. Faster algorithms need to optimize the network traffic. The main techniques for network optimization include:
- Higher bandwidth network connections
- Advanced network protocols
- Compressing network data sizes
- Spreading bandwidth usage over time (avoiding peaks)
- Overlapping computation and communications
- Direct access to peripherals (local and remote)
- Direct access to memory (local and remote)
- Sticky sessions (keeps session data local)
- Sharing cache data between multiple servers
There’s a whole book that needs to be written about network optimizations! Should be done by Tuesday.
References
These are some good articles on optimizing an entire AI LLM backend system:
- Character.AI, June 20, 2024, Optimizing AI Inference at Character.AI, https://research.character.ai/optimizing-inference/
- Apple, June 2024, Introducing Apple’s On-Device and Server Foundation Models, https://machinelearning.apple.com/research/introducing-apple-foundation-models
- Together AI, Nov 13, 2023, Announcing Together Inference Engine – the fastest inference available, https://www.together.ai/blog/together-inference-engine-v1
- Ryan Lucchese, Niki Birkner, Yaron Hagai, Virginia Adams, August 13, 2024, A practitioner’s guide to testing and running large GPU clusters for training generative AI models, Together AI, https://www.together.ai/blog/a-practitioners-guide-to-testing-and-running-large-gpu-clusters-for-training-generative-ai-models
And these are some references about entire HFT system optimizations:
- Larry Jones, 27 Feb 2025, Mastering Concurrency and Multithreading in C++: Unlock the Secrets of Expert-Level Skills, https://www.amazon.com.au/Mastering-Concurrency-Multithreading-Secrets-Expert-Level-ebook/dp/B0DYSB519C/
- Sebastien Donadio, Sourav Ghosh, Romain Rossier, 17 June, 2022, Developing High-Frequency Trading Systems: Learn how to implement high-frequency trading from scratch with C++ or Java basics, https://www.amazon.com/Developing-High-Frequency-Trading-Systems-high-frequency-ebook/dp/B09ZV5L2T7/
- Irene Aldridge, April 2013, Wiley, High-Frequency Trading: A Practical Guide to Algorithmic Strategies and Trading Systems, https://www.amazon.com/High-Frequency-Trading-Practical-Algorithmic-Strategies-ebook/dp/B00B0H9S5K
|
• Online: Table of Contents • PDF: Free PDF book download • Buy: C++ Ultra-Low Latency |
|
C++ Ultra-Low Latency: Multithreading and Low-Level Optimizations:
Get your copy from Amazon: C++ Ultra-Low Latency |