Aussie AI

Chapter 1. Parallel Programming

Book Excerpt from "CUDA C++ Optimization: Coding Faster GPU Kernels"

by David Spuler

Chapter 1. Parallel Programming

Parallel Execution

The reason that GPUs make code run fast is parallel execution of your code. You need to understand the basics of parallel algorithms for two reasons:

1. Understanding how the GPU is parallelizing computations, and

2. Organizing your high-level algorithms for greater parallelism.

The first part is mostly about one type of parallelism in the GPU. The second part ideally uses multiple types of parallelism to maximize overall speed of the total algorithm.

CUDA offers one particular type of parallelism, which is often known as Single Instruction Multiple Data (SIMD). What SIMD really means is that we do addition in parallel on 1,000 elements of a vector. The “single instruction” is addition, and the “multiple data” is the many vector elements.

However, CUDA is not just SIMD, but it’s more accurately called Single Instruction Multiple Threads (SIMT). This is a lot like SIMD, but it’s somewhat more generalized, because it isn’t just running an arithmetic data instruction. Rather, each computation runs in a separate “thread” in parallel, along with many other threads in lock-step over the data. These threads has a fully-fledged runtime environment and these can all run in parallel under your control. As we’ll see, CUDA threads can actually be like mini-programs, which is much more powerful than SIMD.

NVIDIA GPU Architectures

All of this is only possible from many years of GPU hardware optimizations. The state-of-the-art has continually advanced and recent GPUs are much more powerful than ones from only a few years ago. NVIDIA names its architectural generations for GPUs after famous individuals in Science, Mathematics, Physics, and Medicine:

Blackwell (B100, B200 in 2024/2025) — Elizabeth Blackwell was a famous American Scientist and Physician, and was the first woman to gain an official medical degree in the USA in 1847.
Hopper (H100 in 2022) — Grace Hopper, an American Mathematician, Computer Scientist and U.S. Navy Admiral, known for designing the COBOL programming language in 1959.
Lovelace (L40 in 2022) — Ada Lovelace, an English Mathematician and early Computer Scientist in the 1800s, known for her work on the “analytical engine” of Charles Babbage, including identifying its non-computational applications.
Ampere (A100 in 2020) — André-Marie Ampère, a famous French Physicist and Mathematician, known for electrical inventions such as the solenoid, and whose name is used as the “amps” unit for electrical current in modern day.
Volta (V100 in 2017) — Alessandro Volta, a famous Scientist with a focus on electricity, whose name also denotes the unit of “volts” in modern usage.
Turing (T4 in 2018) — Alan Turing, a famous British Mathematician, Computer Scientist and Cryptographer, known for discovering the use of computers for cryptography in wartime code-breaking.
Pascal (P100 in 2016) — Blaise Pascal, a famous French Mathematician, Physicist, and Philosopher known for mathematical theory contributions such as Pascal’s Triangle in binomial distributions.
Maxwell (2014) — James Clerk Maxwell, a famous Scottish Mathematician and Physicist, known for the classical theory of electromagnetism.
Kepler (2012) — Johannes Kepler, a famous German astronomer and mathematician in the 1500s/1600s, known for his laws of planetary motion.
Fermi (2010) — Enrico Fermi, a famous American physicist and Nobel laureate, known for his work on nuclear physics in the 1900s.
Tesla (2006) — Nikola Tesla, a famous Serbian-American engineer, physicist, and inventor, known for his pioneering work in electricity in the late 1800s and early 1900s.

CUDA C++

CUDA is the software stack for NVIDIA GPUs and CUDA C++ is the programming language used to program them. But CUDA isn’t just a programming environment, but it’s a whole ecosystem of tools, libraries, and platforms for parallel programming. A monumental amount of work has gone into offering an amazing suite of C++ libraries for almost anything you could think of.

Optimizing CUDA C++ has some aspects that are the same as standard C++ efficiency techniques, but then there’s a whole gamut of extra techniques for fast parallelism and vectorization. The main advantage of CUDA C++ is that you can write these highly parallelized “kernels” to run on the GPU in a high-level language based on C++. This offers the benefits not only of speed, but also programmer productivity.

CUDA C++ uses a “dual model” of programming whereby you write the two programs inside the same C++ code. There are two main environments that you need to code:

Host code — runs on the CPU.
Device code — runs on the GPU.

Both types of code are written in fully high-level C++ statements. There’s only a few differences that identify GPU code:

__global__ specifier on GPU kernel functions.
A non-standard <<<...>>> “triple chevron” syntax for launching these GPU kernels.

If not for these differences, you’d hardly notice you weren’t in standard C++. Hence, CUDA C++ leverages the experiences and capabilities of the C++ ecosystem, and then extends it to broad vectorization on the GPU.

In a typical application, the host code on the CPU is the “controller” of the overarching application algorithm, and the GPU is used as the grunt worker to smash through huge reams of data processing in its massively parallel hardware architecture. It’s hard for the poor CPU to keep up.

Porting CPU C++ to the GPU

Using CUDA C++ to unleash the power of the GPU onto your sequential C++ code is a great way to improve its performance. The first point to note: keep the CPU code!

The CPU code is retained when porting to CUDA C++. The overall application is still controlled by the CPU code (called the “host”), and the tighter computationally expensive areas are sent to the GPU (called the “device”). Hence, the logical sequence of your application still follows the same sequences at a high level.

Also, the language differences between CUDA C++ and standard C++ are minimal. As a practical matter, the CUDA C++ compiler still uses the CPU’s C++ compilers (e.g., GCC) for compiling the non-GPU code on the CPU platform, and only does special compilation of the GPU code (called “kernels”).

Hence, it’s not a rewrite (sigh). Instead, you have to troll through your CPU code to find parallelization opportunities. Some general pointers on where to start:

Profile your CPU code to identify busy areas (as if you didn’t already know!)
Look for highly parallelizable algorithms (e.g., vectorizable, linear, 2D data tables, etc.)
Unrolled loops are often a good indicator that you were trying to vectorize a busy loop. Roll them back up and build a CUDA kernel instead.
Sharding of data tables often indicates an area that can be parallelized.

Some of the pitfalls of porting sequential code to CUDA C++:

Multithreaded CPU code does not map one-to-one at a high level onto CUDA kernels. Don’t try to port a huge high-level C++ block into a huge GPU kernel.
Data transfer costs in partially ported CPU code can be a bottleneck, if the CPU and GPU have to keep sending computed data back-and-forth (in this case, finish the porting job!)

Overall, it should be an enjoyable and mind-expanding task to port most of your sequential C++ code to CUDA C++ on the GPU.

Optimizing CUDA C++

Optimizing your code is the main topic of this book. At a high level, some of the most important decisions in this regard include:

Measure code performance with the CUDA profiler tools.
Use the already-written CUDA library functions, as they’re very fast.

To focus your research on the types of CUDA C++ optimizations to consider, the main points are:

Parallelization is the main focus of code optimization for GPUs, affecting everything from the top-level task parallelism to the low-level GPU threads.
Memory access and data transfer optimizations need to be considered in GPU applications, moreso than in standard C++ code, because there’s simply a lot more data flying around.

There are multiple chapters in this book on each of those areas. You should be done soon if you speed-read over the weekend!

And it’s not just the software. There are other layers of optimizations that are very important, perhaps moreso than the software: