Aussie AI

Chapter 3. CUDA for C++ Programmers

Book Excerpt from "CUDA C++ Debugging: Safer GPU Kernel Programming"

by David Spuler

Basics of CUDA C++ Programming

CUDA C++ is similar to C++, but with many extensions and idiosyncrasies. Here are some of the salient differences.

Parallel programming C++ features. These capabilities are the main superpower of CUDA and the reason it exists. The idea is to take SIMD to the extreme, and send the same computations to the GPU in massive groups (e.g., 10,000+ operations in parallel). This requires not just a syntax change, with GPU function “__global__” declarations and triple-angle-brackets for “<<<...>>>” invocations, but an entirely new way of thinking about how to optimize the algorithms.

Filename suffixes. Most programs in CUDA C++ are written with “.cu” as the filename suffix for source code. CUDA header files often use “.cuh” but can also simply use “.h” or “.hpp” rather than a CUDA-specific filename. The intermediate PTX assembly files, created by nvcc, have a “.ptx” suffix, if you like that low-level kind of programming.

CUDA Development Tools. Some of the development tools include:

nvcc — NVIDIA C++ compiler.
cuda-gdb — CUDA’s gdb-based debugger.
compute-sanitizer — CUDA’s memory-checker (like valgrind) and three other error detection tools: racecheck, synccheck, and initcheck.
ncu — Nsight Computer CLI command-line performance profiler.
nvprof — NVIDIA performance profiler (although deprecated).

There are various tools with graphical interface and extensive IDE integration. This offers many ways to be productive in coding CUDA C++.

Differences from Standard C++

By now you’ve probably noticed that CUDA C++ programming is a lot like C++ programming, but with some extra stuff. The main things are:

Extra #include directives for CUDA header files.
__global__ specifier (equivalently, __device__ means device-only).
<<< blocks, threads_per_block >>> kernel launch syntax.

A lot of standard C++ code can be run on the GPU in the way that you’d expect, such as:

C++ operators
Expressions
if statements
Loops
switch statements
Types
Local variables
Assignments
printf output

I don’t know what you think, but I find this quite weird! When I was learning CUDA, I expected it to be launching special SIMD instructions and intrinsic function calls to do vector operations. But, no, it’s more like just normal C++ programming, which makes it much easier to learn. They must have some very smart compiler design engineers at NVIDIA working on the CUDA Toolkit (and obviously a lot of brainy hardware engineers there, too).

Auto-included CUDA header files. An interesting and also rather pleasant improvement to the CUDA C++ programming environment is that CUDA doesn’t need many #include directives. Your CUDA C++ “hello world” program doesn’t need to actually include <cuda_runtime.h> or various others, because nvcc does it auto-magically for you, if it’s processing a “.cu” file. You can call CUDA C++ APIs like cudaMemcpy and cudaMalloc without an explicit header file include.

This is not a hidden Easter egg that AI programmers whisper about, but an officially documented feature. The CUDA C++ programming guide explicitly says that nvcc "implicitly includes cuda_runtime.h” at the top of the source file.

Really, wouldn’t it be nice if every C++ compiler did this? Why do we need this boilerplate at the top of every C++ program, when the compiler could almost always guess which header files we want when it sees the functions we’ve called?

Unfortunately, even nvcc doesn’t guess header files for everything non-CUDA. If you want to call printf, you still have to include <iostream> or <stdio.h>. And various CUDA add-on libraries still need to be explicitly included.

Kernel Limitations. Although the GPU code does look like ordinary C++, there are some important limitations on the device “kernel” functions that run on the GPU.

No return type — the kernel above has type void for a reason.
No pass-by-reference — don’t use & parameters for kernels.
main cannot be device code — the program always starts in the CPU host code.

There’s quite a lot of other limitations for kernel code on the GPU, but we’ll get to them later.

However, the host code is not restricted! There’s much fewer C++ limitations on the host code, because it runs on the CPU. In fact, CUDA uses the underlying platform’s C++ compiler, such as GCC, so there are a lot more things possible in the CPU code.

CUDA Dual Programming Model

To create a CUDA program, you need parts that run on the CPU, and parts that run on the GPU. You’re probably fairly familiar with how to compile code to run on a CPU, and the CUDA C++ program is very similar.

But how do you write code for the GPU?

The answer couldn’t be simpler: you just write C++ in the same file. CUDA has a “dual” programming mode in its C++ files (by which I mean its “.cu” files). The two parts are:

Host code — runs on the CPU
Device code — runs on the GPU

Host code versus device code. Your CUDA C++ code specifies both of these two distinct types of code. Host code runs on the CPU of the computer that “hosts” the GPU (or GPUs), and is intended to prepare data for the GPU, process the computed results, and other such high-level tasks. Device code is the optimized low-level code that actually runs on the GPU in a massively parallel SIMD manner, and it’s typically called a “kernel.”

How does the compiler know which is which? The short answer is:

CPU host codes — the default meaning of ordinary C++ functions.
GPU device code — extra __global__ keyword (with double underscores)

The only other trick is when the CPU code launches a kernel on the GPU code. It’s like a function call, but it’s called a “launch” and it uses a special triple-chevron syntax. A simple example of a “CPU-to-GPU” execution launch of a GPU device function would look like this:

    my_gpu_kernel<<<1,1>>>(parameters);

The two numbers inside the angled brackets are the number of blocks to launch, and how many threads-per-block. Note that more advanced calls can have up to four parameters.

In more detail, a normal function definition without any extra keywords is for the CPU, whereas the device code for the GPU has a new CUDA-specific keyword, the “__global__” specifier, on the C++ function declaration. One keyword is all that’s needed to tell the compiler to run a function on the GPU rather than the CPU.

There are also other specifiers: “__device__” and “__host__.” They can be useful to indicate functions that run on either the GPU device or the CPU host, and can be combined if both are true (e.g., a low-level utility function). However, note that “__device__” GPU functions cannot be launched or called from host code, so only “__global__” is used for kernel entry points. The meaning of “global” is effectively that control flow can cross over from the CPU host to the device GPU kernel function.

CUDA’s mixing of host code and device code together is sometimes called “heterogenous computing.” A single CUDA C++ source file can contain code for both the host CPU and the GPU device. The host code is more like the usual type of non-GPU programming and uses mostly the standard C++ features. The device code typically runs across multiple GPU “threads” in parallel, uses a combination of basic C++ syntax with various SIMD builtin extensions, and you have to think in “vectorized logic” to write these device functions.

CUDA Programming Control Flow Model

When I was first learning CUDA, I thought it would be very focused on SIMD operations. What I mean is that adding two vectors would be done by uploading the data for both vectors, and then sending an opcode for “add” so that the GPU would do a parallel SIMD addition on the vectors. In other words, I thought it would be somewhat “declarative” or like assembly language or similar to AVX SIMD instructions on x86 chips.

Not at all!

The CUDA programming model is almost like a full CPU-based computing environment, multiplied by a thousand, running on the GPU. It works like having a thousand mini-programs running in multiple threads on the GPU. The top-level features include:

Instructions (i.e., GPU-specific machine-code)
Data
Storage (for computed results)

You write an entire C++ function for each kernel computation, and then this function gets run in parallel across lots of “threads” on the GPU. In particular, each CUDA thread has its own versions of:

Program counter (instruction pointer)
Function call stack
Variables (on the “stack” or in “registers” underneath)
Local memory (e.g., for local variables)

You write your kernel in almost full C++ capabilities. For example, some of the basic stuff you can use includes:

Sequences of statements
If statements
Loops
Arithmetic expressions
Parameter passing
Variables

There are some limitations, however. For example, you can’t use recursion, or overloaded operator functions, and template usage is somewhat restricted.

Turing completeness. If you like your obscure Computer Science theory (and who can honestly say they don’t), you can see that this covers all of the three key control flow capabilities:

Sequence
Selection
Iteration

And the fourth requiring of data storage is also covered by variables and various layers of memory. This means that GPU threads are “Turing complete” computation models. Hence, the GPU runs your CUDA kernel code almost like a thousand tiny fully-complete computers, all running the same code in parallel. Declarative that!

GPU Program Flow

A typical CUDA program has a conceptual sequence something like this:

Initialization
Copy data from CPU memory to the GPU
Launch the GPU kernel (execute it)
Copy the results back from GPU memory to the CPU
Cleanup

Let’s analyze these steps in more detail.

Initialization. The program initialization in the host code may have all the usual program initialization, but it also usually has one more step: allocating memory on the GPU. This is often done via the cudaMalloc function with the direction parameter set to cudaMemcpyHostToDevice. Addresses from malloc refer to host memory and are called “host pointers.” Similarly, the return from cudaMalloc points into GPU device memory and is called a “device pointer.”

Copy to GPU memory. Copying data between the host memory and the GPU memory uses the cudaMemcpy function. This runs in the host code, but affects the memory on the device. The “unified memory model” that is handled by CUDA means that an address for the GPU memory can be managed in the host code. The host code can allocate and free memory on the GPU.

Kernel launch. The host code that does the launching of the kernel uses the <<<...>>> triple-angle-bracket syntax. This starts the code running in the GPU. The kernel code itself is also defined in the CUDA C++ program, as a function with a “__global__” specifier.

Copying data from GPU memory. Copying the result data back from the GPU uses the cudaMemcpy function again, but with a twist. This function has an extra parameter that specifies whether to perform a host-to-GPU or GPU-to-host memory copy operation. Hence, the reverse copy just uses the other parameter, via cudaMemcpyDeviceToHost rather than cudaMemcpyHostToDevice.

Cleanup. The final program cleanup code is all of the standard program-ending logic. One final step may be to call free for host pointers, and cudaFree to release any device pointers with addresses of GPU memory objects, thereby avoiding memory leaks in either host or device memory.

CUDA Parallel Execution Model

CUDA has various layers of parallelism, some of which map to hardware components in NVIDIA GPUs, and some are more of a software abstraction. These are relevant to the GPU portion of the C++ code, i.e., the device code. This model specifies how many parallel invocations of the device code get launched for your kernel.

Threads. A thread is the lowest level of compute execution. CUDA threads are more of a software abstraction than a direct mapping to hardware.

Blocks. Multiple threads are organized into “blocks” of combined execution. Each block has a fixed number of threads.

Grids. The “grid” is the total span of all the blocks, which contain all the threads. Since threads-per-block is a fixed number (for each invocation, not for everyone), the structure of all the blocks is somewhat “rectangular” in shape.

Streaming Multiprocessors (SMs). The streaming multiprocessors, sometimes just called “multiprocessors” or “SMs,” are a top-level execution unit on a GPU. There are not many of them, and execution of grids (i.e., multiple blocks of threads) is allocated onto parts of a SM, or sometimes across multiple SMs on the same CPU.

These are the four main conceptual structures: threads in blocks in a grid in a “multiprocessor” (i.e., SM). However, there are some other terms used.

Warps. A warp is a group of threads, usually 32 threads on NVIDIA chips. Blocks are actually organized into warps, each of 32 threads, so warps are a structure that sits awkwardly between threads and blocks in size.

Clusters. NVIDIA’s H100 chip introduces a fourth major category: thread clusters. This allows some particular programming of threads that can span different blocks.

Why do we care about all this hardware stuff? In some sense, we don’t care that much about these abstractions of the GPU hardware layers when programming CUDA, since our C++ only does a small amount of logic related to them. A lot of the issues of scheduling execution across different threads, blocks, and cores are hidden from us by the CUDA C++ compiler. However, there are some reasons to pay attention.

Thread computations. The main aspect of CUDA coding is to write the C++ function for each thread (i.e., each invocation of a “kernel” in a separate thread), and we only care about blocks, grids, and SMs because we want to be sure that enough threads are launched to perform all of our computations in parallel. Hence, every CUDA kernel launch involves some arithmetic computations about blocks, warps, and threads. We need enough for full parallelism!

Sharing data. Another reason arises when transferring data between different parts of our CUDA code. There are various different levels of memory and caches in a GPU. Some of these memory structures are limited to within a warp, within a block or within an SM. Hence, if we want our algorithm to share intermediate results across different threads running different parts of the kernel, and we want to use the fastest type of memory to achieve this, then we have to pay attention to which threads can access which data from which other threads, via shared memory and memory caches.

Features of CUDA C++ Programming

Non-blocking asynchronous kernel calling. When the host code calls a GPU kernel (e.g., a function declared as __global__), the invocation via the <<<...>>> syntax does not block and wait. It runs asynchronously, launching the GPU kernel, but continuing the execution of the host code immediately after the call. Hence, it will return before the results are available from the GPU kernel.

Maybe you want to wait until the results are available from the GPU? One way to make the host code block to await the completion of a kernel is the cudaDeviceSynchronize API, which blocks the host code on the CPU until all prior threads have completed. This is a useful safety catch, but can also be a performance slug if you’re needlessly waiting.

Unified memory model. CUDA allows programmers to use a “unified memory model” whereby the same block of memory is available to both host and device code. The same memory address space is abstracted so that both the CPU code and the GPU kernels can access the same memory. This simplifies some aspects of sharing data between the main program and the GPU acceleration kernels. The same memory can even be shared across multiple GPUs, but that’s jumping ahead a little bit.

Device memory management. The GPU memory can be managed via the host code using builtin functions. The main builtin functions for managing GPU device memory are:

cudaMalloc — allocated GPU memory (equivalent to malloc).
cudaFree — de-allocate GPU memory (equivalent to free).
cudaMempcy — copy bytes in device memory (i.e., GPU memcpy).
cudaMemset — set all bytes to the same value (i.e., GPU memset).

There isn’t a cudaCalloc function to zero the memory, but you can combine cudaMalloc with cudaMemset to create your own.

Memory transfer costs. An important point in using CUDA code for AI engines is that various Transformer inference algorithms are memory-bound, rather than compute-bound. Generally speaking, for inference tasks, the initial “prefill” phase (or “prompt processing”) before the first token is emitted is compute-bound (i.e., a very busy GPU), whereas the subsequent decoding phase of token-by-token generation (i.e., “autoregressive decoding”) is memory-bound. Hence, the cost of transferring data between the different memory cache levels, or sending data up to the GPU, or pulling the results down from the GPU to the CPU, can be a bottleneck. Although the unified memory model is very convenient, it hides a lot of the data transfers between the CPU and GPU code, which must be optimized for faster AI kernels.

CUDA C++ Syntax

CUDA C++ is an extension of C++ syntax, and many features are the same. The CUDA extensions are many, mostly aimed at parallel programming support. However, CUDA lags in the adoption of some of the advanced standard features, so not everything is available.

Comments. Comments are supported via the “//” single-line and /*...*/ multi-line C++ comment styles. As with C++, the /*..*/ comments do not nest.

Host code versus device code. The syntax is slightly different for the (non-GPU) “host code” versus the GPU-executed “device code” in a CUDA program. Device code for the GPU is specified via two syntax differences:

(a) “__global__” identifier, which declares a GPU-executable function, and

(b) “<<<...>>>” triple-angle-bracket syntax, which is akin to calling the GPU function (with parameters).

Starting execution on the GPU is conceptually more involved than a function call in a sequential C++ program, but the invocation of a kernel on the GPU is the effect. The “__global__” specifier allows the function to be called not just from the host, but also from the GPU itself (i.e., from the host or the device). There is also “__device__” for a GPU function only callable from the GPU, and “__host__” for a non-GPU host-executed function only callable from the host program.

Note that each of these function types runs on either the host or the device, but not both. However, you can declare a function as both “__device__” and “__host__” and there’s a preprocessor macro “__CUDA_ARCH__” which can be used to define different blocks of code that execute on the host versus the GPU, or indeed for different types of GPU architectures.

The default meaning for a function without any specifier is the same as “__host__” where the function only runs on the CPU (not the GPU) and can only be called from the host code. Hence, you don’t usually need to use any of the specifiers except for the tight GPU kernel code.

inline functions. There are extra specifiers that control inlining optimizations of functions:

__forceinline__
__noinline__
__inline_hint__

Builtin variables. There are various builtin variables or constants that are available to device programs.

threadIdx — thread index in a block
blockIdx — block index in a grid
gridDim — grid dimensions (blocks-per-grid)
blockDim — block dimensions (threads-per-block)
warpSize — size of a warp (how many threads; usually 32)

Memory address specifiers. The CUDA memory model has an extended, shared address space. There are various C++ specifiers that can be applied to variables or addresses:

__device__
__constant__
__shared__
__grid_constant__
__managed__

Pointer Specifiers. The CUDA language supports various extended specifiers for pointers:

__restrict__ for non-aliased restricted pointers to allow the auto-vectorizer to do more.
const can be used at two levels in pointer declarations.

Thread synchronization functions. There are various ways to synchronize parallel execution of multiple threads. The builtin functions include:

cudaDeviceSynchronize()
__synchthreads()
__syncwarp()

Device time functions. These functions are GPU equivalents of the standard C++ clock timing functions:

clock()
clock64()

Kernel Function Limitations

There are a number of limitations when writing GPU kernels in CUDA C++. Some limitations apply to the initial kernel launch and some apply more generally to any code running on the GPU. Note that the GPU-executed device portion of your C++ code is the functions with __global__ or __device__ specifiers.

Some of the limitations of kernel launches and device functions include:

Stack memory size is limited to 32K (for function parameters and local variables).
Pass-by-reference disallowed for kernel launches.
Variable-argument functions disallowed in kernel launches.
Copy constructor calls for kernel launches — bitwise-copy applies to object parameters.
static local variables disallowed in any kernel functions (use shared memory instead).
Global variables not available in the normal sense (use global memory, constant memory, or shared memory instead).

Some other capabilities are somewhat limited in device code:

Function pointers
Recursion (not recommended anyway!)
friend functions
operator functions
template usage (some limits, but also powerful).

That’s not even the full list, but there are far more C++ features that are supported, compared to these restrictions (e.g., basic operators, mathematical functions, control flow, etc.). Overall, these limitations are not central to coding up an algorithm in CUDA’s version of SIMD parallelism on a GPU. Most of these are coding features that you can live without!

• Online: Table of Contents

• PDF: Free PDF book download

• Buy: CUDA C++ Debugging: Safer GPU Kernel Programming