Aussie AI

Chapter 12. CUDA Compute Optimizations

Book Excerpt from "CUDA C++ Optimization: Coding Faster GPU Kernels"

by David Spuler

Chapter 12. CUDA Compute Optimizations

Parallelization

Really? CUDA can do parallelization on a GPU? Who would have thunk it.

Obviously, the whole point of a GPU is to do parallelization in its kernel threads. The main ways to do parallelization on a GPU include:

Run kernel threads in parallel (i.e., vectorization or “data parallelism”).
Run multiple kernels in parallel (i.e., higher-level “task parallelism” in the algorithm).
Multi-GPU architectures if you’re flush.

The general idea of taking an array or vector of data and operating in parallel on each of its elements is called “vectorization.” This is the bread-and-butter of CUDA kernels! The compiler can also auto-optimize in some cases to do additional lower-level vectorization by unrolling loops within kernels.

Task parallelism means looking for high-level “tasks” within your application that can be parallelized (i.e., don’t depend on each other’s results) This can be managed in CUDA applications using “streams” in many of the CUDA Runtime primitives. By defining different activities in different streams, CUDA has the information to know what tasks depend on each other, and can automatically schedule for the best possible parallelism.

And we can stop there, because the rest of the improvements seem less important. It’s quite plausible to have the CPU running as the overall controller of the GPU kernels, but using sequential logic itself. Time to go and get a cup of coffee.

If you’re still reading, there are some other ways to squeeze out a bit more juice. Some of the additional ways to add parallelism to your CUDA applications include:

CPU and GPU workload parallelism.
Overlapping data transfers and GPU computation.
Overlapped data transfers for the CPU, too.

CPU and GPU Workload Parallelization. In any CUDA architecture, you have both a CPU and a GPU, and you can put them both to work in parallel. I mean, the GPU will probably do 99% of the grunt work, but let’s at least keep the CPU happy with some token jobs.

Kernel launches are asynchronous on the host, so the CPU can keep going with its code while the GPU starts cranking away on those threads. To achieve this type of simple CPU-GPU compute parallelism, you need to do this:

1. Launch the GPU kernel in a stream.

2. Don’t synchronize!

3. Do other work on the CPU.

4. Synchronize with the stream when it’s ready.

The host code needs to avoid synchronization, because that would cause the CPU to block. Obviously, that means to avoid calling cudaDeviceSynchronize, which is the worst because it blocks the CPU to wait for everything. However, there are also many CUDA functions that have an “implicit synchronization” with the GPU, such as: cudaGetLastError, cudaMemcpy, cudaMemset, cudaMalloc, and various others. Instead, the CPU needs to do its other work, while watching the stream asynchronously for completion of the GPU kernel.

Synchronization Slugs

The other side of the coin for parallelization is synchronization. You can’t do too little synchronization, because that’s a bug in your parallel algorithm. But if you do too much synchronization, then it’s a slug.

If the CPU or GPU is sitting around waiting for somebody else to do something, that’s not fast. Over-synchronization can represent a bottleneck to many algorithms. High-level inter-dependence should be reviewed to see if any algorithm-level changes can be made. Low-level used of synchronization primitives should be reviewed to ensure that none are redundant.

Some of the beginner mistakes in synchronization include:

Calling cudaDeviceSynchronize at every second statement.
Hidden implicit synchronization in cudaGetLastError and others.
Blocking nature of simple CUDA runtime functions (e.g., cudaMemcpy, cudaMalloc).
Too many calls to __syncthreads() and __syncwarp().
Accidentally leaving “serialized kernel launch” settings enabled (e.g., CUDA_LAUNCH_BLOCKING).
Overuse of atomics (like in Dune) can cause delays due to synchronization.
Redundant barriers are unnecessary synchronization, whereby no race condition would arise without the barrier.

The host should avoid synchronization with device kernels as much as possible, so as to allow the host to launch multiple kernels in parallel, or to perform other computations on the CPU while the GPU is also processing. Obviously, the ability to do this depends on the overall algorithm. Explicit synchronization occurs with CUDA Runtime calls such as cudaDeviceSynchronize, but there is also “implicit synchronization” in other functions, such as cudaMemcpy and cudaMemset.

More effective types of synchronization include:

Use cudaPeekAtLastError to avoid implicit synchronization.
Use cudaStreamSynchronize with streams on the host, along with asynchronous primitives, for more granular control of synchronization.
Synchronize blocks and warps when you need to, but not too often, and hopefully just right.
Avoid redundant atomics, but use them when you need to.

It’s hard to get the balance right, but at least you can watch Netflix in your office while you try.

Thread Bottlenecks

Imagine if you went to all the trouble of analyzing your application to find task parallelism, and then implemented kernels to launch threads to vectorize all of that, and then it still ran slow?

How awkward!

Don’t let your threads get bogged down for really simple reasons. Various things, both internal and external, can prevent your kernel threads from achieving greatness. Always beware the various bottlenecks that may arise in thread execution, such as:

1. Excessively Large Threads. The extreme of “thread coarsening” is large kernels running bloated threads, with a large body of code, or many sub-functions being called, which results in very low thread parallelism. Such code might need some refactoring, or even an algorithm re-design.

2. Function Call Overhead versus Inlining Functions. When the GPU kernel calls another function, this creates function call overhead and uses its stack memory. Using an inline function is allowed for device code, and benefits from compile-time optimizations and reduce stack memory usage. You can also consider whether to use preprocessor macros versus inline functions.

3. Recursion. Well, if you’re using recursion in production code, especially in a kernel, you deserve to go slow. I have no sympathy for you! The only valid place for recursion is in a University assignment about binary tree traversals. Recursion and GPUs don’t mix.

4. CUDA Runtime Errors. An aspect of both performance and debugging is to check every CUDA Runtime API function for an error return. A failing kernel is not just a bug, but also a slow kernel. Most CUDA programmers use macros to wrap every call to these functions. You can also use cudaGetLastError or cudaPeekAtLastError, but be aware of implicit synchronization.

5. Assertions Failing. Assertion failures from the assert function on devices are a special case of CUDA Runtime errors. These will allow the working threads to keep going, but have some threads fail, which is a slow-down. These failures also trigger cudaErrorAssert errors. There’s a great book called CUDA C++ Debugging with a whole chapter written by me on assertions.

6. Debug Trace Statements. The kernel has the builtin printf function, but it slows down performance, especially for large amounts of output, so only use it when debugging. Come on, you don’t really need all that trace output! We have debuggers for that. Also, another chapter in the CUDA C++ Debugging book.

7. Busy Wait Functions or Timers. In the category of “you did this to yourself”: don’t implement timers or “sleep” functions on the GPU using a busy wait. It’s better to run timers on the host, and the kernel should not be marking time by spinning in a loop.

8. Insidious CUDA Runtime Slugs. The CUDA runtime checks a lot of its pointer arguments, but there are a few cases where sending the wrong type of pointer with Unified Addressing causes a slug, rather than a CUDA error code. The whole idea is to relieve you of the burden of managing pointers, but sometimes if you send a host pointer instead of a device pointer, that means it’ll be transferring data without you realizing. This may occur on host code or device code. For example, try sending a host pointer from cudaMallocHost as an argument where cudaMemcpy expects a device pointer (in a few cases only).

Additional Thread-Level Optimizations

1. Thread Coarsening. This optimization involves having “coarse” threads that perform more work. A classic example is unrolling loops inside a kernel thread. The advantage of having coarser threads is reduced kernel launch overhead for a set amount of computation. However, they can also worsen performance by reducing thread parallelism, so there is always a balance, and benchmarking is desirable to optimize.

2. Loop Unrolling. The idea with loop unrolling is to perform multiple computations without testing the loop condition. Fixed-size loops with the size known at compile-time can even be fully unrolled. Note that although loop unrolling is often very powerful in sequential programming, it can actually reduce thread parallelism and introduce non-coalesced memory access patterns when used in CUDA kernels. The cost of a simple index variable comparison (e.g., an “i < n” test), where both i and n will be in a register, is comparatively low against global memory access costs (often over 100 in order of magnitude). Hence, the benefit of loop unrolling in CUDA is moreso the avoidance of kernel launch overhead, rather than loop comparison test overhead.

3. Instruction Cache Optimizations (Instruction Locality). Every CPU maintains its own instruction pointer and has an instruction cache that aims to speed up loading of instructions. Using instructions that are “close” to each other can be significantly faster, whereas branching widely over a large set of instructions destroys instruction locality.

4. Avoid Redundant Extra Threads. Threads are allocated in groups of 32, and cannot be launched in a smaller number. Hence, if the algorithm needs some “extra” computations, there can be threads running or inactive, which wastes resources. Entire warps of threads can also run needlessly in some poorly designed algorithms. Take care to examine your “safety” conditions in threads, in case you accidentally have many threads doing nothing.

5. Detect Redundant Thread Computations. This is where multiple threads are doing the same computations on the same data. This is more likely to be in a novice CUDA kernel, but experts occasionally make errors that can cause additional threads to perform the same work. This error can be hard to detect, because it is not failing, but multiple threads are doing the correct calculations. Sometimes, it can be whole redundant warps!

GPU Configurations

Some additional ways to maximize your use of a GPU are listed below:

1. GPU-Specific Optimizations. More advanced GPUs allow additional types of optimizations. Make sure you take advantage of the full hardware capabilities. These can be examined programmatically in the CUDA C++ code with API calls such as cudaGetDeviceProperties. Useful information includes the Compute Capability level and details of grid dimension constraints.

2. Lazy Loading of GPU Module Instructions. This is a CUDA optimization to kernel launch capabilities that delays the loading of machine code instructions by the GPU until it is needed. It is enabled by default since CUDA 12.2, and can be controlled by setting the environment variable CUDA_MODULE_LOADING to either LAZY (default) or EAGER.

3. GPU Overclocking. This is something that can be considered, but it’s also a risk. GPU overheating is a real problem in production environments, which is why NVIDIA has the command-line dcgmi GPU diagnostic tool (it’s part of the NVIDIA’s Data Center GPU Manager, not the CUDA Toolkit, so it’s not installed by default). The risk of GPU failure is especially high when the GPUs have been running for a while, or have been previously used in Bitcoin mining or other heavy loads. Note that GPU failure is often insidious, calculating incorrect matrix multiplications, rather than triggering a runtime error!

You can get the current GPU clock speed in deviceQuery (a CUDA sample program) or via the nvidia-smi command, or programmatically in CUDA C++ using the cudaGetDeviceProperties function and the clockRate property:

    // GPU clock speed
    int device_number = 0;
    cudaDeviceProp prop;
    CUDACHK( cudaGetDeviceProperties(&prop, device_number) );
    double ghz = prop.clockRate / (1024.0f * 1024.0f);
    printf("GPU Clock Rate: %d KHz (%3.2f GHz)\n", (int)prop.clockRate, ghz);

Also available are other properties, such as the memory clock speed and the width of the PCIe bus (in bits) between GPU and RAM.

Note that you cannot set any of these GPU properties via the CUDA Runtime API (i.e, there’s no “cudaSetDeviceProperties”). You might be able to programmatically change the GPU clock speed using the NVIDIA Management Library (NVML) “nvmlDeviceSetApplicationsClocks” API, or in some circumstances by launching an “nvidia-smi” command as a sub-process.

4. Process Prioritization. An application running as a process on Linux can be given an increased priority when launched, using Linux operating system commands (i.e., the nice command). This can give the application priority access to the CPU, above many other less important processes. Alternatively, the nice command can also be used to lower the priority of less important processes. If you want to run at highest priority:

    nice -20 a.out

The default priority for the nice command is 10. Note that there’s also the renice command to adjust the priority of an already-running process. This suggestion assumes that you have the security privileges to do so. Programmatically, in C++ host code, you could use these APIs:

nice function in <unistd.h> — Linux
setpriority or getpriority — Linux
SetPriorityClass in <processthreadsapi.h> — Windows API

Note that the nice C++ function adds to the priority, rather than setting the value as in the nice command.

5. Device Property Testing. Examining the GPU device properties seems like it should be fast, but there are two ways with different speeds. Generally, cudaGetDeviceProperties is slower because it has to get all of the properties, some of which require an expensive PCIe bus read. Depending on which attribute you seek, cudaDeviceGetAttribute can be much faster. The slower attributes include cudaDevAttrClockRate, cudaDevAttrKernelExecTimeout, cudaDevAttrMemoryClockRate, and cudaDevAttrSingleToDoublePrecisionPerfRatio. For an example of how to get every type of device property, look up the free deviceQuery sample in NVIDIA’s Github area.

6. GPU Isolation. This is an optimization where your application only uses a single GPU on a multi-GPU system. Without paying attention to “isolating” its execution, an application will consumer resources across multiple GPUs, which wastes overall resources. This method does not optimize the execution of the application that is using isolation, but benefits the other workloads on the other GPUs. It’s kind of like quarantine for silicon beings.

• Online: Table of Contents

• PDF: Free PDF book download

• Buy: CUDA C++ Optimization

The new CUDA C++ Optimization book:

Faster CUDA C++ kernels
Optimization tools & techniques
Compute optimization
Memory optimization

Get your copy from Amazon: CUDA C++ Optimization

Aussie AI

Chapter 12. CUDA Compute Optimizations