Aussie AI

Chapter 14. Grid Optimizations

Book Excerpt from "CUDA C++ Optimization: Coding Faster GPU Kernels"

by David Spuler

Chapter 14. Grid Optimizations

Grid Size Optimizations

Choosing the grid size is an important aspect of optimization for CUDA kernels. The goals of this analysis include:

Occupancy rates (GPU thread utilization)
Load balancing of the workload over the GPU (and over multiple waves)
Wave optimizations (execution of multiple blocks)

Risks include allocating too many threads for the GPU, or inefficient wave patterns, such as the “tail effect” of a small final wave. Choose block sizes carefully. Experimentation and benchmarking may be required to find the best size.

Grid size basics. How many blocks and how many threads-per-block should you run? The main basic constraints are that the warp size is 32 threads and the block size (i.e., threads-per-block) is at most 1024 threads. One rule of thumb is to start with block sizes of 256 or 512 threads, and it must be a multiple of the warp size (i.e., 32).

Note that there are no NVIDIA GPUs for which these constraints are higher (at least, as of this writing). However, different GPUs can have more or less blocks and an overall size of a wave (multiple blocks).

You can print out the basic grid constraints of your GPU to check they haven’t changed by using cudaGetDeviceProperties:

    int device_number = 0;
    cudaDeviceProp prop;
    CUDACHK( cudaGetDeviceProperties(&prop, device_number) );
    // Warp size (should be 32)
    printf("Warp Size: %d threads\n", (int)prop.warpSize);
    // Maximum Threads per block (should be 1024) 
    printf("Max Threads Per Block: %d threads\n", (int)prop.maxThreadsPerBlock);

You should get this output:

    Warp Size: 32 threads
    Max Threads Per Block: 1024 threads

The block size should be a multiple of 32 threads, which is the warp size. If you don’t, the program will still run, but the GPU only schedules threads on a per-warp basis, so all the other “extra” threads will be idle, and unavailable to you or any other GPU kernel. A simple way to ensure this inefficiency doesn’t accidentally occur:

    assert(threads_per_block % 32 == 0);

You can use the optimization “& 0x1F” if you don’t trust compiler design engineers (but don’t forget the extra parentheses!).

Block size limit is 1024. The block size is also limited to a maximum of 1024 threads, and common sizes in CUDA programs are 256 or 512, but it can be exactly 1024 if you prefer.

However, if you choose a block size of more than 1024 threads, the kernel launch will immediately fail with an error. Weirdly, this type of synchronous kernel launch error is actually an insidious error, because it seems to get lost if you don’t check for it right away.

    mykernel <<< blocks, 1025 >>> (); // BUG!
    CUDACHK( cudaMemcpy(...) );  // Won’t fail!

Note that kernels cannot return a value and must be of type void. So, you can’t do:

    err = mykernel <<< blocks, 1025 >>> ();  // Compile error!

And similarly, this idea fails:

    CUDACHK( mykernel <<< blocks, 1025 >>> () );  // Compile error!

The way to actually catch this synchronous error is with an immediate call to cudaPeekAtLastError or cudaGetLastError:

    mykernel <<< blocks, 1025 >>> ();
    CUDACHK( cudaPeekAtLastError(...); )  // Fails (correctly!)
    CUDACHK( cudaMemcpy(...) );

Since we may not want the performance cost of the implicit synchronization this causes, a better solution for catching these synchronous kernel launch failures is either:

(a) Don’t do that!, or

(b) Put your post-launch call to cudaPeekAtLastError or cudaGetLastError inside an “#if DEBUG” sequence that disables it for non-debug builds, or

Balancing workload. One way that can help to balance the workload in the GPU is to choose a number of blocks that is an exact multiple of the number of streaming multiprocessors on the chip. You can get this number as a device property via cudaDeviceGetAttribute:

    // Streaming Multpirocessor count
    int smcount;
    CUDACHK( cudaDeviceGetAttribute(&smcount, cudaDevAttrMultiProcessorCount, 0));
    printf("Numbers of SMs: %d\n", smcount);

This number can then be used in the block count calculation. Note that this does not at all guarantee workload balancing, but does give the GPU more chance to do so.

Wave Optimizations

In CUDA, a “wave” is a set of thread blocks running in parallel. Hence, it is a high-level concept of a large amount of parallel execution. Optimizing the number of waves is an important aspect of choosing the number of blocks and block size for your kernel.

Generally, if you can parallelize your algorithm well enough, you would only want one wave, so that all of your workload finishes immediately. But it’s common that that would exceed the number of threads possible on a GPU, so it has to be split into multiple waves.

Tail Effect (Small Final Wave). One aspect of wave optimization is to avoid the “tail effect” at the end of your algorithm. This refers to having a smaller wave at the end, because the early waves didn’t quite finish off all of the workload. Hence, you get a final wave with a very low occupancy level.

The tail effect occurs when computation is split over multiple waves. If the algorithm maximizes occupancy for the start of the algorithm, there is often a much smaller last wave that does any left-over computations. This final phase has a low occupancy, and the overall GPU utilization can often be improved by a more balanced allocation of computations to slightly smaller waves, or to a “single-wave kernel” where possible.

One tip that can be useful for wave optimization is to consider SM balancing in your grid and block size values. Choosing a grid size where the number of blocks is a multiple of the number of multiprocessors can encourage a load-balanced execution by the GPU, although it does not guarantee it. This would mitigate the tail effect if the heuristic is effective. Note that you can programmatically find the number of SMs:

    // SMs on the GPU
    int device_number = 0;
    cudaDeviceProp prop;
    CUDACHK( cudaGetDeviceProperties(&prop, device_number) );
    printf("SMs on GPU: %d SMs\n", (int)prop.multiProcessorCount);

Single-Wave Kernels The idea of a single-wave kernel is simply to design the algorithm in such a way that each kernel can run as a single wave on the GPU, achieving maximum occupancy. Achieving this may involve an algorithm re-design and careful computation of grid dimensions.

Occupancy Optimization

High occupancy rates are an important goal in selecting the grid size. Occupancy is measured as the number of active threads as a percentage of the total theoretically possible threads in a GPU. There is a useful “occupancy calculator” and also an “occupancy API” available from CUDA to help.

Occupancy API. There are some CUDA API functions that can be useful in calculating the best grid and block size dimensions for achieving occupancy. These are defined in “cuda_occupancy.h” and note that this is auto-included via “cuda_runtime.h” by nvcc for “.cu” files. Some of these API calls are:

cudaOccupancyMaxPotentialBlockSize
cudaOccupancyMaxActiveBlocksPerMultiprocessor
cudaOccupancyMaxPotentialBlockSizeVariableSMem

Additional properties of the GPU and the SMs can be queried via the cudaGetDeviceProperties function. Note that these are read-only properties of the hardware and there’s no “set” API. There are dozens of them, but some are:

cudaDevAttrMultiProcessorCount — SMs per GPU.
totalGlobalMem ‐ total global memory.
sharedMemPerBlock — shared memory per block.
warpSize — 32 always!
regsPerBlock — maximum registers per block.
maxThreadsPerBlock — maximum threads-per-block (1024).
maxThreadsPerMultiProcessor — maximum threads per SM.

The function cudaDeviceGetLimit can be used to examine three other configuration settings, which are read-write settable GPU limits:

cudaLimitPrintfFifoSize — printf FIFO buffer size
cudaLimitStackSize — Stack size
cudaLimitMallocHeapSize — Heap size

General occupancy techniques. If you’re not achieving a high occupancy, there a number of optimizations to consider:

Too few warps. If you’re not allocating enough threads in warps to utilize the SM, then your occupancy will simply be low. Consider increasing the parallelization of your algorithm so that more warps are used. But don’t go too high either.
Warps per kernel: choose your block count and block size so that the total number of warps in your kernel is exactly the maximum number of warps allowed for a multiprocessor.
Blocks per SM: ensure the number of blocks in a kernel is adjusted according to the SM maximum allowed. If may be desirable to adjust the block size to contain more warps, and reduce the number of blocks.
Register bottlenecks. There are only a limited number of registers on each SM. Hence, using too many registers for your kernels may restrict occupancy. Consider reducing the allowed number of registers via compiler options, although be aware that this is reducing memory efficient, and comes at a runtime cost.
Shared memory bottlenecks. There is also a restricted amount of shared memory available on each SM. This can be an area that reduces your occupancy rates. Review the amount of shared memory being used by your kernel.
Unbalanced threads. If your kernel has a lot of threads at the start, but reduces to fewer active threads, it is said to be unbalanced. This can occur in many types of kernels, notably reductions, and may lead to poor load balancing over the SMs.
Unbalanced blocks. Similarly, if some of the blocks remain active for longer than others, this is unbalanced. Review the block-level activity across your kernel for effects on occupancy in the multiprocessors.

• Online: Table of Contents

• PDF: Free PDF book download

• Buy: CUDA C++ Optimization

The new CUDA C++ Optimization book:

Faster CUDA C++ kernels
Optimization tools & techniques
Compute optimization
Memory optimization

Get your copy from Amazon: CUDA C++ Optimization

Aussie AI

Chapter 14. Grid Optimizations