Aussie AI
Chapter 5. Debugging Simple CUDA Kernels
-
Book Excerpt from "CUDA C++ Debugging: Safer GPU Kernel Programming"
-
by David Spuler
Grid Dimensions
One of the most common parts of a kernel is to check its own position in the grid. The builtin variables to use are:
gridDim— grid dimension (how many blocks in the grid)blockDim— block dimension (how many threads per block)blockIdx— block index (in a grid)threadIdx— thread index (in a block)
Each thread in a grid has a unique pair of values for the block index and thread index. The typical CUDA idiom for the starting offset into a linear grid looks like this:
int starti = blockDim.x * blockIdx.x + threadIdx.x;
The above computation has a unique value in each thread, which ensures that each thread is working on a different index, such as when processing a vector or other linear array.
Note that the thread and block index values start at zero, rather than one, like C++ array offsets.
The thread index ranges from zero to blockDim.x-1,
and the block index ranges from zero to gridDim.x-1.
These values are different within each thread (or block),
and also stay constant within each thread until completion.
Dimension values are non-zero and fixed during execution of a thread. Neither the block dimension (threads-per-block) nor grid dimension (blocks-per-grid) can be zero. All blocks in a grid have the same number of threads, so these dimension values should be the same for all of the threads executing a given kernel, and not change while executing the kernel.
For a fixed-size operation on an AI model, these builtin dimension functions
could actually be replaced by a numeric constant.
However, this micro-optimization is not usually worth doing,
and the optimizer in the nvcc compiler
or the ptxas assembler is presumably handling this behind the scenes anyway.
The kernel in each thread may not need to know this,
but the total number of threads in a grid can be calculated
by multiplying gridDim (how many blocks in the grid)
and blockDim (how many threads per block):
int total_threads = blockDim.x * gridDim.x;
Usually, the “.x” attribute is accessed from these builtin variables
to get the value.
However, these variables are all objects of type “dim3”
and have three parameters: x, y, and z.
The most common type of grid is a “linear grid”
which has an x value, but the y and z values are zero.
However, advanced grids can be two-dimensional or three-dimensional,
with non-zero values for y and/or z.
One-Dimensional Vector Kernels
Let’s consider a vector addition kernel doing an in-place operation: y=y+x.
There are many ways to do so with different kernel functions:
- Single addition per kernel
- Single addition with a protective
ifstatement - Single block with a block-stride loop
- Looping over a strip
- Looping with a general grid stride
Generally, the main way that experienced CUDA programmers will handle this is a “grid stride” loop. But let’s have a look at some of the other methods.
Single Operation Kernel
In this simple type of kernel, each thread does the addition on a single vector element.
No loops.
The main point is to launch enough threads for cover all the n elements in the vectors.
__global__
void aussie_vector_add(float *x, float *y, int n)
{
int id = blockDim.x * blockIdx.x + threadIdx.x;
y[id] = x[id] + y[id];
}
...
int threads_per_block = 256;
int blocks = n / threads_per_block;
aussie_vector_add<<< blocks, threads_per_block >>>(x, y, n)
Bug!
This code actually has an error if n is not an exact multiple of 256,
due to integer division truncation.
We launch one block too few, and the last few “extra” elements of the vectors won’t get added.
One fix is to ensure we account for the extra cases to launch one more block, with the modified computation:
int blocks = ( n + threads_per_block - 1) / threads_per_block;
But this has another insidious and non-obvious bug, as discussed below.
Kernel Safety Checks
The problem with the basic single-operation kernel
show above
is that sometimes the value of id exceeds n.
This occurs because if n is not an exact multiple of “threads_per_block” value.
The above code launches one more block to handle the extra cases,
but the problem is that it launches 256 threads for that block, even if there aren’t 256 extra cases.
These threads don’t know any better, and try to run,
with the index computation:
int id = blockDim.x * blockIdx.x + threadIdx.x;
In these problematic threads,
this creates an index value for id that is larger than or equal to n.
Then the accesses y[i] and x[i] are array bounds violations
in global memory.
Kaboom!
One way to solve this is to always ensure the n is a multiple of a reasonable block size,
such as 256.
Note that this should be a multiple of the warp size, which is usually 32.
This is quite a viable plan in AI, because most of the Transformer operations can be
chosen so that vectors, matrices and tensors have fixed dimensions that are multiples of 32.
That trade-off comes with some risks, and defensive coding says to do more to protect against crashes.
Hence, another safer solution to the array bounds error
is to add a protective if statement in every thread:
__global__ void aussie_vector_add(float *x, float *y, int n)
{
int id = blockDim.x * blockIdx.x + threadIdx.x;
if (id < n) { // Safety!
y[id] = x[id] + y[id];
}
}
...
int threads_per_block = 256;
int blocks = ( n + threads_per_block - 1) / threads_per_block;
aussie_vector_add<<< blocks, threads_per_block >>>(x, y, n)
All of the threads now must execute an extra “if” comparison.
In the problematic threads, the if statement ensures that no work is done
in the extra threads.
This is a waste of GPU resources in launching unnecessary threads, but at least it doesn’t crash.
One possible compromise solution is to change the if statement to an assertion,
which you allow to run live during test runs:
__global__ void aussie_vector_add(float *x, float *y, int n)
{
int id = blockDim.x * blockIdx.x + threadIdx.x;
assert(id < n);
y[id] = x[id] + y[id];
}
This will actually prevent the crash because the builtin kernel assert primitive stops the thread.
However, it’s probably not fault-tolerant for the entire application, because the kernel
launch will return a cudaErrorAssert runtime error.
Then, for performance reasons, you can remove the assertions from production builds by defining
the special macro NDEBUG
and ship that to customers (fingers crossed!).
You should also add some assertions at the kernel launch to ensure that the total number of threads
you need
is an exact multiple of the block size:
assert(n % threads_per_block == 0);
Note that there’s no way in CUDA to launch blocks of different sizes.
We can’t launch most of the blocks with a fixed thread size like 256, and then launch one extra block
with a few extra threads.
If we try to work around this by launching two kernels with the same function (i.e., the main blocks
and one “extra” block),
the block index (blockIdx) and block dimension (blockDim)
will have values that are wrong in the threads for the second one.
Hence, we need to either guarantee at the algorithmic level
an exact multiple of the thread size (fast),
or add safety checks inside the kernel (slow).
Two-Dimensional Matrix Kernels
Things get trickier in two-dimensional kernels, such as for matrix operations. Let’s look at a matrix addition operation, which generalizes the vector addition idea.
Vector kernels need only use the “.x” attributes, but matrix kernels
also need to look at the “.y” values as well.
Here is the two-dimensional index calculation:
int x = blockIdx.x * blockDim.x + threadIdx.x;
int y = blockIdx.y * blockDim.y + threadIdx.y;
However, you can’t do this with dynamic array sizes:
m3[x][j] = m1[x][y] + m2[x][y];
On the other hand, this array syntax works for fixed-size matrices, if the dimensions are known at compile-time. And if you can guarantee that, the above operation would very fast, so don’t let me discourage you!
Instead, for dynamic arrays in two dimensions, you need to “linearize” the matrix offsets into a single flat one-dimensional array, which contains all the matrix elements, ordered one row at a time. The calculation is:
int id = x + y * nx;
Our full kernel for matrix addition looks like this:
__global__ void matrix_add(
float *m3, const float *m1, const float *m2,
int nx, int ny)
{
int x = blockIdx.x * blockDim.x + threadIdx.x;
int y = blockIdx.y * blockDim.y + threadIdx.y;
int id = x + y * nx; // Linearize
m3[id] = m1[id] + m2[id];
}
Again, we have a problem with unsafe array bounds violations if there are extra threads launched. The safer version is:
__global__ void matrix_add_safe(
float *m3, const float *m1, const float *m2,
int nx, int ny)
{
int x = blockIdx.x * blockDim.x + threadIdx.x;
int y = blockIdx.y * blockDim.y + threadIdx.y;
if (x < nx && y < ny) {
int id = x + y * nx; // Linearize
m3[id] = m1[id] + m2[id];
}
}
This is safe but marginally inefficient if there are extra cases. As before, there are other options such as using assertions, or being extra careful to ensure that our matrix dimensions are fully consistent with the grid dimensions, so that there aren’t any extra threads.
Warps and Lanes
So far, we’ve been talking about blocks and threads, but let’s add another wrinkle. There’s a reason that the number of threads should be divisible by 32: warps.
What do you call a group of 32 threads? No, it’s not a block, but a “warp.” Typical CUDA blocks run multiple warps, each containing 32 threads, which is why block sizes need to be a multiple of the warp size, which is 32.
Hence, our revised terminology is:
- Threads — single execution unit.
- Warp — 32 threads.
- Block — multiple warps (of 32 threads each).
- Grid — multiple blocks.
The GPU kernel code sometimes needs to know the details of which warp it’s in, just like we’ve already seen it work out its thread offset. You can calculate which warp a kernel thread is in by examining the thread index in a block.
int warpid = threadIdx.x / 32;
If the block size is 64 threads, there are 2 warps, and the first 32 threads will get 0 here, and the latter 32 threads will get 1 as the warp index.
Lanes. Within a warp, the 32 threads are actually numbered 0..31. These are called the “lanes” for a warp, and you can calculate each thread’s lane from the thread index using the integer remainder operator:
int laneid = threadIdx.x % 32;
This will compute the “lane” from 0..31. Each thread in a warp has a different lane, but multiple threads across a block can have the same lane.
If you want to be cleaner in your coding, use a named constant for the warp size:
const int WARP_THREADS = 32;
int warpid = threadIdx.x / WARP_THREADS;
int laneid = threadIdx.x % WARP_THREADS;
Or if you want dirtier coding that bets against the compiler design engineers, you can needlessly micro-optimize:
int warpid = threadIdx.x >> 5; int laneid = threadIdx.x & 0x1F;
Lanes are not needed for simple kernels, but they are very important when doing more complex kernels. For example, using the warp shuffle operations to optimize a kernel to avoid shared memory will require careful tracking of the lane index within each of the threads.
|
• Online: Table of Contents • PDF: Free PDF book download |
|
The new CUDA C++ Debugging book:
Get your copy from Amazon: CUDA C++ Debugging |