Aussie AI

Chapter 2. Debugging CUDA Hello World

Book Excerpt from "CUDA C++ Debugging: Safer GPU Kernel Programming"

by David Spuler

Buggy First CUDA C++ Program

Amusingly, a typical “Hello World” program written in CUDA C++ will have a bug. How’s that for a nice introduction to the parallel programming world? Talk about a steep learning curve!

This chapter looks at a “hello world” program written in CUDA C++, that just tries to print out a message. Such a humble goal, and yet it fails, of course, and then the remainder of the chapter is trying to debug the code.

If you’re a beginner at CUDA C++, you’ll need to install the CUDA Toolkit on a computer with a GPU. If you don’t have a GPU, you can use Google Colab without a GPU (mostly for free), as discussed later in the chapter. And if you’re already an advanced CUDA programmer, well, you’ll already have a GPU environment, but why are you reading this chapter?

Let’s have a go at a basic program:

    // Hello World, basic CPU version
    #include <iostream>

    int main()
    {
        printf("Hello CUDA!\n");
    }

Yes, that runs just fine and the output is:

    Hello CUDA!

There’s only one problem with this code: it’s not running on the GPU. You can’t call yourself a CUDA programmer if you run code on a CPU. All of the advanced CUDA programmers rip out the CPU from their computers, and run with just a GPU and a bunch of ping pong balls instead.

Real programming. The basic idea with GPU programming in CUDA C++ is:

Have both CPU “host” code and GPU “device” code in the same C++ file.
The default is that C++ code is for the CPU host.
We mark GPU device code with the “__global__” specifier (yes, it has four underscores).

The way your program runs on a GPU is:

Execution starts in the CPU at the main function.
The GPU function (called a “kernel”) is just sitting there, twiddling thumbs, waiting.
The CPU “launches” a GPU function.
The GPU then runs that kernel function.

Hence, to modify our C++ code to run on the GPU, we need to:

Define a function
Declare it as “__global__”
Launch it using a weird syntax.

Here’s the very first attempt at a program that runs on a GPU:

    // Hello World, buggy GPU version
    #include <iostream>

    __global__ void aussie_cuda_hello_world()
    {
            printf("Aussie CUDA says Hello World!\n");
    }

    int main()
    {
        aussie_cuda_hello_world <<< 1, 1 >>> ();
    }

All this does is say “hello” without any other computations. The way that you know it’s running on the GPU is the extra specifier “__global__” on the function definition. Not only is this function rather dull, it’s also sluggy, because we really don’t want the GPU to be calling printf, except for debug tracing.

Where’s the CPU code? Any function that does not have a CUDA specifier like __global__ or __device__ is by default a CPU function. It’s called “host” code in the CUDA vernacular, since CPU is the “host,” and GPU is the “device.” GPU code is called “device code” or a “kernel.” For our hello world example, the CPU code is just the main function.

The strangest CUDA syntax is pretty obvious with the “triple chevron” tokens “<<<” and “>>>” surrounding some numbers. This code is a “launch” or “invocation” of the GPU “kernel.” It’s easier to visualize the function call without the fancy CUDA syntax, which is basically a zero-parameter standard C++ function call:

    aussie_cuda_hello_world();

However, it has some extra parameters inserted between the function name and the function arguments:

    <<< 1, 1 >>>

The meaning of the numbers is clearer if we do this:

    <<< blocks, threads_per_block >>>

The modified kernel invocation would look like this:

    int blocks = 1;
    int threads_per_block = 1;
    aussie_cuda_hello_world<<< blocks, threads_per_block >>>();

This syntax launches multiple copies of the kernel function aussie_cuda_hello_world, each of which is called a “thread.” How many? In this case, we are launching 1 block of 1 threads-per-block each, so there is a grand total of 1*1=1 function calls to our kernel, which is not exactly “multiple copies” of the kernel, as I vaguely promised above.

So, anyway, here’s our full CUDA “hello world” program. Let’s run it so we can bask in glory. Here’s the output:

     (sound of crickets)

What?!? There’s no output! How can there be a bug when there’s literally only two statements?

Fixing Hello World

Okay, here’s the problem, in simple terms: the CPU didn’t wait for the GPU’s output. The whole program finished on the CPU before the GPU output anything.

The solution is simple: make the CPU wait. The simplest way to do this is to call cudaDeviceSynchronize:

    aussie_cuda_hello_world<<< 1, 1 >>> ();
    cudaDeviceSynchronize();

This forces the CPU to wait for all the GPU kernel threads to finish, which is called “synchronization.” Hence, cudaDeviceSynchronize is a “blocking” or “synchronous” type of CUDA call.

One nice feature of nvcc compiler is that you don’t need any #include of a header file to call cudaDeviceSynchronize. This is because nvcc automatically includes “cuda_runtime.h” at the top of the CUDA C++ file.

More Details on GPU Output Buffering

This section is optional and quite advanced, but if you really want to know (and maybe you don’t), here’s a deeper look at why the output disappeared. In more detail, there’s actually three problems:

1. CUDA kernel launches with “<<<...>>>” are asynchronous.

2. Kernel output is buffered, rather than immediately output.

3. Buffered GPU kernel output is discarded on CPU host program exit.

That’s rather a mouthful. Let’s try to break it down:

The CPU didn’t hang around for the GPU to do anything, because it doesn’t wait for GPU kernels to finish.
But the C++ code in main had no further statements, so the whole program immediately exited (the CPU part).
The GPU still did its work, and called printf correctly inside the GPU code,
The weird part is that printf inside the GPU is not actually printed out immediately by the GPU. Instead, it’s stored (“buffered”) for the CPU to print out later.
Buffered GPU output doesn’t get printed until the CPU runs again afterwards.
But the CPU had already exited, so the CPU wasn’t still there anymore to print out any of the GPU output.
So, the GPU just gave up and threw it all away instead, and then the GPU quit too.

Any clearer? Maybe made it worse? I told you this section was optional for a reason.

Running Multiple Threads

Every call to the function in each block starts in a new thread at the same time, and runs in lock-step over the same set of statements. All of parallel calls to the kernel function have the same function parameters (i.e., none in this case).

This is also a rather dull kernel invocation, because it only runs 1 single instance of the GPU kernel. That’s not parallel! It’s only running 1 copy of the kernel at a time, which is something that a real CUDA programmer would never, ever do.

Let’s run 5 threads, so we have 5 versions of the kernel running instead. Here’s our updated code, including the bug fix call to cudaDeviceSynchronize:

    // Hello World, 5 threads version
    #include <iostream>

    __global__ void aussie_cuda_hello_world()
    {
        printf("Aussie CUDA says Hello World!\n");
    }

    int main()
    {
        int blocks = 1;
        int threads_per_block = 5;
        aussie_cuda_hello_world<<< blocks, threads_per_block >>>();
        cudaDeviceSynchronize();
    }

Here’s the output:

    Aussie CUDA says Hello World!
    Aussie CUDA says Hello World!
    Aussie CUDA says Hello World!
    Aussie CUDA says Hello World!
    Aussie CUDA says Hello World!

This runs 5 threads, because we have launched 1 block, and each block has 5 threads. Now let’s modify it so that it tells you which threads are running. We can do this with a statement:

    int tid = threadIdx.x;

Our new kernel is this:

    __global__ void aussie_cuda_hello_world()
    {
        int tid = threadIdx.x;
        printf("GPU thread %d says Hello World!\n", tid);
    }

Here’s the output:

    GPU thread 0 says Hello World!
    GPU thread 1 says Hello World!
    GPU thread 2 says Hello World!
    GPU thread 3 says Hello World!
    GPU thread 4 says Hello World!

As you can see, the 5 threads are numbered 0..4 for their “thread index” value. Also, don’t be misled by the fact that they appeared in sequential order from 0..4, because that’s an idiosyncrasy of the printf handling in GPU kernel code. All five threads are actually running in parallel!

Running Multiple Blocks

We’ve only had a single “block” of threads so far. Let’s try running two blocks by changing our CPU code:

    int blocks = 2;  // Hooray!
    int threads_per_block = 5;
    aussie_cuda_hello_world<<< blocks, threads_per_block >>>();
    cudaDeviceSynchronize();

And we can also make each thread figure out what block it’s in using the blockIdx “block index” variable. Here’s our updated GPU kernel code:

    __global__ void aussie_cuda_hello_world()
    {
        int tid = threadIdx.x;
        int bid = blockIdx.x;
        int id = blockIdx.x * blockDim.x + threadIdx.x;
        printf("GPU block %d thread %d says Hello World!\n", bid, tid);
    }

The output is:

    GPU block 1 thread 0 says Hello World!
    GPU block 1 thread 1 says Hello World!
    GPU block 1 thread 2 says Hello World!
    GPU block 1 thread 3 says Hello World!
    GPU block 1 thread 4 says Hello World!
    GPU block 0 thread 0 says Hello World!
    GPU block 0 thread 1 says Hello World!
    GPU block 0 thread 2 says Hello World!
    GPU block 0 thread 3 says Hello World!
    GPU block 0 thread 4 says Hello World!

There were two blocks of five threads each, so 10 threads ran in total, and they all ran in parallel. Here we can see that the two blocks had a “block index” of 0 and 1, and they printed in reverse order. Also note that the thread index was always 0..4 in both blocks (i.e., not 0..4 and 5..9).

Finally, let’s show how to get two blocks of five threads to properly count to 10. The way to work out the “index” of a thread in the whole “grid” (multiple blocks), is to use this CUDA code, which is the most common statement you’ll see every day in CUDA C++:

        int id = blockIdx.x * blockDim.x + threadIdx.x;

Note that “blockDim.x” means “block dimension” and is a builtin variable that is the “threads-per-block” value, so it will equal 5 here.

Hence, this is the new GPU kernel C++ function:

    __global__ void aussie_cuda_hello_world()
    {
        int tid = threadIdx.x;
        int bid = blockIdx.x;
        int id = blockIdx.x * blockDim.x + threadIdx.x;
        printf("GPU index %d block %d thread %d says Hello World!\n", id, bid, tid);
    }

Here’s the output:

    GPU index 5 block 1 thread 0 says Hello World!
    GPU index 6 block 1 thread 1 says Hello World!
    GPU index 7 block 1 thread 2 says Hello World!
    GPU index 8 block 1 thread 3 says Hello World!
    GPU index 9 block 1 thread 4 says Hello World!
    GPU index 0 block 0 thread 0 says Hello World!
    GPU index 1 block 0 thread 1 says Hello World!
    GPU index 2 block 0 thread 2 says Hello World!
    GPU index 3 block 0 thread 3 says Hello World!
    GPU index 4 block 0 thread 4 says Hello World!

Okay, so it worked! Even though it looks messy, our 10 kernels counted from 0..9 in parallel.

Why Are Blocks Needed?

Why do you need blocks to run CUDA kernel functions on the GPU at all? I mean, they complicate the index calculations. Why can’t you just always use 1 block, and then specify as many threads as you want? Then every thread could just get its number from threadIdx without that weird calculation involving, blockIdx and blockDim.

Here’s the idea for using single blocks:

    int blocks = 1;
    int threads = 16384;  
    aussie_cuda_hello_world<<< blocks, threads >>>();

Here’s a more CUDA-style use of multiple blocks with thread sizes typically 256 or 512 threads-per-block.

    int blocks = 64;
    int threads = 256;  // 64x256=16384
    aussie_cuda_hello_world<<< blocks, threads >>>();

Why do we do need to multiple blocks? Wouldn’t all the threads run in parallel either way?

Short answer: no, the GPU does not actually run all threads in parallel. It depends.

Longer answer: The answers about blocks being needed are mainly relevant to more advanced CUDA C++ programming, but here are some:

Each GPU has hard limits on the block size.
Shared memory with the “__shared__” specifier has block scope.
Scheduling on the GPU is at a block level.

Okay, so the GPU has a fixed limit on threads-per-block. But even if we could use an unlimited number of threads in each block, we shouldn’t do so. But why?

Shared memory is an important optimization discussed later. This faster memory has block scope, so we need to control the block size to maximize this benefit.

Scheduling is also a low-level GPU issue whereby the Streaming Multiprocessors (SMs) only work on complete blocks. The underlying scheduler tries to allocate as many blocks in parallel as it can, but sometimes it cannot fit all of them, because it’s got too many other workloads. This refers to production usage of a GPU, where it’s overheating from doing other important computation work (e.g., serving Taylor Swift companion bots), not just the cold, lazy GPU under your desk that’s only been playing FortNite.

The scheduler on the GPU does not necessarily run all the blocks in parallel, but has to guarantee that all the threads in a single block do. Hence, a big block will have to wait longer for enough space to be free on the GPU, whereas smaller blocks can get scheduled more easily.

• Online: Table of Contents

• PDF: Free PDF book download

• Buy: CUDA C++ Debugging: Safer GPU Kernel Programming

The new CUDA C++ Debugging book:

Debugging CUDA C++ kernels
Tools & techniques
Self-testing & reliability
Common GPU kernel bugs

Get your copy from Amazon: CUDA C++ Debugging

Aussie AI

Chapter 2. Debugging CUDA Hello World

Buggy First CUDA C++ Program

Fixing Hello World

More Details on GPU Output Buffering

Running Multiple Threads

Running Multiple Blocks

Why Are Blocks Needed?

Quick Links

Product

New to Writing?

Writing Styles