Aussie AI

Appendix: CUDA Puzzles

Book Excerpt from "CUDA C++ Debugging: Safer GPU Kernel Programming"

by David Spuler

Appendix: CUDA Puzzles

Instructions: Here are some puzzles on CUDA C++ debugging for your full and total enjoyment, or to use for tormenting CUDA C++ job applicants. The choice is entirely yours to make!

Every one of these code sequences has a bug in them, and usually a serious one. Catch all the bugs if you can!

Mostly these are insidious run-time errors, but a few might get a helpful warning, or even a compile-time error. Note that #include lines have been removed from some for brevity, so that’s not the answer! Also excluded for simplicity are common things such as the definition of any idiomatic CUDACHK runtime error checking macros or other types of runtime error checking code, or self-testing unit test functions that add up vector elements. If a missing declaration is all you can find, keep looking!

CUDA Puzzle #1

Puzzle Code: Here’s the device kernel:

    __global__ void aussie_add_vector_puzzle1(
          const float*v1, 
          const float*v2, 
          float* vout, 
          int n
        )
    {
        // Compute offset
        int id = threadIdx.x;
        if (id < n) { // Safety
            vout[id] = v1[id] + v2[id];  // Add one element
        }
    }

And here’s the host code that launches the kernel:

    // Kernel launch sequence
    int nthreads = 256;
    int blocks = (n + nthreads - 1) / nthreads;
    aussie_add_vector_puzzle1<<< blocks, nthreads>>>(dv1, dv2, dv3, n);
    CUDACHK( cudaDeviceSynchronize() );

Question: Where’s the bug?

CUDA Puzzle #2

Puzzle Code: Here is the device code:

   __global__ void aussie_vector_scalar_puzzle2(
          float* vout, 
          int n,
          float scalar
        )
    {
        // Compute offset
        int id = blockIdx.x * blockDim.x + threadIdx.x;
        if (id < n) { // Safety
            vout[id] *= scalar;  // Scale element
        }
    }

And here is the host code with kernel launch:

    // Kernel launch sequence
    float recip = 1.0f / divisor;
    int nthreads = 2048;
    int blocks = (n + nthreads - 1) / nthreads;
    aussie_vector_scalar_puzzle2<<< blocks, nthreads>>>(dv, n, recip);
    CUDACHK( cudaDeviceSynchronize() );

Question: Where’s the bug?

CUDA Puzzle #3

Puzzle Code: Here’s the device code for a 2D kernel:

    __global__ void matrix_add_puzzle3(
        float *m3, const float *m1, const float *m2, 
        int nx, int ny)
    {
        int x = blockIdx.x * blockDim.x + threadIdx.x;
        int y = blockIdx.y * blockDim.y + threadIdx.y;
        int id = x + y * nx; // Linearize
        if (x < nx || y < nx) {  // Safety
            m3[id] = m1[id] + m2[id];
        }
    }

Question: Where’s the bug?

CUDA Puzzle #4

Puzzle Code: Here’s the kernel code:

   __global__ void matrix_add_safe_puzzle4(
        float *m3, const float *m1, const float *m2, 
        int nx, int ny)
    {
       int x = blockIdx.x + blockDim.x + threadIdx.x;
       int y = blockIdx.y + blockDim.y + threadIdx.y;
       if (x < nx && y < ny) {  // Safety
           int id = x + y * nx; // Linearize
           m3[id] = m1[id] + m2[id];
       }
    }

Question: Where’s the bug?

CUDA Puzzle #5

Puzzle Code:

    __global__ void matrix_add_safe_puzzle5(
        float *m3, const float *m1, const float *m2, 
        int nx, int ny)
    {
       int x = blockIdx.x * blockDim.x + threadIdx.x;
       int y = blockIdx.y * blockDim.x + threadIdx.y;
       if (x < nx && y < ny) {  // Safety
           int id = x + y * nx; // Linearize
           m3[id] = m1[id] + m2[id];
       }
    }

Question: Where’s the bug?

CUDA Puzzle #6

Puzzle Code: Here’s the code for the 2D kernel:

     __global__ void matrix_add_safe_puzzle6(
        float *m3, const float *m1, const float *m2, 
        int nx, int ny)
    {
       int x = blockIdx.x * blockDim.x + threadIdx.x;
       int y = blockIdx.y * blockDim.y + threadIdx.y;
       if (x < nx && y < ny) {  // Safety
           int id = x + y * ny; // Linearize
           m3[id] = m1[id] + m2[id];
       }
    }

Question: Where’s the bug?

CUDA Puzzle #7

Puzzle Code: Here’s the device code:

    __global__ void aussie_clearvec_puzzle7(
          float* v, int n )
    {
        // Compute offset
        int id = blockIdx.x* blockDim.x + threadIdx.x;
        if (id < n) { // Safety
            v[id] = 0.0;  // Clear element
        }
    }

And here’s the host code with the kernel launch:

    // Kernel launch sequence
    int nthreads = 32;
    int blocks = 1;
    aussie_clearvec_puzzle7 <<< blocks, n>>>(dv, n);
    CUDACHK( cudaDeviceSynchronize() );

Question: Where’s the bug?

CUDA Puzzle #8

Puzzle Code: Here’s the kernel code:

     __global__ void matrix_add_safe_puzzle8(
        float *m3, const float *m1, const float *m2, 
        int nx, int ny)
    {
       int x = blockIdx.x * blockDim.x + threadIdx.x;
       int y = blockIdx.y * blockDim.y + threadIdx.y;
       if (x < nx /*X* / && y < nx /*Y*/ ) {
           int id = x + y * nx; // Linearize
           m3[id] = m1[id] + m2[id];
       }
    }

Question: Where’s the bug?

CUDA Puzzle #9

Puzzle Code: Here’s the kernel code:

    __global__ void matrix_hadamard_safe_puzzle9(
        float *m3, const float *m1, const float *m2, 
        int nx, int ny)
    {
       int x = blockIdx.x * blockDim.x + threadIdx.x;
       int y = blockIdx.y * blockDim.y + threadIdx.y;
       if (x < nx /*X*/ && y /*Y*/ ) {  // Safety
           int id = x + y * nx; // Linearize
           m3[id] = m2[id] * m1[id];
       }
    }

Question: Where’s the bug?

CUDA Puzzle #10

Puzzle Code: Here’s the kernel code:

    __global__ void matrix_diff_safe_puzzle10(
        float *m3, const float *m1, const float *m2, 
        int nx, int ny)
    {
       int x = blockIdx.x * blockDim.x + threadIdx.x;
       int y = blockIdx.y * blockDim.y + threadIdx.y;
       if (x >= nx /*X*/ && y >= ny /*Y*/ ) {
           int id = x + y * nx; // Linearize
           m3[id] = m1[id] - m2[id];
       }
    }

Question: Where’s the bug?

CUDA Puzzle #11

Puzzle Code: Here’s the kernel code:

    __global__ void matrix_add_safe_puzzle11(
        float *m3, const float *m1, const float *m2, 
        int nx, int ny)
    {
       int x = blockIdx.x * blockDim.x + threadIdx.x;
       int y = blockIdx.y * blockDim.y + threadIdx.y;
       if (x < nx /*X*/ & y < ny /*Y*/ ) {
           int id = x + y * nx; // Linearize
           m3[id] = m1[id] + m2[id];
       }
    }

Question: Where’s the bug?

CUDA Puzzle #12

Puzzle Code: Here’s the kernel code:

    __global__ void matrix_clear_safe_puzzle12(
        float *m, int nx, int ny)
    {
       int x = blockIdx.x * blockDim.x + threadIdx.x;
       int y = blockIdx.y * blockDim.y + threadIdx.y;
       if (x < nx && y < ny) {
           m[x][y] = 0.0;
       }
    }

Question: Where’s the bug?

CUDA Puzzle #13

Puzzle Code: Here’s the kernel code:

    __global__ void aussie_add_vector_puzzle13(
          const float*v1, 
          const float*v2, 
          float* vout, 
          int n
        )
    {
        // Compute offset
        int id = blockIdx.x * blockDim.x + threadIdx.x;
        if (id <= n) { // Safety
            vout[id] = v1[id] + v2[id];  // Add one element
        }
    }

Question: Where’s the bug?

CUDA Puzzle #14

Puzzle Code: Here’s the GPU kernel:

    __global__ void aussie_add_vector_puzzle14(
          const float*v1, 
          const float*v2, 
          float* vout, 
          int n
        )
    {
        // Compute offset
        int id = blockIdx.x * blockDim.x * threadIdx.x;
        if (id <= n) { // Safety
            vout[id] = v1[id] + v2[id];  // Add one element
        }
    }

Question: Where’s the bug?

CUDA Puzzle #15

Puzzle Code: Here’s the kernel device code for the GPU:

  __global__ void aussie_add_vector_puzzle15(
          const float*v1, 
          const float*v2, 
          float* vout, 
          int n
        )
    {
        // Compute offset
        int lane = threadIdx.x & 1F;
        int id = blockIdx.x * blockDim.x + lane;
        if (id <= n) { // Safety
            vout[id] = v1[id] + v2[id];  // Add one element
        }
    }

And here’s the kernel launch code:

    // Kernel launch sequence
    int nthreads = 32;
    int blocks = (n + nthreads - 1) / nthreads;
    aussie_add_vector_puzzle15 <<< blocks, nthreads>>>(dv1, dv2, dv3, n);
    CUDACHK( cudaDeviceSynchronize() );

Question: Where’s the bug?

CUDA Puzzle #16

Puzzle Code: Here’s the kernel device code for the GPU:

    __global__ void aussie_clear_vector_puzzle16 (
          float* v, int n)
    {
        // Compute offset
        int id = blockIdx.x * blockDim.x + threadIdx.x;
        if (! (id >= n)) { // Safety
            v[id] = 0.0;  // Clear element
        }
    }

And here’s the kernel launch code:

  int nthreads = (1<<6);
  int blocks = n + nthreads - 1 / nthreads;
  aussie_clear_vector_puzzle16  <<< blocks, nthreads >>> (dv, n);

Question: Where’s the bug?

CUDA Puzzle #17

Puzzle Code: Here’s the kernel device code for the GPU:

    __global__ void aussie_clear_vector_puzzle17(
          float* v, int n) {
        // Compute offset
        int id = blockIdx.x* blockDim.x + threadIdx.x;
        if (id < n)  // Safety
            v[id] = 0.0;  // Clear element
        }
    }

And here’s the kernel launch:

  int nthreads = 27;
  int blocks = (n + nthreads - 1) / nthreads;
  aussie_clear_vector_puzzle17 <<< blocks, nthreads >>> (dv, n);

Question: Where’s the bug?

CUDA Puzzle #18

Puzzle Code: Here’s some GPU code:

    __global__ void aussie_clear_vector_puzzle18(
          float* vout, int n)
    {
        // Compute offset
        int id = blockIdx.x * blockDim.x + threadIdx.x;
        if (id < n) { // Safety
            vout[id] = 0.0;  // Clear element
        }
    }

And here’s the kernel launch:

  int nthreads = 032;
  int blocks = (n + nthreads - 1) / nthreads;
  aussie_clear_vector_puzzle18 <<< blocks, nthreads >>> (dv, n);

Question: Where’s the bug?

CUDA Puzzle #19

Puzzle Code: Here’s the kernel code:

    __global__ void aussie_clear_vector_puzzle19(
          char* v, 
          int n
        )
    {
        // Compute offset
        int id = blockIdx.x* blockDim.x + threadIdx.x;
        if (id < n) { // Safety
            v[id] = 0.0;  // Clear element
        }
    }

Question: Where’s the bug?

CUDA Puzzle #20

Puzzle Code: Here’s the kernel code:

    __global__ void aussie_clear_vector_puzzle20(
          float* v, 
          int n
        )
    {
        int id = blockIdx.x* blockDim.x + threadIdx.x;
        if (id < n) v[id] = 0;
    }

This is the host code:

  int nthreads = 64;
  int blocks = (n + (nthreads - 1)) / nthreads;
  aussie_clear_vector_puzzle20 <<< blocks, nthreads >>> (dv, sz);

Question: Where’s the bug?

CUDA Puzzle #21

Puzzle Code: Here’s the kernel:

    __global__ void aussie_clearvec_puzzle21(
          float* v, 
          int n
        )
    {
        int id = blockIdx.x* blockDim.x + threadIdx.x;
        if (id < n) { // Safety
            v[id] = 0.0;  // Clear element
        }
        else {
            assert(id >= n);
        }
    }

And here’s the host C++ code:

    int nthreads = 32;
    int blocks = (n + nthreads - 1) / n;
    aussie_clearvec_puzzle21 <<< blocks, nthreads >>> (dv, n);

Question: Where’s the bug?

CUDA Puzzle #22

Puzzle Code: Here’s the kernel in 2D:

    __global void matrix_add_safe_puzzle22(
        float *m3, const float *m1, const float *m2, 
        int nx, int ny)
    {
       int x = blockIdx.x * blockDim.x + threadIdx.x;
       int y = blockIdx.y * blockDim.y + threadIdx.y;
       if (x < nx /*X*/ && y < ny /*Y*/ ) {  // Safety
           int id = x + y * nx; // Linearize
           m3[id] = m1[id] + m2[id];  // Add
       }
    }

Question: Where’s the bug?

CUDA Puzzle #23

Puzzle Code: Here’s the kernel code:

    __global__ void aussie_clearvec_puzzle23(
          float* v, int n)
    {
        int id = blockIdx.x * blockDim.x + threadIdx.x;
        (id < n || (v[id] = 0.0));
    }

Question: Where’s the bug?

CUDA Puzzle #24

Puzzle Code: Here’s the kernel code:

    __global__ void aussie_clearvec_puzzle24(
          float* v, int n)
    {
        int id = blockIdx.x * blockDim.x + threadIdx.x;
        assert(id < n && (v[id] = 0.0));
    }

Question: Where’s the bug?

CUDA Puzzle #25

Puzzle Code: Here’s the kernel code:

    __global__ void aussie_clearvec_puzzle25(
          float* v, int n )
    {
        // Compute offset with lane
        int lane = threadIdx.x & 0x1F;
        int id = blockIdx.x * blockDim.x + lane;
        if (id < n) { // Safety
            v[id] = 0.0;  // Clear element
        }
    }

Question: Where’s the bug?

CUDA Puzzle #26

Puzzle Code: Here’s the kernel code:

    __global__ void aussie_clearvec_puzzle26(
          float* v, int n )
    {
        // Compute offset with lane
        int lane = threadIdx.x & 0x1F;
        int id = blockIdx.x * blockIdx.x + lane;
        if (id < n) { // Safety
            v[id] = 0.0;  // Clear element
        }
    }

And here’s the kernel launch:

    // Kernel launch sequence
    int nthreads = 32;
    int blocks = (n + nthreads - 1) / nthreads;
    aussie_clearvec_puzzle26 <<< blocks, nthreads>>>(dv, n);
    CUDACHK( cudaDeviceSynchronize() );

Question: Where’s the bug?

CUDA Puzzle #27

Puzzle Code: The kernel code is:

    __global__ void aussie_clearvec_puzzle27(
          float* v, int n )
    {
        int lane = threadIdx.x & 0x1F;
        int id = blockIdx.x * blockDim.x + lane;
        assert(id < n);
        v[id] == 0.0;
    }

And here’s the kernel launch:

    // Kernel launch sequence
    int n = 256*32;  // multiple of 32
    int nthreads = 32;
    int blocks = (n + nthreads - 1) / nthreads;
    aussie_clearvec_puzzle27 <<< blocks, nthreads>>>(dv, n);
    CUDACHK( cudaDeviceSynchronize() );

Question: Where’s the bug?

CUDA Puzzle #28

Puzzle Code: Here’s the GPU kernel:

    __global__ void matrix_add_safe_puzzle28(
        float *m3, const float *m1, const float *m2, 
        int nx, int ny)
    {
       int x = blockIdx.x * blockDim.x + threadIdx.x;
       int y = blockIdx.y * blockDim.y + threadIdx.y;
       if (x < nx /*X*/ && x < ny /*Y*/ ) {  // Safety
           int id = x + y * nx; // Linearize
           m3[id] = m1[id] + m2[id];
       }
    }

Question: Where’s the bug?

CUDA Puzzle #29

Puzzle Code: The kernel code is:

    __global__ void aussie_clearvec_puzzle29(
          float* v, 
          int n
        )
    {
        // Compute offset using threadIdx
        int id = blockIdx.x* blockDim.x + threadIdx.x;
        if (id <=> n) { // Safety
            v[id] = 0.0;  // Clear element
        }
    }

Question: Where’s the bug?

CUDA Puzzle #30

Puzzle Code: Here’s the kernel code:

    __global__ void aussie_clearvec_puzzle30(
          float* v, int n )
    {
        int id = blockIdx.x* blockDim.x + threadIdx.x;
        if (id < n) { // Safety
            v[id] = 0.0;  // Clear element
        }
    }

And here’s the launch code:

    // Kernel launch sequence
    int n = 1>>12;  // multiple of 32
    int nthreads = 32;
    int blocks = (n + nthreads - 1) / nthreads;
    aussie_clearvec_puzzle30 <<< blocks, nthreads>>>(dv, n);
    CUDACHK( cudaDeviceSynchronize() );

Question: Where’s the bug?

CUDA Puzzle #31

Puzzle Code:

    #define BITS 5

    __global__ void aussie_clearvec_puzzle31(
          float* v, int n )
    {
        assert(blockDim.x == 32);
        int id = blockIdx.x << BITS + threadIdx.x;
        if (x < n) { // Safety
            v[id] = 0.0; // Clear element
        }
    }

And here’s the launch code:

    // Kernel launch sequence
    int n = 1u << 15;  // multiple of 32
    int nthreads = 1 << BITS;  // 32  
    int blocks = ( n + nthreads - 1) / nthreads;
    aussie_clearvec_puzzle31 <<< blocks, nthreads>>>(dv, n);
    CUDACHK( cudaDeviceSynchronize() );

Question: Where’s the bug?

CUDA Puzzle #32

Puzzle Code: Here’s the kernel code:

    __global__ void aussie_clearvec_puzzle32(
          float* v, int n )
    {
        // Compute offset
        int id = blockIdx.x* blockDim.x + (threadIdx.x &0x1F);
        if (id < n) { // Safety
            v[id] = 0.0;  // Clear element
        }
    }

And here’s the launch code:

    // Kernel launch sequence
    int n = 1u << 15;  // multiple of 32
    int nthreads = 32;
    int blocks = ( n + nthreads - 1) / blocks;
    aussie_clearvec_puzzle32 <<< blocks, nthreads>>>(dv, n);
    CUDACHK( cudaDeviceSynchronize() );

Question: Where’s the bug?

Answers

Final Words: How did you go with a full warp of puzzles? Was it fun? Or was it fully warped? Remember that if you’re ever having trouble debugging your CUDA kernels, make like an Aussie and turn your C++ code upside-down.

Answer #1: The kernel does not use blockIdx in the computation of the index, so it won’t ever set the higher elements of a vector. This will only add vector elements 0..31, probably many times over in parallel across different blocks and warps. The kernel will not crash, but won’t work correctly for vectors with more than 32 elements.

Answer #2: The block size in nthreads is 2048, but more than 1024 threads exceeds the limits allowed for block size. Hence, the kernel will fail to launch, with a synchronous failure.

Only part marks if you thought the only problem was that the divisor reciprocal calculation was not protected against divide-by-zero errors. The blocks calculation should really be capped at a maximum, too, as this code will exceed maximum limits for very large n values. But it won’t work with capped blocks because there’s no loop in the kernel. Does the code need more comments?

Answer #3: The || operator should be &&. The safety test is not very safe.

Answer #4: The index computation should use multiplication, blockIdx.x * blockDim.x, not addition (+).

Answer #5: Typo. One of the blockDim.x should be blockDim.y.

Answer #6: Typo. ny should be nx in the id calculation. Works fine if it’s a square matrix!

Answer #7: The launch uses n as the block size, rather than nthreads. This will only work for vectors of sizes up to 1024. If n is ever larger, there will be more than 1024 threads, the hard limit on block sizes for a GPU. Hence, the kernel will fail to launch with a synchronous error.

Answer #8: There’s a nested comment problem that will comment-out the “y < nx” test, because there’s a space between “*” and “/”. You’d probably get a compiler warning, and hopefully you pay attention to them!

Answer #9: Should be “&& y < ny” not just “&& y”.

Answer #10: The two Boolean safety tests have the reverse condition with >= operators, and the kernel will only do invalid assignments. If this is called only with correct indices by correctly grid dimensions, it will simply do nothing.

Answer #11: Should be “&&” (logical-and operator) not “&” (bitwise-and operator), with lots of operator precedence problems occurring in the if test. It would still work if you added enough parentheses.

Answer #12: Two-dimensional array syntax v[x][y] won’t work on a linearized array. The computation of the linearized index is also missing. Fortunately, this should be a compiler error, albeit a confusing one.

Answer #13: Safety test is off-by-one, and should be “id < n” not “id <= n”.

Answer #14: Should be “+ threadIdx.x” (addition) not “*” (multiplication).

Answer #15: The constant “1F” is accidentally a float constant (1.0), but should be “0x1F” (hexadecimal integer). Hence, it has the wrong value, and does bitwise-and on a float type, which is a compile error (luckily!). Note that the use of lane in this way is dubious (should use threadIdx.x), but it’s not a bug here because nthreads is only 32.

Answer #16: Could you do 1<<6 in your head? But that’s not the bug. There’s missing parentheses in the calculation of blocks. The code should be “(n + nthreads - 1) / nthreads” calculation.

Answer #17: Surely, this is an easy one. The number of threads per block should be a multiple of 32, not 27.

Answer #18: The initializer for nthreads is 032, which is an octal constant in C++, and does not equal decimal 32.

Answer #19: The device parameter should be “float*” not “char*”. You’d get a compiler error, but then, without thinking about it much, you might just add a pointer cast to the argument, right?

Answer #20: Yes, it’s the wrong zero, but that won’t crash it. Kernel parameter “sz” should be “n” and that will probably crash. Presumably, sz is the byte size used for memory allocation and equals n*sizeof(float), which is too large. The kernel could overflow its array bounds if n is not a clean multiple of 64, because the safety test has a threshold that’s four times too high.

Answer #21: The assertion is wrong, but is harmless. The part that isn’t harmless is that the calculation of “blocks” mixes up “nthreads” and “n” in the divisor. The value for blocks will always be 1.

Answer #22: Specifier “__global” is invalid and should be “__global__” (with suffix underscores). It’s a harmless problem as there’s a compiler error to remind you.

Answer #23: It’s a tricky try, aiming to use an expression instead of a safety if test. Perhaps the idea is to avoid if statements for branch coherence? However, the short-circuiting of the “||” operator is the wrong logic, with its “or else” meaning. The assignment operator only executes invalid assignments. This idea would work for “&&” (with “and then” logic), but it wouldn’t really do anything to change branch divergence anyway (if that was the intention, rather than just showing off fancy coding skills).

Answer #24: The assertion always fails because the second operand is equal to zero. Furthermore, the kernel will do nothing if assertions are ever “compiled out” for production mode. Also, this will return an error code for the whole kernel if even one of the threads fails the assertion, so it’s not really a good way to combine the safety test with assertions.

Answer #25: If the block size is more than 32 threads, this will miss data for the threads with a higher thread index, because “lane” is always 0..31 here. Portions of the vector won’t be processed.

Answer #26: The second “blockIdx” should be “blockDim.” This use of lane is dubious, but works here for a block size of 32.

Answer #27: No, the assertion should not fail. But the “==” operator on v[id] should be “=”. It’s a null-effect statement, not an assignment, and should get a compiler warning.

Answer #28: The “if” condition is testing “x” twice.

Answer #29: The spaceship operator “<=>” (three-way comparison) should be just “<”. But this is valid in modern C++ and should run.

Answer #30: “1>>12” is zero. Should be “1<<12” presumably.

Answer #31: Operator precedence error. Here, blockDim.x is 32, and x<<5 would be the same as x*32, but the << operator has a lower precedence than the + operator, whereas multiplication has higher precedence. Parentheses are needed around “blockIdx.x << 5”.

Answer #32: Typo. The variable blocks is actually used in its own initializer, which is an uninitialized use with undefined results and could be a divide-by-zero. In any case, it’s an incorrect calculation for the number of blocks required.

• Online: Table of Contents

• PDF: Free PDF book download

• Buy: CUDA C++ Debugging: Safer GPU Kernel Programming

CUDA C++ Optimization

The new CUDA C++ Debugging book:

Debugging CUDA C++ kernels
Tools & techniques
Self-testing & reliability
Common GPU kernel bugs

Get your copy from Amazon: CUDA C++ Debugging