Aussie AI

Chapter 17. CUDA Portability

Book Excerpt from "CUDA C++ Debugging: Safer GPU Kernel Programming"

by David Spuler

Portability of CUDA C++ Applications

The portability model of CUDA programs to multiple architectures is quite complicated. Hence, let’s start with the most basic point about CUDA:

Only NVIDIA GPUs are supported.

Beyond that, things get more complicated. There are two specific issues for the portability of your code:

1. Host code portability (CPU), and

2. Device code portability (GPU).

If you’re trying to run an AI application in the data center, then it’s probably running the host code on Linux and the device code on a H100 GPU. But if you’re using CUDA to write an application for gaming or video editing on a desktop PC, then the host code is running on an x86 CPU, and the GPU is a graphics card like a GeForce RTX 4090 or whatever is the latest chip as you read this.

Forget portability in AI! This simplest case is where you don’t have to worry about any of this. And this is often the case for an AI workload, where you have control over all of the Linux machines with their eight-pack of H100’s. You only need to compile for this one platform. Hence, stop reading this section, because you don’t care about portability: just compile it for your one platform and go to lunch.

Summary of Commands and API Calls

There’s a lot of details in the discussion below, but let’s do a quick summary of the things that you might need. Here are some of the Linux commands you might use:

nvcc --version
nvidia-smi
whereis cuda
which nvcc

Here are some of the many nvcc compiler flags:

-g or --debug — CUDA compiler flag for compilation in debug mode, with extra debug information put into the executable (i.e., similar to “-g” flag for GCC).
-G or --device-debug — CUDA compiler option for “device debug” mode, when compiling CUDA C++ code that runs on the GPU.
-lineinfo or --generate-line-info — NVCC generates extra information for profiling.
-pg or --profile — generates profiler information for use with gprof.

Here are the CUDA C++ preprocessor macros defined during nvcc compilation, which mostly have a double underscore as both prefix and suffix:

__NVCC__ — predefined preprocessor macro when compiling in nvcc.
__CUDACC__ — another preprocessor macro when compiling CUDA C++.
CUDART_VERSION — CUDA Runtime version as a number (preprocessor macro).
__CUDA_ARCH__ — GPU architecture preprocessor macro as a constant number (but be warned that this works in device code only and is undefined in host code).
__CUDACC_DEBUG__ — preprocessor macro set when compiling in debug mode.

Here are the CUDA Runtime C++ API calls:

cudaRuntimeGetVersion — CUDA Runtime version (C++ function call).
cuDeviceGetAttribute — get attributes of the current GPU device.
cudaGetDeviceCount — how many GPUs on this box?
cudaGetDeviceProperties — get properties of the current GPU.
cudaSetDevice — set the current GPU device, so you can query its properties.
cudaDriverGetVersion — CUDA driver version details.

I won’t be insulted if you stop reading now and hit Stack Overflow instead.

Detailed CUDA Portability

CUDA compilation model. Multiple platforms are more complicated to support. The compilation model in CUDA has support for several types of files:

Executable files (e.g., Linux executables)
Binary files (“.cubin”)
PTX assembly files
Non-CUDA C++ source files
CUDA C++ files (“.cu”)

I’ve mixed some host and device code issues together here, but I don’t feel bad because that’s what CUDA does inside its C++ programs. Anyway, let’s split it out.

Host code portability. The host code is like a normal non-CUDA C++ program. You need it to compile into a native binary, just as you would any other C++ program on Linux or Windows. The output from compiling host code is a native executable file (not a “.cubin” file).

The nvcc compiler can do this, but it’s not really doing everything itself. Behind the scenes, it actually calls another non-CUDA C++ compiler, such as g++ on Linux.

For the host code, nvcc generates an intermediate C++ format, with all the CUDA syntax removed (e.g., __global__ and the “<<<...>>” triple chevron syntax). Hence, nvcc acts like a cross compiler that outputs C++ as its target language.

Beyond this, the portability issues for getting the host code running on Linux versus Windows versus MacOS are the same types of concerns as for a non-CUDA C++ program. There are literally whole books on C++ portability, so we’ll be here for a while if I get started.

Device code portability. Where CUDA really shines is its support for multiple GPU chips. I mean, only NVIDIA ones, but it’s still great. You can use nvcc to output two low-level formats:

CUDA binary files (“.cubin”)
PTX assembly language files

The binary files are specific to each GPU, and are machine code for the GPU chip. Hence, you cannot just copy a “.cubin” file from one to the other. You have to specify the target GPU architecture when you create a binary file.

To support multiple GPU types in your application, you’ve got two main options for your build process:

Manage lots of “.cubin” files (not recommended), or
Compile to PTX assembly language

PTX is a text-based assembly language format that’s much lower level than C++. The PTX assembly language files are further compiled to binary code by the GPU’s device driver. What this really means is that every GPU device driver contains an assembler, and does “just-in-time compilation” to create machine code from PTX (really, shouldn’t it be called “just-in-time assembling”?). The command-line version of the PTX assembler is called ptxas.

Note that the PTX language is not fully compatible across all GPU architectures. There are some options that control which level of “compute compatibility” need to be supported in the output PTX files. Hence, this adds another wrinkle to the build process, although maybe you won’t be using any of the less powerful GPUs.

And just to confuse matters, there’s a third option call “just-in-time compilation” of C++. This is where you can actually distribute the device code’s CUDA C++ source code to multiple GPUs, rather than using binary or PTX assembly files. The NVRTC library can compile CUDA C++ files to PTX on the fly, which can then be assembled to binary code by the GPU device driver.

Summary. Let’s wrap up this portability discussion with an overview of the various options.

One CPU, one GPU — just use nvcc to build Linux executables and “.cubin” device binary files.
One CPU, many GPUs — compile to PTX, or to binary, or use just-in-time NVRTC C++ compilation.
Many CPUs, many GPUs — my head hurts; let’s outsource.

Detecting Host versus Device Code

The simplest way to separate host and device code is to use different functions. It’s a basic separation with “__global__” or “__device__” for device functions, and either no specifier or “__host__” for host code.

An even purer method is to separate the host code into its own source code file. In some cases, you could even have the basic C++ functions for host code in a non-CUDA C++ source file, or even link in a simple C++ non-CUDA library (e.g., via g++ options).

But none of that is CUDA style! After all, the “U” in CUDA means “Unified” and we’re supposed to smash it all into one source file. Hence, if you want to do different things on the host and the device, you need to detect it in the C++ code itself.

Preprocessor macro method. Whether the code is run on the host or the device can be detected at compile-time. The simplest way is to use a preprocessor macro.

    #if __CUDA_ARCH__ 
       // Device code
    #else
       // Host code
    #endif

Another alternative way is:

    #ifdef __CUDA_ARCH__
       // Device code
    #endif

And for host code:

    #if !defined(__CUDA_ARCH__)
       // Host code
    #endif

Build your own symbols. Maybe you want it to look clearer in the code?

    #if IS_DEVICE_CODE
      // kernel
    #else
      // host
    #endif

To permit this, you can define your own macros to hide these details. Note that this idea won’t work in a header file:

    #ifdef __CUDA_ARCH__   // Fails!
    #define IS_DEVICE_CODE 1
    #else
    #define IS_DEVICE_CODE 0
    #endif

This above idea fails because the value of __CUDA_ARCH__ will be evaluated by nvcc within your header file, where it is always host code, and the macro will always be empty. Instead, this should work in a header, by making the expansion of your macro happen later:

    #define IS_DEVICE_CODE  ( __CUDA_ARCH__ > 0 )  // Better
    #define IS_HOST_CODE  ( __CUDA_ARCH__ == 0 )

Note that these will work in preprocessor expressions (e.g., #if), but not at runtime in “if” tests, where a compilation error will result. The undefined value of the __CUDA_ARCH__ macro name in host code defaults to zero in preprocessor conditional expressions, but not elsewhere in C++ statements.

Detecting GPU Architectures in Device C++

You can detect the “compute capability” of your NVIDIA GPU within device code using the “__CUDA_ARCH__” preprocessor macro. This macro is not set in host code, which can be used to distinguish host versus device code, as already discussed above.

The main use of this macro is to use different code for more capable GPUs. Here’s an example of how to use faster code with a higher compute capability, but also have code for a lower one on an older GPU. An example of the compile-time method:

    #if __CUDA_ARCH__ >= 800
       // Compute capability 8.0 and above
    #else
       // Less capable GPU
    #endif

Is CUDA Installed?

You can check on a Linux box whether the CUDA Toolkit software is installed in various ways. Here’s a selection of commands you can use. First, you can just try to run the compiler:

    nvcc

Here’s the output:

    nvcc fatal   : No input files specified; use option --help for more information

Use the whereis command on Linux:

    whereis cuda

The output is:

    cuda: /usr/local/cuda

You can list the CUDA file directory:

    ls /usr/local/cuda/

Here’s the output file listing:

    bin compute-sanitizer  extras  include  nvml  res  src
    compat      doc        gds     lib64    nvvm  share  targets

If CUDA is not installed, you get an error with most of these commands. Simples.

Detecting CUDA Version

Using the nvcc compiler’s version flag is one way:

    nvcc --version

If you’re running in Google Colab, you’ll need to add a prefix “!” to the Cell command to run any of these Linux shell commands properly. The command in a new “+Code” cell is simply:

    !nvcc --version

Here’s the output I get, which shows “12.2” in various ways:

    nvcc: NVIDIA (R) Cuda compiler driver
    Copyright (c) 2005-2023 NVIDIA Corporation
    Built on Tue_Aug_15_22:02:13_PDT_2023
    Cuda compilation tools, release 12.2, V12.2.140
    Build cuda_12.2.r12.2/compiler.33191640_0

version.txt disappeared. According to the internet (i.e., Stack Overflow), the installed version of CUDA Runtime is stored in a text file on Linux:

    cat /usr/local/cuda/version.txt

But it doesn’t work. Although there is a directory /usr/local/cuda/, here’s what I get on my Google Colab virtual box running CUDA 12:

    cat: /usr/local/cuda/version.txt: No such file or directory

So, it looks like version.txt is gone, at least by CUDA version 12.

nvidia-smi command. You cannot really also use the nvidia-smi command for this issue, because that is inspecting your GPU chip’s capabilities, rather than the CUDA Toolkit software install. The command is simply:

    nvidia-smi

Here’s the output:

Sun Sep 29 04:43:04 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  Tesla T4                       Off | 00000000:00:04.0 Off |                    0 |
| N/A   35C    P8               9W /  70W |      0MiB / 15360MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

Although it says “CUDA Version 12.2” this is talking about the hardware, not software. It’s not a reliable indicator of the software install, as it can refer to what version the GPU requires, rather than what’s currently installed. (And wow, someone at NVIDIA really took some time to make the columns line up and pretty up the ASCII layout, because coding this is fiddly work!)

Mixing CUDA and Non-CUDA C++

There are times when you want to use the same shared C++ code in both CUDA and non-CUDA applications. How can you tell which type of compiler is running? One way is to detect preprocessor macros:

__NVCC__ is set when nvcc is compiling.
__CUDACC__ is also set when a CUDA C++ compiler is compiling (i.e., nvcc).
__CUDA_ARCH__ specifies host mode or various device architectures.

Hence, you can detect CUDA versus non-CUDA compilation via methods such as:

    #ifdef __NVCC__
      // CUDA C++
    #else
      // Non-CUDA C++
    #endif

There are a number of other supporting preprocessor macros that can be used to further identify compiler versions:

__CUDACC_VER_MAJOR__ is the major version number of the nvcc compiler.
__CUDACC_VER_MINOR__ is the minor version number.
__CUDACC_VER_BUILD__ is the build number.

There are also some preprocessor macros that indicate the “modes” that nvcc is compiling in:

__CUDACC_DEBUG__ for device-debug mode.
__CUDACC_RDC__ for relocatable device code mode.
__CUDACC_EWP__ for extensible whole program mode.

CUDA Portability Traps

There are a few traps in coding portable code:

__CUDA_ARCH__ is an undefined macro in host code.
Preprocessor macros are not checked in C++

To the point about undefined C++ preprocessor macros, here’s a bug:

    #if NVCC  // BUG!
      // CUDA-only code
    #endif

This is a typo of __NVCC__, but it’s also a silent bug. C++ converts unknown symbols in #if expressions to 0, so this fails. Here’s a little trick for your header file:

    #define NVCC Maybe you meant __NVCC__ ?
    #define __NVCC Maybe you meant __NVCC__ ?
    #define __NVCC_ Maybe you meant __NVCC__ ?
    #define _NVCC_ Maybe you meant __NVCC__ ?
    #define NVCC__ Maybe you meant __NVCC__ ?

Now you’ll get compiler errors if you typo them in an #if expression. Unfortunately, I don’t have a trick for #ifdef or the defined operator, so this is still a silent bug:

    #ifdef NVCC  // Wrong!

An alternative strategy would be to tolerate accidental typos of macro names by adding this in your header file:

    #ifdef __NVCC__
    #define NVCC __NVCC__
    #define _NVCC_ __NVCC__
    #endif

It might be easier to just use a grep command on your C++ source code files:

    grep -r NVCC | grep -v __NVCC__

This doesn’t actually catch all cases, such as mixing them, but it’s probably good enough. Alternatively, you can directly scan for all the badly written versions, using regular expressions to avoid matching the correct one.

You need multiple versions for each of the other processor macros, such as __CUDA_ARCH__ as well:

   grep -r CUDA_ARCH | grep -v __CUDA_ARCH__
   grep -r CUDAARCH

And then you have to add it to your build scripts.

C++ Operator Portability Pitfalls

Most of the low-level arithmetic code in C++ algorithms looks quite standardized. Well, not so much. The general areas where C++ code that looks standard is actually non-portable includes trappy issues such as:

Arithmetic overflow of integer or float operators.
Integer % remainder and / division operators on negatives.
Right bitshift operator >> on a negative signed integer is not division.
Divide-by-zero doesn’t always crash on all CPUs and GPUs.
Order of evaluation of expression operands (e.g., with side-effects).
Order of evaluation of function arguments.
Functions that should be Boolean are not always (e.g., isdigit, isalpha)
Functions that don’t return well-defined results (e.g., strcmp, memcmp, etc.)
Initialization order for static or global objects is undefined.
memcmp is not an array equality test for non-basic types (e.g., structures).

Note that these errors are not only portability problems, but can arise in any C++ program. In particular, different levels of optimization in C++ compilers may cause different computations, leading to insidious bugs.

Signed right bitshift is not division

The shift operators << and >> are often used to replace multiplication by a power of 2 for a low-level optimization. However, it is dangerous to use >> on negative numbers. Right shift is not equivalent to division for negative values. Note that the problem does not arise for unsigned data types that are never negative, and for which shifting is always a division.

There are two separate issues involved in shifting signed types with negative values: firstly, that the compiler may choose two distinct methods of implementing >>, and secondly, that neither of these approaches is equivalent to division (although one approach is often equivalent). It is unspecified by the standard whether >> on negative values will:

(a) sign extend, or

(b) shift in zero bits.

Different compilers must choose one of these methods, document it, and use it for all applications of the >> operator. The use of shifting in zero bits is never equal to division for a negative number, since it shifts a zero bit into the sign bit, causing the result to be a nonnegative integer (dividing a negative number by two and getting a positive result is not division!). Shifting in zero bits is always used for unsigned types, which explains why right shifting on unsigned types is a division.

Divide and remainder on negative integers

Extreme care is needed when the integer division and remainder operators / and % are applied to negative values. Actually, no, forgot that, because you should never use division or remainder in a kernel, and if you must, then you choose a power-of-two and use bitwise operations instead. Division is unsigned right bitshift, and remainder is bitwise-and.

Anyway, another reason to avoid these operators occurs with negatives. Problems arise if a program assumes, for example, that -7/2 equals -3 (rather than -4) . The direction of truncation of the / operator is undefined if either operand is negative.

Order of evaluation errors

Humans would assume that expressions are evaluated left-to-right. However, in C++ the order of the evaluation of operands for most binary operators is not specified and is undefined behavior. This makes it possible for compilers to apply very good optimizing algorithms to the code. Unfortunately, it also leads to some problems that the programmer must be aware of.

To see the effect of side effects, consider the increment operator in the expression below. It is a dangerous side effect.

    y = (x++) + (x * 2);

Because the order of evaluation of the addition operator is not specified, there are two orders in which the expression could actually be executed. The programmer’s intended order is left-to-right:

    temp = x++;
    y = (temp) + (x * 2);

The other incorrect order is right-to-left:

    temp = x * 2;
    y = (x++) + (temp);

In the first case, the increment occurs before x*2 is evaluated. In the second, the increment occurs after x*2 has been evaluated. Obviously, the two interpretations give different results. This is a bug because it is undefined which order the compiler will choose.

Function-call side effects

If there are two function calls in the one expression, the order of the function calls can be important. For example, consider the code below:

    x = f() + g()

Our first instinct is to assume a left-to-right evaluation of the “+” operator. If both functions produce output or both modify the same global variable, the result of the expression may depend on the order of evaluation of the “+” operator, which is undefined in C++.

Order of evaluation of assignment operator

Order of evaluation errors are a complicated problem. Most binary operators have unspecified order of evaluation — even the assignment operators. A simple assignment statement can be the cause of an error. This error can occur in assignment statements such as:

   a[i] = i++;   // Bug

The problem here is that “i” has a side effect applied to it (i.e., ++), and is also used without a side effect. Because the order of evaluation of the = operator is unspecified in C++, it is undefined whether the increment side effect occurs before or after the evaluation of i in the array index.

Function-call arguments

Another form of the order of evaluation problem occurs because the order of the evaluation of arguments to a function call is not specified in C++. It is not necessarily left-to-right, as the programmer expects it to be. For example, consider the function call:

    fn(a++, a);  // Bug

Which argument is evaluated first? Is the second argument the new or old value of a? It’s actually undefined in C++.

Order of initialization of static objects

A special order of evaluation error exists because the order of initialization of static or global objects is not defined across files. Within a single file the ordering is the same as the textual appearance of the definitions. For example, the Chicken object is always initialized before the Egg object in the following code:

    Chicken chicken; // Chicken comes first
    Egg egg;

However, as for any declarations there is no specified left-to-right ordering for initialization of objects within a single declaration. Therefore, it is undefined which of c1 or c2 is initialized first in the code below:

    Chicken c1, c2;

If the declarations of the global objects “chicken” and “egg” appear in different files that are linked together using independent compilation, it is undefined which will be constructed first.

memcmp cannot test array equality

For equality tests on many types of arrays, the memcmp function might seem an efficient way to test if two arrays are exactly equal. However, it only works in a few simple situations (e.g., arrays of int), and is buggy in several cases:

Floating-point has two zeros (positive and negative zero), so it fails.
Floating-point also has multiple numbers representing NaN (not-a-number).
If there’s any padding in the array, such as arrays of objects or structures.
Bit-field data members may have undefined padding.

You can’t skip a proper comparison by looking at the bytes.

Data Type Sizes

There are a variety of portability issues with the sizes of basic data types in C++. Some of the problems include:

Fundamental data type byte sizes (e.g., how many bytes is an “int”).
Pointer versus integer sizes (e.g., do void pointers fit inside an int?).
size_t is usually unsigned long, not unsigned int.

Typical AI engines work with 32-bit floating-point (float type). Note that for 32-bit integers you cannot assume that int is 32 bits, but must define a specific type. Furthermore, if you assume that short is 16-bit, int is 32-bit, and long is 64-bit, well, you’d be incorrect. Most platforms have 64-bit int types, and the C++ standard only requires relative sizes, such as that long is at least as big as int.

Your startup portability check should check that sizes are what you want:

    // Test basic numeric sizes
    aussie_assert(sizeof(int) == 4);
    aussie_assert(sizeof(float) == 4);
    aussie_assert(sizeof(short) == 2);

Another more efficient way is the compile-time static_assert method:

    static_assert(sizeof(int) == 4);
    static_assert(sizeof(float) == 4);
    static_assert(sizeof(short) == 2);

And you should also print them out in a report, or to a log file, for supportability reasons. Here’s a useful way with a macro that uses the “#” stringize preprocessor operator and also the standard adjacent string concatenation feature of C++.

    #define PRINT_TYPE_SIZE(type) \
        printf("Config: sizeof " #type " = %d bytes (%d bits)\n", \
        (int)sizeof(type), 8*(int)sizeof(type));

You can print out whatever types you need:

    PRINT_TYPE_SIZE(int);
    PRINT_TYPE_SIZE(float);
    PRINT_TYPE_SIZE(short);

Here’s the output on my Windows laptop with MSVS:

    Config: sizeof int = 4 bytes (32 bits)
    Config: sizeof float = 4 bytes (32 bits)
    Config: sizeof short = 2 bytes (16 bits)

Standard Library Types: Other data types to consider are the builtin ones in the standards. I’m looking at you, size_t and time_t, and a few others that belong on Santa’s naughty list. People often assume that size_t is the same as “unsigned int” but it’s actually usually “unsigned long”. Here’s a partial solution:

    PRINT_TYPE_SIZE(size_t);
    PRINT_TYPE_SIZE(clock_t);
    PRINT_TYPE_SIZE(ptrdiff_t);

Data Representation Pitfalls

Portability of C++ to platforms also has data representation issues such as:

Floating-point oddities (e.g., negative zero, Inf, and NaN).
Whether “char” means “signed char” or “unsigned char”
Endian-ness of integer byte storage (i.e., do you prefer “big endian” or “little endian”?).
Whether zero bytes represent zero integers, zero floating-point, and null pointers.

Zero is not always zero? You probably assume that a 4-byte integer containing “0” has all four individual bytes equal to zero. It seems completely reasonable, and is correct on many platforms, but not all. There’s a theoretical portability problem on a few obscure platforms. There are computers where integer zero or floating-point 0.0 is not four zero bytes. If you want to check, here’s a few lines of code for your platform portability self-check code at startup:

    int i2 = 0;
    unsigned char* cptr2 = (unsigned char*)&i2;
    for (int i = 0; i < sizeof(int); i++) {
        assert(cptr2[i] == 0);
    }

Are null pointers all-bytes-zero, too? Here’s the code to check NULL in a “char*” type:

    // Test pointer NULL portability
    char *ptr1 = NULL;
    unsigned char* cptr3 = (unsigned char*)&ptr1;
    for (int i = 0; i < sizeof(char*); i++) {
        assert(cptr3[i] == 0);
    }

What about 0.0 in floating-point? You can test it explicitly with portability self-testing code:

    // Test float zero portability
    float f1 = 0.0f;
    unsigned char* cptr4 = (unsigned char*)&f1;
    for (int i = 0; i < sizeof(float); i++) {
        assert(cptr4[i] == 0);
    }

It is important to include these tests in a portability self-test, because you’re relying on this whenever you use memset or calloc.

Pointers versus Integer Sizes

You didn’t hear this from me, but apparently you can store pointers in integers, and vice-versa, in C++ code. Weirdly, you can even get paid for doing this. But it only works if the byte sizes are big enough, and it’s best to self-test this portability risk during program startup. What exactly you want to test depends on what you’re (not) doing, but here’s one example:

    // Test LONGs can be stored in pointers
    aussie_assert(sizeof(char*) >= sizeof(long));
    aussie_assert(sizeof(void*) >= sizeof(long));
    aussie_assert(sizeof(int*) >= sizeof(long));
    // ... and more

Note that a better version in modern C++ would use “static_assert” to test these sizes at compile-time, with zero runtime cost.

    static_assert(sizeof(char*) >= sizeof(long));
    static_assert(sizeof(void*) >= sizeof(long));
    static_assert(sizeof(int*) >= sizeof(long));

In this way, you can perfectly safely mix pointers and integers in a single variable. Just don’t tell the SOC compliance officer.

References

Horton, Mark, Portable C Software, Prentice Hall, 1990, https://www.amazon.com/Portable-Software-Mark-R-Horton/dp/0138680507.
Jaeschke, Rex, Portability and the C Language, Hayden Books, 1989, https://www.amazon.com/Portability-Language-Hayden-Books-library/dp/0672484285.
Lapin, J. E., Portable C and UNIX System Programming, Prentice Hall, 1987, https://www.amazon.com/Portable-Systems-Programming-Prentice-hall-Processing/dp/0136864945.
Rabinowitz, Henry, and SCHAAP, Chaim, Portable C, Prentice Hall, 1990, https://www.amazon.com/Portable-C-Prentice-Hall-Software/dp/0136859674.
David Spuler, March 2024, Generative AI in C++, https://www.amazon.com/Generative-AI-Coding-Transformers-LLMs-ebook/dp/B0CXJKCWX9/.

• Online: Table of Contents

• PDF: Free PDF book download

• Buy: CUDA C++ Debugging: Safer GPU Kernel Programming

The new CUDA C++ Debugging book:

Debugging CUDA C++ kernels
Tools & techniques
Self-testing & reliability
Common GPU kernel bugs

Get your copy from Amazon: CUDA C++ Debugging