Aussie AI
Chapter 17. CUDA Portability
-
Book Excerpt from "CUDA C++ Debugging: Safer GPU Kernel Programming"
-
by David Spuler
Portability of CUDA C++ Applications
The portability model of CUDA programs to multiple architectures is quite complicated. Hence, let’s start with the most basic point about CUDA:
Only NVIDIA GPUs are supported.
Beyond that, things get more complicated. There are two specific issues for the portability of your code:
1. Host code portability (CPU), and
2. Device code portability (GPU).
If you’re trying to run an AI application in the data center, then it’s probably running the host code on Linux and the device code on a H100 GPU. But if you’re using CUDA to write an application for gaming or video editing on a desktop PC, then the host code is running on an x86 CPU, and the GPU is a graphics card like a GeForce RTX 4090 or whatever is the latest chip as you read this.
Forget portability in AI! This simplest case is where you don’t have to worry about any of this. And this is often the case for an AI workload, where you have control over all of the Linux machines with their eight-pack of H100’s. You only need to compile for this one platform. Hence, stop reading this section, because you don’t care about portability: just compile it for your one platform and go to lunch.
Summary of Commands and API Calls
There’s a lot of details in the discussion below, but let’s do a quick summary of the things that you might need. Here are some of the Linux commands you might use:
nvcc --versionnvidia-smiwhereis cudawhich nvcc
Here are some of the many nvcc compiler flags:
-gor--debug— CUDA compiler flag for compilation in debug mode, with extra debug information put into the executable (i.e., similar to “-g” flag for GCC).-Gor--device-debug— CUDA compiler option for “device debug” mode, when compiling CUDA C++ code that runs on the GPU.-lineinfoor--generate-line-info— NVCC generates extra information for profiling.-pgor--profile— generates profiler information for use withgprof.
Here are the CUDA C++ preprocessor macros defined
during nvcc compilation,
which mostly have a double underscore as both prefix and suffix:
__NVCC__— predefined preprocessor macro when compiling innvcc.__CUDACC__— another preprocessor macro when compiling CUDA C++.CUDART_VERSION— CUDA Runtime version as a number (preprocessor macro).__CUDA_ARCH__— GPU architecture preprocessor macro as a constant number (but be warned that this works in device code only and is undefined in host code).__CUDACC_DEBUG__— preprocessor macro set when compiling in debug mode.
Here are the CUDA Runtime C++ API calls:
cudaRuntimeGetVersion— CUDA Runtime version (C++ function call).cuDeviceGetAttribute— get attributes of the current GPU device.cudaGetDeviceCount— how many GPUs on this box?cudaGetDeviceProperties— get properties of the current GPU.cudaSetDevice— set the current GPU device, so you can query its properties.cudaDriverGetVersion— CUDA driver version details.
I won’t be insulted if you stop reading now and hit Stack Overflow instead.
Detailed CUDA Portability
CUDA compilation model. Multiple platforms are more complicated to support. The compilation model in CUDA has support for several types of files:
- Executable files (e.g., Linux executables)
- Binary files (“
.cubin”) - PTX assembly files
- Non-CUDA C++ source files
- CUDA C++ files (“
.cu”)
I’ve mixed some host and device code issues together here, but I don’t feel bad because that’s what CUDA does inside its C++ programs. Anyway, let’s split it out.
Host code portability.
The host code is like a normal non-CUDA C++ program.
You need it to compile into a native binary, just
as you would any other C++ program on Linux or Windows.
The output from compiling host code is a native executable file (not a “.cubin” file).
The nvcc compiler can do this, but it’s not really
doing everything itself.
Behind the scenes, it actually calls another non-CUDA C++ compiler,
such as g++ on Linux.
For the host code, nvcc generates an intermediate C++ format,
with all the CUDA syntax removed (e.g., __global__ and the “<<<...>>” triple chevron syntax).
Hence, nvcc acts like a cross compiler that outputs C++ as its target language.
Beyond this, the portability issues for getting the host code running on Linux versus Windows versus MacOS are the same types of concerns as for a non-CUDA C++ program. There are literally whole books on C++ portability, so we’ll be here for a while if I get started.
Device code portability.
Where CUDA really shines is its support for multiple GPU chips.
I mean, only NVIDIA ones, but it’s still great.
You can use nvcc to output two low-level formats:
- CUDA binary files (“
.cubin”) - PTX assembly language files
The binary files are specific to each GPU,
and are machine code for the GPU chip.
Hence, you cannot just
copy a “.cubin” file from one to the other.
You have to specify the target GPU architecture when you create a binary file.
To support multiple GPU types in your application, you’ve got two main options for your build process:
- Manage lots of “.cubin” files (not recommended), or
- Compile to PTX assembly language
PTX is a text-based assembly language format that’s much lower level than C++.
The PTX assembly language files are further compiled to binary code by the GPU’s device driver.
What this really means is that every GPU device driver contains an assembler,
and does “just-in-time compilation” to create machine code from PTX
(really, shouldn’t it be called “just-in-time assembling”?).
The command-line version of the PTX assembler is called ptxas.
Note that the PTX language is not fully compatible across all GPU architectures. There are some options that control which level of “compute compatibility” need to be supported in the output PTX files. Hence, this adds another wrinkle to the build process, although maybe you won’t be using any of the less powerful GPUs.
And just to confuse matters, there’s a third option call “just-in-time compilation” of C++. This is where you can actually distribute the device code’s CUDA C++ source code to multiple GPUs, rather than using binary or PTX assembly files. The NVRTC library can compile CUDA C++ files to PTX on the fly, which can then be assembled to binary code by the GPU device driver.
Summary. Let’s wrap up this portability discussion with an overview of the various options.
- One CPU, one GPU — just use
nvccto build Linux executables and “.cubin” device binary files. - One CPU, many GPUs — compile to PTX, or to binary, or use just-in-time NVRTC C++ compilation.
- Many CPUs, many GPUs — my head hurts; let’s outsource.
Detecting Host versus Device Code
The simplest way to separate host and device code is to use different functions.
It’s a basic separation with “__global__” or “__device__” for device functions,
and either no specifier or “__host__” for host code.
An even purer method is to separate the host code into its own source code file.
In some cases, you could even have the basic C++ functions for host code in a non-CUDA C++ source file,
or even link in a simple C++ non-CUDA library (e.g., via g++ options).
But none of that is CUDA style! After all, the “U” in CUDA means “Unified” and we’re supposed to smash it all into one source file. Hence, if you want to do different things on the host and the device, you need to detect it in the C++ code itself.
Preprocessor macro method. Whether the code is run on the host or the device can be detected at compile-time. The simplest way is to use a preprocessor macro.
#if __CUDA_ARCH__
// Device code
#else
// Host code
#endif
Another alternative way is:
#ifdef __CUDA_ARCH__
// Device code
#endif
And for host code:
#if !defined(__CUDA_ARCH__)
// Host code
#endif
Build your own symbols. Maybe you want it to look clearer in the code?
#if IS_DEVICE_CODE
// kernel
#else
// host
#endif
To permit this, you can define your own macros to hide these details. Note that this idea won’t work in a header file:
#ifdef __CUDA_ARCH__ // Fails!
#define IS_DEVICE_CODE 1
#else
#define IS_DEVICE_CODE 0
#endif
This above idea fails because the value of __CUDA_ARCH__ will be
evaluated by nvcc within your header file, where it is always host code,
and the macro will always be empty.
Instead, this should work in a header, by making the expansion of your macro happen later:
#define IS_DEVICE_CODE ( __CUDA_ARCH__ > 0 ) // Better
#define IS_HOST_CODE ( __CUDA_ARCH__ == 0 )
Note that these will work in preprocessor expressions (e.g., #if),
but not at runtime in “if” tests, where a compilation error will result.
The undefined value of the __CUDA_ARCH__ macro name in host code
defaults to zero in preprocessor conditional expressions,
but not elsewhere in C++ statements.
Detecting GPU Architectures in Device C++
You can detect the “compute capability” of your NVIDIA GPU within device code
using the “__CUDA_ARCH__” preprocessor macro.
This macro is not set in host code, which can be used to distinguish host versus device code, as already discussed above.
The main use of this macro is to use different code for more capable GPUs. Here’s an example of how to use faster code with a higher compute capability, but also have code for a lower one on an older GPU. An example of the compile-time method:
#if __CUDA_ARCH__ >= 800
// Compute capability 8.0 and above
#else
// Less capable GPU
#endif
Is CUDA Installed?
You can check on a Linux box whether the CUDA Toolkit software is installed in various ways. Here’s a selection of commands you can use. First, you can just try to run the compiler:
nvcc
Here’s the output:
nvcc fatal : No input files specified; use option --help for more information
Use the whereis command on Linux:
whereis cuda
The output is:
cuda: /usr/local/cuda
You can list the CUDA file directory:
ls /usr/local/cuda/
Here’s the output file listing:
bin compute-sanitizer extras include nvml res src
compat doc gds lib64 nvvm share targets
If CUDA is not installed, you get an error with most of these commands. Simples.
Detecting CUDA Version
Using the nvcc compiler’s version flag is one way:
nvcc --version
If you’re running in Google Colab, you’ll need to add a prefix “!” to the Cell command to run any of these Linux shell commands properly.
The command in a new “+Code” cell is simply:
!nvcc --version
Here’s the output I get, which shows “12.2” in various ways:
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Tue_Aug_15_22:02:13_PDT_2023
Cuda compilation tools, release 12.2, V12.2.140
Build cuda_12.2.r12.2/compiler.33191640_0
version.txt disappeared. According to the internet (i.e., Stack Overflow), the installed version of CUDA Runtime is stored in a text file on Linux:
cat /usr/local/cuda/version.txt
But it doesn’t work. Although there is a directory /usr/local/cuda/,
here’s what I get on my Google Colab virtual box running CUDA 12:
cat: /usr/local/cuda/version.txt: No such file or directory
So, it looks like version.txt is gone, at least by CUDA version 12.
nvidia-smi command.
You cannot really also use
the nvidia-smi command for this issue, because that is inspecting your GPU chip’s capabilities,
rather than the CUDA Toolkit software install.
The command is simply:
nvidia-smi
Here’s the output:
Sun Sep 29 04:43:04 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05 Driver Version: 535.104.05 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 Tesla T4 Off | 00000000:00:04.0 Off | 0 |
| N/A 35C P8 9W / 70W | 0MiB / 15360MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| No running processes found |
+---------------------------------------------------------------------------------------+
Although it says “CUDA Version 12.2” this is talking about the hardware, not software. It’s not a reliable indicator of the software install, as it can refer to what version the GPU requires, rather than what’s currently installed. (And wow, someone at NVIDIA really took some time to make the columns line up and pretty up the ASCII layout, because coding this is fiddly work!)
Mixing CUDA and Non-CUDA C++
There are times when you want to use the same shared C++ code in both CUDA and non-CUDA applications. How can you tell which type of compiler is running? One way is to detect preprocessor macros:
__NVCC__is set whennvccis compiling.__CUDACC__is also set when a CUDA C++ compiler is compiling (i.e.,nvcc).__CUDA_ARCH__specifies host mode or various device architectures.
Hence, you can detect CUDA versus non-CUDA compilation via methods such as:
#ifdef __NVCC__
// CUDA C++
#else
// Non-CUDA C++
#endif
There are a number of other supporting preprocessor macros that can be used to further identify compiler versions:
__CUDACC_VER_MAJOR__is the major version number of thenvcccompiler.__CUDACC_VER_MINOR__is the minor version number.__CUDACC_VER_BUILD__is the build number.
There are also some preprocessor macros that indicate the “modes”
that nvcc is compiling in:
__CUDACC_DEBUG__for device-debug mode.__CUDACC_RDC__for relocatable device code mode.__CUDACC_EWP__for extensible whole program mode.
CUDA Portability Traps
There are a few traps in coding portable code:
- __CUDA_ARCH__ is an undefined macro in host code.
- Preprocessor macros are not checked in C++
To the point about undefined C++ preprocessor macros, here’s a bug:
#if NVCC // BUG!
// CUDA-only code
#endif
This is a typo of __NVCC__, but it’s also a silent bug.
C++ converts unknown symbols in #if expressions to 0, so this fails.
Here’s a little trick for your header file:
#define NVCC Maybe you meant __NVCC__ ?
#define __NVCC Maybe you meant __NVCC__ ?
#define __NVCC_ Maybe you meant __NVCC__ ?
#define _NVCC_ Maybe you meant __NVCC__ ?
#define NVCC__ Maybe you meant __NVCC__ ?
Now you’ll get compiler errors if you typo them in an #if expression.
Unfortunately, I don’t have a trick for #ifdef or the defined operator, so this is still a silent bug:
#ifdef NVCC // Wrong!
An alternative strategy would be to tolerate accidental typos of macro names by adding this in your header file:
#ifdef __NVCC__
#define NVCC __NVCC__
#define _NVCC_ __NVCC__
#endif
It might be easier to just use a grep command on your C++ source code files:
grep -r NVCC | grep -v __NVCC__
This doesn’t actually catch all cases, such as mixing them, but it’s probably good enough. Alternatively, you can directly scan for all the badly written versions, using regular expressions to avoid matching the correct one.
You need multiple versions for each of the other processor macros, such as __CUDA_ARCH__ as well:
grep -r CUDA_ARCH | grep -v __CUDA_ARCH__ grep -r CUDAARCH
And then you have to add it to your build scripts.
C++ Operator Portability Pitfalls
Most of the low-level arithmetic code in C++ algorithms looks quite standardized. Well, not so much. The general areas where C++ code that looks standard is actually non-portable includes trappy issues such as:
- Arithmetic overflow of integer or
floatoperators. - Integer
%remainder and/division operators on negatives. - Right bitshift operator
>>on a negative signed integer is not division. - Divide-by-zero doesn’t always crash on all CPUs and GPUs.
- Order of evaluation of expression operands (e.g., with side-effects).
- Order of evaluation of function arguments.
- Functions that should be Boolean are not always (e.g.,
isdigit,isalpha) - Functions that don’t return well-defined results (e.g.,
strcmp,memcmp, etc.) - Initialization order for
staticor global objects is undefined. memcmpis not an array equality test for non-basic types (e.g., structures).
Note that these errors are not only portability problems, but can arise in any C++ program. In particular, different levels of optimization in C++ compilers may cause different computations, leading to insidious bugs.
Signed right bitshift is not division
The shift operators << and >> are often used to replace multiplication by a power of 2 for a low-level optimization. However, it is dangerous to use >> on negative numbers. Right shift is not equivalent to division for negative values. Note that the problem does not arise for unsigned data types that are never negative, and for which shifting is always a division.
There are two separate issues involved in shifting signed types with negative values: firstly, that the compiler may choose two distinct methods of implementing >>, and secondly, that neither of these approaches is equivalent to division (although one approach is often equivalent). It is unspecified by the standard whether >> on negative values will:
(a) sign extend, or
(b) shift in zero bits.
Different compilers must choose one of these methods, document it, and use it for all applications of the >> operator. The use of shifting in zero bits is never equal to division for a negative number, since it shifts a zero bit into the sign bit, causing the result to be a nonnegative integer (dividing a negative number by two and getting a positive result is not division!). Shifting in zero bits is always used for unsigned types, which explains why right shifting on unsigned types is a division.
Divide and remainder on negative integers
Extreme care is needed when the integer division and remainder operators / and % are applied to negative values. Actually, no, forgot that, because you should never use division or remainder in a kernel, and if you must, then you choose a power-of-two and use bitwise operations instead. Division is unsigned right bitshift, and remainder is bitwise-and.
Anyway, another reason to avoid these operators occurs with negatives. Problems arise if a program assumes, for example, that -7/2 equals -3 (rather than -4) . The direction of truncation of the / operator is undefined if either operand is negative.
Order of evaluation errors
Humans would assume that expressions are evaluated left-to-right. However, in C++ the order of the evaluation of operands for most binary operators is not specified and is undefined behavior. This makes it possible for compilers to apply very good optimizing algorithms to the code. Unfortunately, it also leads to some problems that the programmer must be aware of.
To see the effect of side effects, consider the increment operator in the expression below. It is a dangerous side effect.
y = (x++) + (x * 2);
Because the order of evaluation of the addition operator is not specified, there are two orders in which the expression could actually be executed. The programmer’s intended order is left-to-right:
temp = x++;
y = (temp) + (x * 2);
The other incorrect order is right-to-left:
temp = x * 2;
y = (x++) + (temp);
In the first case, the increment occurs before x*2 is evaluated. In the second, the
increment occurs after x*2 has been evaluated. Obviously, the two interpretations give
different results.
This is a bug because it is undefined which order the compiler will choose.
Function-call side effects
If there are two function calls in the one expression, the order of the function calls can be important. For example, consider the code below:
x = f() + g()
Our first instinct is to assume a left-to-right evaluation of the “+” operator.
If both functions produce output or both modify the same global variable, the result of the
expression may depend on the order of evaluation of the “+” operator,
which is undefined in C++.
Order of evaluation of assignment operator
Order of evaluation errors are a complicated problem. Most binary operators have unspecified order of evaluation — even the assignment operators. A simple assignment statement can be the cause of an error. This error can occur in assignment statements such as:
a[i] = i++; // Bug
The problem here is that “i” has a side effect applied to it (i.e., ++), and is also used without a side
effect.
Because the order of evaluation of the = operator is unspecified in C++, it is undefined whether the increment side effect occurs before or after the evaluation of i in
the array index.
Function-call arguments
Another form of the order of evaluation problem occurs because the order of the evaluation of arguments to a function call is not specified in C++. It is not necessarily left-to-right, as the programmer expects it to be. For example, consider the function call:
fn(a++, a); // Bug
Which argument is evaluated first? Is the second argument the new or old value of a? It’s actually undefined in C++.
Order of initialization of static objects
A special order of evaluation error exists because the order of initialization of static or global objects is not defined across files. Within a single file the ordering is the same as the textual appearance of the definitions. For example, the Chicken object is always initialized before the Egg object in the following code:
Chicken chicken; // Chicken comes first
Egg egg;
However, as for any declarations there is no specified left-to-right ordering for
initialization of objects within a single declaration. Therefore, it is undefined which of
c1 or c2 is initialized first in the code below:
Chicken c1, c2;
If the declarations of the global objects “chicken” and “egg” appear in different files that are linked together using independent compilation, it is undefined which will be constructed first.
memcmp cannot test array equality
For equality tests on many types of arrays, the memcmp function might seem an efficient way to test
if two arrays are exactly equal.
However, it only works in a few simple situations (e.g., arrays of int),
and is buggy in several cases:
- Floating-point has two zeros (positive and negative zero), so it fails.
- Floating-point also has multiple numbers representing
NaN(not-a-number). - If there’s any padding in the array, such as arrays of objects or structures.
- Bit-field data members may have undefined padding.
You can’t skip a proper comparison by looking at the bytes.
Data Type Sizes
There are a variety of portability issues with the sizes of basic data types in C++. Some of the problems include:
- Fundamental data type byte sizes (e.g., how many bytes is an “
int”). - Pointer versus integer sizes (e.g., do
voidpointers fit inside anint?). size_tis usuallyunsigned long, notunsigned int.
Typical AI engines work with 32-bit floating-point (float type).
Note that for 32-bit integers you cannot assume that int is 32 bits,
but must define a specific type.
Furthermore, if you assume that short is 16-bit, int is 32-bit, and long is 64-bit, well, you’d be incorrect.
Most platforms have 64-bit int types,
and the C++ standard only requires relative sizes,
such as that long is at least as big as int.
Your startup portability check should check that sizes are what you want:
// Test basic numeric sizes
aussie_assert(sizeof(int) == 4);
aussie_assert(sizeof(float) == 4);
aussie_assert(sizeof(short) == 2);
Another more efficient way is the compile-time static_assert method:
static_assert(sizeof(int) == 4);
static_assert(sizeof(float) == 4);
static_assert(sizeof(short) == 2);
And you should also print them out in a report, or to a log file, for supportability reasons.
Here’s a useful way with a macro
that uses the “#” stringize preprocessor operator and also the standard adjacent string concatenation feature of C++.
#define PRINT_TYPE_SIZE(type) \
printf("Config: sizeof " #type " = %d bytes (%d bits)\n", \
(int)sizeof(type), 8*(int)sizeof(type));
You can print out whatever types you need:
PRINT_TYPE_SIZE(int);
PRINT_TYPE_SIZE(float);
PRINT_TYPE_SIZE(short);
Here’s the output on my Windows laptop with MSVS:
Config: sizeof int = 4 bytes (32 bits)
Config: sizeof float = 4 bytes (32 bits)
Config: sizeof short = 2 bytes (16 bits)
Standard Library Types:
Other data types to consider are the builtin ones in the standards.
I’m looking at you, size_t and time_t, and a few others that belong on Santa’s naughty list.
People often assume that size_t is the same as “unsigned int”
but it’s actually usually “unsigned long”.
Here’s a partial solution:
PRINT_TYPE_SIZE(size_t);
PRINT_TYPE_SIZE(clock_t);
PRINT_TYPE_SIZE(ptrdiff_t);
Data Representation Pitfalls
Portability of C++ to platforms also has data representation issues such as:
- Floating-point oddities (e.g., negative zero,
Inf, andNaN). - Whether “
char” means “signed char” or “unsigned char” - Endian-ness of integer byte storage (i.e., do you prefer “big endian” or “little endian”?).
- Whether zero bytes represent zero integers, zero floating-point, and null pointers.
Zero is not always zero? You probably assume that a 4-byte integer containing “0” has all four individual bytes equal to zero. It seems completely reasonable, and is correct on many platforms, but not all. There’s a theoretical portability problem on a few obscure platforms. There are computers where integer zero or floating-point 0.0 is not four zero bytes. If you want to check, here’s a few lines of code for your platform portability self-check code at startup:
int i2 = 0;
unsigned char* cptr2 = (unsigned char*)&i2;
for (int i = 0; i < sizeof(int); i++) {
assert(cptr2[i] == 0);
}
Are null pointers all-bytes-zero, too?
Here’s the code to check NULL in a “char*” type:
// Test pointer NULL portability
char *ptr1 = NULL;
unsigned char* cptr3 = (unsigned char*)&ptr1;
for (int i = 0; i < sizeof(char*); i++) {
assert(cptr3[i] == 0);
}
What about 0.0 in floating-point? You can test it explicitly with portability self-testing code:
// Test float zero portability
float f1 = 0.0f;
unsigned char* cptr4 = (unsigned char*)&f1;
for (int i = 0; i < sizeof(float); i++) {
assert(cptr4[i] == 0);
}
It is important to include these tests in a portability self-test,
because you’re relying on this whenever you use memset or calloc.
Pointers versus Integer Sizes
You didn’t hear this from me, but apparently you can store pointers in integers, and vice-versa, in C++ code. Weirdly, you can even get paid for doing this. But it only works if the byte sizes are big enough, and it’s best to self-test this portability risk during program startup. What exactly you want to test depends on what you’re (not) doing, but here’s one example:
// Test LONGs can be stored in pointers
aussie_assert(sizeof(char*) >= sizeof(long));
aussie_assert(sizeof(void*) >= sizeof(long));
aussie_assert(sizeof(int*) >= sizeof(long));
// ... and more
Note that a better version in modern C++ would use “static_assert” to test these sizes at compile-time,
with zero runtime cost.
static_assert(sizeof(char*) >= sizeof(long));
static_assert(sizeof(void*) >= sizeof(long));
static_assert(sizeof(int*) >= sizeof(long));
In this way, you can perfectly safely mix pointers and integers in a single variable. Just don’t tell the SOC compliance officer.
References
- Horton, Mark, Portable C Software, Prentice Hall, 1990, https://www.amazon.com/Portable-Software-Mark-R-Horton/dp/0138680507.
- Jaeschke, Rex, Portability and the C Language, Hayden Books, 1989, https://www.amazon.com/Portability-Language-Hayden-Books-library/dp/0672484285.
- Lapin, J. E., Portable C and UNIX System Programming, Prentice Hall, 1987, https://www.amazon.com/Portable-Systems-Programming-Prentice-hall-Processing/dp/0136864945.
- Rabinowitz, Henry, and SCHAAP, Chaim, Portable C, Prentice Hall, 1990, https://www.amazon.com/Portable-C-Prentice-Hall-Software/dp/0136859674.
- David Spuler, March 2024, Generative AI in C++, https://www.amazon.com/Generative-AI-Coding-Transformers-LLMs-ebook/dp/B0CXJKCWX9/.
|
• Online: Table of Contents • PDF: Free PDF book download |
|
The new CUDA C++ Debugging book:
Get your copy from Amazon: CUDA C++ Debugging |