Aussie AI Blog

CUDA C++ Job Interview Questions

September 23, 2025

by David Spuler, Ph.D.

CUDA C++ Job Interview Questions

GPUs Verus CPUs

What is a GPU?
A GPU is a "graphics processing unit" that doesn't do graphics anymore. Well, games still use GPUs for graphics, but nobody cares about anything other than AI these days. Generally, a GPU is a big "vectorizer" that does lots of arithmetic in parallel. The early GPUs were great at things like "ray tracing" in 3D games, where each pixel could be calculated in parallel.
What does a GPU cost?
More than your bonus. PC gamer GPUs are less than a thousand. Data center GPUs are pricier. If you have venture funding, you buy them using "per wheelbarrow load" pricing (it's a joke; don't say that in your interview).
What are GPUs good at?
Lots of parallel computations that are homogeneous. You want to be doing the same computations on lots of data and lots of times.
What are GPUs not good at?
Heterogenous parallelism. If we're trying to do lots of different things at the same time, in parallel, the GPU isn't great. CPU multithreading is better at this.
What optimizations do CPUs have that GPUs don't?
Each core in a multicore CPU is actually much larger than the "mini-cores" that run in a GPU. Not only do CPU cores run at a faster clock speed, but they also have advanced low-level microcode features that ar not present on a GPU, such as speculative execution and branch prediction. The GPU makes up for simplicity with scale with many more GPU mini-cores (thousands) than cores in a CPU multicore (dozens).
Can you use a GPU without a CPU?
Not really. The CPU is still the overarching controller for the whole application in most cases.
Does each GPU need its own CPU?
No. Advanced architectures can have multiple GPUs for each CPU. There are "multi-GPU" architectures.
GPUs are good for throughput, but CPUs are better at low latency. Do you agree?
Yes and no. GPUs are better at throughput of homogenous computations (but not so great if the compute is heterogeneous). CPUs have a faster clock speed, so they can be lower latency if the "fast path" or "hot path" doesn't have too much computation. However, if there's a lot of homogeneous computations to do, which is the case with AI models and their endless matrix multiplications, then a GPU cranking through all the work will have a faster latency than a CPU trying with its puny multicore SIMD instructions. I like to think of the way that GPUs achieved low latency as an equation: Throughput + Parallelization = Low Latency. The equation for CPU is therefore: Thin compute + Fast clock speed = Low Latency.
Which has the fastest clock speed - CPU or GPU?
CPUs. You can easily buy CPUs in home PCs with 4GHz or 5GHz clock speed, whereas your GPU might be around 1.6GHz. The GPU is just doing so much computation in parallel that running too fast will overheat it.
Can you overclock a GPU?
Yes, just like a CPU. But you can't go much faster, because then it can overheat. And GPUs do "wear out" over time and eventually fail, which is more likely the more work they do, and the hotter they run.

Basic CUDA Programming Concepts

What is host code?
C++ code that runs on the CPU. That's the "host."
What is device code?
C++ code that runs on the GPU. That's the "device."
What does CUDA stand for?
Something that's no longer relevant. Preferable to think of "barracuda" if you ask me. Officially, it's "Common Unified Device Architecture" (CUDA).
What is a kernel?
In CUDA C++ programming, a "kernel" is something that runs on a GPU, usually a function. Hence, a kernel may be running lots of GPU threads. (Generally, the term "kernel" in AI can be used more generally as well, where it refers to some nugget of computation, such as a "matrix multiplication kernel", which could be running on either a CPU or GPU, depending on what you're talking about.)
Does CUDA C++ run an AMD GPUs?
No, CUDA C++ is NVIDIA-GPU specific. There's not even an emulator. If you want to support multiple GPUs from both NVIDIA and AMD (and others), you need to use a higher-level parallelism via an umbrella technology.
What programming languages does CUDA support?
C and C++. Also Fortran and others we don't care about, since we're C++ programmers here. Amongst friends.
What CPUs does CUDA device code run on?
None. Device code runs on GPUs, not CPUs. Host code is for CPUs.
Is CUDA C++ compiled or interpreted?
Compiled. Well, mostly. It's a bit messier at the very low levels, because of how CUDA supports multiple different types of GPU architectures. But no, we'll ignore that and say that it's not interpreted like Python, and not a "byte-code interpreter" architecture on a virtual machine like Java. It actually compiles to executable instructions on both the CPU and GPU (but different ones).
What's the name of the CUDA C++ compiler?
nvcc (NVIDIA C++ Compiler or something like that).
What compiler does nvcc use?
Umm, nvcc? No, nvcc uses two compilers: itself for device code and some other C++ compiler for host code. It's a "pass-through" compiler for the CPU code. Hence, nvcc uses either GCC or MSVC for the CPU code. However, it's a proper compiler to convert the device code into GPU instructions, however that works.
Does CUDA have assembly language?
Yes, on the GPU it's called PTX, and there's also SASS at an even lower level. You can see the PTX assembler using the "keep" option to the nvcc compiler. On the CPU host code, it's not PTX, but whatever the CPU uses, such as x86 assembler or Arm assembler, which you can see with the "gcc -S" option.
Does host code support modern C++?
Yes, whatever the underlying compiler allows (e.g., GCC or MSVC).
Does GPU device code support modern C++?
Somewhat. It's a totally different compiler, so they're continually adding support for various standard C++ features. Generally speaking, most of the things you like in modern C++ are probably available in CUDA C++ kernels. However, there are some specific limitations for device code C++ in a GPU kernel. For example, you can't launch a GPU kernel that's a virtual function and various obscure things like that.
Does NVIDIA sell CPUs?
Yes, they do now! There's the Grace CPU, based on an ARM architecture. There's also a partnership with Intel that has other CPUs planned. Also, NVIDIA has "superchips" which are a combination of CPU-GPU, and there are "racks" that have lots of both CPUs and GPUs.

Floating-Point C++ Programming

What is the standard for floating-point computation?
IEEE 754.
Is there a negative zero in floating-point?
Yes, IEEE 754 has two zeros: normal zero and "negative zero." Positive zero has all bits zero. Negative zero has all bits zero, except the "sign bit" is one.
Is floating-point computation commutative?
Yes, assuming you mean addition and multiplication (not subtraction or division). Overflows can occur, but are the same either way.
Is floating-point computation associative?
No. The problem is that overflows are sticky. So "(Negative+Positive)+Positive" might not overflow, but "Negative+(Positive+Positive)" can overflow if the two positives are large. Multiplication is more likely to overfow than addition. (Note that integer arithmetic does not necessarily have this problem, because overflows are ignored and wrap-around in 2's complement integer arithmetic.)
What is FP4?
It is a 4-bit representation of a floating-point number, with 1 sign bit, 2 exponent bits, and 1 mantissa bit. This sounds very silly, but is actually a thing. FP4 is widely used in AI model quantization on GPUs, with special support on Blackwell and Rubin GPUs.
Is there an FP1?
Well, sort of. It's just a single bit that's the sign bit, so it's either +1 or -1. However, there are "binary quantization" AI models that use 1 bit for each model weight, but we usually call that "INT1" or "binarized" or "bitnet" rather than "FP1".
Why can't you use memcmp or other byte-comparisons to test equality on arrays of float?
There are multiple floating-point problems where two values should compare equal but have different bit values: NaN's with different values, and zero versus negative-zero. Problems exist not only in floating-point data. There are also general C++ problems with padding bits and padding bytes, such as in structure fields needing padding bytes for alignment, and padding bits in C++ "bit-fields".
Why does IEEE754 use an "offset" in the exponent bits?
It makes comparison of floating-point numbers faster for "less-than" or "greater-than" comparisons, because the bit representations are in sorted order (if you treated the 32 bits as a signed integer). However, it might make addition/multiplication floating-point operations slower, because the exponent needs to be tweaked, although it's all done in hardware microcode these days, so probably doesn't matter much.

AI C++ Programming

Why are GPUs fast at AI?
AI models run lots of matrix multiplications with massive sizes. This much computation overwhelms the average CPU, whereas a GPU can vectorize it all. This is true of both AI training (building new models or fine-tuning existing models) or AI inference (running queries on a model for users).
How do you port your CUDA C++ to different GPUs?
Short answer: you don't, because CUDA C++ does it for you in the nvcc compiler. Longer answer: you have to optimize for different characteristics of GPUs in terms of their grid sizes, memory sizes, and so on. Also, some CUDA C++ features require a "compute capability" that may not exist on earlier versions of GPUs, so compatibility with older GPUs is a concern.
What is on-device AI?
This usually refers to AI running on small devices, which don't have a GPU, such as a PC, laptop, tablet, or smartphone. It's also called "edge AI" and this may even include less powerful devices like network switches or the internet-of-things (IoT) devices like security cameras and your AI refrigerator. So, on-device or edge AI is not usually related to CUDA C++ programming.

Advanced CUDA C++ Programming

What are tensors?
3D arrays. Technically, a vector is a 1D tensor, matrix is a 2D tensor, and a real tensor is 3D (or 4D).
What is a "slice" of a tensor?
Short answer: a matrix.
What is the "shape" of a tensor?
It's the dimensions. A 2-D matrix arr[N][M] has "shape" of "N,M". Obviously, the dimensions cannot be negative or zero. The idea generalizes for 3D. The "shape" of a tensor is often represented generally as a vector of positive integers.
Can you access the same tensor using different shape?
Yes, if the underlying data is stored in one big contiguous memory block. In 2D, arr[N][M] can also be accessed as arr[M][N], for example. Note that the two shapes should have the same number of elements, which is true in that simple example, because N*M=M*N. (Whether it's a good idea to use a different shape is another question, and it's often a mistake to do so, which is called a "tensor shape error" and is a very common problem.)
Can you run SIMD instructions on a GPU kernel?
Surprisingly, yes, but not what you're probably thinking. No, you cannot run the CPU SIMD instructions on a GPU, such as x86 AVX or Arms' Neon/SVE instruction sets. The GPU runs a totally different instruction set to x86 or Arm CPUs.
However, yes, the GPU does have it's own SIMD instructions, which you can access via the types "float2" or "float4". The SIMD features don't get discussed as much at CUDA parties as SIMT vectorization because they only double the computation, whereas SIMT threads can run 1,024 in parallel. However, you can double that 1,024 to 2,048 by using "float2" so maybe they should get more attention. (Also, you can obviously run the appropriate set of SIMD instructions in your host code on the CPU: AVX on x86 from Intel or ADM, or Neon/SVE on Arm CPUs including the NVIDIA Grace CPU.)
Explain predication on a GPU.
Well, weird. Each CUDA "warp" runs 32 threads in parallel, each in a kind of "mini-core" (I'm not meaning a tensor core, but a low-level concept of how things run in warps). In CPU multithreading, each thread has its own instruction pointer, and they can run different code, all over the place (with different data). GPU threads in a warp do have their own data (in registers) but share an instruction pointer.
Imagine if all 32 GPU threads only had one instruction pointer, and had to all execute the exact same instruction at the same time. That works fine for a fixed sequence of instructions, but what about branching?
How could you possibly run an "if-then-else" sequence on a warp in a way that some of the 32 threads did "if" and the rest of the 32 threads did the "else" body? The answer is called "predication." There's a bit flag with 32 bits, that tells which of the 32 warp threads should run the current instruction or should run a "nop" (no-operation) if their bit is not set. The GPU actually runs both the "if" and "else" blocks on all the 32 threads in a warp. That's so weird, I'll say it again: the GPU just runs both parts of an if-statement. And all 32 threads run both the if and else blocks. Huh? How can then be different? Well, the compiler splits the 32 threads into two sets using 32 predication bits. It just turns the predication bits on for the if-block, and then inverts the predication bits so that the other threads in the warp run the else-block. You can look at predication instructions in PTX assembly code with a "@p" symbol.
Explain predication on a CPU.
No, it's only on the GPU. CPUs run more normally.

Simple Low-Level Coding Questions for CUDA C++ Programming

Check if an address has a specified alignment using C++ (or C).
Extract the floating-point sign bit using bitwise arithmetic.
Hint: convert it to an integer.
Extract exponent and mantissa bits from a 32-bit float using bitwise tricks.
Implement a test for floating-point zero using bitwise tricks.
Hint: don't forget about negative zero!
Implement a test for floating-point NaN using bitwise tricks.
Implement a test for floating-point Inf using bitwise tricks.
Convert BF16 to FP32 using bitwise arithmetic tricks.
Detect big-endian versus little-endian architectures using C++ (or C).
For a runtime test, create an int variable, initialize its value to 1, take its address, convert that to a pointer to "unsigned char*" and then test the first byte at addr[0] to see if it's 0 or 1. Note that there are compile-time methods: non-standard preprocessor macros in GCC, and a standardized method with "std::endian" in C++20. For C, there's also "__STDC_ENDIAN_LITTLE__" and other <stdlib.h> macros in C23.
Implement a dynamically-allocated 2-dimensional array so that arr[i][j] indexing works using C++ (or C).
Implement a dynamically-allocated 2-dimensional array so that arr[i][j] indexing works, but all the array elements are in a single contiguous block.
Implement a dynamically-allocated 2-dimensional array over contiguous memory in a faster way?
Yes, relax the requirement that arr[i][j] actually works, in which case you don't need the extra set of pointers in the array arr[i]. There's just one big memory block with all the data, and no other memory allocations. Instead, use an index computation method over the low-level contiguous array, like arr[COMPUTE_OFFSET(i,j)], where the index is computed arithmetically using the size of the rows. You can use a C-style macro, a short C++ inline function, or wrap the whole request for a matrix element in a memory function for your matrix class.
You have a 2-D matrix stored in a contiguous memory block. Compute the address of arr[i][j] using i and j, in C++ (or C).
You have a 3-D tensor stored in a contiguous memory block. Compute the address of arr[i][j][k] using i, j, and k, in C++ (or C).
Compute the number of elements in a tensor given its shape as a vector.
This is the product from multiplying of all the numbers in the shape vector. For a 2D matrix, the number of elements in arr[N][M] is N*M, and this generalizes for 3D tensors and beyond.

Advanced Coding Questions for AI C++ Programming (CUDA or non-CUDA C++)

Implement a basic matrix-matrix multiplication.
Implement a basic matrix-vector multiplication.
Implement a transpose of matrix operation.
Set a matrix to be an identity matrix.
Implement a triangular-matrix version of matrix-matrix multiplication.
Implement a convolution.
Implement a custom iterator over a matrix.

Systems Design Questions for CUDA C++ Programming

Note that "systems design" questions means:

No coding
Talk through the "design"

Here are some questions:

How to port a CPU workload to CUDA C++?
How can you optimize a CUDA C++ application?
General points to make before diving into optimization techniques:
- Profile the existing code for bottlenecks
- Examine the overall algorithm and data structures for high-level parallelization opportunities.
The next level down to consider for optimizing CUDA C++ code is the separate areas:
- Compute
- Memory access costs
- Network data transfer costs
Compare optimizing vector addition on a CPU versus a GPU?
Vector addition is an "element-wise" operation that is very easy to parallelize. In fact, it's "embarassingly parallel" (that's a real term). Hence, this operation is all "vertical" and does not have any "horizontal" or "reduction" computation. Since we can forget about reductions, there are actually two important aspects:
- Compute parallelization method
- Memory access costs
The basic compute parallelization method is:
- CPU — multithreading (e.g., std::thread) and each thread could use SIMD instructions (e.g., x86 AVX or Arm Neon/SVE low-level machine instructions)
- GPU — SIMT method with CUDA C++ kernel threads
The memory access ordering issues are opposite for CPU and GPU:
- CPU — each thread (with or without CPU SIMD instructions) works on adjacent array elements in a contiguous segment of the vector (i.e., a "segmented" parallelization).
- GPU — each thread in a CUDA kernel works on a "stride" so that the 32 threads of a warp collectively process a vector segment of 32 adjacent elements, which gives a "coalesced memory addressing" sequence, which is fast on a GPU. Using the "segmented" version on a GPU is slower.
Some finesses to this answer include:
- CPU — talk about "cache line" sizes (e.g., usually 64 bytes to 256 bytes).
- GPU — talk about "grid-stride loops" in the kernel threads (to maintain coalesced addressing).
Compare optimizing vector dot product on a CPU versus a GPU?
Discuss row-major order versus column-major order in two-dimensional arrays or matrices.
Design a fast matrix-matrix multiplication kernel (in C++ or in CUDA C++).
Design a fast matrix-vector multiplication kernel (in C++ or in CUDA C++).
Design the backend for ChatGPT (with hundreds of millions of users).
Optimize the backend for serving the answers to LLM requests.
Optimize the backend of an LLM inference engine.
Design an AI application that uses the OpenAI API for LLM inference (i.e., an LLM API "wrapper" app) so that it scales.
How would you save money on token costs for an LLM API "wrapper" app?
How would you handled security credentials for users in an LLM API "wrapper" app?

What Else Should I Research?

If you have a job interview for an AI C++ engineer, whether it's for CUDA C++ on a GPU, or CPU C++ coding, here's what else to know:

Python
AI frameworks like TensorFlow, PyTorch, and JAX (there are several others).
Inference engines like vLLM, SGLang, or TensorRT (again, many others).
AI LLM kernels (e.g., attention, FFN/MLP, activation functions, etc.)
Non-LLM model types (e.g., diffusion models, convolutions, CNNs, RNNs).
Graph algorithms (because of "graph compilers" that run AI models).
LLM Inference Optimizations (e.g., quantization, pruning, MoE, etc.)
LLM Training methods (basic training techniques, fine-tuning, etc.)
Floating point numbers (e.g., the IEEE 754 standards and the newer variants like BF16).
Vectors, matrices, and tensors (e.g., how to use contiguous memory).
Matrix multiplication (it's 99% of all electricity usage when running AI engines).
C programming (low-level operations like arrays and pointers).
General C++ programming (i.e., modern C++ containers, algorithms, data structures, all the usual suspects).

Yeah, I know, that's rather a lot!