Aussie AI

Chapter 10. CUDA Data Transfer Optimizations

Book Excerpt from "CUDA C++ Optimization: Coding Faster GPU Kernels"

by David Spuler

Chapter 10. CUDA Data Transfer Optimizations

Bottlenecks in Data Transfer

Transfer cost of data can be a bottleneck, which includes:

Host-device data transfers (i.e., CPU-to-GPU or GPU-to-CPU).
Multi-GPU data transfers.
Server data transfers

General optimizations that reduce overall data requirements are effective as reducing overall transfer volume (e.g., smaller data types, algorithm-level changes). There are also several optimization techniques that are specific to data transfers, including:

Pinned host memory that cannot be page-faulted by the CPU.
Overlapping data transfers and CPU computations.
Overlapping data transfers and GPU kernel computations.
Splitting data transfers into chunks that can be overlapped.
Prefetching to trigger a data transfer sooner (e.g., cudaMemPrefetchAsync).

Host-Device Transfer Costs

If you are using a simple kernel, the typical sequence is:

malloc — allocate CPU memory.
Initialize CPU copy of input data.
cudaMalloc — initialize device vector on GPU.
cudaMemcpy — copy input data up from CPU-to-GPU.
Kernel launch — snazzy grid-stride loops.
Synchronize — e.g., cudaDeviceSynchronize to wait.
cudaMemcpy — copy results back from GPU-to-CPU.
cudaFree — cleanup kernel memory.
free — cleanup CPU memory.

In this sequence, you can see where the data transfers are: cudaMemcpy. This does the data transfers in both directions.

When I did some timings on simple non-overlapped non-pinned data transfers and a very basic kernel (element-wise vector clear), I saw this pattern of results across multiple different versions of my kernel (basic, segmented, grid-stride loop):

cudaMalloc remote allocation — 5%
cudaMemcpy input data transfer up — 40%
Kernel launch and synchronization — 10%
cudaMemcpy results data transfer back down — 40%
cudaFree remote de-allocation — 5%
Everything else — negligible.

Oh, and one point: I excluded the CPU costs of initializing the vector with data, and running the self-tests on the CPU to unit test that it worked. I had to do this with the CPU costs of a couple of for loops came in at around 70% of execution time, because CPUs are slow! Hence, the above analysis is only data transfer costs and GPU computations.

Let’s just stop and bask in that glory for a moment. We made the right choice in learning CUDA because it’s so very fast. On the other hand, it means you bought the wrong book, because you’re optimizing only 30% of the problem, and you need a book on non-CUDA C++ optimization on the CPU.

Early Runtime Initialization. Also, I noticed that the very first call to cudaMalloc in the program execution was very expensive, circa 100ms, but the second and subsequent cudaMalloc calls were no longer this costly. Presumably, this is the setup time for Unified Memory addressing, with a handshake between the CPU and GPU happening across the PCIe bus behind-the-scenes.

If you don’t want your first user query to have a 100ms extra delay then issue a dummy call to warm up the GPU and trigger the CUDA Runtime API initialization. Actually, I removed it from the main computation path simply by adding a call to cudaDeviceSynchronize earlier in my startup code.

Data Transfer Costs. Analyzing the above data without that extra initialization cost, we can surmise a few notable factors:

Direction not important — the data transfers were approximately the same cost in either direction.
Hidden CPU-GPU synchronization costs in Unified Memory address management (i.e., cudaMalloc, cudaFree).
Data transfer costs were 80%!
Allocation/cleanup of remote memory raised that to 90%!

Admittedly, this was a very simple kernel, but it nevertheless underscores the fact that host-device data transfer costs are significant. My test code is definitely not compute-bound. What can we do about that?

Pinned Host Memory

Memory in the CPU will normally be subject to paging, so primitives like cudaMemcpy have to run slower by taking a copy. We can avoid this expense by using “pinned” memory (also called “page-locked”), where operating system paging is blocked, and the memory transfer can run faster. The calls to manage pinned allocated memory are:

cudaMallocHost or cudaHostAlloc
cudaFreeHost

Note that this is host memory, so you make these changes to the handling of CPU data:

malloc — replace with cudaMallocHost or cudaHostAlloc.
free — replace with cudaFreeHost.

You don’t change cudaMalloc or cudaFree! In fact, you leave the both in there, because that’s the other step of setting up the allocated memory on the device side. Pinned memory is on the host side.

My basic test code looked like this:

    float *v = NULL;
    if (pinned) {
        CUDACHK( cudaMallocHost((void**)&v, n * sizeof(float)) );  // Pinned host array
    }
    else {
        v = (float*)malloc(n * sizeof(float));  // Non-pinned host memory
    }

And the matching de-allocation code at the end:

    if (pinned) {
        CUDACHK(cudaFreeHost(v));  // Free pinned host vector
    }
    else {  // not pinned...
        free(v);   // Free non-pinned host vector
    }

When I made these changes and re-profiled the CPU time for this version, it was worse! The overall time cost on the CPU went up. The general pattern of the time cost changed to:

cudaMallocHost pinned allocation — 40-50% (very high!)
cudaMalloc remote allocation — 3% (lower)
cudaMemcpy input data transfer up — 10-15% (down from 40%)
Kernel launch and synchronization — 10%
cudaMemcpy results data transfer back down — 10-15% (down from 40%)
cudaFree remote de-allocation — 3% (lower)
cudaFreeHost pinned de-allocation — 15-25% (very high!)

Whereas the calls to malloc and free were “negligible” in the non-pinned version, the calls to cudaMallocHost and cudaFreeHost were very expensive. On the other hand, the amount of CPU time used by the cudaMemcpy calls, both up and down, was reduced to about half its original cost.

Is this a bad result?

What this shows is that initialization of pinned host memory is expensive, but that the data transfer time can be greatly reduced to about half the time. Hence, there’s not an efficiency gain, and in fact a loss, if you have to allocate and de-allocate the pinned host memory for every user query. However, if you can pre-initialize pinned memory blocks on the host and then re-use them over many user queries, the latency for your users will be much reduced, because the data transfer costs are down by 50%.

Overlapping Data Transfers and CPU Computation

When the host code launches a kernel, it does so asynchronously, and the CPU can go do other work while the GPU processes the kernel, thereby overlapping or parallelizing CPU and GPU computation. A similar idea can be applied to overlapping data transfers and CPU computation.

To do this, we need an asynchronous way for the host to start data transfers. The default behavior of cudaMemcpy is synchronous on the host, and will block waiting for the data transfer to complete. Hence, you need to change to cudaMemcpyAsync, which is the non-blocking version. Since the host no longer waits around for the data to be copied up to or down from the GPU, the host code can do other computations in parallel.

Usually, the host will need to know when the GPU work has finished on the data that it uploaded. Or it may need to know when results data from the GPU has been downloaded. Thus, the host code needs a way to determine when:

(a) an asynchronous data transfer has completed, and/or

(b) the kernel processing on that data has occurred.

This type of synchronization is usually done via CUDA streams.

Streams

What are CUDA Streams? The way that streams work is that you can queue up a series of work jobs on a stream. All jobs on a stream have to be run sequentially, but the work can be parallelized across different streams. In this way, CUDA C++ offers an easy way for the application code to specify which work can be parallelized and what must remain serialized.

CUDA Runtime calls that don’t specify a stream are on the “default stream.” This is a simple but inflexible way of running some work, but does not offer as fine-grained parallelization as with the use of streams.

CUDA handles most of the work for streams, and the programmer just has to load up the work. For example, the CUDA Runtime will handle the parallel scheduling and sequencing requirements of these jobs.

The basic idea for efficient parallelization using streams from the host code viewpoint is:

1. Load up the data transfers, kernels, and other work jobs onto a stream.

2. Do other stuff in parallel.

3. Synchronize when the stream has finished its work.

For example, if the host code needs to do a vector dot product computation:

1. Queue up two asynchronous host-to-device data transfers for the two input vectors onto the stream (i.e., cudaMemcpyAsync).

2. Queue up the vector dot product GPU kernel launch (asynchronously on the same stream).

3. Queue up the device-to-host data transfer of the results vector (same stream).

4. Do other stuff in parallel on the CPU.

5. Synchronize with the stream when the results data is available.

This is highly parallel for the CPU, but less so for the GPU. Although the GPU can do work from other kernels in parallel, most of this sequence is sequential. The GPU has to execute the data transfers, kernel launch and return data transfer in serial order, as they are queued on the same stream.

Asynchronous CUDA Operations

Efficient use of streams requires asynchronous actions. Although kernel launches are asynchronous, most of the simplest CUDA runtime functions are synchronous by default, such as cudaMalloc, cudaMemset, and cudaMemcpy. The code will block waiting for them to finish, even if the work is being done on the other hardware (e.g., the CPU host code blocks waiting for the GPU to do a memory set or memory allocation). To do parallelization and overlapping optimizations with streams, you need to use the asynchronous CUDA methods:

Kernel launch mykernel<<<...>>> syntax (asynchronous by default)
cudaMemcpyAsync (and also cudaMemcpy2DAsync and cudaMemcpy3DAsync)
cudaMemsetAsync
cudaMemPrefetchAsync
cudaMallocAsync (CUDA 11.2)
cudaFreeAsync (CUDA 11.2)

The most basic GPU operation is already asynchronous: launching kernels. You have to do the work to synchronize the host with a kernel launch, such as by calling cudaDeviceSynchronize, which is simple but inefficient. For the other work, you need to do two changes:

1. Add a stream argument to the kernel launch syntax, and

2. Call the asynchronous versions of the CUDA APIs, and provide a stream to track them.

Overlapping Kernels and Data Transfers

It also possible to overlap GPU kernel execution with asynchronous data transfers. Note that the GPU is capable of doing data transfers without launching any threads, and the threads can be doing other things (i.e., other workloads can run). These host-device data transfers are processed in a different part of the GPU hardware, which runs in parallel with the execution of kernel threads. This is true for both uploads (CPU-to-GPU) and downloads (GPU-to-CPU), and has been supported in GPUs since approximately Compute Capability 1.1 (i.e., for many years).

However, a naive attempt to overlap an upload of input data (i.e., CPU-to-GPU data transfer) and the kernel that processes that data, is not going to work. The kernel has to wait! It cannot start processing until it has the data.

This model only works if there are other kernels doing work on the GPU. The other kernels that don’t require the input data can run in parallel with the data transfer. However, our kernel is blocked waiting for the data upload to complete.

Overlapping Partial Transfers and Computations. The idea of overlapping data transfers and computations doesn’t need to occur within different kernels to get a parallelization gain. At first thought, you can’t overlap the data transfer for a kernel that requires the data, because the kernel has to wait for the full data transfer to complete. Only when it has the data, can it can begin computation. Hence, it seems that this kernel can’t do any work in parallel with the data transfer.

This is annoying, because surely the kernel could start adding two vectors, even if it didn’t have the whole vectors. Indeed, that’s the idea, but CUDA cannot do it automatically for you. If the kernel is an algorithm that can be parallelized into chunks (e.g., any element-wise vector or matrix operation, just for starters), the transfer and processing of the separate chunks can be overlapped by splitting them onto multiple streams.

Data transfers and kernel processing of two different chunks can be overlapped. Obviously, you have to serialize the processing of each chunk, so that the upload of one chunk has to finish before the kernel operates on that same chunk. However, the kernel can start processing on the first chunk (after it finishes uploading) while the second chunk is uploading in parallel. This overlaps the first kernel computation with the second data transfer, and then the second is processed while the third is uploaded, and this can continue over many pairs of different chunks. The only parts that are not overlapped are the first and last chunks.

The implementation of this idea requires multiple streams initialized, one for each chunk. And each of these streams needs to have work queued:

1. Transfer the chunk (e.g., vector segment) to the GPU.

2. Kernel launch to work on that chunk.

3. Transfer the resulting chunk back to the CPU.

Note that you can’t set up a single thread to handle each chunk (why would you want to?). You have to queue up a full kernel launch, with one or more blocks, and warps of threads, on the right stream for that chunk.

Another even more advanced way to parallelize work on chunks of data would be to use a “persistent kernel” that handles each chunk. A persistent kernel never exits, but has its threads endlessly running, tirelessly waiting for work, and then receiving their work jobs on a scheduler queue. It’s quite an involved architecture, and is discussed in a later chapter.

Additional Host-Device Optimizations

Tuning the interaction between the host and device code is a never-ending programming task. Here are some additional techniques:

1. Managed Memory. The CUDA APIs for “managed memory” are convenient, but not necessarily more efficient. On the other hand, you might use a poor algorithm, whereas the CUDA Runtime will manage the transfer of data behind-the-scenes.

2. Reduce Data Transfer Volume. The cost of data transfer is often a significant cost, whether it is host-device transfers, or between GPUs or servers. Algorithmic changes can sometimes be used to reduce the size of these transfers, or you can use smaller data types (i.e., quantization).

3. Batch Multiple Small Transfers. If you have a lot of small data transfers, the overhead can be significant. Batching them together into a single, larger transfer won’t reduce volume, but it cuts down on the overhead.

4. Don’t Transfer Interim Results. The last thing you want is data going back-and-forth between CPU and GPU. If your CPU application has multiple steps in the overall top-level algorithm, don’t interleave CPU and GPU computations. Instead, you need to migrate all of the processing to the GPU, rather than send back partial results that are re-processed by the CPU. If you’re porting legacy CPU code to the GPU, don’t stop halfway!

5. Larger Kernel Launch Parameters. CUDA C++ allows passing data from host-to-device as kernel launch parameters, which are stored in constant memory. Although these are more commonly used to pass single values, such as pointers, sizes and dimensions, this is also a fast way to pass large data chunks to the GPU efficiently. These parameters were previously limited to 4K in total size, but this limit is now 32K per launch for the latest GPUs.

6. Direct Memory Access (DMA) and GPUDirect. This optimizes GPU to storage accesses locally, such as accessing an NVMe or flash memory device. This bypasses the CPU by allowing direct memory transfers from local storage to the GPU. It works by copying memory blocks asynchronously over the PCIe bus.

7. PCIe Bus Bandwidth. This is the bus for data transfer between CPU and GPU. Its speed is an important consideration for host-device data transfers (e.g., via cudaMemcpy), which are any important part of CUDA kernel performance.

8. Zero-Copy Memory. This is pinned memory on the host, as used for asynchronous memory copying, but this method does not require copying (i.e., no call to cudaMemcpy). However, the data on the host is accessed by the GPU over the PCIe bus, which is not especially efficient and can have significant latency.

Networking Optimizations

Larger GPU applications may need to transfer data between multiple GPUs or across the network to other servers. A large data center running many CUDA backends will also have a lot of work to do in terms of the various networking protocols and software stacks. There are various optimizations in these cases, and here’s a summary of some of them.

A data center running H100 GPUs will have different types of networking:

1. Front-end networking — Ethernet from external accesses.

2. Back-end networking — optimizing inter-GPU transfers.

3. Out-of-band networking — for internal monitoring and management.

The front-end networking is typically an Ethernet connection from the internet into the hosts. This is how customers and external users connect into the data center for reaching servers and for data storage needs.

The back-end technologies are much more intense and high-bandwidth, because they manage bursts of inter-GPU communications for reductions and gather operations. AI training applications have a particularly bursty pattern of concurrent data sending at high volume when updating the parameters. Technologies to use include InfiniBand, Spectrum-X, or RoCEv2 Ethernet. This may require optimizations to the NVIDIA Collective Communications Library (NCCL), such as to make Ethernet run fast enough. Different connectivity topologies may be considered viz network switches and the GPU servers. Connectivity hardware options include various network switches and the choice between optical or electrical cabling.

Monitoring and management of both software and hardware devices is important as failures and errors are common in hot GPUs and in other network devices and servers. Monitoring tools for data centers include Grafana and Prometheus. Insidious failures in GPUs from overheating that cause incorrect results in computations can be diagnosed by running self-diagnostics, such as NVIDIA’s “dcgmi” diagnostics (at level 4).

As a CUDA C++ programmer you are largely abstracted away from all this frantic nonsense in the networking layers of a high-end datacenter, but it’s a very specialist skill in high demand at the moment. Some additional information on particular networking technologies is below, and more details are also available in the references.

1. Remote Direct Memory Access (RDMA). This is a network protocol whereby servers can access the memory in other servers, without having to interrupt the remote CPU (or GPU). Using RDMA can allow fast network data transfers between servers without slowing down their computations.

2. Lazy Connection Establishment in NCCL. This is an optimization to the NVIDIA Collective Communications Library (NCCL) protocol, often pronounced as “nickel,” for inter-GPU communication. Lazy connection establishment delays the establishment of connections by the GPU until they are required, thereby reducing the initialization time for NCCL. The feature is controlled by the NCCL_RUNTIME_CONNECT environment variable, and can be disabled by setting this to zero. Note that this is not the same optimization as “lazy loading,” which refers to GPU loading of machine code instructions.

3. Multi-GPU Peer-to-Peer Memory Access. This is sometimes called “P2P” in CUDA and is relevant to motherboards with multiple GPUs running on them. It is an optimization method that involves one GPU accessing the memory of another GPU directly, without any involvement of the CPU.

4. nvlink Data Transfers. This method is for multi-GPU communication within a server. It offers a faster communication protocol that bypasses the PCIe bus for data transfer, so as to allow GPUs to communicate more efficiently with each other.

5. Memory-Mapped I/O. This is an optimization where I/O peripherals are directly connected to memory, rather than needing the CPU’s involvement to control data transfers. There are a variety of peripherals that could be attached to your CUDA algorithm, starting with a Tardis or a Holodeck.