Aussie AI
Chapter 5. CUDA Profiling Tools
-
Book Excerpt from "CUDA C++ Optimization: Coding Faster GPU Kernels"
-
by David Spuler
Chapter 5. CUDA Profiling Tools
Profiling CUDA C++ Execution
There are various ways to time the execution of your CUDA programs. There are two basic strategies:
- Profiler tools
- Timers in the C++ code
This is basically an “inside or outside” choice. Using C++ timers wrapped around your functions is discussed in the next chapter. This chapter focuses on profiler tools that take your executable and examine its runtime efficiency.
Some of the examples of profiler tools available for CUDA C++ include:
- NVIDIA Visual Profiler — performance profiling with a GUI interface.
- Nsight Systems — system profiling and tracing.
- Nsight Compute — performance profiling for CUDA kernels.
- Nsight Graphics — specialized profiling for graphics applications.
- Nsight Deep Learning Designer — profiler focused on AI/ML applications.
Command-line profiler tools include:
ncu— NVIDIA Nsight Compute CLI.nvprof— command-line profiler (now deprecated)
There are also some advanced APIs and SDKs available if you want to get ambitious and do some very deep integrations into the CUDA profiling tools:
- CUDA Profiling Tools Interface (CUPTI) — profiling and tracing integration API.
- NVIDIA Tools Extension SDK (NVTX) — tool integration API.
- Nsight Perf SDK — performance profiling for graphics applications.
- Nsight Tools JupyterLab Extension — extension for profiling of Python applications.
Nsight Compute CLI: ncu
I’m a big fan of command-line tools on Linux,
so this section demonstrates the various reports from the CLI.
The most functional profiler from NVIDIA is the
Nsight Compute CLI, which is simply “ncu” on the command-line.
It also goes by the much more impressive name “nv-nsight-cu-cli”
and whoever chose that name must be good at touch typing.
If you prefer the GUI version, this is launched by the “nv-nsight-cu” command.
There are a variety of graphical reports available in that interface,
but I’d rather stick to text, thank you very much,
mainly because I was programming Unix back when there were 27 types of Unix,
rather than 27 flavors of Linux.
Running the command-line interface to Nsight Compute is as simple as this:
ncu a.out
The result is a text-based report at the end of execution with a variety of sections, focused on kernel efficiency.
Here is one section:
Section: GPU Speed Of Light Throughput
----------------------- ------------- ------------
Metric Name Metric Unit Metric Value
----------------------- ------------- ------------
DRAM Frequency cycle/nsecond 4.91
SM Frequency cycle/usecond 576.05
Elapsed Cycles cycle 4,962
Memory Throughput % 3.10
DRAM Throughput % 0.02
Duration usecond 8.61
L1/TEX Cache Throughput % 5.46
L2 Cache Throughput % 2.36
SM Active Cycles cycle 2,811.53
Compute (SM) Throughput % 9.81
----------------------- ------------- ------------
Here is another report summary showing details of the grid size and other parameters of the kernel launch:
Section: Launch Statistics
-------------------------------- --------------- ---------------
Metric Name Metric Unit Metric Value
-------------------------------- --------------- ---------------
Block Size 32
Function Cache Configuration CachePreferNone
Grid Size 1,024
Registers Per Thread register/thread 16
Shared Memory Configuration Size Kbyte 32.77
Driver Shared Memory Per Block byte/block 0
Dynamic Shared Memory Per Block byte/block 0
Static Shared Memory Per Block byte/block 0
Threads thread 32,768
Waves Per SM 1.60
-------------------------------- --------------- ---------------
And here is an example of the occupancy-related report:
Section: Occupancy
------------------------------- ----------- ------------
Metric Name Metric Unit Metric Value
------------------------------- ----------- ------------
Block Limit SM block 16
Block Limit Registers block 128
Block Limit Shared Mem block 16
Block Limit Warps block 32
Theoretical Active Warps per SM warp 16
Theoretical Occupancy % 50
Achieved Occupancy % 35.00
Achieved Active Warps Per SM warp 11.20
------------------------------- ----------- ------------
There are a variety of command-line options for ncu, such as:
-hor-help— helpful helping helpers.-vor-version— profiler version information.-mode— useful for attaching to running processes.-hostname— used for remote debugging.-devices— limit profiling to chosen GPU devices.
There are numerous other command-line options, but I’ve run out of room in the e-book. You might have to read the documentation.
NVIDIA Profiler: nvprof
The command-line nvprof profiler tool is now “deprecated”
and won’t be around forever,
but it’s still useful for now.
There is an “nvprof Transition Guide” document available,
which was like learning
about submarine sonar in the The Hunt for Red October,
except more useful.
The report from nvprof is much simpler than that from ncu,
which has more functionality.
The focus is much more on the percentage usage by function call.
Here’s a report from nvprof, which looks a lot like
the output I’m used to from gprof:
==3137== Profiling application: ./a.out
==3137== Profiling result:
Type Time(%) Time Calls Avg Min Max Name
GPU activities: 36.23% 13.728us 1 13.728us 13.728us 13.728us [CUDA memcpy HtoD]
32.18% 12.192us 1 12.192us 12.192us 12.192us [CUDA memcpy DtoH]
22.13% 8.3840us 1 8.3840us 8.3840us 8.3840us aussie_clear_vector_kernel_basic(float*, int)
9.46% 3.5840us 1 3.5840us 3.5840us 3.5840us [CUDA memset]
API calls: 96.45% 93.891ms 3 31.297ms 872ns 93.889ms cudaDeviceGetLimit
2.63% 2.5584ms 21 121.83us 107.43us 140.16us cudaGetDeviceProperties
0.31% 299.66us 1 299.66us 299.66us 299.66us cudaLaunchKernel
0.15% 145.36us 114 1.2750us 142ns 56.265us cuDeviceGetAttribute
0.14% 139.41us 2 69.704us 50.862us 88.547us cudaMemcpy
0.14% 133.71us 1 133.71us 133.71us 133.71us cudaMalloc
0.11% 109.20us 1 109.20us 109.20us 109.20us cudaFree
0.04% 41.426us 1 41.426us 41.426us 41.426us cudaMemset
0.01% 12.107us 1 12.107us 12.107us 12.107us cuDeviceGetName
0.01% 5.0840us 1 5.0840us 5.0840us 5.0840us cuDeviceGetPCIBusId
0.00% 4.4260us 1 4.4260us 4.4260us 4.4260us cuDeviceTotalMem
0.00% 1.5590us 3 519ns 212ns 1.0520us cuDeviceGetCount
0.00% 1.2630us 2 631ns 521ns 742ns cudaGetLastError
0.00% 1.1280us 2 564ns 177ns 951ns cuDeviceGet
0.00% 823ns 1 823ns 823ns 823ns cuModuleGetLoadingMode
0.00% 257ns 1 257ns 257ns 257ns cuDeviceGetUuid
As you can see, this gives a simple breakdown of the time spent
in the various routines and APIs.
Interestingly, it looks like cudaDeviceGetLimit is quite expensive.
|
• Online: Table of Contents • PDF: Free PDF book download • Buy: CUDA C++ Optimization |
|
The new CUDA C++ Optimization book:
Get your copy from Amazon: CUDA C++ Optimization |