Aussie AI

Chapter 5. CUDA Profiling Tools

  • Book Excerpt from "CUDA C++ Optimization: Coding Faster GPU Kernels"
  • by David Spuler

Chapter 5. CUDA Profiling Tools

Profiling CUDA C++ Execution

There are various ways to time the execution of your CUDA programs. There are two basic strategies:

  • Profiler tools
  • Timers in the C++ code

This is basically an “inside or outside” choice. Using C++ timers wrapped around your functions is discussed in the next chapter. This chapter focuses on profiler tools that take your executable and examine its runtime efficiency.

Some of the examples of profiler tools available for CUDA C++ include:

  • NVIDIA Visual Profiler — performance profiling with a GUI interface.
  • Nsight Systems — system profiling and tracing.
  • Nsight Compute — performance profiling for CUDA kernels.
  • Nsight Graphics — specialized profiling for graphics applications.
  • Nsight Deep Learning Designer — profiler focused on AI/ML applications.

Command-line profiler tools include:

  • ncu — NVIDIA Nsight Compute CLI.
  • nvprof — command-line profiler (now deprecated)

There are also some advanced APIs and SDKs available if you want to get ambitious and do some very deep integrations into the CUDA profiling tools:

  • CUDA Profiling Tools Interface (CUPTI) — profiling and tracing integration API.
  • NVIDIA Tools Extension SDK (NVTX) — tool integration API.
  • Nsight Perf SDK — performance profiling for graphics applications.
  • Nsight Tools JupyterLab Extension — extension for profiling of Python applications.

Nsight Compute CLI: ncu

I’m a big fan of command-line tools on Linux, so this section demonstrates the various reports from the CLI. The most functional profiler from NVIDIA is the Nsight Compute CLI, which is simply “ncu” on the command-line. It also goes by the much more impressive name “nv-nsight-cu-cli” and whoever chose that name must be good at touch typing.

If you prefer the GUI version, this is launched by the “nv-nsight-cu” command. There are a variety of graphical reports available in that interface, but I’d rather stick to text, thank you very much, mainly because I was programming Unix back when there were 27 types of Unix, rather than 27 flavors of Linux.

Running the command-line interface to Nsight Compute is as simple as this:

    ncu a.out

The result is a text-based report at the end of execution with a variety of sections, focused on kernel efficiency.

Here is one section:

    Section: GPU Speed Of Light Throughput
    ----------------------- ------------- ------------
    Metric Name               Metric Unit Metric Value
    ----------------------- ------------- ------------
    DRAM Frequency          cycle/nsecond         4.91
    SM Frequency            cycle/usecond       576.05
    Elapsed Cycles                  cycle        4,962
    Memory Throughput                   %         3.10
    DRAM Throughput                     %         0.02
    Duration                      usecond         8.61
    L1/TEX Cache Throughput             %         5.46
    L2 Cache Throughput                 %         2.36
    SM Active Cycles                cycle     2,811.53
    Compute (SM) Throughput             %         9.81
    ----------------------- ------------- ------------

Here is another report summary showing details of the grid size and other parameters of the kernel launch:

   Section: Launch Statistics
    -------------------------------- --------------- ---------------
    Metric Name                          Metric Unit    Metric Value
    -------------------------------- --------------- ---------------
    Block Size                                                    32
    Function Cache Configuration                     CachePreferNone
    Grid Size                                                  1,024
    Registers Per Thread             register/thread              16
    Shared Memory Configuration Size           Kbyte           32.77
    Driver Shared Memory Per Block        byte/block               0
    Dynamic Shared Memory Per Block       byte/block               0
    Static Shared Memory Per Block        byte/block               0
    Threads                                   thread          32,768
    Waves Per SM                                                1.60
    -------------------------------- --------------- ---------------

And here is an example of the occupancy-related report:

    Section: Occupancy
    ------------------------------- ----------- ------------
    Metric Name                     Metric Unit Metric Value
    ------------------------------- ----------- ------------
    Block Limit SM                        block           16
    Block Limit Registers                 block          128
    Block Limit Shared Mem                block           16
    Block Limit Warps                     block           32
    Theoretical Active Warps per SM        warp           16
    Theoretical Occupancy                     %           50
    Achieved Occupancy                        %        35.00
    Achieved Active Warps Per SM           warp        11.20
    ------------------------------- ----------- ------------

There are a variety of command-line options for ncu, such as:

  • -h or -help — helpful helping helpers.
  • -v or -version — profiler version information.
  • -mode — useful for attaching to running processes.
  • -hostname — used for remote debugging.
  • -devices — limit profiling to chosen GPU devices.

There are numerous other command-line options, but I’ve run out of room in the e-book. You might have to read the documentation.

NVIDIA Profiler: nvprof

The command-line nvprof profiler tool is now “deprecated” and won’t be around forever, but it’s still useful for now. There is an “nvprof Transition Guide” document available, which was like learning about submarine sonar in the The Hunt for Red October, except more useful.

The report from nvprof is much simpler than that from ncu, which has more functionality. The focus is much more on the percentage usage by function call.

Here’s a report from nvprof, which looks a lot like the output I’m used to from gprof:


==3137== Profiling application: ./a.out
==3137== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   36.23%  13.728us         1  13.728us  13.728us  13.728us  [CUDA memcpy HtoD]
                   32.18%  12.192us         1  12.192us  12.192us  12.192us  [CUDA memcpy DtoH]
                   22.13%  8.3840us         1  8.3840us  8.3840us  8.3840us  aussie_clear_vector_kernel_basic(float*, int)
                    9.46%  3.5840us         1  3.5840us  3.5840us  3.5840us  [CUDA memset]
      API calls:   96.45%  93.891ms         3  31.297ms     872ns  93.889ms  cudaDeviceGetLimit
                    2.63%  2.5584ms        21  121.83us  107.43us  140.16us  cudaGetDeviceProperties
                    0.31%  299.66us         1  299.66us  299.66us  299.66us  cudaLaunchKernel
                    0.15%  145.36us       114  1.2750us     142ns  56.265us  cuDeviceGetAttribute
                    0.14%  139.41us         2  69.704us  50.862us  88.547us  cudaMemcpy
                    0.14%  133.71us         1  133.71us  133.71us  133.71us  cudaMalloc
                    0.11%  109.20us         1  109.20us  109.20us  109.20us  cudaFree
                    0.04%  41.426us         1  41.426us  41.426us  41.426us  cudaMemset
                    0.01%  12.107us         1  12.107us  12.107us  12.107us  cuDeviceGetName
                    0.01%  5.0840us         1  5.0840us  5.0840us  5.0840us  cuDeviceGetPCIBusId
                    0.00%  4.4260us         1  4.4260us  4.4260us  4.4260us  cuDeviceTotalMem
                    0.00%  1.5590us         3     519ns     212ns  1.0520us  cuDeviceGetCount
                    0.00%  1.2630us         2     631ns     521ns     742ns  cudaGetLastError
                    0.00%  1.1280us         2     564ns     177ns     951ns  cuDeviceGet
                    0.00%     823ns         1     823ns     823ns     823ns  cuModuleGetLoadingMode
                    0.00%     257ns         1     257ns     257ns     257ns  cuDeviceGetUuid

As you can see, this gives a simple breakdown of the time spent in the various routines and APIs. Interestingly, it looks like cudaDeviceGetLimit is quite expensive.

 

Online: Table of Contents

PDF: Free PDF book download

Buy: CUDA C++ Optimization

CUDA C++ Optimization The new CUDA C++ Optimization book:
  • Faster CUDA C++ kernels
  • Optimization tools & techniques
  • Compute optimization
  • Memory optimization

Get your copy from Amazon: CUDA C++ Optimization