Aussie AI

Chapter 4. AI Kernel Optimization

  • Book Excerpt from "CUDA C++ Optimization: Coding Faster GPU Kernels"
  • by David Spuler

Chapter 4. AI Kernel Optimization

GPU Demand for AI

The future is bright for GPU hardware, and also for CUDA C++ programmers. The demand should remain strong for GPUs to run AI backends because of these trends:

  • Consumer adoption — increasing steadily, and yet people haven’t really figured it out yet.
  • Business productionization — there is a continual release of POC projects into real-world usage.
  • Multi-step reasoning algorithms — as exemplified by the recent OpenAI o1 model, the way to make AI models smarter is through running multiple sub-queries (e.g., Chain-of-Thought, Reflection, etc.), thereby soaring AI backend processing requirements.
  • On-device is inadequate — although some workload will migrate to on-device LLMs, they just won’t be fast enough for multi-step algorithms.
  • Application-layer is just starting — expect a whole generation of new AI-centric applications.

With these tail-winds and the power of NVIDIA’s next-generation GPUs, I expect CUDA C++ will remain in demand for years to come.

AI Kernel Bottlenecks

AI models and inference engines have some particular characteristics that enable additional optimization techniques. Some of the AI-specific optimizations for CUDA kernels are discussed here.

The main bottlenecks in AI computations are similar across all engines. There are two main number crunching sequences in AI inference, which is:

  • Attention algorithm
  • Feed-forward networks (FFNs)

Note that “attention” is sometimes called “self-attention” or “attention-heads” or “Multi-Head Attention” (MHA). Furthermore, another older name for the FFNs is “Multi-Layer Perceptron” (MLP).

Both the attention kernel and the FFN kernels are doing matrix multiplications, but they’re not the same thing. FFN kernels are more classic in their matrix multiplication kernels, but need to do so twice with an activation function in between. Attention kernels also use matrix multiplication, but in a very specific higher-level algorithm using QKV matrices.

Attention Kernels

For attention optimizations, you probably have heard of one of the newer memory-efficient attention algorithms:

  • Group-Query Attention (GQA)
  • Multi-Query Attention (MQA)
  • Flash attention (there’s version 1, 2, or 3, so far)
  • Paged attention

There are some variations, such as combining Flash attention with Paged attention. There have also been some other derivative papers on other LLM components (i.e., not the attention heads), such as Flash Decoding and Flash Inference.

There was a flurry of work on memory-efficient attention kernels that lead to Paged attention (implemented in the open source VLLM engine) and Flash Attention (versions 1, 2, and 3). However, the pace of those theoretical papers has since slowed, although the CUDA C++ implementations of these algorithms are now appearing in industry engines.

The newer area of research focus is that more attention (haha) is going to the KV cache bottleneck in attention computations, especially its unlimited memory size for long context queries. There is currently an endless stream of papers on KV cache compression, which are effectively a re-application of all the model compression theories to the KV cache, such as KV cache quantization (4-bits), KV cache pruning, KV cache layer fusion, and KV cache matrix factorization, just to name a few.

Prefix and Substring KV Caching

The other alternative to compressing the KV cache size is to avoid needing to compute it entirely by caching the data. This optimization means that almost the entire prefill phase of computation is avoided, leading to very low latency. The benefit is so high that several industry players are offering cheaper per-token prices for “cached tokens” (e.g., OpenAI and DeepSeek). No doubt, there are CUDA C++ engineers rushing to add this functionality at many other AI platform companies.

Prefixes are surprisingly common in AI inference. For example, the conversational history with a chatbot is always a known prefix. However, it only works partially for RAG chunks, which must be first in line to be seen as a prefix.

The only problem with prefix caching? It needs a prefix. This is problematic for RAG chunks and other documents, which aren’t always at the start of the token stream. Hence, the obvious generalization is to make it work for non-prefix substrings of tokens. For example, prefix KV caching requires to have seen a particular prefix, whereas substring KV caching could work for any appearance of a known token sequence, in any order (e.g., RAG chunks in any order). There’s already research on this “substring KV caching” or “fused KV caches” but only a handful of papers have been released so far. I predict more!

How do we merge multiple KV caches? It seems like it should be a fancy GPU computation, and yet, no. Instead, we just put one after the other, in a sequence, and pretend that we’ve computed the cache for the full token sequence. This is “fused KV caching” and it’s obviously a fast method that only requires memory copies and literally zero extra computations. It seems like this surprisingly simple result needs a little more research, but it may be that the solution is that easy. There are only a couple of papers in this area so far: Yao et. al. (2024) and Gim et. al. (2023).

Matrix Multiplication Kernels

And for the FFNs, there are already some very-optimized MatMul/GEMM kernels to consider. However, there are many variants on matrix multiplication to consider, including quantized versions (e.g., 8-bit or 4-bit integer kernels) or sparse matrix multiplications. An AI engine uses both matrix-matrix multiplication and matrix-vector multiplication computations.

Note that most matrices in AI kernels are rectangular rather than square. In fact, they’re usually three-dimensional tensors shaped like a brick, rather than two-dimensional matrices.

The right FFN matrix multiplication kernel is a very critical choice to make in terms of the overall speed of your CUDA-optimized AI inference engine. Might want to take some extra time to benchmark your CUDA C++ implementation carefully.

Compute-Bound versus Memory-Bound

The above analysis of the two main computation costs is somewhat misleading, because it’s forgotten about memory costs. The main GPT architecture is called the “Transformer” and it has a lot of components, with different characteristics across multiple “layers” of components. Generally, it has been found that the initial phase of “prefill” before outputting the first token is compute-bound, whereas the subsequent “decoding algorithm” that spits out one token at a time (called “autoregressive decoding”) is memory-bound. Note that compute-bound means “GPU-bound” and memory-bound really means “VRAM-bound.”

However, it’s even more nuanced than that:

  • Prefill phase — compute-bound.
  • Attention algorithm (during decoding phase) — memory-bound.
  • Feed-forward networks (during decoding) — compute-bound.

The reason is complicated, but it’s basically that FFN matrices are more static than the attention matrices during the decoding phase. The FFNs are sitting there multiplying the same matrices over and over (i.e., compute-bound). However, the attention algorithm gets new KV matrices from the “KV cache” at every iteration, so it has the extra memory cost of managing that changing data (i.e., memory-bound). Hence, the above “memory-efficient attention algorithms” are focused on reducing memory and data access costs.

The prefill phase has both FFNs and attention computations (and KV caches), but both subcomponents are compute-bound. The initial prefill phase is not as dynamic with its data. This phase involves processing the input query tokens, rather than generating new tokens, so all of the tokens are known ahead-of-time, and so they can be loaded in, and it crunches away (i.e., compute-bound for both FFNs and attention).

State-of-the-Art Kernels

Some examples of the kinds of CUDA C++ kernel algorithms that are state-of-the-art as of this writing:

  • Hybrid local-global attention — alternating layers of sliding window attention and global attention.
  • 4-bit kernels for weights, activations, and the KV cache — dense, sparse, GEMM/GEMV, fused activation functions, and other variants.
  • Advanced KV caching — prefix KV caching, RAG cache, etc.

Some of the even more obscure research-level algorithms that I believe may break through to mainstream include:

  • Fused/substring KV caching — generalizing prefix KV caches.
  • Block-level mixed-precision integer quantization — different quantization parameters for vector segments.
  • Block Floating-Point (BFP) — granular representation of floating-point exponents, allowing integer arithmetic.
  • Multi-LoRA memory-efficient loading/swapping algorithms — avoiding the weight update to the large model.
  • Heuristic generalized speculative decoding — eschewing the small drafter model for efficient heuristics.
  • Parallel decoding/lookahead decoding/multi-token prediction — bypassing autoregression.
  • Advanced prefill optimizations — going beyond chunked prefill and phase splitting (disaggregated prefill).

Might need some CUDA C++ written for all of those!

State-of-the-Art Industry Backends

It’s somewhat difficult to determine the state-of-the-art in LLM serving backends, as used by the industry’s top players. Much of this information is commercially sensitive and is no longer appearing in public research papers (maybe it’s in patents!). Nevertheless, there are some public papers and articles about various issues. Let’s look at a few of them.

Character.AI companionbots backend. As detailed in their blog post, Character.AI has a very high level of traffic to their models. Inference optimization techniques include:

  • INT8 quantization of weights and activations
  • KV cache quantization (also INT8)
  • MatMul INT8 kernels
  • INT8 training (QAT)
  • Hybrid attention with interleaved layers of local attention and global attention (with global attention for only approximately 1 in every 6 layers)
  • KV cache compression
  • Multi-Query Attention (MQA)
  • KV cache layer fusion
  • Session KV caching (for chat sessions)
  • Prefix KV caching
  • Sticky sessions (avoids copying session caches)

They cite a 13.5X reduction in cost versus use of commercial model hosting (i.e., by doing it themselves), and also a 33X reduction compared to when they started optimizing.

Together AI data center networking. In various papers and announcements, Together AI has offered various details of its backend platform. Some example software optimizations available include:

  • CUDA backend for NVIDIA GPUs
  • Flash Attention
  • Flash Decoding
  • Medusa decoding (multi-token prediction)

Together AI also described their GPU cluster management, including the networking aspects and validation steps. This involves techniques and components such as:

  • H100 GPUs
  • Infiniband networking
  • NVLink and NVSwitch
  • NCCL
  • HPCX
  • SLURM (workload management)
  • Remote Direct Memory Access (RDMA) using GPUDirect
  • Telegraf (open source monitoring)

Important steps in the process include:

  • GPU validation
  • Network validation
  • Storage validation

Optimizing the AI Stack

CUDA C++ optimization is important, but it’s only part of the optimization puzzle. The inference engine running the LLM queries is one component of the AI tech stack. Optimizing CUDA C++ will make the AI engine backend run faster, but there are many other techniques for optimizing the entire tech stack of an AI application. Some of the main ideas include:

  • Hardware layer — buy a bigger GPU (and write CUDA for it!)
  • OS layer — tweak your Linux kernel parameters and process priorities.
  • Model compression — make the AI models smaller via quantization, pruning, or distillation (or just use a smaller model!).
  • Deployment stack — optimize serving and scheduling of user requests.
  • Streaming — pass back partial results incrementally.
  • Caching — use an inference cache, prompt caching, KV caching, or other types.
  • Networking and bandwidth optimizations — use RDMA, nvlink, and others.

RAG Stack Optimization. The RAG architecture has all of the above, and also has a set of additional optimizations:

  • Faster vector databases (e.g., indexing, in-memory databases, nearest neighbor optimizations).
  • RAG-specific KV caching (i.e., prefix caching, and fused KV caching to come soon).
  • Rerankers and packers to better organize multiple RAG chunks.

That’s all getting to be far beyond a CUDA C++ conversation. Overall, let’s note that the inference time of an LLM still remains one of the main bottlenecks in the AI stack, so the user’s latency will be greatly affected by your CUDA C++ optimization skills.

References

 

Online: Table of Contents

PDF: Free PDF book download

Buy: CUDA C++ Optimization

CUDA C++ Optimization The new CUDA C++ Optimization book:
  • Faster CUDA C++ kernels
  • Optimization tools & techniques
  • Compute optimization
  • Memory optimization

Get your copy from Amazon: CUDA C++ Optimization