Aussie AI Blog

Layerwise Pipelined Overlapping of Prefill and Decode

  • 15th April, 2026
  • by David Spuler, Ph.D.

Here's an interesting idea for optimizing the prefill and decode phases of an LLM. Separate optimization of prefill and decoding have been getting a lot of research attention lately, especially "prefill-decode disaggregation" (or PD-disaggregation for short), which runs prefill and decode logic separately on different groups of GPUs.

What about running prefill and decode on the same GPU, but with pipelining? The idea here is based on the fact that the attention computation in one layer depends only on the KV caches of the previous layer. This is true for prefill, and is why we can fully parallelize prefill across all tokens at once. It's also true for attention in the decoding phase when computing the next output token.

Hence, why can't we start the decoding phase to overlap the prefill phase, in a pipelining optimization, by just running the decoding stack one layer behind? A layer of decoding can get started if prefill has computed the KV caches from the previous layer. This would overlap the computation of the early layers of decode with the later layers of prefill.

Alas, it doesn't work.

The problem is that the very first layer of the decoding phase requires an input, and that input actually depends on the final layer of prefill. To get started, the decoding stack needs an embedding vector from the prior token. And the "prior token" is the single token that's output at the end of the prefill phase. We just don't know which token it's going to be. Regardless of whether the final prefill layer gives us a hidden state (to compute logits), or the logits, or the token ID (from sampling those logits), these are all equivalent. The next run of the decoder stack cannot start anything without this input from prefill.

Can this idea be saved? Could the decoder stack start with an approximate embedding input, and then adjust things later at the end? I don't really see how.

Well, we can try to speculate on what the output token would be based on predictions from the early layers during prefill, and see if we can get started decoding early. But that's not a new optimization, and we've just re-invented something similar to "self speculative decoding" where the early layers of a model become its own draft model. And that method is superior because it works generally for all decoding steps, not just in the first decode iteration after prefill.

More AI Research Topics

Read more about:

Aussie AI Advanced C++ Coding Books



CUDA C++ Optimization CUDA C++ Optimization book:
  • Faster CUDA C++ kernels
  • Optimization tools & techniques
  • Compute optimization
  • Memory optimization

Get your copy from Amazon: CUDA C++ Optimization



CUDA C++ Optimization CUDA C++ Debugging book:
  • Debugging CUDA C++ kernels
  • Tools & techniques
  • Self-testing & reliability
  • Common GPU kernel bugs

Get your copy from Amazon: CUDA C++ Debugging



C++ AVX Optimization C++ AVX Optimization: CPU SIMD Vectorization:
  • Introduction to AVX SIMD intrinsics
  • Vectorization and horizontal reductions
  • Low latency tricks and branchless programming
  • Instruction-level parallelism and out-of-order execution
  • Loop unrolling & double loop unrolling

Get your copy from Amazon: C++ AVX Optimization: CPU SIMD Vectorization



C++ Ultra-Low Latency C++ Ultra-Low Latency: Multithreading and Low-Level Optimizations:
  • Low-level C++ efficiency techniques
  • C++ multithreading optimizations
  • AI LLM inference backend speedups
  • Low latency data structures
  • Multithreading optimizations
  • General C++ optimizations

Get your copy from Amazon: C++ Ultra-Low Latency



Advanced C++ Memory Techniques Advanced C++ Memory Techniques: Efficiency & Safety:
  • Memory optimization techniques
  • Memory-efficient data structures
  • DIY memory safety techniques
  • Intercepting memory primitives
  • Preventive memory safety
  • Memory reduction optimizations

Get your copy from Amazon: Advanced C++ Memory Techniques



Safe C++ Safe C++: Fixing Memory Safety Issues:
  • The memory safety debate
  • Memory and non-memory safety
  • Pragmatic approach to safe C++
  • Rust versus C++
  • DIY memory safety methods
  • Safe standard C++ library

Get it from Amazon: Safe C++: Fixing Memory Safety Issues



Efficient C++ Multithreading Efficient C++ Multithreading: Modern Concurrency Optimization:
  • Multithreading optimization techniques
  • Reduce synchronization overhead
  • Standard container multithreading
  • Multithreaded data structures
  • Memory access optimizations
  • Sequential code optimizations

Get your copy from Amazon: Efficient C++ Multithreading



Efficient Modern C++ Data Structures Efficient Modern C++ Data Structures:
  • Data structures overview
  • Modern C++ container efficiency
  • Time & space optimizations
  • Contiguous data structures
  • Multidimensional data structures

Get your copy from Amazon: Efficient C++ Data Structures



Low Latency C++: Multithreading and Hotpath Optimizations Low Latency C++: Multithreading and Hotpath Optimizations: advanced coding book:
  • Low Latency for AI and other applications
  • C++ multithreading optimizations
  • Efficient C++ coding
  • Time and space efficiency
  • C++ slug catalog

Get your copy from Amazon: Low Latency C++