Aussie AI Blog

What is Prefill?

  • April 18th, 2026
  • by David Spuler, Ph.D.

What is Prefill?

Prefill is one of the first steps in LLM inference and training computations. The prefill phase operates on the prompt tokens, and is also known as "prompt processing" or "input processing". The phase after prefill is called the "decoding phase" and this outputs lots more tokens. One way to think about this distinction:

  • Prefill — reads the input prompt.
  • Decoding — outputs the answers.

Prefill is like thinking about the question, and decoding is like endless speaking. Your LLM does both, whereas most people seem to do mostly one or the other.

Prefill is like encoding: The prefill architecture is importance in the decoder-only Transformer architectures that are the mainstay of modern LLMs. The original vanilla Transformers (in 2017) were encoder-decoder architectures, and the change to more efficient decoder-only layouts was possible by having the "encoding" done by the prefill steps in a decoder-only version.

Prefill and decode are similar. Both phases do a "forward pass" of the entire LLM decoder stack of layers. The prefill and decode mode do almost the same set of computations:

  • Same overall Transformer algorithms (e.g., activation functions, attention algorithms, etc.)
  • Same number of model layers.
  • Same LLM weights everywhere.
  • Attention modules (same sizes and weights)
  • KV cache data is saved during attention (for both).
  • FFN modules (same sizes and weights; same FFN vs GLU choices).
  • Both operate on each token (full forward pass).

Prefill and decode are different. Some of the distinctions between prefill and decoding include:

  • Input: the input to prefill is each token's embedding vector, whereas each step of decoding has the "hidden state" from the previous token's forward pass as input (although both these input vectors are the same size).
  • Output: prefill does not output tokens (except at most one) but prefill does output a "hidden state" (activations) and its attention modules store KV data, whereas decoding outputs many tokens (and also tracks a hidden state and stores KV data).
  • Causal masking is required to prevent prefill from looking ahead at future tokens, whereas decoding doesn't need this extra step (because there simply aren't any future tokens)
  • Prefill is compute-bound, whereas decoding is memory-bound.
  • Prefill is parallelization-friendly over all input prompt tokens; decoding is inherently sequential per output token ("autoregressive"), although decoding is usually still parallelized across "batches" of queries from multiple users.

Parallel prefill computation. How can prefill operate in parallel if all the prior tokens haven't finished? Prefill is parallelized along the sequent dimension, with each token processed in parallel, so the tokens are not finished until the very end (at the same time). Firstly, the input to each token's first layer is not dependent on the prior token's computation, but is the fixed embedding vector for that token (with positional encoding possibly). Secondly, the other part of the answer is that a token's attention module uses the data from the prior layer only (not the current layer, which isn't completed yet). Thirdly, the input to each layer (except the first) is the results from the prior layer's computation, which doesn't depend on other tokens. Hence, the entire stack can do each token in parallel, but the different layers and sub-layer stages have to be done in lock-step. You can't parallelize different steps up the stack, but can only do the same layer's step across the sequence of tokens.

Main prefill optimizations. The prefill has some specific optimization techniques that are well-known:

  • Parallelization per token — the entire computation is a compute-heavy run through the LLM stack, with each token's processing done in parallel.
  • Chunked prefill — the input prompt is split up into fixed-size segments of tokens, for a more orderly prefill computation phase (except the final segment may not be the same size, but zero padding can be used to force this, too).
  • Triangular matrix optimizations for the causal mask — rather than using a naive triangular matrix, this extra matrix operation is usually avoided, and these masking optimizations are often fused into the attention kernels.
  • Tile skipping — fusing triangular matrix masking into tiled matrix multiplications can involve: (a) skipping tiles (if all zero), and (b) using separate specialized tile mini-kernels for sparse tiles (on the diagonal with some zero and some non-zero), and dense tiles (no zeros). For example, Flash Attention is a tiled matrix multiplication method and can use this tile-skipping optimization in its logic.
  • Prefill-decoding disaggregation — the prefill and decode phases are so different (compute-bound vs memory-bound) that they run different code, and often run on different GPUs or wholly separate servers.
  • Different kernels for prefill vs decoding — not only can they be disaggregated to different GPUs, it's typical to run different kernels on prefill vs decoding (e.g., Flash Attention is great for prefill, but other methods are better on decoding; Flash Attention 3 is better for decoding also.).

Prefill outputs at most one token. The prefill phase can output the very first token of the answer, or it can send its data off for decoding to output the first token. I've read various descriptions of prefill, where some say it outputs the first token, and some say it doesn't. The hidden state or output vector computed by prefill for the very last token of the prompt can be used to output the very first token of the output. This final-token hidden state can be sent to the unembedding module to create the "logits" or "probabilities" and then decoding algorithm can choose a token to output from this set of probabilities (e.g. greedy decoding, top-k decoding, whatever you like). Sometimes, this first token output is considered to be part of the decoding phase, and the prefill phase sends its final output hidden state for that purpose. Hence, prefill is said to output at most one token.

Wacky prefill optimizations. Here are some weird and wonderful facts about optimizing the prefill phase.

  • Final-layer prefill FFN skipping — the output of hidden state from prefill is only used by the very last token of the prompt (to emit the first output token), and the hidden state of the first N-1 prompt tokens is discarded. Hence, the FFN can be skipped for those N-1 tokens. The attention module cannot be skipped in the final layer, because it creates the KV cache, which is needed for the attention modules in the future output tokens from the decoding phase. Running the last layer's FFN just for one token is rather inefficient on GPU architectures, but many chunks of prefill don't include the final token. Another approach is to defer that prefill FFN step for the final token to a separate decoding phase module, but this requires sending the hidden state vector to a separate GPU that's handling the decoding phase (assuming disaggregation), and starting that first token's processing in the middle of the last layer.
  • First-layer prefill precomputation (with positional encoding awareness) — the input to the first layer of the prefill phase is the embedding vector for each token, which is a known vector (it's a row in the embedding matrix). The first layer doesn't depend on the other tokens, because attention uses the KV cache from the prior layer, and there isn't one for the first layer. Since we know the input vector for a token, and don't depend on the other tokens, we could precompute the hidden state for each token, but for the first layer only (the second and subsequent layers depend on the other tokens' KV cache from the first or prior layers). Note that positional encoding methods are problematic for this tricky optimization, whether its done in the embedding vectors (e.g., classic sinusoidal PE) or inside the attention module (e.g., RoPE, YaRN). The precomputation trick has to adjust for that, or else use NoPE!
  • First-token prefill precomputation — since the input to the first token is the embedding vector, and there's no attention processing (the first token cannot look ahead, only backward), then the entire compute of the first token is deterministic.

So, geometrically, there are optimizations on prefill on the bottom (first layer precomputation), on the top (last layer FFN skipping), and the left (first-token precomputation). The precomputation optimizations save compute, but cost more memory, so your mileage may vary.

Prefix KV caching. The prefill computation for any known prefill tokens can be avoided completely by loading a previously-stored KV cache. This may avoid using prefill completely in some cases (e.g., a user's query is identical to one seem before), and will even more often work partially (e.g., the system prompt is fixed and its KV data precomputed, but the user's query hasn't been seen before). Note that the decoding phase needs all of the KV caches from either prefix KV caching loaded data and also any newer KV cache data computed by prefill.

Speculative decoding is like prefill. Speculative decoding is a "draft-then-verify" optimization for the decoding phase, where a cheap model tries to "draft" (guess or predict) a few tokens ahead, and then the main larger model tries to "verify" them in parallel. This is very similar to drafting a sequence, and then running prefill on the candidate tokens. Hence, the inspiration for why speculative decoding is an optimization is somewhat the same as prefill. The idea of speculative decoding is similar to prefill in that the parallelization possible by verifying multiple candidate tokens in a sequence all at once. This verification can be run in parallel, up the stack in lock-step, in a way similar to doing prompt tokens in parallel in prefill phase. It's the same kind of vertical parallelization across multiple tokens.

More AI Research Topics

Read more about:

Aussie AI Advanced C++ Coding Books



CUDA C++ Optimization CUDA C++ Optimization book:
  • Faster CUDA C++ kernels
  • Optimization tools & techniques
  • Compute optimization
  • Memory optimization

Get your copy from Amazon: CUDA C++ Optimization



CUDA C++ Optimization CUDA C++ Debugging book:
  • Debugging CUDA C++ kernels
  • Tools & techniques
  • Self-testing & reliability
  • Common GPU kernel bugs

Get your copy from Amazon: CUDA C++ Debugging



C++ AVX Optimization C++ AVX Optimization: CPU SIMD Vectorization:
  • Introduction to AVX SIMD intrinsics
  • Vectorization and horizontal reductions
  • Low latency tricks and branchless programming
  • Instruction-level parallelism and out-of-order execution
  • Loop unrolling & double loop unrolling

Get your copy from Amazon: C++ AVX Optimization: CPU SIMD Vectorization



C++ Ultra-Low Latency C++ Ultra-Low Latency: Multithreading and Low-Level Optimizations:
  • Low-level C++ efficiency techniques
  • C++ multithreading optimizations
  • AI LLM inference backend speedups
  • Low latency data structures
  • Multithreading optimizations
  • General C++ optimizations

Get your copy from Amazon: C++ Ultra-Low Latency



Advanced C++ Memory Techniques Advanced C++ Memory Techniques: Efficiency & Safety:
  • Memory optimization techniques
  • Memory-efficient data structures
  • DIY memory safety techniques
  • Intercepting memory primitives
  • Preventive memory safety
  • Memory reduction optimizations

Get your copy from Amazon: Advanced C++ Memory Techniques



Safe C++ Safe C++: Fixing Memory Safety Issues:
  • The memory safety debate
  • Memory and non-memory safety
  • Pragmatic approach to safe C++
  • Rust versus C++
  • DIY memory safety methods
  • Safe standard C++ library

Get it from Amazon: Safe C++: Fixing Memory Safety Issues



Efficient C++ Multithreading Efficient C++ Multithreading: Modern Concurrency Optimization:
  • Multithreading optimization techniques
  • Reduce synchronization overhead
  • Standard container multithreading
  • Multithreaded data structures
  • Memory access optimizations
  • Sequential code optimizations

Get your copy from Amazon: Efficient C++ Multithreading



Efficient Modern C++ Data Structures Efficient Modern C++ Data Structures:
  • Data structures overview
  • Modern C++ container efficiency
  • Time & space optimizations
  • Contiguous data structures
  • Multidimensional data structures

Get your copy from Amazon: Efficient C++ Data Structures



Low Latency C++: Multithreading and Hotpath Optimizations Low Latency C++: Multithreading and Hotpath Optimizations: advanced coding book:
  • Low Latency for AI and other applications
  • C++ multithreading optimizations
  • Efficient C++ coding
  • Time and space efficiency
  • C++ slug catalog

Get your copy from Amazon: Low Latency C++