Aussie AI

Parallel Decoding

  • Last Updated 30 August, 2025
  • by David Spuler, Ph.D.

What is Parallel Decoding?

Parallel decoding is an LLM optimization method to produce two or more tokens in parallel. This is faster than the vanilla LLM "autoregressive" decoding method, which outputs only one token at a time, in a sequential manner. Parallel decoding algorithms aim to break the autoregression bottleneck in decoder output algorithms. The idea is to output as many tokens in parallel as possible, which is much faster than greedy decoding or beam search decoding, which are both autoregressive. For more information on the basics of the sequential decoder, see non-autoregressive decoding algorithms.

Types of Parallel Decoding Optimizations

There are several types of "parallel decoding" algorithms:

  • Speculative decoding
  • Generalized speculative decoding
  • Aggressive decoding
  • Lookup decoding
  • Prompt lookup decoding
  • Lookahead decoding

Note that these above methods are all computing multiple tokens in parallel for a single query within a single model. There are also various ways to parallelize decoding at a higher level by using multiple models, which is called "ensemble decoding" (e.g. big-little decoding, consensus decoding, collaborative decoding).

Research on Parallel Decoding

Papers on parallel decoding algorithms include (see also non-autoregressive decoding algorithms):

n-gram decoding

N-gram decoding is an LLM optimization that emits a sequence of tokens each decoding step, rather than only one token at a time in sequence. An "n-gram" decoding algorithm is one that generates more than one token (i.e., n tokens in an "n-gram") in one single sequence. This is usually done in parallel execution, because it isn't much of an optimization to run this sequentially, because that's how normal autoregressive decoding generates n-grams, too.

Research on n-gram generation:

Blockwise Parallel Decoding

Blockwise parallel decoding is an LLM inference optimization whereby the decoder outputs multiple tokens at once, by processing in blocks. It is a type of parallel decoding that improves efficiency beyond the vanilla LLM decoder, which emits one token at a time in an autoregressive, sequential manner.

Research on blockwise parallel decoding:

  • Stern, Mitchell Thomas, 2020, Structured Neural Models and Structured Decoding for Natural Language Processing, Ph.D. Thesis, Computer Science, University of California, Berkeley, https://escholarship.org/uc/item/4m2211b5 https://escholarship.org/content/qt4m2211b5/qt4m2211b5.pdf
  • Chen Zhang, Zhuorui Liu, Dawei Song, 23 Apr 2024, Beyond the Speculative Game: A Survey of Speculative Execution in Large Language Models, Beijing Institute of Technology, China, https://arxiv.org/abs/2404.14897 (Strong survey specific to speculative decoding and other draft-then-verify optimization techniques.)
  • Taehyeon Kim, Ananda Theertha Suresh, Kishore Papineni, Michael Riley, Sanjiv Kumar, Adrian Benton, 14 Apr 2024, Towards Fast Inference: Exploring and Improving Blockwise Parallel Drafts, Google Research, https://arxiv.org/abs/2404.09221 (Improving blockwise parallel decoding via top-k decoding and generation of predicted n-grams.)
  • 8 Jun 2024 (v2), A Survey on Efficient Inference for Large Language Models, Zixuan Zhou, Xuefei Ning, Ke Hong, Tianyu Fu, Jiaming Xu, Shiyao Li, Yuming Lou, Luning Wang, Zhihang Yuan, Xiuhong Li, Shengen Yan, Guohao Dai, Xiao-Ping Zhang, Yuhan Dong, Yu Wang, https://arxiv.org/abs/2404.14294
  • Mahsa Khoshnoodi, Vinija Jain, Mingye Gao, Malavika Srikanth, Aman Chadha, 24 May 2024 (v2), A Comprehensive Survey of Accelerated Generation Techniques in Large Language Models, https://arxiv.org/abs/2405.13019
  • Taehyeon Kim, Ananda Theertha Suresh, Kishore Papineni, Michael Riley, Sanjiv Kumar, Adrian Benton, 2024, Exploring and Improving Drafts in Blockwise Parallel Decoding, https://openreview.net/pdf?id=KtnUTS1f91

Lookahead Decoding

Lookahead decoding is a type parallel decoding where the algorithm attempts to "look ahead" in parallel at more than one token. This allows the output of two or more tokens at a time, which is more efficiency than the standard autoregressive decoding algorithms.

MEDUSA Decoding

Medusa decoding is an LLM optimization that uses multiple decoding heads to emit tokens in parallel. This method is faster than the sequential bottleneck of emitting one token at a time, as done by the standard autoregressive LLM decoding algorithms.

More Research on Decoding Algorithms

AI Books from Aussie AI



The Sweetest Lesson: Your Brain Versus AI The Sweetest Lesson: Your Brain Versus AI: new book on AI intelligence theory:
  • Your brain is 50 times bigger than the best AI engines.
  • Truly intelligent AI will require more compute!
  • Another case of the bitter lesson?
  • Maybe it's the opposite of that: the sweetest lesson.

Get your copy from Amazon: The Sweetest Lesson



RAG Optimization RAG Optimization: Accurate and Efficient LLM Applications: new book on RAG architectures:
  • Smarter RAG
  • Faster RAG
  • Cheaper RAG
  • Agentic RAG
  • RAG reasoning

Get your copy from Amazon: RAG Optimization



Generative AI in C++ Generative AI Applications book:
  • Deciding on your AI project
  • Planning for success and safety
  • Designs and LLM architectures
  • Expediting development
  • Implementation and deployment

Get your copy from Amazon: Generative AI Applications



Generative AI in C++ Generative AI programming book:
  • Generative AI coding in C++
  • Transformer engine speedups
  • LLM models
  • Phone and desktop AI
  • Code examples
  • Research citations

Get your copy from Amazon: Generative AI in C++



CUDA C++ Optimization CUDA C++ Optimization book:
  • Faster CUDA C++ kernels
  • Optimization tools & techniques
  • Compute optimization
  • Memory optimization

Get your copy from Amazon: CUDA C++ Optimization



CUDA C++ Optimization CUDA C++ Debugging book:
  • Debugging CUDA C++ kernels
  • Tools & techniques
  • Self-testing & reliability
  • Common GPU kernel bugs

Get your copy from Amazon: CUDA C++ Debugging

More AI Research

Read more about: