Yoryck AI
Lookahead Decoding
-
Last Updated 11 June, 2025
-
by David Spuler, Ph.D.
What is Lookahead Decoding?
Lookahead decoding is a type of parallel decoding method that looks forwards in the sequence to see the upcoming tokens. The idea is to "guess" or "draft" the most likely token, and usually multiple tokens, which can then be verified in parallel for a speedup. This is similar to speculative decoding in that there's both drafting and verification, but in lookahead decoding this is done inside the same model.
This method operates in parallel, making it effective for parallel GPU implementions, but not effective for low-resource platforms such as AI PCs or AI Phones.
Research on Lookahead Decoding
Research papers include:
- Yichao Fu, Peter Bailis, Ion Stoica, Hao Zhang, Nov 21, 2023, Break the Sequential Dependency of LLM Inference Using Lookahead Decoding, https://lmsys.org/blog/2023-11-21-lookahead-decoding/, Code: https://github.com/hao-ai-lab/LookaheadDecoding (Generates tokens in parallel by using Jacobi iteration.)
- Yao Zhao, Zhitian Xie, Chenyi Zhuang, Jinjie Gu, 4 Jan 2024 (v2), Lookahead: An Inference Acceleration Framework for Large Language Model with Lossless Generation Accuracy, https://arxiv.org/abs/2312.12728, Code: https://github.com/alipay/PainlessInferenceAcceleration
- Zixuan Zhou, Xuefei Ning, Ke Hong, Tianyu Fu, Jiaming Xu, Shiyao Li, Yuming Lou, Luning Wang, Zhihang Yuan, Xiuhong Li, Shengen Yan, Guohao Dai, Xiao-Ping Zhang, Yuhan Dong, Yu Wang, 22 Apr 2024, A Survey on Efficient Inference for Large Language Models, https://arxiv.org/abs/2404.14294
- Vgel, December 18, 2023, How to make LLMs go fast, https://vgel.me/posts/faster-inference/
- Shixiaowei02, Oct 2024, TensorRT-LLM 0.13.0 Release Latest, https://github.com/NVIDIA/TensorRT-LLM/releases/tag/v0.13.0
- NVIDIA, Dec 2024, Speculative Sampling, https://nvidia.github.io/TensorRT-LLM/advanced/speculative-decoding.html
- NVIDIA, Dec 2024, Lookahead Speculative Decoding, https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/lookahead/README.md
- Yinfeng Xia, Huiyan Li, Chenyang Le, Manhong Wang, Yutao Sun, Xingyang Ma, Yanmin Qian, 4 Jun 2025, MFLA: Monotonic Finite Look-ahead Attention for Streaming Speech Recognition, https://arxiv.org/abs/2506.03722
- Yixuan Wang, Shiyu Ji, Yijun Liu, Yuzhuang Xu, Yang Xu, Qingfu Zhu, Wanxiang Che, 24 May 2025, Lookahead Q-Cache: Achieving More Consistent KV Cache Eviction via Pseudo Query, https://arxiv.org/abs/2505.20334?
More Research on Decoding Algorithms
- Decoding algorithms (overview)
— Non-autoregressive decoding
— Greedy decoding
— Top-k decoding
— Top-p decoding
— Min-P Sampling
— Flash decoding
— Beam search decoding
— Edit decoding
— Contrastive decoding
— Constrained decoding - Parallel decoding (overview)
— Blockwise parallel decoding
— n-gram parallel decoding
— Lookahead decoding
— Medusa decoding
— Consensus decoding - Speculative decoding (overview)
— Generalized speculative decoding
— Aggressive decoding
— Lookup decoding
— Retrieval lookup decoding
— Prompt lookup decoding
— Self speculative decoding
— Tree speculative decoding
— Superposed decoding
— Hierarchical speculative decoding
— Heuristic speculative decoding
— Multi-token speculative decoding
— Sequential speculative decoding
More AI Research
Read more about:
(For feedback, suggestions or corrections, please email research@yoryck.com.)