Yoryck AI

Lookahead Decoding

Last Updated 11 June, 2025

by David Spuler, Ph.D.

What is Lookahead Decoding?

Lookahead decoding is a type of parallel decoding method that looks forwards in the sequence to see the upcoming tokens. The idea is to "guess" or "draft" the most likely token, and usually multiple tokens, which can then be verified in parallel for a speedup. This is similar to speculative decoding in that there's both drafting and verification, but in lookahead decoding this is done inside the same model.

This method operates in parallel, making it effective for parallel GPU implementions, but not effective for low-resource platforms such as AI PCs or AI Phones.

Research on Lookahead Decoding

Research papers include:

Yichao Fu, Peter Bailis, Ion Stoica, Hao Zhang, Nov 21, 2023, Break the Sequential Dependency of LLM Inference Using Lookahead Decoding, https://lmsys.org/blog/2023-11-21-lookahead-decoding/, Code: https://github.com/hao-ai-lab/LookaheadDecoding (Generates tokens in parallel by using Jacobi iteration.)
Yao Zhao, Zhitian Xie, Chenyi Zhuang, Jinjie Gu, 4 Jan 2024 (v2), Lookahead: An Inference Acceleration Framework for Large Language Model with Lossless Generation Accuracy, https://arxiv.org/abs/2312.12728, Code: https://github.com/alipay/PainlessInferenceAcceleration
Zixuan Zhou, Xuefei Ning, Ke Hong, Tianyu Fu, Jiaming Xu, Shiyao Li, Yuming Lou, Luning Wang, Zhihang Yuan, Xiuhong Li, Shengen Yan, Guohao Dai, Xiao-Ping Zhang, Yuhan Dong, Yu Wang, 22 Apr 2024, A Survey on Efficient Inference for Large Language Models, https://arxiv.org/abs/2404.14294
Vgel, December 18, 2023, How to make LLMs go fast, https://vgel.me/posts/faster-inference/
Shixiaowei02, Oct 2024, TensorRT-LLM 0.13.0 Release Latest, https://github.com/NVIDIA/TensorRT-LLM/releases/tag/v0.13.0
NVIDIA, Dec 2024, Speculative Sampling, https://nvidia.github.io/TensorRT-LLM/advanced/speculative-decoding.html
NVIDIA, Dec 2024, Lookahead Speculative Decoding, https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/lookahead/README.md
Yinfeng Xia, Huiyan Li, Chenyang Le, Manhong Wang, Yutao Sun, Xingyang Ma, Yanmin Qian, 4 Jun 2025, MFLA: Monotonic Finite Look-ahead Attention for Streaming Speech Recognition, https://arxiv.org/abs/2506.03722
Yixuan Wang, Shiyu Ji, Yijun Liu, Yuzhuang Xu, Yang Xu, Qingfu Zhu, Wanxiang Che, 24 May 2025, Lookahead Q-Cache: Achieving More Consistent KV Cache Eviction via Pseudo Query, https://arxiv.org/abs/2505.20334?

More Research on Decoding Algorithms

Decoding algorithms (overview)
— Non-autoregressive decoding
— Greedy decoding
— Top-k decoding
— Top-p decoding
— Min-P Sampling
— Flash decoding
— Beam search decoding
— Edit decoding
— Contrastive decoding
— Constrained decoding
Parallel decoding (overview)
— Blockwise parallel decoding
— n-gram parallel decoding
— Lookahead decoding
— Medusa decoding
— Consensus decoding
Speculative decoding (overview)
— Generalized speculative decoding
— Aggressive decoding
— Lookup decoding
— Retrieval lookup decoding
— Prompt lookup decoding
— Self speculative decoding
— Tree speculative decoding
— Superposed decoding
— Hierarchical speculative decoding
— Heuristic speculative decoding
— Multi-token speculative decoding
— Sequential speculative decoding

Yoryck AI

Lookahead Decoding

What is Lookahead Decoding?

Research on Lookahead Decoding

More Research on Decoding Algorithms

More AI Research

Quick Links

Product

New to Writing?

Writing Styles