Aussie AI

Greedy Decoding

Last Updated 6 January, 2025

by David Spuler, Ph.D.

What is Greedy Decoding?

Greedy decoding is a simplistic type of decoding, where the decoder simply picks the output token with the highest predicted probability. This algorithm is very efficient, but regarded as unreliable in terms of accuracy and the quality of output text. More reliable alternative decoding algorithms include top-k decoding, top-p decoding, and beam search decoding.

Problems with Greedy Decoding

One problem is that greedy decoding is deterministic, which means that the AI engine is certain to always output the same token from the same inputs. This leads to an unimpressive lack of creativity and variety, and in the worst cases, can lead to loops of repetitive phrasing and poor flow of natural language text. This repetition is called "neural text degeneration."

Greedy decoding doesn't look ahead by even one word, so it can get tripped up by two-word phrases or longer sequences of more accurate text. It is also an autoregressive decoding model, because the single output token is added to the inputs for the next phase of decoding.

Optimizing Greedy Decoding

Speed is a major advantage of greedy decoding. The algorithm to choose a token is simply the "max" function on the vector of logits. And this can be further optimized by noticing that there's no longer any need to use Softmax to convert the logit outputs from log-scale to full probabilities. The exponential function is a monotonically increasing function, so the token with the highest logit value will also have the highest probability, so the incoming logit vector can simply be scanned for its maximum, skipping the entire Softmax calculation.

Research on Greedy Decoding

Articles and papers on greedy decoding:

Jiatao Gu, Kyunghyun Cho, Victor O.K. Li, 2017, Trainable Greedy Decoding for Neural Machine Translation, arXiv preprint arXiv:1702.02429, https://arxiv.org/abs/1702.02429
James Briggs Feb 25, 2021, The Three Decoding Methods For NLP, Towards Data Science https://towardsdatascience.com/the-three-decoding-methods-for-nlp-23ca59cb1e9d
Y Chen, VOK Li, K Cho, SR Bowman, 2018, A stable and effective learning strategy for trainable greedy decoding, https://arxiv.org/abs/1804.07915
Gian Wiher, Clara Meister, Ryan Cotterell, Mar 2022, On Decoding Strategies for Neural Text Generators, https://arxiv.org/abs/2203.15721 (An evaluation of a variety of decoding algorithms including beam search, top-k, and top-p.)
Chufan Shi, Haoran Yang, Deng Cai, Zhisong Zhang, Yifan Wang, Yujiu Yang, Wai Lam, 10 Feb 2024, A Thorough Examination of Decoding Methods in the Era of LLMs, https://arxiv.org/abs/2402.06925 (Evaluates a number of decoding algorithms with several 7B models including Llama2-7B, and also with 4-bit and 8-bit quantization.)
Yunsheng Ni, Chuanjian Liu, Yehui Tang, Kai Han, Yunhe Wang, 13 May 2024, EMS-SD: Efficient Multi-sample Speculative Decoding for Accelerating Large Language Models, https://arxiv.org/abs/2405.07542 Code: https://github.com/niyunsheng/EMS-SD (Speculative decoding across multiple queries by avoiding padding tokens and optimizing the KV cache.)
David Spuler, March 2024, Chapter 26. Decoding Algorithms, Generative AI in C++: Coding Transformers and LLMs, https://www.amazon.com/dp/B0CXJKCWX9
Chengbo Liu, Yong Zhu, 1 Apr 2024 (v2), SDSAT: Accelerating LLM Inference through Speculative Decoding with Semantic Adaptive Tokens, https://arxiv.org/abs/2403.18647 Code: https://github.com/hasuoshenyun/SDSAT
Sean Welleck, Amanda Bertsch, Matthew Finlayson, Hailey Schoelkopf, Alex Xie, Graham Neubig, Ilia Kulikov, Zaid Harchaoui, 24 Jun 2024, From Decoding to Meta-Generation: Inference-time Algorithms for Large Language Models, https://arxiv.org/abs/2406.16838 (Survey and theoretical analysis of many different decoding algorithms, along with various ways to speed them up such as speculative decoding and KV caches.)
Domas Grigaliūnas, Mantas Lukoševičius, 29 Jul 2024, Inference acceleration for large language models using "stairs" assisted greedy generation, https://arxiv.org/abs/2407.19947 (A draft-and-verify method that is similar to speculative decoding.)
Seungrok Jung., 15, Mar 2024, Large language model inference optimizations on AMD GPUs, ROCm blogs, https://rocm.blogs.amd.com/artificial-intelligence/llm-inference-optimize/README.html
David Spuler, March 2024, Greedy Decoding, in Generative AI in C++, https://www.aussieai.com/book/ch26-greedy-decoding
Xuezhi Wang, Denny Zhou, 23 May 2024 (v2), Chain-of-Thought Reasoning Without Prompting, https://arxiv.org/abs/2402.10200 ("CoT decoding" is examining the alternative paths in the decoding algorithm, which is somewhat similar to Chain-of-Thought reasoning.)
Haoran Wang, Kai Shu, Jan 2025, MakeEveryTokenCount: ASystematic Survey on Decoding Methods for Foundation Model, https://www.researchgate.net/profile/Haoran-Wang-96/publication/387703971_Make_Every_Token_Count_A_Systematic_Survey_on_Decoding_Methods_for_Foundation_Models/links/67784c8ce74ca64e1f49eb15/Make-Every-Token-Count-A-Systematic-Survey-on-Decoding-Methods-for-Foundation-Models.pdf https://github.com/wang2226/Awesome-LLM-Decoding

Neural Text Degeneration

The problem of repetitious and looping text decoding is called neural text degeneration in research papers. This occurs primarily in deterministic decoding algorithms such as greedy decoding, and is largely resolved by stochastic methods such as top-k decoding.

Research on Neural Text Degeneration: Research papers include:

Zihao Fu, Wai Lam, Anthony Man-Cho So, and Bei Shi. 2021. A theoretical analysis of the repetition problem in text generation. In Thirty-Fifth AAAI Conference on Artificial Intelligence. https://arxiv.org/abs/2012.14660
Ari Holtzman, Jan Buys, Li Du, Maxwell Forbes, and Yejin Choi, 2019, The curious case of neural text degeneration, International Conference on Learning Representations, https://arxiv.org/abs/1904.09751
Krishna Pillutla, Swabha Swayamdipta, Rowan Zellers, John Thickstun, Sean Welleck, Yejin Choi, and Zaid Harchaoui. 2021. MAUVE: Measuring the gap between neural text and human text using divergence frontiers. Advances in Neural Information Processing Systems. https://proceedings.neurips.cc/paper/2021/file/260c2432a0eecc28ce03c10dadc078a4-Paper.pdf