Aussie AI
Decoding Algorithms
-
Last Updated 27 August, 2025
-
by David Spuler, Ph.D.
What are Decoding Algorithms?
The decoding algorithm in Transformer AI engines is the method whereby the decoder emits tokens for the output message. At the end of each decoder sequence, the output is a list of "logits" with probabilities for the predictions of the next best token. The algorithm by which the decoder decides to output one token, or multiple tokens, and which ones, is called the decoding algorithm.
Logits vs Activations
Each decoding step is aimed at producing a single token (i.e., the next word to output). The output of a decoding phase for one token is actually a two-step process:
- The "activation vector" or "activations" are computed (numbers representing "embeddings"), and then
- The "logits" are computed from the activation vector (called "unembedding").
These two vectors are not usually the same size:
- Activations vector — size is the "hidden" model dimension (e.g., 4096)
- Logits vector — size is the model vocabulary size (e.g., 50,000 unique tokens).
Logits are very closely related to tokens, and there is one logit value per token. Each logit value represent the LLM's prediction of how likely it would be to output this token as the next one. We could simply take the logit with the highest probability, which is the most likely token according to the LLM, and output that token. This is called "greedy decoding."
Note that most of the LLM's processing is not using logits, but uses activation vectors. The logits only appear only at the very end of a decoding phase. In all the interim steps, which are usually multiple layers of computations inside the model, we use an "embedding space" representation called an activation vector. We don't actually work on "tokens" or their probabilities, but we work on the probabilities of what I call the "signals" in the embedding space, as stored in the activation vector.
The activations are a vector of numbers representing the likelihood of each "signal" in the embeddings For example, a signal might be something like "noun" or "adjective" signals, but there are literally thousands of them, and not everyone understands what every value in an embedding actually represents in reality. But the LLM sure knows!
This vector of numbers is the "activation vector" and is usually shortened to "activations." These values represent the extent to which each signal has been "activated" in each neuron. Each layer of computation modifies the activations, and we get the final activations after the final model layer.
At this very final phase, we need to take these "activations," which are based on the model internal dimension (e.g., 4096) that represents how many signals it's tracking. We need to convert that to logit probabilities, one per token, and there are perhaps 50,000 tokens (depends on the "vocabulary size" but 50,000 or 100,000 is common). Hence, we need to convert a 4096-length vector of numbers ("activations") into a 50,000-length vector of numbers ("logits").
The "unembedding matrix" is what we use. Multiplying the activation vector by this matrix, which is large and rectangular (e.g., 4096x50,000), is how this is done, This converts the 4096-vector into a 50,000-vector. The embedding matrix is large, and expensive to use, which is why we only do this once per decoding phase, rather than once per layer.
Anyway, the computation of activations is not the decoding algorithm. Nor is the multiplication by the unembedding matrix to get the logits vector. Rather, the decoding algorithm is the final phase, which operates on the logits vector of probabilities for each of the 50,000 tokens, and thereby chooses the next token to output.
Types of Decoding Algorithms
There are several possible decoding algorithms for the basic situation of choosing one token to output from a vector of probabilities for each token:
- Greedy decoding — always choose the highest-probability token.
- Top-k sampling (random sampling) — choose from k most likely tokens.
- Top-p sampling (nucleus sampling) — a finesse on the top-k decoding algorithm.
- Beam search decoding — a more complex "tree" search of multiple token sequences.
- Edit decoding — using the input context to help decode the output (e.g., grammar checking).
The above are all variations on a theme: take a vector of token probabilities as the input, and analyze these probabilities to choose exactly one of the tokens as the output.
At a higher-level, there are more advanced options, and the main classes of decoding algorithms are:
- Autoregressive decoding
- Non-Autoregressive (NAR) decoding
- Parallel decoding
- Multi-token output
Other issues for decoding algorithms include:
- Prefill phase (runs before decoding)
- Temperature (scaling hyper-parameter that affects decoding)
Parallel Decoding Algorithms
There are several types of parallel optimizations for decoding:
- Speculative decoding
- Generalized speculative decoding
- Lookahead decoding
- Lookup decoding (including "prompt lookup decoding" and "retrieval lookup decoding")
- Parallel decoding (generally)
Multi-model decoding algorithms have also been examined:
- Supervised decoding (see big-little architectures)
- Ensemble decoding (see ensemble architectures).
- Collaborative decoding
- Consensus decoding
Hybrid Decoding Optimizations
The decoding algorithm may also be combined with other optimizations that improve the decoding process, such as:
- Non-autoregressive decoding
- Token pruning
- Prompt compression (input compression)
Beam Search Decoding
Beam search decoding is an advanced type of decoding that works on a tree of potential output sequences. This is a complex search space that keeps multiple candidate token sequences in reserve, until it chooses the best one. Beam search can look ahead a few tokens, and then backtrack to choose a different final output.
Research papers on beam search:
- Xupeng Miao, Gabriele Oliaro, Zhihao Zhang, Xinhao Cheng, Zeyu Wang, Zhengxin Zhang, Rae Ying Yee Wong, Alan Zhu, Lijie Yang, Xiaoxiang Shi, Chunan Shi, Zhuoming Chen, Daiyaan Arfeen, Reyna Abhyankar, Zhihao Jia, 2024. SpecInfer: Accelerating Large Language Model Serving with Tree-based Speculative Inference and Verification, ASPLOS '24: Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 3, April 2024, Pages 932–949, https://doi.org/10.1145/3620666.3651335 https://dl.acm.org/doi/abs/10.1145/3620666.3651335 Code: https://github.com/flexflow/FlexFlow/
- Jared Lichtarge, Christopher Alberti, Shankar Kumar, Noam Shazeer, and Niki Parmar. 2018. Weakly supervised grammatical error correction using iterative decoding. CoRR, abs/1811.01710. https://arxiv.org/abs/1811.01710 (Beam search decoding with a high threshold to emit corrections.)
- Jindrich Libovicky, Jindrich Helcl, Marek Tlusty, Ondrej Bojar, and Pavel Pecina. 2016. CUNI system for WMT16 automatic post-editing and multimodal translation tasks. In Proceedings of the First Conference on Machine Translation: Volume 2, Shared Task Papers, pages 646–654, Berlin, Germany. https://arxiv.org/abs/1606.07481 (Post-editing of machine translation.)
- Daniel Dahlmeier, Hwee Tou Ng, 2012, A Beam-Search Decoder for Grammatical Error Correction, Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, pages 568–578, Jeju Island, Korea, 12–14 July 2012, https://aclanthology.org/D12-1052.pdf
- Xiaoming (Jason) Cui, Ashraf Bhuiyan, 2023, Optimizing Transformer Model Inference on Intel® Processors, https://www.intel.com/content/www/us/en/developer/articles/technical/optimize-transformer-model-inference-processors.html
- Ashwin K. Vijayakumar, Michael Cogswell, Ramprasaath R. Selvaraju, Qing Sun, Stefan Lee, David J. Crandall, and Dhruv Batra. 2018. Diverse beam search for improved description of complex scenes. In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, (AAAI-18), the 30th innovative Applications of Artificial Intelligence (IAAI-18), and the 8th AAAI Symposium on Educational Advances in Artificial Intelligence (EAAI-18), New Orleans, Louisiana, USA, February 2-7, 2018, pages 7371–7379. AAAI Press. https://www.aaai.org/ocs/index.php/AAAI/AAAI18/paper/view/17329
- Xiaohui Wang, Ying Xiong, Yang Wei, Mingxuan Wang, Lei Li Apr 2021, LightSeq: A High Performance Inference Library for Transformers, https://arxiv.org/pdf/2010.13887.pdf
- Chufan Shi, Haoran Yang, Deng Cai, Zhisong Zhang, Yifan Wang, Yujiu Yang, Wai Lam, 10 Feb 2024, A Thorough Examination of Decoding Methods in the Era of LLMs, https://arxiv.org/abs/2402.06925 (Evaluates a number of decoding algorithms with several 7B models including Llama2-7B, and also with 4-bit and 8-bit quantization.)
- GC Garbacea, 2023, Neural Language Generation for Content Adaptation: Explainable, Efficient Low-Resource Text Simplification and Evaluation, Ph.D. thesis, Computer Science and Engineering, University of Michigan, https://deepblue.lib.umich.edu/bitstream/handle/2027.42/178028/garbacea_1.pdf?sequence=1 (Broad thesis with sections on beam search decoding optimizations and AI safety issues such as bias.)
- Peter Anderson, Basura Fernando, Mark Johnson, and Stephen Gould. Guided open vocabulary image captioning with constrained beam search, 2017, Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 936–945, https://arxiv.org/abs/1612.00576
- Chris Hokamp and Qun Liu, 2017, Lexically constrained decoding for sequence generation using grid beam search. Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1535–1546, https://arxiv.org/abs/1704.07138
- Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, Ion Stoica, Oct 2023, Efficient Memory Management for Large Language Model Serving with PagedAttention, SOSP ’23, October 23–26, 2023, Koblenz, Germany, https://dl.acm.org/doi/pdf/10.1145/3600006.3613165 (The original Paged Attention and vLLM paper, focusing on optimizing memory size of the KV cache using methods similar to operating-system memory paging.)
- Zhaorun Chen, Zhuokai Zhao, Hongyin Luo, Huaxiu Yao, Bo Li, Jiawei Zhou, July 2024, HALC: Object Hallucination Reduction via Adaptive Focal-Contrast Decoding, Proceedings of the 41st International Conference on Machine Learning, PMLR 235:7824-7846, 2024, https://proceedings.mlr.press/v235/chen24bi.html PDF: https://raw.githubusercontent.com/mlresearch/v235/main/assets/chen24bi/chen24bi.pdf https://github.com/BillChan226/HALC
- Tinghui Zhu, Kai Zhang, Jian Xie, Yu Su, 4 Feb 2024 (v2), Deductive Beam Search: Decoding Deducible Rationale for Chain-of-Thought Reasoning, https://arxiv.org/abs/2401.17686
- Jungo Kasai, Keisuke Sakaguchi, Ronan Le Bras, Dragomir Radev, Yejin Choi, and Noah A. Smith. 2024. A Call for Clarity in Beam Search: How It Works and When It Stops. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), pages 77–90, Torino, Italia. ELRA and ICCL. https://aclanthology.org/2024.lrec-main.7/ https://aclanthology.org/2024.lrec-main.7.pdf
- Zongyue Qin, Zifan He, Neha Prakriya, Jason Cong, Yizhou Sun, 25 Sep 2024, Dynamic-Width Speculative Beam Decoding for Efficient LLM Inference, https://arxiv.org/abs/2409.16560
- Shixiaowei02, Oct 2024, TensorRT-LLM 0.13.0 Release Latest, https://github.com/NVIDIA/TensorRT-LLM/releases/tag/v0.13.0
- Yejin Lee, Anna Sun, Basil Hosmer, Bilge Acun, Can Balioglu, Changhan Wang, Charles David Hernandez, Christian Puhrsch, Daniel Haziza, Driss Guessous, Francisco Massa, Jacob Kahn, Jeffrey Wan, Jeremy Reizenstein, Jiaqi Zhai, Joe Isaacson, Joel Schlosser, Juan Pino, Kaushik Ram Sadagopan, Leonid Shamis, Linjian Ma, Min-Jae Hwang, Mingda Chen, Mostafa Elhoushi, Pedro Rodriguez, Ram Pasunuru, Scott Yih, Sravya Popuri, Xing Liu, Carole-Jean Wu, 30 Sep 2024, Characterizing and Efficiently Accelerating Multimodal Generation Model Inference, https://arxiv.org/abs/2410.00215 (Analyzes the bottlenecks in inference, finding the usual problems of autoregression, but also more interesting issues such as that linear kernels can be expensive, and KV cache reordering is a bottleneck in beam search, and layer skipping is analyzed.)
- Xinyu Lin, Chaoqun Yang, Wenjie Wang, Yongqi Li, Cunxiao Du, Fuli Feng, See-Kiong Ng, Tat-Seng Chua, 8 Oct 2024 (v2), Efficient Inference for Large Language Model-based Generative Recommendation, https://arxiv.org/abs/2410.05165
- Rongxiang Wang and Felix Xiaozhu Lin. 2024. Turbocharge Speech Understanding with Pilot Inference. In Proceedings of the 30th Annual International Conference on Mobile Computing and Networking (ACM MobiCom '24). Association for Computing Machinery, New York, NY, USA, 1299–1313. https://doi.org/10.1145/3636534.3690694 https://dl.acm.org/doi/abs/10.1145/3636534.3690694 https://dl.acm.org/doi/pdf/10.1145/3636534.3690694 ("Pilot inference" is a specialized mix of caching, computation reuse, and backtracking in beam search for speech understanding, and is somewhat related to speculative decoding, and similar to continual inference for processing a stream.)
- NVIDIA, Dec 2024, Multi-Head, Multi-Query, and Group-Query Attention, https://nvidia.github.io/TensorRT-LLM/advanced/gpt-attention.html#kv-cache
- Xuezhi Wang, Denny Zhou, 23 May 2024 (v2), Chain-of-Thought Reasoning Without Prompting, https://arxiv.org/abs/2402.10200 ("CoT decoding" is examining the alternative paths in the decoding algorithm, which is somewhat similar to Chain-of-Thought reasoning.)
- Xiangjue Dong, Maria Teleki, James Caverlee, 18 Dec 2024, A Survey on LLM Inference-Time Self-Improvement, https://arxiv.org/abs/2412.14352 https://github.com/dongxiangjue/Awesome-LLM-Self-Improvement (Broad survey of reasoning improvement methods from multi-step inference to RALM to decoding algorithms.)
- Haoran Wang, Kai Shu, Jan 2025, MakeEveryTokenCount: ASystematic Survey on Decoding Methods for Foundation Model, https://www.researchgate.net/profile/Haoran-Wang-96/publication/387703971_Make_Every_Token_Count_A_Systematic_Survey_on_Decoding_Methods_for_Foundation_Models/links/67784c8ce74ca64e1f49eb15/Make-Every-Token-Count-A-Systematic-Survey-on-Decoding-Methods-for-Foundation-Models.pdf https://github.com/wang2226/Awesome-LLM-Decoding
- Edward Beeching, Lewis Tunstall, Sasha Rush Dec 16, 2024, Scaling Test Time Compute with Open Source Models, https://huggingface.co/spaces/HuggingFaceH4/blogpost-scaling-test-time-compute
- Maciej Besta, Julia Barth, Eric Schreiber, Ales Kubicek, Afonso Catarino, Robert Gerstenberger, Piotr Nyczyk, Patrick Iff, Yueling Li, Sam Houliston, Tomasz Sternal, Marcin Copik, Grzegorz Kwaśniewski, Jürgen Müller, Łukasz Flis, Hannes Eberhard, Hubert Niewiadomski, Torsten Hoefler, 23 Jan 2025 (v3), Reasoning Language Models: A Blueprint, https://arxiv.org/abs/2501.11223 (Survey and blueprint for how to build a Large Reasoning Model.)
- Wendi Cui, Jiaxin Zhang, Zhuohang Li, Hao Sun, Damien Lopez, Kamalika Das, Bradley A. Malin, Sricharan Kumar, 26 Feb 2025, Automatic Prompt Optimization via Heuristic Search: A Survey, https://arxiv.org/abs/2502.18746 (Survey of auto prompting, from basic LLM enhancements to some methods quite similar to RALM and TALM.)
- Komal Kumar, Tajamul Ashraf, Omkar Thawakar, Rao Muhammad Anwer, Hisham Cholakkal, Mubarak Shah, Ming-Hsuan Yang, Phillip H.S. Torr, Salman Khan, Fahad Shahbaz Khan, 28 Feb 2025, LLM Post-Training: A Deep Dive into Reasoning Large Language Models, https://arxiv.org/abs/2502.21321 https://github.com/mbzuai-oryx/Awesome-LLM-Post-training
- Yangchao Wu, Zongyue Qin, Alex Wong, Stefano Soatto, 20 May 2025, STree: Speculative Tree Decoding for Hybrid State-Space Models, https://arxiv.org/abs/2505.14969
- Mikhail Andronov, Natalia Andronova, Michael Wand, J\"urgen Schmidhuber, Djork-Arn\'e Clevert, 2 Aug 2025, Fast and scalable retrosynthetic planning with a transformer neural network and speculative beam search, https://arxiv.org/abs/2508.01459
- Harold Silv\`ere Kiossou and Siegfried Nijssen and Pierre Schaus, 8 Aug 2025, A Generic Complete Anytime Beam Search for Optimal Decision Tree, https://arxiv.org/abs/2508.06064
Phrase Banning
Phrase banning is a feature extension for LLM decoding to disallow selected words or phrases, rather than a speed optimization. The idea is to block the LLM from outputting certain words or phrases, rather than post-processing LLM output to remove words. Words or phrases can be "banned" at the decoder level, forcing the LLM decoding phase to backtrack whenever it tries to emit a disallowed word or phrase. If a model has whole-word tokenization, then individual words can be banned at the current decoding step, by modifying simple decoding algorithms like greedy or top-k/top-p decoding. However, banning multi-word phrases or other multi-token sequences requires backtracking similar to beam search decoding. In fact, it makes sense to merge the phrase banning algorithm into beam search or other tree decoding methods. Banning phrases is usually efficient, because it has only a small token search cost to detect the phrases, and although backtracking is expensive, hopefully it is a relatively rare condition.
Research papers on phrase banning:
- Lost Ruins, Oct 11, 2024, koboldcpp-1.76, https://github.com/LostRuins/koboldcpp/releases/tag/v1.76 (Release includes "anti-slop" using "phrase banning" decoding algorithm.)
- Sam Paech, 2024, antislop-sampler, https://github.com/sam-paech/antislop-sampler?tab=readme-ov-file (Decoding algorithm for "phrase banning" with backtracking.)
- Bilgehan Sel, Dingcheng Li, Phillip Wallis, Vaishakh Keshava, Ming Jin, Siddhartha Reddy Jonnalagadda, 11 Mar 2025, Backtracking for Safety, https://arxiv.org/abs/2503.08919
Tree Decoding
Tree decoding is the use of alternative pathways in decoding, in the form of a hierarchical tree. This idea is a generalization of beam search decoding. One of the applications of tree decoding is in the attempt to mimic Chain-of-Thought reasoning in a single inference step using a tree of pathways in CoT decoding.
Research papers on tree decoding:
- Ziyu Wan, Xidong Feng, Muning Wen, Stephen Marcus Mcaleer, Ying Wen, Weinan Zhang, Jun Wang, July 2024, AlphaZero-Like Tree-Search can Guide Large Language Model Decoding and Training, Proceedings of the 41st International Conference on Machine Learning, PMLR 235:49890-49920, 2024, https://proceedings.mlr.press/v235/wan24c.html PDF: https://raw.githubusercontent.com/mlresearch/v235/main/assets/wan24c/wan24c.pdf
- Xiangxiang Gao, Weisheng Xie, Yiwei Xiang, Feng Ji, 17 Dec 2024, Falcon: Faster and Parallel Inference of Large Language Models through Enhanced Semi-Autoregressive Drafting and Custom-Designed Decoding Tree, https://arxiv.org/abs/2412.12639
- Xuezhi Wang, Denny Zhou, 23 May 2024 (v2), Chain-of-Thought Reasoning Without Prompting, https://arxiv.org/abs/2402.10200 ("CoT decoding" is examining the alternative paths in the decoding algorithm, which is somewhat similar to Chain-of-Thought reasoning.)
- Xiangjue Dong, Maria Teleki, James Caverlee, 18 Dec 2024, A Survey on LLM Inference-Time Self-Improvement, https://arxiv.org/abs/2412.14352 https://github.com/dongxiangjue/Awesome-LLM-Self-Improvement (Broad survey of reasoning improvement methods from multi-step inference to RALM to decoding algorithms.)
- Xidong Feng, Ziyu Wan, Muning Wen, Ying Wen, Weinan Zhang, and Jun Wang. 2023. Alphazero-like tree-search can guide large language model decoding and training. In NeurIPS 2023 Foundation Models for Decision Making Workshop. https://arxiv.org/abs/2309.17179
- Zongyue Qin, Zifan He, Neha Prakriya, Jason Cong, Yizhou Sun, 25 Sep 2024, Dynamic-Width Speculative Beam Decoding for Efficient LLM Inference, https://arxiv.org/abs/2409.16560
- Penghui Yang, Cunxiao Du, Fengzhuo Zhang, Haonan Wang, Tianyu Pang, Chao Du, Bo An, 24 Feb 2025, LongSpec: Long-Context Speculative Decoding with Efficient Drafting and Verification, https://arxiv.org/abs/2502.17421 https://github.com/sail-sg/LongSpec
- Yifu Ding, Wentao Jiang, Shunyu Liu, Yongcheng Jing, Jinyang Guo, Yingjie Wang, Jing Zhang, Zengmao Wang, Ziwei Liu, Bo Du, Xianglong Liu, Dacheng Tao, 27 Feb 2025 (v2), Dynamic Parallel Tree Search for Efficient LLM Reasoning, https://arxiv.org/abs/2502.16235
- Yangchao Wu, Zongyue Qin, Alex Wong, Stefano Soatto, 20 May 2025, STree: Speculative Tree Decoding for Hybrid State-Space Models, https://arxiv.org/abs/2505.14969
- Yuhao Shen, Junyi Shen, Quan Kong, Tianyu Liu, Yao Lu, Cong Wang, 16 May 2025, Speculative Decoding via Hybrid Drafting and Rollback-Aware Branch Parallelism, https://arxiv.org/abs/2506.01979
Contrastive Decoding
Contrastive decoding is a method whereby the probabilities of two or more outputs are "contrasted" to choose the best token to output. This can be done by examining prior layers during inference, or it can be done with multiple models.
- Chufan Shi, Haoran Yang, Deng Cai, Zhisong Zhang, Yifan Wang, Yujiu Yang, Wai Lam, 10 Feb 2024, A Thorough Examination of Decoding Methods in the Era of LLMs, https://arxiv.org/abs/2402.06925 (Evaluates a number of decoding algorithms with several 7B models including Llama2-7B, and also with 4-bit and 8-bit quantization.)
- Kaiyan Zhang, Jianyu Wang, Ning Ding, Biqing Qi, Ermo Hua, Xingtai Lv, Bowen Zhou, 18 Jun 2024, Fast and Slow Generating: An Empirical Study on Large and Small Language Models Collaborative Decoding, https://arxiv.org/abs/2406.12295 Code: https://github.com/TsinghuaC3I/FS-GEN
- Sean Welleck, Amanda Bertsch, Matthew Finlayson, Hailey Schoelkopf, Alex Xie, Graham Neubig, Ilia Kulikov, Zaid Harchaoui, 24 Jun 2024, From Decoding to Meta-Generation: Inference-time Algorithms for Large Language Models, https://arxiv.org/abs/2406.16838 (Survey and theoretical analysis of many different decoding algorithms, along with various ways to speed them up such as speculative decoding and KV caches.)
- Xiang Lisa Li, Ari Holtzman, Daniel Fried, Percy Liang, Jason Eisner, Tatsunori Hashimoto, Luke Zettlemoyer, Mike Lewis, 10 Jul 2023 (v2), Contrastive Decoding: Open-ended Text Generation as Optimization, https://arxiv.org/abs/2210.15097
- Hyunjong Ok, Jegwang Ryu, Jaeho Lee, 26 Jun 2024, Decoding with Limited Teacher Supervision Requires Understanding When to Trust the Teacher, https://arxiv.org/abs/2406.18002 (Examines the idea of not using the larger model to always verify, and when to trust either the smaller or larger models, which is an idea that generalized beyond speculative decoding.)
- Zexuan Qiu, Zijing Ou, Bin Wu, Jingjing Li, Aiwei Liu, Irwin King, 25 Jun 2024, Entropy-Based Decoding for Retrieval-Augmented Large Language Models, https://arxiv.org/abs/2406.17519 (Enhanced decoding algorithm for multi-document RAG processing.)
- Hongyi Yuan, Keming Lu, Fei Huang, Zheng Yuan, Chang Zhou, 13 Mar 2024 (v2), Speculative Contrastive Decoding, https://arxiv.org/abs/2311.08981
- Zhaorun Chen, Zhuokai Zhao, Hongyin Luo, Huaxiu Yao, Bo Li, Jiawei Zhou, July 2024, HALC: Object Hallucination Reduction via Adaptive Focal-Contrast Decoding, Proceedings of the 41st International Conference on Machine Learning, PMLR 235:7824-7846, 2024, https://proceedings.mlr.press/v235/chen24bi.html PDF: https://raw.githubusercontent.com/mlresearch/v235/main/assets/chen24bi/chen24bi.pdf https://github.com/BillChan226/HALC
- F. Li, X. zhang and P. Zhang, 2024, Mitigating Hallucination Issues in Small-Parameter LLMs through Inter-Layer Contrastive Decoding, 2024 International Joint Conference on Neural Networks (IJCNN), Yokohama, Japan, 2024, pp. 1-8, doi: 10.1109/IJCNN60899.2024.10650644, https://ieeexplore.ieee.org/abstract/document/10650644
- Xiangjue Dong, Maria Teleki, James Caverlee, 18 Dec 2024, A Survey on LLM Inference-Time Self-Improvement, https://arxiv.org/abs/2412.14352 https://github.com/dongxiangjue/Awesome-LLM-Self-Improvement (Broad survey of reasoning improvement methods from multi-step inference to RALM to decoding algorithms.)
- Haoran Wang, Kai Shu, Jan 2025, MakeEveryTokenCount: ASystematic Survey on Decoding Methods for Foundation Model, https://www.researchgate.net/profile/Haoran-Wang-96/publication/387703971_Make_Every_Token_Count_A_Systematic_Survey_on_Decoding_Methods_for_Foundation_Models/links/67784c8ce74ca64e1f49eb15/Make-Every-Token-Count-A-Systematic-Survey-on-Decoding-Methods-for-Foundation-Models.pdf https://github.com/wang2226/Awesome-LLM-Decoding
- Phuc Phan, Hieu Tran, Long Phan, 23 Aug 2024 (v2), Distillation Contrastive Decoding: Improving LLMs Reasoning with Contrastive Decoding and Distillation, https://arxiv.org/abs/2402.14874
- Nikhil Anand, Nov 14, 2024, Making LLMs more Truthful with DoLa: A Contrastive Decoding Approach (Part I), https://ai.gopubby.com/making-llms-more-truthful-with-dola-a-contrastive-decoding-approach-part-i-1c2f90c91996 (Decoding by examining probabilities across layers.)
- Hongxiang Zhang, Hao Chen, Muhao Chen, Tianyi Zhang, 2 Jun 2025 (v2), Active Layer-Contrastive Decoding Reduces Hallucination in Large Language Model Generation, https://arxiv.org/abs/2505.23657
- Che-Yu Chou, Hung-Hsuan Chen, 14 Aug 2025, Contrastive ECOC: Learning Output Codes for Adversarial Defense, https://arxiv.org/abs/2508.10491
- Shan Shen, Shenglu Hua, Jiajun Zou, Jiawei Liu, Jianwang Zhai, Chuan Shi, Wenjian Yu, 14 Aug 2025, Transferable Parasitic Estimation via Graph Contrastive Learning and Label Rebalancing in AMS Circuits, https://arxiv.org/abs/2507.06535
- Lei Tian, Xiaomin Li, Liqian Ma, Hao Yin, Zirui Zheng, Hefei Huang, Taiqing Li, Huchuan Lu, Xu Jia, 14 Aug 2025, CCL-LGS: Contrastive Codebook Learning for 3D Language Gaussian Splatting, https://arxiv.org/abs/2505.20469
- Amr Mousa, Neil Karavis, Michele Caprio, Wei Pan and Richard Allmendinger, 14 Aug 2025, TAR: Teacher-Aligned Representations via Contrastive Learning for Quadrupedal Locomotion, https://arxiv.org/abs/2503.20839
- Weijia Yang, Tian Lan, Leyuan Liu, Wei Chen, Tianqing Zhu, Sheng Wen, Xiaosong Zhang, 19 Jul 2025, CASPER: Contrastive Approach for Smart Ponzi Scheme Detecter with More Negative Samples, https://arxiv.org/abs/2507.16840
- Xiaoqiang He, 21 Jul 2025, CLAMP: Contrastive Learning with Adaptive Multi-loss and Progressive Fusion for Multimodal Aspect-Based Sentiment Analysis, https://arxiv.org/abs/2507.16854
- Piotr Masztalski, Micha{\l} Romaniuk, Jakub \.Zak, Mateusz Matuszewski, Konrad Kowalczyk, 23 Jul 2025, Clustering-based hard negative sampling for supervised contrastive speaker verification, https://arxiv.org/abs/2507.17540
- Arsh Tangri, Nichols Crawford Taylor, Haojie Huang, Robert Platt, 22 Jul 2025, Equivariant Goal Conditioned Contrastive Reinforcement Learning, https://arxiv.org/abs/2507.16139
- Xiaoya Li, Xiaofei Sun, Albert Wang, Jiwei Li and Chris Shum, 22 Jul 2025, CUDA-L1: Improving CUDA Optimization via Contrastive Reinforcement Learning, https://arxiv.org/abs/2507.14111
- Zhijie Wang, Zixin Xu, Zhiyuan Pan, 24 Jul 2025, GCC-Spam: Spam Detection via GAN, Contrastive Learning, and Character Similarity Networks, https://arxiv.org/abs/2507.14679
- Yajiao Dai, Jun Li, Zhen Mei, Yiyang Ni, Shi Jin, Zengxiang Li, Sheng Guo, Wei Xiang, 12 Jul 2025, Semi-Supervised Federated Learning via Dual Contrastive Learning and Soft Labeling for Intelligent Fault Diagnosis, https://arxiv.org/abs/2507.14181
- Xiaotong Luo, Shengda Zhuo, Min Chen, Lichun Li, Ruizhao Lu, Wenqi Fan, Shuqiang Huang and Yin Tang, 12 Jul 2025, From Bias to Behavior: Learning Bull-Bear Market Dynamics with Contrastive Modeling, https://arxiv.org/abs/2507.14182
- Yiming Xu, Zhen Peng, Bin Shi, Xu Hua, Bo Dong, Song Wang, Chen Chen, 19 Jul 2025, Revisiting Graph Contrastive Learning on Anomaly Detection: A Structural Imbalance Perspective, https://arxiv.org/abs/2507.14677
- Abdul-Kazeem Shamba, Kerstin Bach and Gavin Taylor, 20 Jul 2025, eMargin: Revisiting Contrastive Learning with Margin-Based Separation, https://arxiv.org/abs/2507.14828
- Jinzhi Wang, Bin Li, Qingke Peng, Haozhou Li, Zeyuan Zeng, Ruimeng Li, Kaixuan Yang, Jiangbo Zhang, Biyi Zhou, Yaoying Wang, 20 Jul 2025, LumiCRS: Asymmetric Contrastive Prototype Learning for Long-Tail Conversational Recommender Systems, https://arxiv.org/abs/2507.04722
- Sho Oshima, Yuji Okamoto, Taisei Tosaki, Ryosuke Kojima, Yasushi Okuno, 19 Jul 2025, Supervised Graph Contrastive Learning for Gene Regulatory Network, https://arxiv.org/abs/2505.17786
- Chaoqun Cui, Caiyan Jia, 10 Aug 2025, Propagation Tree Is Not Deep: Adaptive Graph Contrastive Learning Approach for Rumor Detection, https://arxiv.org/abs/2508.07201
- WonJun Moon, Hyun Seok Seong, Jae-Pil Heo, 11 Aug 2025, Selective Contrastive Learning for Weakly Supervised Affordance Grounding, https://arxiv.org/abs/2508.07877
- Mohammad Zia Ur Rehman, Anukriti Bhatnagar, Omkar Kabde, Shubhi Bansal, Nagendra Kumar, 7 Aug 2025, ImpliHateVid: A Benchmark Dataset and Two-stage Contrastive Learning Framework for Implicit Hate Speech Detection in Videos, https://arxiv.org/abs/2508.06570
- Mengting Pan, Fan Li, Xiaoyang Wang, Wenjie Zhang, Xuemin Lin, 10 Aug 2025, HiTeC: Hierarchical Contrastive Learning on Text-Attributed Hypergraph with Semantic-Aware Augmentation, https://arxiv.org/abs/2508.03104
- Binxiong Li, Yuefei Wang, Binyu Zhao, Heyang Gao, Benhan Yang, Quanzhou Luo, Xue Li, Xu Xiang, Yujie Liu, Huijie Tang, 28 Jul 2025, Attributed Graph Clustering with Multi-Scale Weight-Based Pairwise Coarsening and Contrastive Learning, https://arxiv.org/abs/2507.20505
- Chengkai Wang, Di Wu, Yunsheng Liao, Wenyao Zheng, Ziyi Zeng, Xurong Gao, Hemmings Wu, Zhoule Zhu, Jie Yang, Lihua Zhong, Weiwei Cheng, Yun-Hsuan Chen and Mohamad Sawan, 27 Jul 2025, NeuroCLIP: A Multimodal Contrastive Learning Method for rTMS-treated Methamphetamine Addiction Analysis, https://arxiv.org/abs/2507.20189
- Wenhao Ma, Yu-Cheng Chang, Jie Yang, Yu-Kai Wang, Chin-Teng Lin, 28 Jul 2025, Contrastive learning-based agent modeling for deep reinforcement learning, https://arxiv.org/abs/2401.00132
- Sanqing Qu, Tianpei Zou, Florian R\"ohrbein, Cewu Lu, Guang Chen, Dacheng Tao, Changjun Jiang, 26 Jul 2025, GLC++: Source-Free Universal Domain Adaptation through Global-Local Clustering and Contrastive Affinity Learning, https://arxiv.org/abs/2403.14410
- Maximillian Chen and Ruoxi Sun and Tomas Pfister and Sercan \"O. Ar{\i}k, 27 Jul 2025, Learning to Clarify: Multi-turn Conversations with Action-Based Contrastive Self-Training, https://arxiv.org/abs/2406.00222
- Yu Tai, Xinglong Wu, Hongwei Yang, Hui He, Duanjing Chen, Yuanming Shao and Weizhe Zhang, 28 Jul 2025, How to Bridge Spatial and Temporal Heterogeneity in Link Prediction? A Contrastive Method, https://arxiv.org/abs/2411.00612
- Kristin Qi, Jiali Cheng, Youxiang Zhu, Hadi Amiri, Xiaohui Liang, 28 Jul 2025, Unveil Multi-Picture Descriptions for Multilingual Mild Cognitive Impairment Detection via Contrastive Learning, https://arxiv.org/abs/2505.17067
- Fabrizio Lo Scudo, Alessio De Rango, Luca Furnari, Alfonso Senatore, Donato D'Ambrosio, Giuseppe Mendicino and Gianluigi Greco, 23 Jul 2025, Advancing Wildfire Risk Prediction via Morphology-Aware Curriculum Contrastive Learning, https://arxiv.org/abs/2507.21147
- Yaoyu Zhang and Chi-Guhn Lee, 28 Jul 2025, A Contrastive Diffusion-based Network (CDNet) for Time Series Classification, https://arxiv.org/abs/2507.21357
- David A Kelly and Hana Chockler, 31 Jul 2025, Causal Identification of Sufficient, Contrastive and Complete Feature Sets in Image Classification, https://arxiv.org/abs/2507.23497
- Binxiong Li, Xu Xiang, Xue Li, Quanzhou Lou, Binyu Zhao, Yujie Liu, Huijie Tang, Benhan Yang, 31 Jul 2025, GCL-GCN: Graphormer and Contrastive Learning Enhanced Attributed Graph Clustering Network, https://arxiv.org/abs/2507.19095
- Qile Liu, Weishan Ye, Lingli Zhang, Zhen Liang, 31 Jul 2025, EEG-SCMM: Soft Contrastive Masked Modeling for Cross-Corpus EEG-Based Emotion Recognition, https://arxiv.org/abs/2408.09186
- Ziwei Wang, Siyang Li, Xiaoqing Chen, and Dongrui Wu, 31 Jul 2025, MVCNet: Multi-View Contrastive Network for Motor Imagery Classification, https://arxiv.org/abs/2502.17482
- Gianluca Carloni, Biagio Brattoli, Seongho Keum, Jongchan Park, Taebum Lee, Chang Ho Ahn, Sergio Pereira, 29 Jul 2025, Pathology Foundation Models are Scanner Sensitive: Benchmark and Mitigation with Contrastive ScanGen Loss, https://arxiv.org/abs/2507.22092
- Sara Sarto, Nicholas Moratelli, Marcella Cornia, Lorenzo Baraldi, Rita Cucchiara, 29 Jul 2025, Positive-Augmented Contrastive Learning for Vision-and-Language Evaluation and Training, https://arxiv.org/abs/2410.07336
- Tuna Han Salih Meral, Enis Simsar, Federico Tombari, Pinar Yanardag, 29 Jul 2025, Contrastive Test-Time Composition of Multiple LoRA Models for Image Generation, https://arxiv.org/abs/2403.19776
- Zizhuo Zhang, Jianing Zhu, Xinmu Ge, Zihua Zhao, Zhanke Zhou, Xuan Li, Xiao Feng, Jiangchao Yao, Bo Han, 1 Aug 2025, Co-Reward: Self-supervised Reinforcement Learning for Large Language Model Reasoning via Contrastive Agreement, https://arxiv.org/abs/2508.00410
- Yiming Xu, Xu Hua, Zhen Peng, Bin Shi, Jiarun Chen, Xingbo Fu, Song Wang, Bo Dong, 1 Aug 2025, Text-Attributed Graph Anomaly Detection via Multi-Scale Cross- and Uni-Modal Contrastive Learning, https://arxiv.org/abs/2508.00513
- Shiyi Liu, Buwen Liang, Yuetong Fang, Zixuan Jiang and Renjing Xu, 1 Aug 2025, Hierarchical Multi-Label Contrastive Learning for Protein-Protein Interaction Prediction Across Organisms, https://arxiv.org/abs/2507.02724
- Amrit Rajeev, Udayaadithya Avadhanam, Harshula Tulapurkar, SaiBarath Sundar, 1 Aug 2025, Small sample-based adaptive text classification through iterative and contrastive description refinement, https://arxiv.org/abs/2508.00957
- Xiaoya Li, Xiaofei Sun, Albert Wang, Chris Shum and Jiwei Li, 4 Aug 2025, CRINN: Contrastive Reinforcement Learning for Approximate Nearest Neighbor Search, https://arxiv.org/abs/2508.02091
- Jing Lan, Hexiao Ding, Hongzhao Chen, Yufeng Jiang, Ng Nga Chun, Gerald W.Y. Cheng, Zongxi Li, Jing Cai, Liang-ting Lin, Jung Sun Yoo, 3 Aug 2025, Contrastive Multi-Task Learning with Solvent-Aware Augmentation for Drug Discovery, https://arxiv.org/abs/2508.01799
- Yujia Tong, Tian Zhang, Jingling Yuan, Yuze Wang, Chuang Hu, 3 Aug 2025, LetheViT: Selective Machine Unlearning for Vision Transformers via Attention-Guided Contrastive Learning, https://arxiv.org/abs/2508.01569
- Kosmas Pinitas and Konstantinos Makantasis and Georgios N. Yannakakis, 30 Jul 2025, Privileged Contrastive Pretraining for Multimodal Affect Modelling, https://arxiv.org/abs/2508.03729
- Hyungbin Kim, Incheol Baek, Yon Dohn Chung, 6 Aug 2025, Decoupled Contrastive Learning for Federated Learning, https://arxiv.org/abs/2508.04005
- Thang Duc Tran, Thai Hoang Le, 6 Aug 2025, WSS-CL: Weight Saliency Soft-Guided Contrastive Learning for Efficient Machine Unlearning Image Classification, https://arxiv.org/abs/2508.04308
- Rui Zuo, Simon Khan, Zifan Wang, Garrett Ethan Katz, Qinru Qiu, 6 Aug 2025, Why the Agent Made that Decision: Contrastive Explanation Learning for Reinforcement Learning, https://arxiv.org/abs/2411.16120
- Sahil Sethi, David Chen, Thomas Statchen, Michael C. Burkhart, Nipun Bhandari, Bashar Ramadan, Brett Beaulieu-Jones, 6 Aug 2025, ProtoECGNet: Case-Based Interpretable Deep Learning for Multi-Label ECG Classification with Contrastive Learning, https://arxiv.org/abs/2504.08713
- Tianchen Fang, Guiru Liu, 7 Aug 2025, RegionMed-CLIP: A Region-Aware Multimodal Contrastive Learning Pre-trained Model for Medical Image Understanding, https://arxiv.org/abs/2508.05244
- Kang Liu and Zhuoqi Ma and Zikang Fang and Yunan Li and Kun Xie and Qiguang Miao, 7 Aug 2025, PriorRG: Prior-Guided Contrastive Pre-training and Coarse-to-Fine Decoding for Chest X-ray Report Generation, https://arxiv.org/abs/2508.05353
- Wonjun Kang, Byeongkeun Ahn, Minjae Lee, Kevin Galim, Seunghyuk Oh, Hyung Il Koo, Nam Ik Cho, 7 Aug 2025, UNCAGE: Contrastive Attention Guidance for Masked Generative Transformers in Text-to-Image Generation, https://arxiv.org/abs/2508.05399
- Willian T. Lunardi, Abdulrahman Banabila, Dania Herzalla, and Martin Andreoni, 7 Aug 2025, Contrastive Representation Modeling for Anomaly Detection, https://arxiv.org/abs/2501.05130
- Shengzhu Yang, Jiawei Du, Shuai Lu, Weihang Zhang, Ningli Wang, Huiqi Li, 8 Aug 2025, CLIPin: A Non-contrastive Plug-in to CLIP for Multimodal Semantic Alignment, https://arxiv.org/abs/2508.06434
- Zihu Wang, Boxun Xu, Hejia Geng, Peng Li, 8 Aug 2025, Khan-GCL: Kolmogorov-Arnold Network Based Graph Contrastive Learning with Hard Negatives, https://arxiv.org/abs/2505.15103
- Huifa Li, Jie Fu, Xinlin Zhuang, Haolin Yang, Xinpeng Ling, Tong Cheng, Haochen xue, Imran Razzak, Zhili Chen, 7 Aug 2025, scAGC: Learning Adaptive Cell Graphs with Contrastive Guidance for Single-Cell Clustering, https://arxiv.org/abs/2508.09180
- Ziyu Liu, Azadeh Alavi, Minyi Li, Xiang Zhang, 13 Aug 2025, A Unified Contrastive-Generative Framework for Time Series Classification, https://arxiv.org/abs/2508.09451
- Han Yu, Huiyuan Yang, Akane Sano, 12 Aug 2025, LEAVES: Learning Views for Time-Series Biobehavioral Data in Contrastive Learning, https://arxiv.org/abs/2210.07340
- Minghui Sun, Matthew M. Engelhard, Benjamin A. Goldstein, 15 Aug 2025, Borrowing From the Future: Enhancing Early Risk Assessment through Contrastive Learning, https://arxiv.org/abs/2508.11210
- Bin Ma, Yifei Zhang, Yongjin Xian, Qi Li, Linna Zhou, Gongxun Miao, 15 Aug 2025, A Cross-Modal Rumor Detection Scheme via Contrastive Learning by Exploring Text and Image internal Correlations, https://arxiv.org/abs/2508.11141
- Haojie Zhang, Yixiong Liang, Hulin Kuang, Lihui Cen, Zhe Qu, Yigang Cen, Min Zeng, Shichao Kan, 8 Aug 2025, Contrastive Regularization over LoRA for Multimodal Biomedical Image Incremental Learning, https://arxiv.org/abs/2508.11673
- Reza Shirkavand, Shangqian Gao, Peiran Yu, Heng Huang, 17 Aug 2025, Cost-Aware Contrastive Routing for LLMs, https://arxiv.org/abs/2508.12491
- Alicja Ziarko, Michal Bortkiewicz, Michal Zawalski, Benjamin Eysenbach and Piotr Milos, 18 Aug 2025, Contrastive Representations for Temporal Reasoning, https://arxiv.org/abs/2508.13113
- Yihan Wang, Yiwei Lu, Guojun Zhang, Franziska Boenisch, Adam Dziedzic, Yaoliang Yu, Xiao-Shan Gao, 16 Aug 2025, MUC: Machine Unlearning for Contrastive Learning with Black-box Evaluation, https://arxiv.org/abs/2406.03603
- Kai Sun, Yushi Bai, Zhen Yang, Jiajie Zhang, Ji Qi, Lei Hou and Juanzi Li, 17 Aug 2025, Hard Negative Contrastive Learning for Fine-Grained Geometric Understanding in Large Multimodal Models, https://arxiv.org/abs/2505.20152
- Lingyu Si, Jingyao Wang, Wenwen Qiang, 19 Aug 2025, A Generalized Learning Framework for Self-Supervised Contrastive Learning, https://arxiv.org/abs/2508.13596
- Ruobing Jiang, Yacong Li, Haobing Liu, Yanwei Yu, 19 Aug 2025, Incorporating Attributes and Multi-Scale Structures for Heterogeneous Graph Contrastive Learning, https://arxiv.org/abs/2503.13911
- Tianxi Cai, Feiqing Huang, Ryumei Nakada, Linjun Zhang, Doudou Zhou, 19 Aug 2025, Contrastive Learning on Multimodal Analysis of Electronic Health Records, https://arxiv.org/abs/2403.14926
- Qian Zhanga, Ruilin Zhang, Jun Xiao, Yifan Liu and Zhe Wang, 12 Aug 2025, MCLPD:Multi-view Contrastive Learning for EEG-based PD Detection Across Datasets, https://arxiv.org/abs/2508.14073
- Chen-Hao Chang, Hui-Ju Hung, Chia-Hsun Lu, Chih-Ya Shen, 20 Aug 2025, Enhancing Contrastive Link Prediction With Edge Balancing Augmentation, https://arxiv.org/abs/2508.14808
- Guilhem Faur\'e (MULTISPEECH), Mostafa Sadeghi (MULTISPEECH), Sam Bigeard (MULTISPEECH), Slim Ouni (LORIA, MULTISPEECH), 20 Aug 2025, Towards Skeletal and Signer Noise Reduction in Sign Language Production via Quaternion-Based Pose Encoding and Contrastive Learning, https://arxiv.org/abs/2508.14574
- Yifan Zhang, Junhui Hou, 20 Aug 2025, Is Contrastive Distillation Enough for Learning Comprehensive 3D Representations?, https://arxiv.org/abs/2412.08973
- Yi Yuan, Joseph Van Duyn, Runze Yan, Zhuoyi Huang, Sulaiman Vesal, Sergey Plis, Xiao Hu, Gloria Hyunjung Kwak, Ran Xiao, Alex Fedorov, 21 Aug 2025, Learning ECG Representations via Poly-Window Contrastive Learning, https://arxiv.org/abs/2508.15225
- Junho Song, Jong-Hwan Jang, DongGyun Hong, Joon-myoung Kwon, and Yong-Yeon Jo, 21 Aug 2025, CREMA: A Contrastive Regularized Masked Autoencoder for Robust ECG Diagnostics across Clinical Domains, https://arxiv.org/abs/2407.07110
- Pouria Mortezaagha, Arya Rahgozar, 17 Aug 2025, An Auditable Pipeline for Fuzzy Full-Text Screening in Systematic Reviews: Integrating Contrastive Semantic Highlighting and LLM Judgment, https://arxiv.org/abs/2508.15822
- Wenqiao Zhu, Ji Liu, Rongjuncheng Zhang, Haipang Wu, Yulun Zhang, 21 Aug 2025, CARFT: Boosting LLM Reasoning via Contrastive Learning with Annotated Chain-of-Thought-based Reinforced Fine-Tuning, https://arxiv.org/abs/2508.15868
- Yulin Zhu, Xing Ai, Yevgeniy Vorobeychik, Kai Zhou, 22 Aug 2025, Robust Graph Contrastive Learning with Information Restoration, https://arxiv.org/abs/2307.12555
- Yushi Lin, Peng Yang, 23 Aug 2025, A Decoupled LOB Representation Framework for Multilevel Manipulation Detection with Supervised Contrastive Learning, https://arxiv.org/abs/2508.17086
- Muhammad Aqeel, Danijel Skocaj, Marco Cristani, Francesco Setti, 25 Aug 2025, A Contrastive Learning-Guided Confident Meta-learning for Zero Shot Anomaly Detection, https://arxiv.org/abs/2508.17827
- Bin Tan, Wangyao Ge, Yidi Wang, Xin Liu, Jeff Burtoft, Hao Fan, Hui Wang, 25 Aug 2025, PCR-CA: Parallel Codebook Representations with Contrastive Alignment for Multiple-Category App Recommendation, https://arxiv.org/abs/2508.18166
Flash Decoding
Flash decoding is a memory-reducing decoding algorithm introduced by the research team better known for "flash attention" (versions 1, 2, and 3 so far). This is similar memory access reductions applied to the decoding algorithm.
- Xupeng Miao, Gabriele Oliaro, Zhihao Zhang, Xinhao Cheng, Hongyi Jin, Tianqi Chen, Zhihao Jia, 23 Dec 2023, Towards Efficient Generative Large Language Model Serving: A Survey from Algorithms to Systems, https://arxiv.org/abs/2312.15234
- Mengwei Xu, Wangsong Yin, Dongqi Cai, Rongjie Yi, Daliang Xu, Qipeng Wang, Bingyang Wu, Yihao Zhao, Chen Yang, Shihe Wang, Qiyang Zhang, Zhenyan Lu, Li Zhang, Shangguang Wang, Yuanchun Li, Yunxin Liu, Xin Jin, Xuanzhe Liu, 16 Jan 2024, A Survey of Resource-efficient LLM and Multimodal Foundation Models, https://arxiv.org/abs/2401.08092 Project: https://github.com/UbiquitousLearning/Efficient_Foundation_Model_Survey
- 8 Jun 2024 (v2), A Survey on Efficient Inference for Large Language Models, Zixuan Zhou, Xuefei Ning, Ke Hong, Tianyu Fu, Jiaming Xu, Shiyao Li, Yuming Lou, Luning Wang, Zhihang Yuan, Xiuhong Li, Shengen Yan, Guohao Dai, Xiao-Ping Zhang, Yuhan Dong, Yu Wang, https://arxiv.org/abs/2404.14294
- Ke Hong, Guohao Dai, Jiaming Xu, Qiuli Mao, Xiuhong Li, Jun Liu, kangdi chen, Yuhan Dong, Yu Wang, 2024, FlashDecoding++: Faster Large Language Model Inference with Asynchronization, Flat GEMM Optimization, and Heuristics, Part of Proceedings of Machine Learning and Systems 6 (MLSys 2024) Conference, PDF: https://proceedings.mlsys.org/paper_files/paper/2024/file/5321b1dabcd2be188d796c21b733e8c7-Paper-Conference.pdf (Next generation of Flash Decoding, with improved ascynchronous parallelism of Softmax in both prefill and decoding phases, heuristic dataflow management algorithms, and enhanced GEMM during the decoding phase.)
- Together AI, Nov 13, 2023, Announcing Together Inference Engine – the fastest inference available, https://www.together.ai/blog/together-inference-engine-v1
- Tri Dao, Daniel Haziza, Francisco Massa, Grigory Sizov, October 12, 2023, Flash-Decoding for long-context inference, https://www.together.ai/blog/flash-decoding-for-long-context-inference
- Jinhao Li, Jiaming Xu, Shan Huang, Yonghua Chen, Wen Li, Jun Liu, Yaoxiu Lian, Jiayi Pan, Li Ding, Hao Zhou, Guohao Dai, 6 Oct 2024, Large Language Model Inference Acceleration: A Comprehensive Hardware Perspective, https://arxiv.org/abs/2410.04466
- Aniruddha Nrusimha, William Brandon, Mayank Mishra, Yikang Shen, Rameswar Panda, Jonathan Ragan-Kelley, Yoon Kim, 28 May 2025, FlashFormer: Whole-Model Kernels for Efficient Low-Batch Inference, https://arxiv.org/abs/2505.22758 https://github.com/aninrusimha/flashformer (Optimizing kernels for low latency in a single isolated query, not a batch, via kernel fusion and running all components in one kernel, along with programming techniques like metaprogramming.)
Top-p Decoding
Top-p decoding is a longstanding decoding method that examines the cumulative probabilities of the top candidate tokens. Top-p is usually combined with top-k decoding into a hybrid top-k top-p decoding algorithm.
Research papers on Top-p decoding:
- David Spuler, March 2024, Chapter 26. Decoding Algorithms, Generative AI in C++: Coding Transformers and LLMs, https://www.amazon.com/dp/B0CXJKCWX9
- Sean Welleck, Amanda Bertsch, Matthew Finlayson, Hailey Schoelkopf, Alex Xie, Graham Neubig, Ilia Kulikov, Zaid Harchaoui, 24 Jun 2024, From Decoding to Meta-Generation: Inference-time Algorithms for Large Language Models, https://arxiv.org/abs/2406.16838 (Survey and theoretical analysis of many different decoding algorithms, along with various ways to speed them up such as speculative decoding and KV caches.)
- Hugging Face, 2024, Text Generation Inference, https://huggingface.co/docs/text-generation-inference/index
- David Spuler, March 2024, Top-p Decoding, in Generative AI in C++, https://www.aussieai.com/book/ch26-top-p-decoding
- Haoran Wang, Kai Shu, Jan 2025, MakeEveryTokenCount: ASystematic Survey on Decoding Methods for Foundation Model, https://www.researchgate.net/profile/Haoran-Wang-96/publication/387703971_Make_Every_Token_Count_A_Systematic_Survey_on_Decoding_Methods_for_Foundation_Models/links/67784c8ce74ca64e1f49eb15/Make-Every-Token-Count-A-Systematic-Survey-on-Decoding-Methods-for-Foundation-Models.pdf https://github.com/wang2226/Awesome-LLM-Decoding
Min-P Decoding
Min-p decoding is a new minor decoding modification, that mainly improves accuracy (rather than efficiency), but doesn't reduce efficiency either. Similar to top-p decoding, min-p tries to avoid showing tokens with too-low probabilities, so top-p and min-p have the same goal. However, min-p uses a lower threshold for the minimum probability allowed, and changes this threshold dynamically. The discovery of "min-p" was a nice piece of research work, since it is a small coding change that improves accuracy without sacrificing latency.
Research on min-p decoding:
- Ignacio de Gregorio, Aug 2024, Elevate LLM Performance by 20% Instantly with Min-P, https://medium.com/@ignacio.de.gregorio.noblejas/elevate-llm-performance-by-20-instantly-with-min-p-c961fe1daf3b
- Hugging Face, 2024, Min P style sampling - an alternative to Top P/TopK #27670, https://github.com/huggingface/transformers/issues/27670
- Minh Nguyen, Andrew Baker, Andreas Kirsch, Clement Neo, 1 Jul 2024, Min P Sampling: Balancing Creativity and Coherence at High Temperature, https://arxiv.org/abs/2407.01082
- Joao Gante, May 2024, New sampling strategy dropped in 🤗 transformers -- Min P sampling , Hugging Face, https://huggingface.co/posts/joaogante/319451541682734
Constrained Decoding
Constrained decoding is an optimization of the decoding algorithm where there are extra constraints on the token that can be output. This extra information can be used to either force inclusion of a particular token, or to exclude a subset of the tokens from consideration. Examples where there is extra information to use in decoding include:
- Programming language syntax (code generation)
- Parts-of-speech identification
For example, if you're programming an LLM decoding algorithm to output C++ code, then you know that the token 'if' is always followed by a token '(' in the code syntax. Hence, there's not really any need for a full LLM computation after an 'if' token, but the heuristic can be used. This idea is using the "constraint" of the language syntax to do "constrained decoding."
Clearly, that heuristic would be much faster, and easily coded. However, it's not all strawberries and cream, because the next token won't have a KV cache for the current token, if we use this heuristic. Hence, the next token would need to do a "mini-prefill" computation to calculate the KV cache, which means there's almost no point in avoiding the current token's computation (i.e., we are simply pushing the current token's computation onto the next token).
However, we've seen this issue of a "missing KV cache" before in early exit or layer skipping optimizations, where the KV cache is missing for any skipped layers (see KV caching). And there are various tricks to avoid fully re-computing the KV cache, such as propagation of the prior one or fusion with another layer. Similar ideas can be used when constrained decoding skips an LLM computation and the next token's KV cache is thereby absent.
Overlapped parallel computation can be used to address the missing KV cache, as also possible for early exit. The constraints of the language grammar allow the second token's inference to start almost immediately, possibly via a heuristic that does not even involve LLM layer execution. However, the computation of the current token's KV cache can still be completed, in parallel to the next token's decoding cycle, by ensuring that the next token's layers are staggered a little behind the current token's KV cache computation. This overlaps the next token's decoding phase with the current token's KV cache computation.
Research papers on constrained decoding:
- Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie, Chuyue Sun, Jeff Huang, Cody Hao Yu, Shiyi Cao, Christos Kozyrakis, Ion Stoica, Joseph E. Gonzalez, Clark Barrett, Ying Sheng, 6 Jun 2024 (v2), SGLang: Efficient Execution of Structured Language Model Programs, https://arxiv.org/abs/2312.07104 https://github.com/sgl-project/sglang
- K Ahmed, KW Chang, G Van den Broeck, Oct 2024, Controllable Generation via Locally Constrained Resampling, Neurips Safe Generative AI Workshop 2024, https://openreview.net/pdf?id=v091fzXTu0
- Gaya Mehenni, Amal Zouaq, 23 Nov 2024, Ontology-Constrained Generation of Domain-Specific Clinical Summaries, https://arxiv.org/abs/2411.15666
- Will Kurt, Nov 2024, Say What You Mean: A Response to 'Let Me Speak Freely', https://blog.dottxt.co/say-what-you-mean.html
- Zhi Rui Tam, Cheng-Kuang Wu, Yi-Lin Tsai, Chieh-Yen Lin, Hung-yi Lee, Yun-Nung Chen, 14 Oct 2024 (v3), Let Me Speak Freely? A Study on the Impact of Format Restrictions on Performance of Large Language Models, https://arxiv.org/abs/2408.02442
- Peter Anderson, Basura Fernando, Mark Johnson, and Stephen Gould. Guided open vocabulary image captioning with constrained beam search, 2017, Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 936–945, https://arxiv.org/abs/1612.00576
- Chris Hokamp and Qun Liu, 2017, Lexically constrained decoding for sequence generation using grid beam search. Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1535–1546, https://arxiv.org/abs/1704.07138
- Yizhe Zhang, Guoyin Wang, Chunyuan Li, Zhe Gan, Chris Brockett, and Bill Dolan. Pointer: Constrained text generation via insertion-based generative pre-training. arXiv preprint arXiv:2005.00558, 2020. https://arxiv.org/abs/2005.00558
- Saibo Geng, Martin Josifoski, Maxime Peyrard, Robert West, 18 Jan 2024 (v6), Grammar-Constrained Decoding for Structured NLP Tasks without Finetuning, https://arxiv.org/abs/2305.13971 https://github.com/epfl-dlab/GCD
- Yanjun Fu, Ethan Baker, Yu Ding, Yizheng Chen, 20 Jul 2024 (v3), Constrained Decoding for Secure Code Generation, https://arxiv.org/abs/2405.00218 https://codeguardplus.github.io/
- Zekun Hao, David W. Romero, Tsung-Yi Lin, Ming-Yu Liu, 12 Dec 2024, Meshtron: High-Fidelity, Artist-Like 3D Mesh Generation at Scale, https://arxiv.org/abs/2412.09548 https://research.nvidia.com/labs/dir/meshtron/ (Optimizations to avoid the quadratic Transformer cost, in both training and inference, include "hourglass neural architecture" analogous to widthwise pruning or slimming, sliding window attention, rolling KV cache, truncated sequence training, and a "robust sampling strategy" that is effectively a type of constrained decoding based on mesh layouts.)
- Xiaoxi Li, Jiajie Jin, Yujia Zhou, Yongkang Wu, Zhonghua Li, Qi Ye, Zhicheng Dou, 16 Dec 2024, RetroLLM: Empowering Large Language Models to Retrieve Fine-grained Evidence within Generation, https://arxiv.org/abs/2412.11919 https://github.com/sunnynexus/RetroLLM
- Xiangjue Dong, Maria Teleki, James Caverlee, 18 Dec 2024, A Survey on LLM Inference-Time Self-Improvement, https://arxiv.org/abs/2412.14352 https://github.com/dongxiangjue/Awesome-LLM-Self-Improvement (Broad survey of reasoning improvement methods from multi-step inference to RALM to decoding algorithms.)
- Haoran Wang, Kai Shu, Jan 2025, MakeEveryTokenCount: ASystematic Survey on Decoding Methods for Foundation Model, https://www.researchgate.net/profile/Haoran-Wang-96/publication/387703971_Make_Every_Token_Count_A_Systematic_Survey_on_Decoding_Methods_for_Foundation_Models/links/67784c8ce74ca64e1f49eb15/Make-Every-Token-Count-A-Systematic-Survey-on-Decoding-Methods-for-Foundation-Models.pdf https://github.com/wang2226/Awesome-LLM-Decoding
- Theia Vogel, December 18, 2023, How to make LLMs go fast, https://vgel.me/posts/faster-inference/
- D Banerjee, T Suresh, S Ugare, S Misailovic, G Singh, Mar 2025, Preserving Reasoning Capabilities Under Constrained LLM Generation, https://openreview.net/pdf?id=RX3GIOkGHr
- Changran Xu, Yi Liu, Yunhao Zhou, Shan Huang, Ningyi Xu, Qiang Xu, 18 Mar 2025, Speculative Decoding for Verilog: Speed and Quality, All in One, https://arxiv.org/abs/2503.14153
- Niels M\"undler and Jasper Dekoninck and Martin Vechev, 13 Aug 2025, Constrained Decoding of Diffusion LLMs with Context-Free Grammars, https://arxiv.org/abs/2508.10111
- Lingxiao Li, Salar Rahili, Yiwei Zhao, 20 Aug 2025, Correctness-Guaranteed Code Generation via Constrained Decoding, https://arxiv.org/abs/2508.15866
Multi-Token Decoding
Multi-token decoding is an optimization whereby two or more tokens are output in a single decoding step. The idea of multi-token decoding is to train a special type of model so that it predicts not just the next token, but also the one after that (and possibly more). This improves on autoregressive decoding because the output is no longer one-at-a-time.
- Shikhar Tuli, Chi-Heng Lin, Yen-Chang Hsu, Niraj K. Jha, Yilin Shen, Hongxia Jin, 1 May 2024, DynaMo: Accelerating Language Model Inference with Dynamic Multi-Token Sampling, https://arxiv.org/abs/2405.00888 (A model trained to predict multiple tokens ahead.)
- Fabian Gloeckle, Badr Youbi Idrissi, Baptiste Rozière, David Lopez-Paz, Gabriel Synnaeve, 30 Apr 2024, Better & Faster Large Language Models via Multi-token Prediction, https://arxiv.org/abs/2404.19737 Project: https://huggingface.co/facebook/multi-token-prediction
- Michael Nuñez, July 4, 2024, Meta drops AI bombshell: Multi-token prediction models now open for research, https://venturebeat.com/ai/meta-drops-ai-bombshell-multi-token-prediction-models-now-open-for-research/
- Zongyue Qin, Ziniu Hu, Zifan He, Neha Prakriya, Jason Cong, Yizhou Sun, 12 Jul 2024, Multi-Token Joint Speculative Decoding for Accelerating Large Language Model Inference, https://arxiv.org/abs/2407.09722
- Tianle Cai, Yuhong Li, Zhengyang Geng, Hongwu Peng, Jason D Lee, Deming Chen, and Tri Dao. Medusa: Simple llm inference acceleration framework with multiple decoding heads. arXiv preprint arXiv:2401.10774, 2024 https://arxiv.org/abs/2401.10774
- Guanqiao Qu, Qiyuan Chen, Wei Wei, Zheng Lin, Xianhao Chen, Kaibin Huang, July 2024, Mobile Edge Intelligence for Large Language Models: A Contemporary Survey, https://www.techrxiv.org/doi/pdf/10.36227/techrxiv.172115025.57884352
- Taehyeon Kim, Ananda Theertha Suresh, Kishore Papineni, Michael Riley, Sanjiv Kumar, Adrian Benton, 2024, Exploring and Improving Drafts in Blockwise Parallel Decoding, https://openreview.net/pdf?id=KtnUTS1f91
- Zhihang Yuan, Yuzhang Shang, Yang Zhou, Zhen Dong, Zhe Zhou, Chenhao Xue, Bingzhe Wu, Zhikai Li, Qingyi Gu, Yong Jae Lee, Yan Yan, Beidi Chen, Guangyu Sun, Kurt Keutzer, 1 May 2024 (v6), LLM Inference Unveiled: Survey and Roofline Model Insights, https://arxiv.org/abs/2402.16363 Code: https://github.com/hahnyuan/LLM-Viewer
- David Spuler, 25th August, 2024, Hot Inference Optimization Techniques, https://www.aussieai.com/blog/hot-inference-research
- Tianle Cai, Yuhong Li, Zhengyang Geng, Hongwu Peng, Tri Dao, September 11, 2023, Medusa: Simple framework for accelerating LLM generation with multiple decoding heads, https://www.together.ai/blog/medusa
- Wei Zhong, Manasa Bharadwaj, 1 Jun 2024 (v2), S3D: A Simple and Cost-Effective Self-Speculative Decoding Scheme for Low-Memory GPUs, https://arxiv.org/abs/2405.20314
- Desh Raj, Gil Keren, Junteng Jia, Jay Mahadeokar, Ozlem Kalinli, 12 Sep 2024, Faster Speech-LLaMA Inference with Multi-token Prediction, https://arxiv.org/abs/2409.08148
- Zilin Xiao, Hongming Zhang, Tao Ge, Siru Ouyang, Vicente Ordonez, Dong Yu, 8 Oct 2024, ParallelSpec: Parallel Drafter for Efficient Speculative Decoding, https://arxiv.org/abs/2410.05589 (Multi-token prediction in draft models for speculative decoding.)
- Siru Ouyang, Shuohang Wang, Minhao Jiang, Ming Zhong, Donghan Yu, Jiawei Han, Yelong Shen, 14 Oct 2024, Temperature-Centric Investigation of Speculative Decoding with Knowledge Distillation, https://arxiv.org/abs/2410.10141 https://github.com/ozyyshr/TempSpec
- Tan Dat Nguyen, Ji-Hoon Kim, Jeongsoo Choi, Shukjae Choi, Jinseok Park, Younglo Lee, Joon Son Chung, 17 Oct 2024, Accelerating Codec-based Speech Synthesis with Multi-Token Prediction and Speculative Decoding, https://arxiv.org/abs/2410.13839
- Anonymous Authors, Oct 2024, Optimized Multi-Token Joint Decoding With Auxiliary Model for LLM Inference, https://openreview.net/pdf?id=ZHhBawo3k5
- Pengfei Wu, Jiahao Liu, Zhuocheng Gong, Qifan Wang, Jinpeng Li, Jingang Wang, Xunliang Cai, Dongyan Zhao, 27 Oct 2024, FIRP: Faster LLM inference via future intermediate representation prediction, https://arxiv.org/abs/2410.20488
- DP Ghosh, DA Team, Oct 29, 2024, Multi-Token Prediction with Extended Transformer Layers, https://www.researchgate.net/profile/Debiprasad-Ghosh/publication/385311204_Multi-Token_Prediction_with_Extended_Transformer_Layers/links/671fdd2c55a5271cdee28059/Multi-Token-Prediction-with-Extended-Transformer-Layers.pdf
- Yash Akhauri, Safeen Huda, Mohamed S. Abdelfattah, 26 Nov 2024, Attamba: Attending To Multi-Token States, https://arxiv.org/abs/2411.17685
- Shibaranjani Dasgupta, Chandan Maity, Somdip Mukherjee, Rohan Singh, Diptendu Dutta, Debasish Jana, 14 Dec 2024, HITgram: A Platform for Experimenting with n-gram Language Models, https://arxiv.org/abs/2412.10717
- Y Li, K Livescu, J Zhou, Dec 2024, Beyond Token Generation: Adaptive Chunk-Distilled Language Modeling, 38th Conference on Neural Information Processing Systems (NeurIPS 2024), https://neurips2024-enlsp.github.io/papers/paper_90.pdf (Generate multiple tokens in decoding by inserting RAG chunks directly into the decoding output.)
- Tim Urista, Dec 2024, Dramatically Reduce Inference Costs with DeepSeek-V3: A New Era in Open-Source LLMs, https://ai.gopubby.com/dramatically-reduce-inference-costs-with-deepseek-v3-a-new-era-in-open-source-llms-4f1adf760ee1
- Yanhong Li, Karen Livescu, Jiawei Zhou, 31 Dec 2024, Chunk-Distilled Language Modeling, https://arxiv.org/abs/2501.00343 (Multi-token decoding using retrieval.)
- Sean Welleck, Amanda Bertsch, Matthew Finlayson, Hailey Schoelkopf, Alex Xie, Graham Neubig, Ilia Kulikov, Zaid Harchaoui, 20 Nov 2024 (v2), From Decoding to Meta-Generation: Inference-time Algorithms for Large Language Models, https://arxiv.org/abs/2406.16838
- Minhajul Hoque, Jan 4, 2025, DeepSeek V3: How They Achieved Big Results with Small Compute, https://ai.plainenglish.io/deepseek-v3-how-they-achieved-big-results-with-small-compute-fb694606d59a (DeepSeek optimizations included FP8 quantization with outlier handling, attention and KV cache optimization via Multi-Head Latent Attention (MHLA), and multi-token decoding.)
- Nandini Lokesh Reddy, Jan 2025, DeepSeek: Bridging Performance and Efficiency in Modern AI, https://medium.com/@nandinilreddy/deepseek-bridging-performance-and-efficiency-in-modern-ai-106181a85693
- Qianhui Zhao, Li Zhang, Fang Liu, Xiaoli Lian, Qiaoyuanhe Meng, Ziqian Jiao, Zetong Zhou, Borui Zhang, Runlin Guo, Jia Li, 24 Feb 2025, CodeSwift: Accelerating LLM Inference for Efficient Code Generation, https://arxiv.org/abs/2502.17139 (Using draft sequences from a datastore of code, to achieve parallel inference, similar to prompt looking decoding or retrieval lookup decoding.)
- Yunhai Hu, Zining Liu, Zhenyuan Dong, Tianfan Peng, Bradley McDanel, Sai Qian Zhang, 27 Feb 2025, Speculative Decoding and Beyond: An In-Depth Review of Techniques, https://arxiv.org/abs/2502.19732
- Yijiong Yu, 26 Mar 2025, Accelerate Parallelizable Reasoning via Parallel Decoding within One Sequence, https://arxiv.org/abs/2503.20533 https://github.com/yuyijiong/parallel-decoding-in-one-sequence
- Chengen Wang, Murat Kantarcioglu, 14 Mar 2025, A Review of DeepSeek Models' Key Innovative Techniques, https://arxiv.org/abs/2503.11486
- L. Xiong et al., May 2025, DeepSeek: Paradigm Shifts and Technical Evolution in Large AI Models, IEEE/CAA Journal of Automatica Sinica, vol. 12, no. 5, pp. 841-858, May 2025, doi: 10.1109/JAS.2025.125495, https://ieeexplore.ieee.org/abstract/document/11005752
- Anastasios Gerontopoulos, Spyros Gidaris, Nikos Komodakis, 15 May 2025, Multi-Token Prediction Needs Registers, https://arxiv.org/abs/2505.10518
- Somesh Mehra, Javier Alonso Garcia, Lukas Mauch, 13 Feb 2025, On multi-token prediction for efficient LLM inference, https://arxiv.org/abs/2502.09419?
- Xiaohao Liu, Xiaobo Xia, Weixiang Zhao, Manyi Zhang, Xianzhi Yu, Xiu Su, Shuo Yang, See-Kiong Ng, Tat-Seng Chua, 23 May 2025, L-MTP: Leap Multi-Token Prediction Beyond Adjacent Context for Large Language Models, https://arxiv.org/abs/2505.17505
- Stephen Diehl, 2025, Attention Wasn't All We Needed, https://www.stephendiehl.com/posts/post_transformers/
- Anirudhan Badrinath, Prabhat Agarwal, Laksh Bhasin, Jaewon Yang, Jiajing Xu, Charles Rosenberg, 6 Aug 2025, PinRec: Outcome-Conditioned, Multi-Token Generative Retrieval for Industry-Scale Recommendation Systems, https://arxiv.org/abs/2504.10507
Stop Tokens
Stop tokens are one way whereby LLMs can be trained to control the length of their output. The idea is that stop tokens are incorporated at the end of answers during the training phase, and when they occur in an inference phase, they cause the LLM to stop outputting further tokens at that point.
Research papers with coverage of stop token techniques:
- Louis-François Bouchard, May 10, 2024, How LLMs Know When to Stop Generating? Understand how LLMs like GPT-4 decide when they have answered your question, https://pub.towardsai.net/how-llms-know-when-to-stop-generating-b82a9a57e2c4
- Lianghong Guo, Yanlin Wang, Ensheng Shi, Wanjun Zhong, Hongyu Zhang, Jiachi Chen, Ruikai Zhang, Yuchi Ma, Zibin Zheng, 29 Jul 2024, When to Stop? Towards Efficient Code Generation in LLMs with Excess Token Prevention, https://arxiv.org/abs/2407.20042 Code: https://github.com/DeepSoftwareAnalytics/CodeFast
- Jiaming Li, Lei Zhang, Yunshui Li, Ziqiang Liu, yuelin bai, Run Luo, Longze Chen, Min Yang, 1 Oct 2024 (v2), Ruler: A Model-Agnostic Method to Control Generated Length for Large Language Models, https://arxiv.org/abs/2409.18943 https://github.com/Geaming2002/Ruler
- Bradley Butcher, Michael O'Keefe, James Titchener, 16 Dec 2024, Precise Length Control in Large Language Models, https://arxiv.org/abs/2412.11937
General Research on Decoding Algorithms
Papers on the various decoding methods include:
- S Bae, J Ko, H Song, SY Yun, Oct 2023, Fast and Robust Early-Exiting Framework for Autoregressive Language Models with Synchronized Parallel Decoding, arXiv preprint arXiv:2310.05424, https://arxiv.org/pdf/2310.05424.pdf, Code: https://github.com/raymin0223/fast_robust_early_exit (Combination of early-exit with a "shallow-deep module" and parallel decoding.)
- Jiatao Gu, James Bradbury, Caiming Xiong, Victor O.K. Li, Richard Socher, 2018, Non-Autoregressive Neural Machine Translation, International Conference on Learning Representations, https://arxiv.org/abs/1711.02281 (Parallel decoding early paper.)
- Marjan Ghazvininejad, Omer Levy, Yinhan Liu, and Luke Zettlemoyer. 2019. Mask-predict: Parallel decoding of conditional masked language models. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, EMNLP-IJCNLP 2019, Hong Kong, China, November 3-7, 2019, pages 6111–6120. Association for Computational Linguistics. https://arxiv.org/abs/1904.09324
- Jiatao Gu and Xiang Kong. 2021. Fully non-autoregressive neural machine translation: Tricks of the trade. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 120–133, https://arxiv.org/abs/2012.15833
- Nikolay Savinov, Junyoung Chung, Mikolaj Binkowski, Erich Elsen, and Aaron van den Oord. 2022. Step-unrolled denoising autoencoders for text generation. International Conference on Learning Representations. https://arxiv.org/abs/2112.06749
- Andrea Santilli, Silvio Severino, Emilian Postolache, Valentino Maiorca, Michele Mancusi, Riccardo Marin, and Emanuele Rodolà. May 2023. Accelerating transformer inference for translation via parallel decoding. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2023, Toronto, Canada, July 9-14, 2023, pages 12336–12355. Association for Computational Linguistics. https://arxiv.org/abs/2305.10427
- Y Zhang, Y Zhang, L Cui, G Fu, Oct 2023, Non-autoregressive Text Editing with Copy-aware Latent Alignments, arXiv preprint arXiv:2310.07821, https://arxiv.org/pdf/2310.07821.pdf
- Tri Dao, Daniel Haziza, Francisco Massa, Grigory Sizov, October 13, 2023, Flash-Decoding for long-context inference, PyTorch Blog, https://pytorch.org/blog/flash-decoding/
- Nitish Shirish Keskar, Bryan McCann, Lav R. Varshney, Caiming Xiong, Richard Socher, Sep 2019, CTRL: A Conditional Transformer Language Model for Controllable Generation, https://arxiv.org/abs/1909.05858, Code: https://github.com/salesforce/ctrl
- Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, Ryan Lowe, Mar 2022, Training language models to follow instructions with human feedback, https://arxiv.org/abs/2203.02155 (InstructGPT main paper from OpenAI in 2022.)
- Ning Gong, Nianmin Yao, June 2023, A generalized decoding method for neural text generation, Computer Speech & Language, Volume 81, 101503, https://www.sciencedirect.com/science/article/abs/pii/S0885230823000220
- Cohere, 2023, Temperature, https://docs.cohere.com/docs/temperature
- GC Garbacea, 2023, Neural Language Generation for Content Adaptation: Explainable, Efficient Low-Resource Text Simplification and Evaluation, Ph.D. thesis, Computer Science and Engineering, University of Michigan, https://deepblue.lib.umich.edu/bitstream/handle/2027.42/178028/garbacea_1.pdf?sequence=1 (Broad thesis with sections on beam search decoding optimizations and AI safety issues such as bias.)
- Yizhe Zhang, Guoyin Wang, Chunyuan Li, Zhe Gan, Chris Brockett, and Bill Dolan. Pointer: Constrained text generation via insertion-based generative pre-training. arXiv preprint arXiv:2005.00558, 2020. https://arxiv.org/abs/2005.00558
- Bryan Eikema and Wilker Aziz. 2020. Is MAP decoding all you need? The inadequacy of the mode in neural machine translation. Proceedings of the 28th International Conference on Computational Linguistics, COLING 2020, Barcelona, Spain (Online), December 8- 13, 2020, pages 4506–4520. International Committee on Computational Linguistics. https://arxiv.org/abs/2005.10283
- Haoran Yang, Deng Cai, Huayang Li, Wei Bi, Wai Lam, Shuming Shi, May 2023, A Frustratingly Simple Decoding Method for Neural Text Generation, https://arxiv.org/abs/2305.12675
- Clara Meister, Tiago Pimentel, Gian Wiher, and Ryan Cotterell. 2022. Typical decoding for natural language generation. arXiv preprint arXiv:2202.00666, https://arxiv.org/abs/2202.00666 (The "typical sampling" decoding algorithm.)
- Yixuan Su, Tian Lan, Yan Wang, Dani Yogatama, Lingpeng Kong, and Nigel Collier. 2022. A contrastive framework for neural text generation. Advances in Neural Information Processing Systems, https://arxiv.org/abs/2202.06417 (The "contrastive search" decoding algorithm.)
- Tianyu Gao, Xingcheng Yao, and Danqi Chen. 2021. SimCSE: Simple contrastive learning of sentence embeddings. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 6894–6910, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics. https://arxiv.org/abs/2104.08821 (A "contrastive" decoding algorithm.)
- John Hewitt, Christopher D. Manning, and Percy Liang. 2022. Truncation sampling as language model desmoothing. In Findings of the Conference on Empirical Methods in Natural Language Processing (Findings of EMNLP). https://arxiv.org/abs/2210.15191 (The "truncation sampling" decoding algorithm.)
- Xiang Lisa Li, Ari Holtzman, Daniel Fried, Percy Liang, Jason Eisner, Tatsunori Hashimoto, Luke Zettlemoyer, and Mike Lewis. 2022. Contrastive decoding: Open-ended text generation as optimization. arXiv preprint arXiv:2210.15097, https://arxiv.org/abs/2210.15097 (A "contrastive decoding" algorithm.)
- Ari Holtzman, Jan Buys, Maxwell Forbes, Antoine Bosselut, David Golub, Yejin Choi, 2018, Learning to Write with Cooperative Discriminators, https://arxiv.org/abs/1805.06087
- Massimo Caccia, Lucas Caccia, William Fedus, Hugo Larochelle, Joelle Pineau, and Laurent Charlin, 2020, Language GANs falling short, International Conference on Learning Representations. https://arxiv.org/abs/1811.02549
- Moin Nadeem, Tianxing He, Kyunghyun Cho, and James Glass, 2020, A systematic characterization of sampling algorithms for open-ended language generation, Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing, pp. 334–346. https://arxiv.org/abs/2009.07243, Code: https://github.com/moinnadeem/characterizing-sampling-algorithms
- Hugh Zhang, Daniel Duckworth, Daphne Ippolito, and Arvind Neelakantan, 2021, Trading off diversity and quality in natural language generation, EACL 2021, p. 25, https://arxiv.org/abs/2004.10450
- Yunqi Zhu, Xuebing Yang, Yuanyuan Wu, Wensheng Zhang, 22 Mar 2024, Hierarchical Skip Decoding for Efficient Autoregressive Text Generation, https://arxiv.org/abs/2403.14919 (A new decoding algorithm called Hierarchical Skip Decoding involving layer skipping.)
- Yassir Fathullah, Puria Radmard, Adian Liusie, Mark J. F. Gales, 2024, Who Needs Decoders? Efficient Estimation of Sequence-Level Attributes with Proxies, Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics Volume 1: Long Papers, pages 1478–1496 March 17-22, 2024, https://aclanthology.org/2024.eacl-long.89.pdf (Non-autoregressive decoding methods in special use cases such as machine language translation.)
- Abhimanyu Bambhaniya, Ritik Raj, Geonhwa Jeong, Souvik Kundu, Sudarshan Srinivasan, Midhilesh Elavazhagan, Madhu Kumar, Tushar Krishna, 3 Jun 2024, Demystifying Platform Requirements for Diverse LLM Inference Use Cases, https://arxiv.org/abs/2406.01698 Code: https://github.com/abhibambhaniya/GenZ-LLM-Analyzer (Analysis of cost of serving LLMs, including separate profiles of prefill versus decoding phases, and the cost of extra prompt processing in RAG architectures with prepended information.)
- Yechen Xu, Xinhao Kong, Tingjun Chen, Danyang Zhuo, 4 Jun 2024 (v2), Conveyor: Efficient Tool-aware LLM Serving with Tool Partial Execution, https://arxiv.org/abs/2406.00059 Code: https://github.com/conveyor-sys/conveyor (Speeding up inference by partially running tools in parallel to the LLM query procesisng, rather than sequentially after the LLM request, by detecting tool requests deep inside the decoding algorithm and starting them off immediately, before the LLM has finished generating the fully decoed output.)
- Hao (Mark) Chen, Wayne Luk, Ka Fai Cedric Yiu, Rui Li, Konstantin Mishchenko, Stylianos I. Venieris, Hongxiang Fan, 28 May 2024, Hardware-Aware Parallel Prompt Decoding for Memory-Efficient Acceleration of LLM Inference, https://arxiv.org/abs/2405.18628 Code: https://github.com/hmarkc/parallel-prompt-decoding (Similar to speculative decoding with extra trained prompt tokens and a tree-structured verification of multiple optional draft sequences.)
- Maxime Peyrard, Martin Josifoski, Robert West, 21 Mar 2024, The Era of Semantic Decoding, https://arxiv.org/abs/2403.14562
- Ethan Shen, Alan Fan, Sarah M Pratt, Jae Sung Park, Matthew Wallingford, Sham M. Kakade, Ari Holtzman, Ranjay Krishna, Ali Farhadi, Aditya Kusupati, 28 May 2024, Superposed Decoding: Multiple Generations from a Single Autoregressive Inference Pass, https://arxiv.org/abs/2405.18400 https://github.com/RAIVNLab/SuperposedDecoding (Generating multiple possible drafts from a single decoding algorithm with one model pass by superimposing embeddings and using top-k decoding.)
- Rya Sanovar, Srikant Bharadwaj, Renee St. Amant, Victor Rühle, Saravan Rajmohan, 17 May 2024, Lean Attention: Hardware-Aware Scalable Attention Mechanism for the Decode-Phase of Transformers, https://arxiv.org/abs/2405.10480
- Yifu Qiu, Zheng Zhao, Yftah Ziser, Anna Korhonen, Edoardo M. Ponti, Shay B. Cohen, 15 May 2024, Spectral Editing of Activations for Large Language Model Alignment, https://arxiv.org/pdf/2405.09719 Code: https://github.com/yfqiu-nlp/sea-llm
- D Shin, May 8, 2024, Multi-User Language Model Resource Allocation Using Contextual Pause Token Aware Transformers, Technical Disclosure Commons, https://www.tdcommons.org/dpubs_series/6981/ PDF: https://www.tdcommons.org/cgi/viewcontent.cgi?article=8121&context=dpubs_series (Interesting idea of training a model how and when to pause during inference, so it can be pre-empted if needed, and thus the overall system can schedule batching of multiple queries more optimally.)
- Shujian Zhang, Korawat Tanwisuth, Chengyue Gong, Pengcheng He, Mingyuan Zhou, 7 May 2024, Switchable Decision: Dynamic Neural Generation Networks, https://arxiv.org/abs/2405.04513 (Switching and skipping sub-layer components such as attention heads, FFNs, or input token skipping, using decisions made based on allocating computation resources.)
- Shikhar Tuli, Chi-Heng Lin, Yen-Chang Hsu, Niraj K. Jha, Yilin Shen, Hongxia Jin, 1 May 2024, DynaMo: Accelerating Language Model Inference with Dynamic Multi-Token Sampling, https://arxiv.org/abs/2405.00888 (A model trained to predict multiple tokens ahead.)
- Fabian Gloeckle, Badr Youbi Idrissi, Baptiste Rozière, David Lopez-Paz, Gabriel Synnaeve, 30 Apr 2024, Better & Faster Large Language Models via Multi-token Prediction, https://arxiv.org/abs/2404.19737
- Sachin Mehta, Mohammad Hossein Sekhavat, Qingqing Cao, Maxwell Horton, Yanzi Jin, Chenfan Sun, Iman Mirzadeh, Mahyar Najibi, Dmitry Belenko, Peter Zatloukal, Mohammad Rastegari, 22 Apr 2024, OpenELM: An Efficient Language Model Family with Open-source Training and Inference Framework, Apple Research, https://arxiv.org/abs/2404.14619 Code: https://huggingface.co/apple/OpenELM
- Jared Lichtarge, Christopher Alberti, Shankar Kumar, Noam Shazeer, Niki Parmar, 31 Oct 2018, Weakly Supervised Grammatical Error Correction using Iterative Decoding, https://arxiv.org/abs/1811.01710
- Cunchen Hu, Heyang Huang, Liangliang Xu, Xusheng Chen, Jiang Xu, Shuang Chen, Hao Feng, Chenxi Wang, Sa Wang, Yungang Bao, Ninghui Sun, Yizhou Shan, 20 Jan 2024, Inference without Interference: Disaggregate LLM Inference for Mixed Downstream Workloads, https://arxiv.org/abs/2401.11181 (Separating the prefill and decoding phases for optimization.)
- Amey Agrawal, Ashish Panwar, Jayashree Mohan, Nipun Kwatra, Bhargav S. Gulavani, Ramachandran Ramjee, 31 Aug 2023, SARATHI: Efficient LLM Inference by Piggybacking Decodes with Chunked Prefills, https://arxiv.org/abs/2308.16369 (Examines the different GPU costs of prefill vs decoding phases, and optimizes decoding by "piggybacking" off the more intense computation during prefill.)
- You Zhou, Xiujing Lin, Xiang Zhang, Maolin Wang, Gangwei Jiang, Huakang Lu, Yupeng Wu, Kai Zhang, Zhe Yang, Kehang Wang, Yongduo Sui, Fengwei Jia, Zuoli Tang, Yao Zhao, Hongxuan Zhang, Tiannuo Yang, Weibo Chen, Yunong Mao, Yi Li, De Bao, Yu Li, Hongrui Liao, Ting Liu, Jingwen Liu, Jinchi Guo, Xiangyu Zhao, Ying WEI, Hong Qian, Qi Liu, Xiang Wang, Wai Kin (Victor)Chan, Chenliang Li, Yusen Li, Shiyu Yang, Jining Yan, Chao Mou, Shuai Han, Wuxia Jin, Guannan Zhang, Xiaodong Zeng, Nov 2023, On the Opportunities of Green Computing: A Survey, https://arxiv.org/abs/2311.00447 (Extensive survey of environmental and green AI issues, along with a survey of various optimization methods to reduce AI resource requirements in training and inference.)
- Pratyush Patel, Esha Choukse, Chaojie Zhang, Íñigo Goiri, Aashaka Shah, Saeed Maleki, Ricardo Bianchini, 30 Nov 2023, Splitwise: Efficient generative LLM inference using phase splitting, https://arxiv.org/abs/2311.18677 (Separates the two Transformer phases of initial prompt computation or prefill to generate the KV cache, and the token generation phase or decoding algorithm onto two machines.)
- Yao Zhao, Zhitian Xie, Chenyi Zhuang, Jinjie Gu, Jan 2024, Lookahead: An Inference Acceleration Framework for Large Language Model with Lossless Generation Accuracy, https://arxiv.org/abs/2312.12728 Code: https://github.com/alipay/PainlessInferenceAcceleration
- Xupeng Miao, Gabriele Oliaro, Zhihao Zhang, Xinhao Cheng, Hongyi Jin, Tianqi Chen, Zhihao Jia, 23 Dec 2023, Towards Efficient Generative Large Language Model Serving: A Survey from Algorithms to Systems, https://arxiv.org/abs/2312.15234
- Yang Song, Chenlin Meng, Renjie Liao, Stefano Ermon, 2021, Accelerating Feedforward Computation via Parallel Nonlinear Equation Solving, Proceedings of the 38th International Conference on Machine Learning, PMLR 139, 2021, https://proceedings.mlr.press/v139/song21a/song21a.pdf
- Yichao Fu, Peter Bailis, Ion Stoica, Hao Zhang, Nov 21, 2023, Break the Sequential Dependency of LLM Inference Using Lookahead Decoding, https://lmsys.org/blog/2023-11-21-lookahead-decoding/ Code: https://github.com/hao-ai-lab/LookaheadDecoding (Generates tokens in parallel by using Jacobi iteration.)
- N Varshney, A Chatterjee, M Parmar, C Baral, Oct 2023, arXiv preprint arXiv:2310.18581, Accelerating LLM Inference by Enabling Intermediate Layer Decoding, https://arxiv.org/pdf/2310.18581.pdf (Dynamic confidence-based early exiting analysis on LLama models.)
- Chufan Shi, Haoran Yang, Deng Cai, Zhisong Zhang, Yifan Wang, Yujiu Yang, Wai Lam, 10 Feb 2024, A Thorough Examination of Decoding Methods in the Era of LLMs, https://arxiv.org/abs/2402.06925 (Evaluates a number of decoding algorithms with several 7B models including Llama2-7B, and also with 4-bit and 8-bit quantization.)
- Yehui Tang, Yunhe Wang, Jianyuan Guo, Zhijun Tu, Kai Han, Hailin Hu, Dacheng Tao, 5 Feb 2024. A Survey on Transformer Compression. https://arxiv.org/abs/2402.05964 (Model compression survey paper with focus on pruning, quantization, knowledge distillation, and efficient architecture design.)
- Xuanlei Zhao, Bin Jia, Haotian Zhou, Ziming Liu, Shenggan Cheng, Yang You, 2 Mar 2024, HeteGen: Heterogeneous Parallel Inference for Large Language Models on Resource-Constrained Devices, https://arxiv.org/abs/2403.01164
- Amey Agrawal, Nitin Kedia, Ashish Panwar, Jayashree Mohan, Nipun Kwatra, Bhargav S. Gulavani, Alexey Tumanov, Ramachandran Ramjee, 4 Mar 2024, Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi-Serve, https://arxiv.org/abs/2403.02310 (Faster latency by scheduling of prefill and decoding algorithm phases.)
- C Hooper, S Kim, H Mohammadzadeh, H Genc, Oct 2023, SPEED: Speculative Pipelined Execution for Efficient Decoding https://arxiv.org/pdf/2310.12072.pdf
- Peter Anderson, Basura Fernando, Mark Johnson, and Stephen Gould. Guided open vocabulary image captioning with constrained beam search, 2017, Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 936–945, https://arxiv.org/abs/1612.00576
- Chris Hokamp and Qun Liu, 2017, Lexically constrained decoding for sequence generation using grid beam search. Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1535–1546, https://arxiv.org/abs/1704.07138
- David Spuler, March 2024, Chapter 26. Decoding Algorithms, Generative AI in C++: Coding Transformers and LLMs, https://www.amazon.com/dp/B0CXJKCWX9
- S Yang, G Lee, J Cho, D Papailiopoulos, 2023, Predictive Pipelined Decoding: A Compute-Latency Trade-off for Exact LLM Decoding, https://arxiv.org/abs/2307.05908
- Ke Hong, Guohao Dai, Jiaming Xu, Qiuli Mao, Xiuhong Li, Jun Liu, kangdi chen, Yuhan Dong, Yu Wang, 2024, FlashDecoding++: Faster Large Language Model Inference with Asynchronization, Flat GEMM Optimization, and Heuristics, Part of Proceedings of Machine Learning and Systems 6 (MLSys 2024) Conference, PDF: https://proceedings.mlsys.org/paper_files/paper/2024/file/5321b1dabcd2be188d796c21b733e8c7-Paper-Conference.pdf (Next generation of Flash Decoding, with improved ascynchronous parallelism of Softmax in both prefill and decoding phases, heuristic dataflow management algorithms, and enhanced GEMM during the decoding phase.)
- kipply's blog, 2023-03-30, Transformer Taxonomy (the last lit review), https://kipp.ly/transformer-taxonomy/ (Papers for all the Transformer architectures and milestone papers for the major optimization improvements on them.)
- Trenton Bricken, November 20, 2019, Tail Free Sampling A new way to sample from language models for text generation, https://www.trentonbricken.com/Tail-Free-Sampling/ (Alternative to top-k/top-p decoding.)
- Sean Welleck, Amanda Bertsch, Matthew Finlayson, Hailey Schoelkopf, Alex Xie, Graham Neubig, Ilia Kulikov, Zaid Harchaoui, 24 Jun 2024, From Decoding to Meta-Generation: Inference-time Algorithms for Large Language Models, https://arxiv.org/abs/2406.16838 (Survey and theoretical analysis of many different decoding algorithms, along with various ways to speed them up such as speculative decoding and KV caches.)
- Mouxiang Chen, Hao Tian, Zhongxin Liu, Xiaoxue Ren, Jianling Sun, 5 Jun 2024 (v2), JumpCoder: Go Beyond Autoregressive Coder via Online Modification, https://arxiv.org/abs/2401.07870 Code: https://github.com/Keytoyze/JumpCoder
- Zexuan Qiu, Zijing Ou, Bin Wu, Jingjing Li, Aiwei Liu, Irwin King, 25 Jun 2024, Entropy-Based Decoding for Retrieval-Augmented Large Language Models, https://arxiv.org/abs/2406.17519 (Enhanced decoding algorithm for multi-document RAG processing.)
- Youngsuk Park, Kailash Budhathoki, Liangfu Chen, Jonas Kübler, Jiaji Huang, Matthäus Kleindessner, Jun Huan, Volkan Cevher, Yida Wang, George Karypis, 12 Jul 2024, Inference Optimization of Foundation Models on AI Accelerators, KDD’24, August 25–29, 2024, Barcelona, Spain, https://arxiv.org/abs/2407.09111
- Jiaao He, Kezhao Huang, Jidong Zhai, July 2024, FASTDECODE: High-Throughput LLM Serving through Disaggregating Attention Computation, https://openreview.net/pdf?id=GahfuPsGw2 (Distributing KV caches to multiple nodes.)
- Xukun Liu, Bowen Lei, Ruqi Zhang, Dongkuan Xu, 27 Jun 2024, Adaptive Draft-Verification for Efficient Large Language Model Decoding, https://arxiv.org/abs/2407.12021 Project: https://anonymous.4open.science/r/ADED-C7D5 (A draft-and-verification method that is similar to speculative decoding, but differs.)
- Leo Donisch, Sigurd Schacht, Carsten Lanquillon, 6 Aug 2024, Inference Optimizations for Large Language Models: Effects, Challenges, and Practical Considerations, https://arxiv.org/abs/2408.03130
- Yunjia Xi, Hangyu Wang, Bo Chen, Jianghao Lin, Menghui Zhu, Weiwen Liu, Ruiming Tang, Weinan Zhang, Yong Yu, 11 Aug 2024, A Decoding Acceleration Framework for Industrial Deployable LLM-based Recommender Systems, https://arxiv.org/abs/2408.05676 (Determining when speculative decoding is most beneficial.)
- Sidharth Mudgal, Jong Lee, Harish Ganapathy, Yaguang Li, Tao Wang, Yanping Huang, Zhifeng Chen, Heng-Tze Cheng, Michael Collins, Trevor Strohman, Jilin Chen, Alex Beutel, Ahmad Beirami, July 2024, Controlled Decoding from Language Models, Proceedings of the 41st International Conference on Machine Learning, PMLR 235:36486-36503, 2024, https://proceedings.mlr.press/v235/mudgal24a.html
- Wenhong Zhu, Hongkun Hao, Zhiwei He, Yiming Ai, Rui Wang, July 2024, Improving Open-Ended Text Generation via Adaptive Decoding, Proceedings of the 41st International Conference on Machine Learning, PMLR 235:62386-62404, 2024, https://proceedings.mlr.press/v235/zhu24d.html
- Chenhan Yuan, Fei Huang, Ru Peng, Keming Lu, Bowen Yu, Chang Zhou, Jingren Zhou, 20 Aug 2024, Predicting Rewards Alongside Tokens: Non-disruptive Parameter Insertion for Efficient Inference Intervention in Large Language Model, https://arxiv.org/abs/2408.10764 Code: https://github.com/chenhan97/Otter (Inference intervention in the decoding algorithm.)
- Yi Cheng, Xiao Liang, Yeyun Gong, Wen Xiao, Song Wang, Yuji Zhang, Wenjun Hou, Kaishuai Xu, Wenge Liu, Wenjie Li, Jian Jiao, Qi Chen, Peng Cheng, Wayne Xiong, Integrative Decoding: Improve Factuality via Implicit Self-consistency, 3 Oct 2024 (v2), https://arxiv.org/abs/2410.01556 (Prepends a previous response to improve decoding accuracy.)
- Xinyi Zeng, Yuying Shang, Yutao Zhu, Jiawei Chen, Yu Tian, 9 Oct 2024, Root Defence Strategies: Ensuring Safety of LLM at the Decoding Level, https://arxiv.org/abs/2410.06809
- K Ahmed, KW Chang, G Van den Broeck, Oct 2024, Controllable Generation via Locally Constrained Resampling, Neurips Safe Generative AI Workshop 2024, https://openreview.net/pdf?id=v091fzXTu0
- Yuxuan Liu, Wenyuan Li, Laizhong Cui, Hailiang Yang, 17 Oct 2024, Cerberus: Efficient Inference with Adaptive Parallel Decoding and Sequential Knowledge Enhancement, https://arxiv.org/abs/2410.13344
- Rongxiang Wang and Felix Xiaozhu Lin. 2024. Turbocharge Speech Understanding with Pilot Inference. In Proceedings of the 30th Annual International Conference on Mobile Computing and Networking (ACM MobiCom '24). Association for Computing Machinery, New York, NY, USA, 1299–1313. https://doi.org/10.1145/3636534.3690694 https://dl.acm.org/doi/abs/10.1145/3636534.3690694 https://dl.acm.org/doi/pdf/10.1145/3636534.3690694 ("Pilot inference" is a specialized mix of caching, computation reuse, and backtracking in beam search for speech understanding, and is somewhat related to speculative decoding, and similar to continual inference for processing a stream.)
- Yixiong Fang, Ziran Yang, Zhaorun Chen, Zhuokai Zhao, Jiawei Zhou, 9 Dec 2024, From Uncertainty to Trust: Enhancing Reliability in Vision-Language Models with Uncertainty-Guided Dropout Decoding, https://arxiv.org/abs/2412.06474
- Xuezhi Wang, Denny Zhou, 23 May 2024 (v2), Chain-of-Thought Reasoning Without Prompting, https://arxiv.org/abs/2402.10200 ("CoT decoding" is examining the alternative paths in the decoding algorithm, which is somewhat similar to Chain-of-Thought reasoning.)
- Y Li, K Livescu, J Zhou, Dec 2024, Beyond Token Generation: Adaptive Chunk-Distilled Language Modeling, 38th Conference on Neural Information Processing Systems (NeurIPS 2024), https://neurips2024-enlsp.github.io/papers/paper_90.pdf (Generate multiple tokens in decoding by inserting RAG chunks directly into the decoding output.)
- Xiangjue Dong, Maria Teleki, James Caverlee, 18 Dec 2024, A Survey on LLM Inference-Time Self-Improvement, https://arxiv.org/abs/2412.14352 https://github.com/dongxiangjue/Awesome-LLM-Self-Improvement (Broad survey of reasoning improvement methods from multi-step inference to RALM to decoding algorithms.)
- Mehul Damani, Idan Shenfeld, Andi Peng, Andreea Bobu, Jacob Andreas, 7 Oct 2024, Learning How Hard to Think: Input-Adaptive Allocation of LM Computation, https://arxiv.org/abs/2410.04707
- Jianyi Zhang, Da-Cheng Juan, Cyrus Rashtchian, Chun-Sung Ferng, Heinrich Jiang, Yiran Chen, 27 Nov 2024 (v2), SLED: Self Logits Evolution Decoding for Improving Factuality in Large Language Models, https://arxiv.org/abs/2411.02433 https://jayzhang42.github.io/sled_page/ (Decoding algorithm that compares logit values in the final layer with those from earlier layers.)
- Yuval Shalev, Amir Feder, Ariel Goldstein, 19 Jun 2024, Distributional reasoning in LLMs: Parallel reasoning processes in multi-hop reasoning, https://arxiv.org/abs/2406.13858 (Using embeddings from intermediate model layers in decoding to mimic reasoning pathways.)
- Eden Biran, Daniela Gottesman, Sohee Yang, Mor Geva, Amir Globerson, 14 Oct 2024 (v2), Hopping Too Late: Exploring the Limitations of Large Language Models on Multi-Hop Queries, https://arxiv.org/abs/2406.12775 (Backpatching prior layers using embeddings from the current activations to mimic multi-step reasoning.)
- Jacob Pfau, William Merrill, Samuel R. Bowman, 24 Apr 2024, Let's Think Dot by Dot: Hidden Computation in Transformer Language Models, https://arxiv.org/abs/2404.15758 (Use of dummy "filler tokens" similar to "pause tokens" or "reasoning tokens" to aid multi-step reasoning in decoding.)
- Haoran Wang, Kai Shu, Jan 2025, MakeEveryTokenCount: ASystematic Survey on Decoding Methods for Foundation Model, https://www.researchgate.net/profile/Haoran-Wang-96/publication/387703971_Make_Every_Token_Count_A_Systematic_Survey_on_Decoding_Methods_for_Foundation_Models/links/67784c8ce74ca64e1f49eb15/Make-Every-Token-Count-A-Systematic-Survey-on-Decoding-Methods-for-Foundation-Models.pdf https://github.com/wang2226/Awesome-LLM-Decoding
- Qiang Wang, Bei Li, Tong Xiao, Jingbo Zhu, Changliang Li, Derek F. Wong, and Lidia S. Chao. 2019, Learning deep transformer models for machine translation. In Proc. of ACL, 2019. https://arxiv.org/abs/1906.01787
- Xuezhe Ma, Chunting Zhou, Xian Li, Graham Neubig, and Eduard H. Hovy. FlowSeq: Non-autoregressive conditional sequence generation with generative flow. In Proc. of EMNLP, 2019. https://arxiv.org/abs/1909.02480.
- Raphael Shu, Jason Lee, Hideki Nakayama, and Kyunghyun Cho. 2020, Latent-variable non-autoregressive neural machine translation with deterministic inference using a delta posterior. In Proc. of AAAI, 2020. https://arxiv.org/abs/1908.07181
- Huan Ma, Jingdong Chen, Guangyu Wang, Changqing Zhang, 1 Feb 2025, Estimating LLM Uncertainty with Logits, https://arxiv.org/abs/2502.00290
- Zeyu Tang, Zhenhao Chen, Loka Li, Xiangchen Song, Yunlong Deng, Yifan Shen, Guangyi Chen, Peter Spirtes, Kun Zhang, 5 Feb 2025, Reflection-Window Decoding: Text Generation with Selective Refinement, https://arxiv.org/abs/2502.03678 (Combination of sliding window attention with pausing.)
- Weihua Du, Yiming Yang, Sean Welleck, 7 Feb 2025, Optimizing Temperature for Language Models with Multi-Sample Inference, https://arxiv.org/abs/2502.05234 https://github.com/StigLidu/TURN
- Jacob Trauger, Ambuj Tewari, 16 May 2025, On Next-Token Prediction in LLMs: How End Goals Determine the Consistency of Decoding Algorithms, https://arxiv.org/abs/2505.11183
- Zhibin Wang, Rui Ning, Chao Fang, Zhonghui Zhang, Xi Lin, Shaobo Ma, Mo Zhou, Xue Li, Zhongfeng Wang, Chengying Huan, Rong Gu, Kun Yang, Guihai Chen, Sheng Zhong, Chen Tian, 23 May 2025, FlashForge: Ultra-Efficient Prefix-Aware Attention for LLM Decoding, https://arxiv.org/abs/2505.17694
- Niels M\"undler and Jasper Dekoninck and Martin Vechev, 13 Aug 2025, Constrained Decoding of Diffusion LLMs with Context-Free Grammars, https://arxiv.org/abs/2508.10111
- Haonan Ge, Yiwei Wang, Ming-Hsuan Yang, Yujun Cai, 14 Aug 2025, MRFD: Multi-Region Fusion Decoding with Self-Consistency for Mitigating Hallucinations in LVLMs, https://arxiv.org/abs/2508.10264
- Timon Merk, Saeed Salehi, Richard M. Koehler, Qiming Cui, Maria Olaru, Amelia Hahn, Nicole R. Provenza, Simon Little, Reza Abbasi-Asl, Phil A. Starr, Wolf-Julian Neumann, 13 Aug 2025, Pre-trained Transformer-models using chronic invasive electrophysiology for symptom decoding without patient-individual training, https://arxiv.org/abs/2508.10160
- Keyu Chen, Zhifeng Shen, Daohai Yu, Haoqian Wu, Wei Wen, Jianfeng He, Ruizhi Qiao, Xing Sun, 14 Aug 2025, ASPD: Unlocking Adaptive Serial-Parallel Decoding by Exploring Intrinsic Parallelism in LLMs, https://arxiv.org/abs/2508.08895
- Ran Wang, Xiaoxuan Liu, Hao Ren, Gang Chen, Fanchao Qi, Maosong Sun, 22 Jul 2025, WGRAMMAR: Leverage Prior Knowledge to Accelerate Structured Decoding, https://arxiv.org/abs/2507.16768
- Sijin Yu, Zijiao Chen, Wenxuan Wu, Shengxian Chen, Zhongliang Liu, Jingxin Nie, Xiaofen Xing, Xiangmin Xu, Xin Zhang, 22 Jul 2025, From Flat to Round: Redefining Brain Decoding with Surface-Based fMRI and Cortex Structure, https://arxiv.org/abs/2507.16389
- Yuxi Lin and Yaxue Fang and Zehong Zhang and Zhouwu Liu and Siyun Zhong and Fulong Yu, 22 Jul 2025, Decoding Translation-Related Functional Sequences in 5'UTRs Using Interpretable Deep Learning Models, https://arxiv.org/abs/2507.16801
- Arindam Ghosh, Mark Fuhs, Bongjun Kim, Anurag Chowdhury, Monika Woszczyna, 14 Jul 2025, ASR-Guided Speaker-Role Diarization and Diarization-Guided ASR Decoding, https://arxiv.org/abs/2507.17765
- Milad Taghipour, Bane Vasic, 23 Jul 2025, Action-List Reinforcement Learning Syndrome Decoding for Binary Linear Block Codes, https://arxiv.org/abs/2507.17893
- Alex Liu, Lief Esbenshade, Shawon Sarkar, Victor Tian, Zachary Zhang, Kevin He, Min Sun, 23 Jul 2025, Decoding Instructional Dialogue: Human-AI Collaborative Analysis of Teacher Use of AI Tool at Scale, https://arxiv.org/abs/2507.17985
- Anushka Tiwari, Sayantan Pal, Rohini K. Srihari, Kaiyi Ji, 19 Jul 2025, Task-Agnostic Continual Prompt Tuning with Gradient-Based Selection and Decoding, https://arxiv.org/abs/2507.14725
- Xiaojuan Zhang and Tianyu Jiang and Haoxiang Zong and Chen Zhang and Chendan Li and Marta Molinas, 13 Jul 2025, AI-Based Impedance Encoding-Decoding Method for Online Impedance Network Construction of Wind Farms, https://arxiv.org/abs/2507.14187
- Donghoon Kim, Minji Bae, Kyuhong Shim, Byonghyo Shim, 21 Jul 2025, Visually Guided Decoding: Gradient-Free Hard Prompt Inversion with Language Models, https://arxiv.org/abs/2505.08622
- Taeyoun Kwon, Junhyuk Ahn, Taegeun Yun, Heeju Jwa, Yoonchae Choi, Siwon Park, Nam-Joon Kim, Jangchan Kim, Hyun Gon Ryu, and Hyuk-Jae Lee, 9 Aug 2025, Whisfusion: Parallel ASR Decoding via a Diffusion Transformer, https://arxiv.org/abs/2508.07048
- Lilit Grigoryan, Vladimir Bataev, Nikolay Karpov, Andrei Andrusenko, Vitaly Lavrukhin, Boris Ginsburg, 10 Aug 2025, FlexCTC: GPU-powered CTC Beam Decoding with advanced Contextual Abilities, https://arxiv.org/abs/2508.07315
- Hao Yang, Qinghua Zhao, Lei Li, 28 Jul 2025, How Chain-of-Thought Works? Tracing Information Flow from Decoding, Projection, and Activation, https://arxiv.org/abs/2507.20758
- David Ye, Jan Williams, Mars Gao, Stefano Riva, Matteo Tomasetto, David Zoro, J. Nathan Kutz, 28 Jul 2025, PySHRED: A Python package for SHallow REcurrent Decoding for sparse sensing, model reduction and scientific discovery, https://arxiv.org/abs/2507.20954
- Jinzhou Wu, Baoping Tang, Qikang Li, Yi Wang, Cheng Li, Shujian Yu, 28 Jul 2025, When Brain Foundation Model Meets Cauchy-Schwarz Divergence: A New Framework for Cross-Subject Motor Imagery Decoding, https://arxiv.org/abs/2507.21037
- Max Peeperkorn, Tom Kouwenhoven, Dan Brown and Anna Jordanous, 28 Jul 2025, Mind the Gap: Conformative Decoding to Improve Output Diversity of Instruction-Tuned Large Language Models, https://arxiv.org/abs/2507.20956
- Ningyuan Xi, Xiaoyu Wang, Yetao Wu, Teng Chen, Qingqing Gu, Yue Zhao, Jinxian Qu, Zhonglin Jiang, Yong Chen, Luo Ji, 26 Jul 2025, MeTHanol: Modularized Thinking Language Models with Intermediate Layer Thinking, Decoding and Bootstrapping Reasoning, https://arxiv.org/abs/2409.12059
- Vishal Raman, Vijai Aravindh R, 29 Jul 2025, Evo-DKD: Dual-Knowledge Decoding for Autonomous Ontology Evolution in Large Language Models, https://arxiv.org/abs/2507.21438
- Dian Chen, Yansong Qu, Xinyang Li, Ming Li, Shengchuan Zhang, 31 Jul 2025, XSpecMesh: Quality-Preserving Auto-Regressive Mesh Generation Acceleration via Multi-Head Speculative Decoding, https://arxiv.org/abs/2507.23777
- Shukai Gong, Yiyang Fu, Fengyuan Ran, Quyu Kong, Feng Zhou, 31 Jul 2025, TPP-SD: Accelerating Transformer Point Process Sampling with Speculative Decoding, https://arxiv.org/abs/2507.09252
- Songsheng Wang, Rucheng Yu, Zhihang Yuan, Chao Yu, Feng Gao, Yu Wang and Derek F. Wong, 30 Jul 2025, Spec-VLA: Speculative Decoding for Vision-Language-Action Models with Relaxed Acceptance, https://arxiv.org/abs/2507.22424
- Woojae Jeong, Aditya Kommineni, Kleanthis Avramidis, Colin McDaniel, Donald Berry, Myzelle Hughes, Thomas McGee, Elsi Kaiser, Dani Byrd, Assal Habibi, B. Rael Cahn, Idan A. Blank, Kristina Lerman, Dimitrios Pantazis, Sudarsana R. Kadiri, Takfarinas Medani, Shrikanth Narayanan, and Richard M. Leahy, 30 Jul 2025, Decoding Neural Signatures of Semantic Evaluations in Depression and Suicidality, https://arxiv.org/abs/2507.22313
- Ramchalam Kinattinkara Ramakrishnan, Zhaocong Yuan, Shaojie Zhuo, Chen Feng, Yicheng Lin, Chenzheng Su, Xiaopeng Zhang, 31 Jul 2025, OmniDraft: A Cross-vocabulary, Online Adaptive Drafter for On-device Speculative Decoding, https://arxiv.org/abs/2507.02659
- Manh Nguyen, Sunil Gupta and Hung Le, 4 Aug 2025, CAAD: Context-Aware Adaptive Decoding for Truthful Text Generation, https://arxiv.org/abs/2508.02184
- Yike Zhang and Zhiyuan He and Huiqiang Jiang and Chengruidong Zhang and Yuqing Yang and Jianyong Wang and Lili Qiu, 4 Aug 2025, LeanK: Learnable K Cache Channel Pruning for Efficient Decoding, https://arxiv.org/abs/2508.02215
- Taehan Lee, Hyukjun Lee, 3 Aug 2025, Token Pruning in Audio Transformers: Optimizing Performance and Decoding Patch Importance, https://arxiv.org/abs/2504.01690
- Bolian Li, Yifan Wang, Anamika Lochab, Ananth Grama, Ruqi Zhang, 3 Aug 2025, Cascade Reward Sampling for Efficient Decoding-Time Alignment, https://arxiv.org/abs/2406.16306
- Fatih Gulec, Hamdan Awan, Nigel Wallbridge, Andrew W. Eckford, 5 Aug 2025, Decoding and Engineering the Phytobiome Communication for Smart Agriculture, https://arxiv.org/abs/2508.03584
- Jilong Li, Zhenxi Song, Jiaqi Wang, Meishan Zhang, Honghai Liu, Min Zhang, Zhiguo Zhang, 5 Aug 2025, BrainECHO: Semantic Brain Signal Decoding through Vector-Quantized Spectrogram Reconstruction for Whisper-Enhanced Text Generation, https://arxiv.org/abs/2410.14971
- Md Raisul Kibria, S\'ebastien Lafond, Janan Arslan, 6 Aug 2025, Decoding the Multimodal Maze: A Systematic Review on the Adoption of Explainability in Multimodal Attention-based Models, https://arxiv.org/abs/2508.04427
- Enyu Zhou, Kai Sheng, Hao Chen, Xin He, 6 Aug 2025, CARD: Cache-Assisted Parallel Speculative Decoding for Efficient Large Language Model Inference, https://arxiv.org/abs/2508.04462
- Shunqi Mao, Chaoyi Zhang, Weidong Cai, 6 Aug 2025, Through the Magnifying Glass: Adaptive Perception Magnification for Hallucination-Free VLM Decoding, https://arxiv.org/abs/2503.10183
- Kang Liu and Zhuoqi Ma and Zikang Fang and Yunan Li and Kun Xie and Qiguang Miao, 7 Aug 2025, PriorRG: Prior-Guided Contrastive Pre-training and Coarse-to-Fine Decoding for Chest X-ray Report Generation, https://arxiv.org/abs/2508.05353
- Jungbin Cho, Junwan Kim, Jisoo Kim, Minseo Kim, Mingu Kang, Sungeun Hong, Tae-Hyun Oh, Youngjae Yu, 7 Aug 2025, DisCoRD: Discrete Tokens to Continuous Motion via Rectified Flow Decoding, https://arxiv.org/abs/2411.19527
- Hossein Entezari Zarch, Lei Gao, Chaoyi Jiang, Murali Annavaram, 7 Aug 2025, DEL: Context-Aware Dynamic Exit Layer for Efficient Self-Speculative Decoding, https://arxiv.org/abs/2504.05598
- Woojeong Kim, Junxiong Wang, Jing Nathan Yan, Mohamed Abdelfattah, Alexander M. Rush, 11 Aug 2025, OverFill: Two-Stage Models for Efficient Language Model Decoding, https://arxiv.org/abs/2508.08446
- Ziqi Wang, Hailiang Zhao, Cheng Bao, Wenzhuo Qian, Yuhao Yang, Xueqiang Sun, Shuiguang Deng, 1 Aug 2025, XFMNet: Decoding Cross-Site and Nonstationary Water Patterns via Stepwise Multimodal Fusion for Long-Term Water Quality Forecasting, https://arxiv.org/abs/2508.08279
- Lingzhe Zhang, Liancheng Fang, Chiming Duan, Minghua He, Leyi Pan, Pei Xiao, Shiyu Huang, Yunpeng Zhai, Xuming Hu, Philip S. Yu, Aiwei Liu, 12 Aug 2025, A Survey on Parallel Text Generation: From Parallel Decoding to Diffusion Language Models, https://arxiv.org/abs/2508.08712
- Xingyou Song, Dara Bahri, 12 Aug 2025, Decoding-based Regression, https://arxiv.org/abs/2501.19383
- Qiaoqiao Ren, Remko Proesmans, Yuanbo Hou, Francis wyffels, and Tony Belpaeme, 12 Aug 2025, Touch and Tell: Multimodal Decoding of Human Emotions and Social Gestures for Robots, https://arxiv.org/abs/2412.03300
- Changhong Jing, Yan Liu, Shuqiang Wang, Bruce X.B. Yu, Gong Chen, Zhejing Hu, Zhi Zhang, Yanyan Shen, 15 Aug 2025, PTSM: Physiology-aware and Task-invariant Spatio-temporal Modeling for Cross-Subject EEG Decoding, https://arxiv.org/abs/2508.11357
- Oscar Ma\~nas, Pierluca D'Oro, Koustuv Sinha, Adriana Romero-Soriano, Michal Drozdzal, Aishwarya Agrawal, 15 Aug 2025, Controlling Multimodal LLMs via Reward-guided Decoding, https://arxiv.org/abs/2508.11616
- Pengcheng Huang, Shuhao Liu, Zhenghao Liu, Yukun Yan, Shuo Wang, Zulong Chen, Tong Xiao, 18 Aug 2025, PC-Sampler: Position-Aware Calibration of Decoding Bias in Masked Diffusion Models, https://arxiv.org/abs/2508.13021
- Jihoon Park, Seungeun Oh, and Seong-Lyun Kim, 18 Aug 2025, Energy-Efficient Wireless LLM Inference via Uncertainty and Importance-Aware Speculative Decoding, https://arxiv.org/abs/2508.12590
- Yuanhao Li, Badong Chen, Wenjun Bai, Yasuharu Koike, Okito Yamashita, 5 Aug 2025, Robust Sparse Bayesian Learning Based on Minimum Error Entropy for Noisy High-Dimensional Brain Activity Decoding, https://arxiv.org/abs/2508.11657
- Dylan Cope, Peter McBurney, 18 Aug 2025, Decoding Communications with Partial Information, https://arxiv.org/abs/2508.13326
- Oriana Presacan, Alireza Nik, Vajira Thambawita, Bogdan Ionescu, Michael Riegler, 19 Aug 2025, A Comparative Study of Decoding Strategies in Medical Text Generation, https://arxiv.org/abs/2508.13580
- Sanggeon Yun, Raheeb Hassan, Ryozo Masukawa, Mohsen Imani, 20 Aug 2025, MissionHD: Data-Driven Refinement of Reasoning Graph Structure through Hyperdimensional Causal Path Encoding and Decoding, https://arxiv.org/abs/2508.14746
- Majid Daliri, Christopher Musco, Ananda Theertha Suresh, 20 Aug 2025, Coupling without Communication and Drafter-Invariant Speculative Decoding, https://arxiv.org/abs/2408.07978
- Julian Oestreich and Lydia M\"uller, 21 Aug 2025, Evaluating Structured Decoding for Text-to-Table Generation: Evidence from Three Datasets, https://arxiv.org/abs/2508.15910
- Yicheng Ji, Jun Zhang, Heming Xia, Jinpeng Chen, Lidan Shou, Gang Chen, Huan Li, 22 Aug 2025, SpecVLM: Enhancing Speculative Decoding of Video LLMs via Verifier-Guided Token Pruning, https://arxiv.org/abs/2508.16201
- Lingxiao Li, Salar Rahili, Yiwei Zhao, 20 Aug 2025, Correctness-Guaranteed Code Generation via Constrained Decoding, https://arxiv.org/abs/2508.15866
- Jungyoub Cha, Hyunjong Kim, Sungzoon Cho, 22 Aug 2025, SpecExtend: A Drop-in Enhancement for Speculative Decoding of Long Sequences, https://arxiv.org/abs/2505.20776
- Xuekang Wang, Shengyu Zhu, Xueqi Cheng, 25 Aug 2025, Speculative Safety-Aware Decoding, https://arxiv.org/abs/2508.17739
- Jaydip Sen, Subhasis Dasgupta, Hetvi Waghela, 21 Aug 2025, Confidence-Modulated Speculative Decoding for Large Language Models, https://arxiv.org/abs/2508.15371
- Abdul Rehman Akbar, Usama Sajjad, Ziyu Su, Wencheng Li, Fei Xing, Jimmy Ruiz, Wei Chen, Muhammad Khalid Khan Niazi, 22 Aug 2025, CellEcoNet: Decoding the Cellular Language of Pathology with Deep Learning for Invasive Lung Adenocarcinoma Recurrence Prediction, https://arxiv.org/abs/2508.16742
- Ziyin Zhang and Jiahao Xu and Tian Liang and Xingyu Chen and Zhiwei He and Rui Wang and Zhaopeng Tu, 24 Aug 2025, Draft Model Knows When to Stop: Self-Verification Speculative Decoding for Long-Form Generation, https://arxiv.org/abs/2411.18462
More Research on Decoding Algorithms
- Decoding algorithms (overview)
— Non-autoregressive decoding
— Greedy decoding
— Top-k decoding
— Top-p decoding
— Min-P Sampling
— Flash decoding
— Beam search decoding
— Edit decoding
— Contrastive decoding
— Constrained decoding - Parallel decoding (overview)
— Blockwise parallel decoding
— n-gram parallel decoding
— Lookahead decoding
— Medusa decoding
— Consensus decoding - Speculative decoding (overview)
— Generalized speculative decoding
— Aggressive decoding
— Lookup decoding
— Retrieval lookup decoding
— Prompt lookup decoding
— Self speculative decoding
— Tree speculative decoding
— Superposed decoding
— Hierarchical speculative decoding
— Heuristic speculative decoding
— Multi-token speculative decoding
— Sequential speculative decoding
AI Books from Aussie AI
![]() |
The Sweetest Lesson: Your Brain Versus AI: new book on AI intelligence theory:
Get your copy from Amazon: The Sweetest Lesson |
![]() |
RAG Optimization: Accurate and Efficient LLM Applications:
new book on RAG architectures:
Get your copy from Amazon: RAG Optimization |
![]() |
Generative AI Applications book:
Get your copy from Amazon: Generative AI Applications |
![]() |
Generative AI programming book:
Get your copy from Amazon: Generative AI in C++ |
![]() |
CUDA C++ Optimization book:
Get your copy from Amazon: CUDA C++ Optimization |
![]() |
CUDA C++ Debugging book:
Get your copy from Amazon: CUDA C++ Debugging |
More AI Research
Read more about: