Aussie AI

11. Reading & Writing

Book Excerpt from "The Sweetest Lesson: Your Brain vs AI"

by David Spuler, Ph.D.

11. Reading & Writing

“If you don’t have time to read,

you don’t have the time to write.”

— Stephen King.

AI Reading and Writing

Humans and AI engines share a strange commonality in how they communicate. As a human with the ability to both read and write, there are two somewhat related facts of life:

It’s easier to read than to write.
It’s easier to understand a foreign language than to speak it.

Weirdly, this turns out to also be true for AI engines. The coders for AI engines have done a lot of work in both areas, but reading remains easier than writing.

In the olden days, which is 2017 in AI, the LLM engines based on the “Transformer” architecture (discovered by Google) had two explicit pieces, called the “encoder” and “decoder” components. The idea was basically:

Encoders — reading
Decoders — writing

The encoder would take its input text and “encode” it into some internal vectors, which are used to “understand” the text (i.e., reading). The decoder would then use these internal numbers to emit new numbers that represented words that it was outputing (i.e., writing).

However, that combined encoder-decoder architecture didn’t last long, and the encoder was found to be mostly redundant, since it was doing the computations that were very similar to the decoder, only it wasn’t actually decoding anything. Confused? You’re like most AI engineers, if so. Anyway, you could do both the reading and writing inside the decoder, by doing a trick called a “prefill” phase.

No need for half the brain.

Hence, the encoder was removed from AI engines and only the decoder was left. This was called a “decoder-only” architecture and was used as early as the GPT-2 model. The newer idea is effectively:

Prefill — reading
Decoding — writing

Prefill is an algorithm that runs in the decoder, so these two phases are running in the same GPU chips (well, no, but let’s assume that for now). The decoder in the newer decoder-only architecture is similar to that used in the encoder-decoder early version, but obviously doesn’t need the parts that used to accept data from the encoder (which no longer exists).

Have we lost something by not having an encoder? It’s certainly possible that some insight is missed by treating reading like it’s a subset of writing, rather than trying to fully understand first. However, we gained: speed.

We need a lot more speed to do advanced data processing, such as voice or video analysis. It’s all just data to an AI engine, but there’s much more data in audio and video. The way it works for speech models:

Understanding speech — “reading” of voice numbers.
Speaking — “writing” of audio data (i.e., decoding).

Again, there’s not usually an encoder for voice models. But don’t worry, because the poor old encoder component didn’t get completely thrown into landfill. Encoder-decoder architectures are still used for some two-phase type operations like foreign language translation, where we’re reading in one language and writing in another. In these cases, the reading is very different from the writing, so having two separate components works well.

Prefill for Reading

The way that prefill works doesn’t really make sense compared to human reading. The AI engine looks at every word in its input prompt and processes them all at once, in parallel. The reason it can do this is that this phase does not output anything, and is really a preparatory phase before writing.

What’s it prefilling?

The reason it’s called “prefill” is that it fills in the “KV caches” for the input tokens (e.g., the user’s question). These are extra numbers used by the second phase, the decoding phase, to figure out which new words to output.

Humans can actually do this type of parallel reading that AI does in prefill, but it’s not the natural way that we read. If you train yourself, you can actually start reading a novel by scanning multiple lines at once, across the page, like you are ingesting groups of words as they cross your visual field. It’s hard to learn, but it can be done (by humans).

GPU chips are naturally parallel, and AI engines can read all the words of a document in parallel without even blinking. The LLM ingests all of the words at once, and then cross-analyzes them to find relationships between the positions of all the other words. This is all done in parallel in the prefill phase, analyzing all of the inputs, before it starts emitting words in the decoding phase.

Decoding for Writing

If you read about LLM theory, you will discover that the way that an LLM writes text in the “decoding” phase is to output one word at a time. It only looks backwards to any of the previous words, and figures how the probabilities for each of the possible next words, and then chooses the best one. The decoder is actually “masked” so that it doesn’t look ahead, and anyway there’s nothing up ahead anyway, and it only looks backwards, and never backtracks.

This is total baloney.

It may have been true in the very earliest theories of neural networks, but modern LLMs do a whole smash of other stuff in their decoding algorithm. I mean, you can’t write a good piece of prose based only on past words, especially if you only ever look one word ahead. Humans do writing by thinking forward many words, revising, going back and changing, and so on. Hence, in reality, your LLM does a lot of coding tricks to decide on the best words to write:

Trying multiple possible word sequences (in parallel).
Backtracking and restarting phrases and sections.

There are some fancy names for some of these algorithms, which mostly run in parallel on GPU chips:

Beam search
Speculative decoding
Multi-Token Prediction (MTP)

Beam search is about running ahead a few words, checking whether it’s any good, and then backtracking. Speculative decoding is about trying multiple possible new phrases to write, all calculated in parallel. Multi-token prediction is about emitting more than one word at a time, usually also combined with beam search or spec dec (that’s short for speculative decoding).

Humans struggle to do this, although there are probably some people who can speak more than one word at a time in parallel. Strangely, an image of Elon Musk just popped into my subconscious brain. But I digress.

All of these decoding methods are done in parallel in the “forward pass” of the decoding algorithm. This is like the “fast thinking” mode of writing, and are not really doing advanced rational logic or any higher-level analysis of the words. In some of the fancier models, there’s also a second “slow” method of revising the output, whereby the LLM can itself make changes to what it’s written, or possibly have an even bigger model check over its work. That’s like your English teacher peeking over your shoulder while you compose your persuasive writing essay.

Reading Limitations

Reading doesn’t mean understanding. The LLM is reading the words as “tokens” that represent the text. There’s also tokens for images, audio, and video, which are more complicated, but still the same basic architecture. A voice LLM sees your speech patterns as just numbers, too.

Words are actually a higher level abstraction than the numbers in voice or images, so they should be easier to interpret. However, the problem with reading words is that all the LLM knows about is words. The meaning of words is often obscured, and there are problems with simple things like representing numbers as words, which makes it hard to do basic arithmetic.

Understanding of words also implies a lot more things than just text patterns. There’s hidden meanings in all sorts of ways using rules that humans know, or have learned. For example, how do you know that it’s a joke that we “drive on parkways and park on driveways”? LLMs are bad at jokes or sarcasm without special training.

Common sense is also hard to explain in words. It’s like there’s a mapping between pairs or groups of words. For example, cats are “svelte” but not dogs, or we “sit on sofas not tables” and so on, but there are all manner of different exceptions to all those so-called rules.

Our world is difficult to describe in words, because its three-dimensional. Babies learn that if they try to crawl under a desk, they might smack their head, because one failure gives them a powerful learning signal involving lots of noise and cuddles. How does an AI even know what three dimensions are? It’s not in the word patterns.

The fourth dimension, time, is also tricky. Food appears on a plate and then disappears, not the other way around, and yet both ways it would have the same words. They call this “temporal reasoning” and, since it has a fancy name, that means it’s a research area with lots of difficulty and plenty of obscure research papers. AI models are not good with time. In contrast, a human child moves through time intrinsically, and comes to understand that idea at a very deep level.

References

Prefill Phase. Research papers on “prefill” (i.e., reading), most of which related to performance improvement of the prefill phase:

Reiner Pope, Sholto Douglas, Aakanksha Chowdhery, Jacob Devlin, James Bradbury, Anselm Levskaya, Jonathan Heek, Kefan Xiao, Shivani Agrawal, Jeff Dean, 9 Nov 2022, Efficiently Scaling Transformer Inference, https://arxiv.org/abs/2211.05102 (The paper that seems to have coined the term “prefill” and examines some aspects of prefill vs decoding phases in optimization.)
Yinmin Zhong, Shengyu Liu, Junda Chen, Jianbo Hu, Yibo Zhu, Xuanzhe Liu, Xin Jin, Hao Zhang, 19 Mar 2024 (v2), DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving, https://arxiv.org/abs/2401.09670 (Optimizing LLMs differently in the prefill and decoding phases.)
Juntao Zhao, Borui Wan, Yanghua Peng, Haibin Lin, Chuan Wu, 2 Mar 2024, LLM-PQ: Serving LLM on Heterogeneous Clusters with Phase-Aware Partition and Adaptive Quantization, https://arxiv.org/abs/2403.01136 (Deployment of LLMs on heterogenous GPUs and also differences between the two phases of decoder-only Transformers: prefill and decoding computations.)
Cunchen Hu, Heyang Huang, Liangliang Xu, Xusheng Chen, Jiang Xu, Shuang Chen, Hao Feng, Chenxi Wang, Sa Wang, Yungang Bao, Ninghui Sun, Yizhou Shan, 20 Jan 2024, Inference without Interference: Disaggregate LLM Inference for Mixed Downstream Workloads, https://arxiv.org/abs/2401.11181 (Separating the prefill and decoding phases for optimization.)
Siyan Zhao, Daniel Israel, Guy Van den Broeck, Aditya Grover, 15 Apr 2024, Prepacking: A Simple Method for Fast Prefilling and Increased Throughput in Large Language Models, https://arxiv.org/abs/2404.09529 Code: https://github.com/siyan-zhao/prepacking (Optimizes prefill KV cache computations by batching multiple query prefill phases together via packing, since prefill token sequence lengths are fully known, and further combined with simple modifications to positional encoding and masking to avoid cross-query attention.)
VLLM, 2024, Performance and Tuning: Chunked Prefill, https://docs.vllm.ai/en/v0.4.2/models/performance.html
Schwinn Saereesitthipitak, Ashish Rao, Cathy Zhou, William Li, 2024, Prophet: An LLM Inference Engine Optimized For Head-of-Line Blocking, https://www.scs.stanford.edu/24sp-cs244b/projects/Prophet_An_LLM_Inference_Engine_Optimized_For_Head_of_Line_Blocking.pdf (Faster inference serving via iterative scheduling, separating prefill and decoding phase computations for batching, using priority-based schedulers with preemption, and controling transfer of KV caches from prefill to decoders.)
Daliang Xu, Hao Zhang, Liming Yang, Ruiqi Liu, Mengwei Xu, and Xuanzhe Liu, 11 June 2024, WiP: Efficient LLM Prefilling with Mobile NPU, EdgeFM '24: Proceedings of the Workshop on Edge and Mobile Foundation Models, June 2024, Pages 33 - 35, https://doi.org/10.1145/3662006.3662066 https://dl.acm.org/doi/abs/10.1145/3662006.3662066 PDF: https://dl.acm.org/doi/pdf/10.1145/3662006.3662066 (Faster NPU prefill via chunked prefilling using sequences of tokens, along with INT8 NPU quantization that is aware of outliers and offloads FP32 calculations from NPU back to CPU.)
Cunchen Hu, Heyang Huang, Junhao Hu, Jiang Xu, Xusheng Chen, Tao Xie, Chenxi Wang, Sa Wang, Yungang Bao, Ninghui Sun, Yizhou Shan, 26 Jun 2024 (v2), MemServe: Context Caching for Disaggregated LLM Serving with Elastic Memory Pool, https://arxiv.org/abs/2406.17565 (Combined session-based prefix KV caching with disaggregation of prefill and decoding phases.)
Ruoyu Qin, Zheming Li, Weiran He, Mingxing Zhang, Yongwei Wu, Weimin Zheng, Xinran Xu, 2 Jul 2024 (v2), Mooncake: A KVCache-centric Disaggregated Architecture for LLM Serving, https://arxiv.org/abs/2407.00079 Code: https://github.com/kvcache-ai/Mooncake (Disaggregates prefill and decoding phases for scheduling, with chunked prefill, while managing the KV cache.)
Huiqiang Jiang, Yucheng Li, Chengruidong Zhang, Qianhui Wu, Xufang Luo, Surin Ahn, Zhenhua Han, Amir H. Abdi, Dongsheng Li, Chin-Yew Lin, Yuqing Yang, Lili Qiu, 2 Jul 2024, MInference 1.0: Accelerating Pre-filling for Long-Context LLMs via Dynamic Sparse Attention, https://arxiv.org/abs/2407.02490 Code: https://aka.ms/MInference
Daliang Xu, Hao Zhang, Liming Yang, Ruiqi Liu, Gang Huang, Mengwei Xu, Xuanzhe Liu, 8 Jul 2024, Empowering 1000 tokens/second on-device LLM prefilling with mllm-NPU, https://arxiv.org/abs/2407.05858
Baolin Li, Yankai Jiang, Vijay Gadepally, Devesh Tiwari, 17 Jul 2024, LLM Inference Serving: Survey of Recent Advances and Opportunities, https://arxiv.org/abs/2407.12391
Runheng Liu, Xingchen Xiao, Heyan Huang, Zewen Chi, Zhijing Wu, 7 May 2024, FlashBack:Efficient Retrieval-Augmented Language Modeling for Long Context Inference, https://arxiv.org/abs/2405.04065 (Optimize RAG by appending rather than prepending documents, and modifying the attention for improvements in KV caching, by shimming or replacing some of the CUDA GPU low-level memory management APIs to avoid the need to rewrite kernels with extra higher-level memory management code.)
Haifeng Qian, Sujan Kumar Gonugondla, Sungsoo Ha, Mingyue Shang, Sanjay Krishna Gouda, Ramesh Nallapati, Sudipta Sengupta, Xiaofei Ma, Anoop Deoras, 24 Apr 2024, BASS: Batched Attention-optimized Speculative Sampling, https://arxiv.org/abs/2404.15778 (Optimizes batched multi-query use of speculative decoding with consideration of GPU utilization in prefill and decoding phases.)
Yibo Jin, Tao Wang, Huimin Lin, Mingyang Song, Peiyang Li, Yipeng Ma, Yicheng Shan, Zhengfan Yuan, Cailong Li, Yajing Sun, Tiandeng Wu, Xing Chu, Ruizhi Huan, Li Ma, Xiao You, Wenting Zhou, Yunpeng Ye, Wen Liu, Xiangkun Xu, Yongsheng Zhang, Tiantian Dong, Jiawei Zhu, Zhe Wang, Xijian Ju, Jianxun Song, Haoliang Cheng, Xiaojing Li, Jiandong Ding, Hefei Guo, Zhengyong Zhang, 15 Aug 2024, P/D-Serve: Serving Disaggregated Large Language Model at Scale, https://arxiv.org/abs/2408.08147 (Comprehensive serving system addressing disaggregated prefill and KV cache transfer with RDMA.)
Junlin Lv, Yuan Feng, Xike Xie, Xin Jia, Qirong Peng, Guiming Xie, 19 Sep 2024, CritiPrefill: A Segment-wise Criticality-based Approach for Prefilling Acceleration in LLMs, https://arxiv.org/abs/2409.12490
Shuowei Jin, Xueshen Liu, Qingzhao Zhang, Z. Morley Mao, 4 Oct 2024, Compute Or Load KV Cache? Why Not Both? https://arxiv.org/abs/2410.03065
Aurick Qiao, Zhewei Yao, Samyam Rajbhandari, Yuxiong He, 4 Oct 2024, SwiftKV: Fast Prefill-Optimized Inference with Knowledge-Preserving Model Transformation, https://arxiv.org/abs/2410.03960
Maxwell Horton, Qingqing Cao, Chenfan Sun, Yanzi Jin, Sachin Mehta, Mohammad Rastegari, Moin Nabi, 10 Oct 2024, KV Prediction for Improved Time to First Token, https://arxiv.org/abs/2410.08391 https://github.com/apple/corenet/tree/main/projects/kv-prediction (Small model creates an approximation of the KV cache for use by a larger model.)
Amy Yang, Jingyi Yang, Aya Ibrahim, Xinfeng Xie, Bangsheng Tang, Grigory Sizov, Jongsoo Park, Jianyu Huang, 4 Nov 2024, Context Parallelism for Scalable Million-Token Inference, https://arxiv.org/abs/2411.01783
Gursimran Singh, Xinglu Wang, Ivan Hu, Timothy Yu, Linzi Xing, Wei Jiang, Zhefeng Wang, Xiaolong Bai, Yi Li, Ying Xiong, Yong Zhang, Zhenan Fan, 25 Dec 2024, Efficiently serving large multimedia models using EPD Disaggregation, https://arxiv.org/abs/2501.05460 (Diaggregation of three steps: encoding, prefill, and decoding.)
Daliang Xu, Hao Zhang, Liming Yang, Ruiqi Liu, Gang Huang, Mengwei Xu, and Xuanzhe Liu, 2025, Fast On-device LLM Inference with NPUs, Proceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 1 (ASPLOS '25), Association for Computing Machinery, New York, NY, USA, 445–462, https://doi.org/10.1145/3669940.3707239 https://dl.acm.org/doi/abs/10.1145/3669940.3707239 (Offloading chunked prefill computations to NPUs.)
Xunhao Lai, Jianqiao Lu, Yao Luo, Yiyuan Ma, Xun Zhou, 28 Feb 2025, FlexPrefill: A Context-Aware Sparse Attention Mechanism for Efficient Long-Sequence Inference, https://arxiv.org/abs/2502.20766 (Prefill optimization that dynamically applies different attention patterns, including sparse attention, for KV computations, based on the input query.)
Yunkai Liang, Zhangyu Chen, Pengfei Zuo, Zhi Zhou, Xu Chen, Zhou Yu, 26 Mar 2025, Injecting Adrenaline into LLM Serving: Boosting Resource Utilization and Throughput via Attention Disaggregation, https://arxiv.org/abs/2503.20552
Rajeshkumar Bambhaniya, Abhimanyu ; Wu, Hanjiang ; Subramanian, Suvinay ; Srinivasan, Sudarshan ; Kundu, Souvik ; Yazdanbakhsh, Amir ; Elavazhagan, Midhilesh ; Kumar, Madhu ; Krishna, Tushar, April 2025, Understanding and Optimizing Multi-Stage AI Inference Pipelines, https://ui.adsabs.harvard.edu/abs/2025arXiv250409775R/abstract https://arxiv.org/abs/2504.09775
Kuntai Du, Bowen Wang, Chen Zhang, Yiming Cheng, Qing Lan, Hejian Sang, Yihua Cheng, Jiayi Yao, Xiaoxuan Liu, Yifan Qiao, Ion Stoica, Junchen Jiang, 12 May 2025, PrefillOnly: An Inference Engine for Prefill-only Workloads in Large Language Model Applications, https://arxiv.org/abs/2505.07203

Beam Search. Research papers on beam search decoding algorithms, which look ahead in words and then backtrack, include:

Xiaohui Wang, Ying Xiong, Yang Wei, Mingxuan Wang, Lei Li Apr 2021, LightSeq: A High Performance Inference Library for Transformers, https://arxiv.org/pdf/2010.13887.pdf
James Briggs Feb 25, 2021, The Three Decoding Methods For NLP, Towards Data Science https://towardsdatascience.com/the-three-decoding-methods-for-nlp-23ca59cb1e9d
GC Garbacea, 2023, Neural Language Generation for Content Adaptation: Explainable, Efficient Low-Resource Text Simplification and Evaluation, Ph.D. thesis, Computer Science and Engineering, University of Michigan, https://deepblue.lib.umich.edu/bitstream/handle/2027.42/178028/garbacea_1.pdf?sequence=1 (Broad thesis with sections on beam search decoding optimizations and AI safety issues such as bias.)
Peter Anderson, Basura Fernando, Mark Johnson, and Stephen Gould, 2017, Guided open vocabulary image captioning with constrained beam search, Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 936–945, https://arxiv.org/abs/1612.00576
Chris Hokamp and Qun Liu, 2017, Lexically constrained decoding for sequence generation using grid beam search, Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1535–1546, https://arxiv.org/abs/1704.07138
G Keren, Feb 2023, A Token-Wise Beam Search Algorithm for RNN-T, arXiv preprint arXiv:2302.14357, https://arxiv.org/abs/2302.14357
Gian Wiher, Clara Meister, Ryan Cotterell, Mar 2022, On Decoding Strategies for Neural Text Generators, https://arxiv.org/abs/2203.15721 (An evaluation of a variety of decoding algorithms including beam search, top-k, and top-p.)
Ashwin K. Vijayakumar, Michael Cogswell, Ramprasaath R. Selvaraju, Qing Sun, Stefan Lee, David J. Crandall, and Dhruv Batra, 2016, Diverse beam search: Decoding diverse solutions from neural sequence models, CoRR, abs/1610.02424, https://arxiv.org/abs/1610.02424 (An algorithm variant called “diverse beam search” decoding.)
Ilya Sutskever, Oriol Vinyals, and Quoc V Le, 2014, Sequence to sequence learning with neural networks, arXiv preprint arXiv:1409.3215, https://arxiv.org/abs/1409.3215 (Early paper using a kind of beam search decoding and top-k decoding.)
Kenton Murray, David Chiang, Aug 2018, Correcting Length Bias in Neural Machine Translation, https://arxiv.org/abs/1808.10006 (Brevity problems in beam search decoding.)
Xupeng Miao, Gabriele Oliaro, Zhihao Zhang, Xinhao Cheng, Zeyu Wang, Zhengxin Zhang, Rae Ying Yee Wong, Alan Zhu, Lijie Yang, Xiaoxiang Shi, Chunan Shi, Zhuoming Chen, Daiyaan Arfeen, Reyna Abhyankar, Zhihao Jia, 2024, SpecInfer: Accelerating Large Language Model Serving with Tree-based Speculative Inference and Verification, ASPLOS’24: Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 3, April 2024, Pages 932–949, https://doi.org/10.1145/3620666.3651335 https://dl.acm.org/doi/abs/10.1145/3620666.3651335 Code: https://github.com/flexflow/FlexFlow/
Jared Lichtarge, Christopher Alberti, Shankar Kumar, Noam Shazeer, and Niki Parmar, 2018, Weakly supervised grammatical error correction using iterative decoding, CoRR, abs/1811.01710, https://arxiv.org/abs/1811.01710 (Beam search decoding with a high threshold to emit corrections.)
Jindrich Libovicky, Jindrich Helcl, Marek Tlusty, Ondrej Bojar, and Pavel Pecina, 2016, CUNI system for WMT16 automatic post-editing and multimodal translation tasks, Proceedings of the First Conference on Machine Translation: Volume 2, Shared Task Papers, pages 646–654, Berlin, Germany, https://arxiv.org/abs/1606.07481 (Post-editing of machine translation.)
Daniel Dahlmeier, Hwee Tou Ng, 2012, A Beam-Search Decoder for Grammatical Error Correction, Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, pages 568–578, Jeju Island, Korea, 12–14 July 2012, https://aclanthology.org/D12-1052.pdf
Ashwin K. Vijayakumar, Michael Cogswell, Ramprasaath R. Selvaraju, Qing Sun, Stefan Lee, David J. Crandall, and Dhruv Batra, 2018, Diverse beam search for improved description of complex scenes, In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, (AAAI-18), the 30th innovative Applications of Artificial Intelligence (IAAI-18), and the 8th AAAI Symposium on Educational Advances in Artificial Intelligence (EAAI-18), New Orleans, Louisiana, USA, February 2-7, 2018, pages 7371–7379, AAAI Press, https://www.aaai.org/ocs/index.php/AAAI/AAAI18/paper/view/17329
Tinghui Zhu, Kai Zhang, Jian Xie, Yu Su, 4 Feb 2024 (v2), Deductive Beam Search: Decoding Deducible Rationale for Chain-of-Thought Reasoning, https://arxiv.org/abs/2401.17686
Jungo Kasai, Keisuke Sakaguchi, Ronan Le Bras, Dragomir Radev, Yejin Choi, and Noah A. Smith, 2024, A Call for Clarity in Beam Search: How It Works and When It Stops, Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), pages 77–90, Torino, Italia. ELRA and ICCL, https://aclanthology.org/2024.lrec-main.7/ https://aclanthology.org/2024.lrec-main.7.pdf
Zongyue Qin, Zifan He, Neha Prakriya, Jason Cong, Yizhou Sun, 25 Sep 2024, Dynamic-Width Speculative Beam Decoding for Efficient LLM Inference, https://arxiv.org/abs/2409.16560
Shixiaowei02, Oct 2024, TensorRT-LLM 0.13.0 Release Latest, https://github.com/NVIDIA/TensorRT-LLM/releases/tag/v0.13.0
Rongxiang Wang and Felix Xiaozhu Lin, 2024, Turbocharge Speech Understanding with Pilot Inference, Proceedings of the 30th Annual International Conference on Mobile Computing and Networking (ACM MobiCom '24). Association for Computing Machinery, New York, NY, USA, 1299–1313, https://doi.org/10.1145/3636534.3690694 https://dl.acm.org/doi/abs/10.1145/3636534.3690694 https://dl.acm.org/doi/pdf/10.1145/3636534.3690694 (“Pilot inference” is a specialized mix of caching, computation reuse, and backtracking in beam search for speech understanding, and is somewhat related to speculative decoding, and similar to continual inference for processing a stream.)

Decoding Algorithms. Research papers on decoding algorithms in general:

Xiangjue Dong, Maria Teleki, James Caverlee, 18 Dec 2024, A Survey on LLM Inference-Time Self-Improvement, https://arxiv.org/abs/2412.14352 https://github.com/dongxiangjue/Awesome-LLM-Self-Improvement (Broad survey of reasoning improvement methods from multi-step inference to RALM to decoding algorithms.)
Chufan Shi, Haoran Yang, Deng Cai, Zhisong Zhang, Yifan Wang, Yujiu Yang, Wai Lam, 10 Feb 2024, A Thorough Examination of Decoding Methods in the Era of LLMs, https://arxiv.org/abs/2402.06925 (Evaluates a number of decoding algorithms with several 7B models including Llama2-7B, and also with 4-bit and 8-bit quantization.)
Sean Welleck, Amanda Bertsch, Matthew Finlayson, Hailey Schoelkopf, Alex Xie, Graham Neubig, Ilia Kulikov, Zaid Harchaoui, 24 Jun 2024, From Decoding to Meta-Generation: Inference-time Algorithms for Large Language Models, https://arxiv.org/abs/2406.16838 (Survey and theoretical analysis of many different decoding algorithms, along with various ways to speed them up such as speculative decoding and KV caches.)
Haoran Wang, Kai Shu, Jan 2025, Make Every Token Count: A Systematic Survey on Decoding Methods for Foundation Model, https://www.researchgate.net/profile/Haoran-Wang-96/publication/387703971_Make_Every_Token_Count_A_Systematic_Survey_on_Decoding_Methods_for_Foundation_Models/links/67784c8ce74ca64e1f49eb15/Make-Every-Token-Count-A-Systematic-Survey-on-Decoding-Methods-for-Foundation-Models.pdf https://github.com/wang2226/Awesome-LLM-Decoding
Edward Beeching, Lewis Tunstall, Sasha Rush Dec 16, 2024, Scaling Test Time Compute with Open Source Models, https://huggingface.co/spaces/HuggingFaceH4/blogpost-scaling-test-time-compute
Maciej Besta, Julia Barth, Eric Schreiber, Ales Kubicek, Afonso Catarino, Robert Gerstenberger, Piotr Nyczyk, Patrick Iff, Yueling Li, Sam Houliston, Tomasz Sternal, Marcin Copik, Grzegorz Kwaśniewski, Jürgen Müller, Łukasz Flis, Hannes Eberhard, Hubert Niewiadomski, Torsten Hoefler, 23 Jan 2025 (v3), Reasoning Language Models: A Blueprint, https://arxiv.org/abs/2501.11223 (Survey and blueprint for how to build a Large Reasoning Model.)
Wendi Cui, Jiaxin Zhang, Zhuohang Li, Hao Sun, Damien Lopez, Kamalika Das, Bradley A. Malin, Sricharan Kumar, 26 Feb 2025, Automatic Prompt Optimization via Heuristic Search: A Survey, https://arxiv.org/abs/2502.18746 (Survey of auto prompting, from basic LLM enhancements to some methods quite similar to RALM and TALM.)
Komal Kumar, Tajamul Ashraf, Omkar Thawakar, Rao Muhammad Anwer, Hisham Cholakkal, Mubarak Shah, Ming-Hsuan Yang, Phillip H.S. Torr, Salman Khan, Fahad Shahbaz Khan, 28 Feb 2025, LLM Post-Training: A Deep Dive into Reasoning Large Language Models, https://arxiv.org/abs/2502.21321 https://github.com/mbzuai-oryx/Awesome-LLM-Post-training

Tree Decoding. Research papers on “tree decoding,” which is the general idea of trying multiple word output pathways, and then backtracking:

Ziyu Wan, Xidong Feng, Muning Wen, Stephen Marcus Mcaleer, Ying Wen, Weinan Zhang, Jun Wang, July 2024, AlphaZero-Like Tree-Search can Guide Large Language Model Decoding and Training, Proceedings of the 41st International Conference on Machine Learning, PMLR 235:49890-49920, 2024, https://proceedings.mlr.press/v235/wan24c.html PDF: https://raw.githubusercontent.com/mlresearch/v235/main/assets/wan24c/wan24c.pdf
Xiangxiang Gao, Weisheng Xie, Yiwei Xiang, Feng Ji, 17 Dec 2024, Falcon: Faster and Parallel Inference of Large Language Models through Enhanced Semi-Autoregressive Drafting and Custom-Designed Decoding Tree, https://arxiv.org/abs/2412.12639
Yangchao Wu, Zongyue Qin, Alex Wong, Stefano Soatto, 20 May 2025, STree: Speculative Tree Decoding for Hybrid State-Space Models, https://arxiv.org/abs/2505.14969
Xuezhi Wang, Denny Zhou, 23 May 2024 (v2), Chain-of-Thought Reasoning Without Prompting, https://arxiv.org/abs/2402.10200 (“CoT decoding” is examining the alternative paths in the decoding algorithm, which is somewhat similar to Chain-of-Thought reasoning.)
Xiangjue Dong, Maria Teleki, James Caverlee, 18 Dec 2024, A Survey on LLM Inference-Time Self-Improvement, https://arxiv.org/abs/2412.14352 https://github.com/dongxiangjue/Awesome-LLM-Self-Improvement (Broad survey of reasoning improvement methods from multi-step inference to RALM to decoding algorithms.)
Xidong Feng, Ziyu Wan, Muning Wen, Ying Wen, Weinan Zhang, and Jun Wang, 2023, Alphazero-like tree-search can guide large language model decoding and training, NeurIPS 2023 Foundation Models for Decision Making Workshop, https://arxiv.org/abs/2309.17179
Zongyue Qin, Zifan He, Neha Prakriya, Jason Cong, Yizhou Sun, 25 Sep 2024, Dynamic-Width Speculative Beam Decoding for Efficient LLM Inference, https://arxiv.org/abs/2409.16560
Penghui Yang, Cunxiao Du, Fengzhuo Zhang, Haonan Wang, Tianyu Pang, Chao Du, Bo An, 24 Feb 2025, LongSpec: Long-Context Speculative Decoding with Efficient Drafting and Verification, https://arxiv.org/abs/2502.17421 https://github.com/sail-sg/LongSpec
Yifu Ding, Wentao Jiang, Shunyu Liu, Yongcheng Jing, Jinyang Guo, Yingjie Wang, Jing Zhang, Zengmao Wang, Ziwei Liu, Bo Du, Xianglong Liu, Dacheng Tao, 27 Feb 2025 (v2), Dynamic Parallel Tree Search for Efficient LLM Reasoning, https://arxiv.org/abs/2502.16235
Yangchao Wu, Zongyue Qin, Alex Wong, Stefano Soatto, 20 May 2025, STree: Speculative Tree Decoding for Hybrid State-Space Models, https://arxiv.org/abs/2505.14969
Yuhao Shen, Junyi Shen, Quan Kong, Tianyu Liu, Yao Lu, Cong Wang, 16 May 2025, Speculative Decoding via Hybrid Drafting and Rollback-Aware Branch Parallelism, https://arxiv.org/abs/2506.01979