Aussie AI Blog
Structured Decoding
-
April 27, 2026
-
by David Spuler, Ph.D.
What is Structured Decoding?
Structured decoding means generating output tokens where there is a pattern or constrain on the output. It's also been called "constrained decoding", "constraint decoding", or "guided decoding." This method has become widely used in industry, such as by vLLM, xGrammar, SGLang, and TensorRT-LLM engines, and by the major frontier coding models, such as Claude Code, OpenAI Codex, Google Flash, and Cursor.
The attention to structured decoding has taken off recently, particularly because of the explosion of usage in code generation models, such as Claude Code, Codex, and Cursor. However, it's also been used much earlier with models that output in JSON or XML formats. In particular, JSON has often been used as both the input and the return-data format for LLM-integrated tools, which means that when an LLM does a tool's "function call", it must generate valid JSON text.
Model Accuracy with Structured Decoding
One of the upsides to structured decoding is that the model's output can become more correct. Early LLM usage with JSON saw the output of incorrectly formatted data, as produced from the LLM's decoding. One solution back in the bad old days was to heuristically correct simple syntax problems in JSON output formatting as a post-processing step on model output. However, recent LLMs have become a lot better at this, and output malformation errors are rare in code generation models. Some of the technologies used to perform structured decoding include:
- Regexps
- CFG grammars
- BNF grammars
- Finite State Machines (FSMs)
- Automata (implement FSMs)
- Heuristics
Early versions of structured decoding, starting around 2021, involved using regular expressions to predict tokens. Subsequent to that, more advanced versions were developed to use Context-Free Grammars (CFG) and Backus-Naur Form (BNF) grammars.
It is quite a complex issue to convert a grammar over text characters into an automaton over the token ids, but this has been achieved and there are some papers on it and industry solutions. More optimized versions now generate a bitmask over the model vocabulary, so that the decoding phase can know which tokens are allowed or not allowed by the rules.
Use Cases for Structured Decoding
Programming language syntaxes are much more rigid than English, and the decoding phase can take advantage of this to shortcut the generation of some of the tokens. Relevant types of task that have rules about their structure include the main usages in the AI industry:
- Programming languages (code generation)
- HTML web pages and CSS formatting
- Function calling syntax (LLM tool usage)
- Symbolic execution (LLM's generate and execute Pythons scripts)
And there's also a wide variety of secondary use cases, where information is in a structured format:
- Tables of data
- Spreadsheets
- Genetic DNA sequences
- Written music scores
Nevertheless, code generation is the hot use case for structured decoding, aiming to make it more accurate.
Structured Decoding Efficiency
The use of structured decoding can improve efficiency in various ways:
- Speculative decoding — faster drafting of token sequences (and often 100% accuracy).
- Faster decoding module — fewer logits to choose from.
- Beam search decoding optimizations — there's only one path, or fewer paths, for the tree of possible output sequences.
- Avoids wasted corrections — less need for self-checking or external syntax verifiers.
- Token skipping — see below for concerns about this one.
Speculative decoding is the big one!
Speculative Drafting with Structured Decoding
In formats like JSON, it's sometimes possible to predict long sequences ahead with 100% accuracy. This is great for speculative coding since you can parallelize many tokens, rather than running them sequentially. Hence, you can use structured decoding as a method for "drafting" the candidate sequences of tokens, which is the first phase of speculative decoding. Usually, you'd use a smaller LLM as the "draft model" to create candidate tokens, but here you don't need a drafter LLM at all. Instead, you can use a non-LLM method such as an automaton (finite state machine) or a simpler heuristic method, which should allow super-fast drafting.
It works great!
The efficiency gain is very good from this approach, because it parallelizes any token that the grammar can predict. The accuracy of token prediction by structured decoding is very high, so this is speculative decoding with a high acceptance rate, making it even faster. Also, it avoids the need to verify multiple candidate sequences at many points, when there's often only one possible sequence. Finally, at the points where the grammar's syntax cannot predict outputs without ambiguity, the algorithm can revert back to using that small LLM as the drafter giving ideas to speculative decoding's verifier model.
There's an interesting wrinkle in the case where output contains both English text and program code. You want the small draft LLM to handle the English drafting, but have structured decoding kick in as speculative drafter whenever it sees some code. There's nothing wrong with having two drafters, or switching between drafters. The English-speaking LLM will get confused by the code, and the automaton won't understand the English prose, but it doesn't matter because the verifier checks both in parallel, and will pick the right drafts. I don't know an elegant solution for implementing this, but the automaton can just keep watching the token sequence until it sees some code that matches its pattern.
Here are some research papers on combining speculative decoding with a drafter using structured decoding:
- Minghao Yan, Saurabh Agarwal, and Shivaram Venkataraman. 2025. Decoding Speculative Decoding. In Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 6460–6473, Albuquerque, New Mexico. Association for Computational Linguistics, https://aclanthology.org/2025.naacl-long.328/ https://aclanthology.org/2025.naacl-long.328.pdf
- Ziyang Liu, 20 Apr 2026, Copy-as-Decode: Grammar-Constrained Parallel Prefill for LLM Editing, https://arxiv.org/abs/2604.18170
- Nishanth Nakshatri, Shamik Roy, Rajarshi Das, Suthee Chaidaroon, Leonid Boytsov, Rashmi Gangadharaiah, 10 Feb 2025 (v2), Constrained Decoding with Speculative Lookaheads, https://arxiv.org/abs/2412.10418
- Zhuocheng Gong, Jiahao Liu, Ziyue Wang, Pengfei Wu, Jingang Wang, Xunliang Cai, Dongyan Zhao, Rui Yan, 23 Jul 2024, Graph-Structured Speculative Decoding, https://arxiv.org/abs/2407.16207
Can we do even better? When the correct tokens are 100% predicted by a grammar, why bother verifying the tokens at all? Surely, you could just skip them (i.e., output them immediately), and not use the verification phase at all. The token skipping idea sounds like it'd be any even bigger speedup, but there's a major problem with that technique.
Structured Decoding for Token Skipping
My naive idea is to use structured decoding as an efficiency gain via token skipping. If you can tell from a "context-free grammar" specification (Yacc and Bison anyone?), or from simpler heuristics (e.g., in C++ there's always a left parentheis after the "if" keyword), what the next programming token should be, then why waste LLM compute on that token?
Faster to skip it?
You could skip it as part of speculative decoding, or even without using spec dec at all. Unfortunately, it's not quite so straight-forward, both in terms of speed and accuracy. There are concerns that constrained decoding leads to a decline in accuracy and diversity of LLM responses, and can reduce their capabilities in reasoning and creativity. Anyone working with coding models knows that they're not perfect yet, although massively improved. Speed gains from naive structured decoding and token skipping are also less than expected.
Blame that pesky KV cache!
Although it's fast to skip the layer stack of the current token (just output it!), other future tokens' attention modules still need the KV cache data from that token, which was not produced in the attention module for the token that's been skipped. But there's zero KV cache data for that token for any of the layers.
There's a hole in my KV bucket.
Can we fill the hole? This missing KV cache issue for constrained decoding is similar to the problems with KV caches that arise in the "early exiting of layers" optimization. In the early exit KV research papers, there are various tricks to later re-generate missing KV caches. Some of the "KV cache fixup" solutions that we could borrow from early exit research include:
- Recomputation — loses most of the efficiency benefit of token skipping.
- Propagation — use the KV cache from a prior layer (effectively doing layerwise fused KV caches).
- Approximation — try to calculate an approximate version without full recomputation.
Arguably it's worse for structured decoding than in early exit, because there's no cache at all for this token, whereas early exit has some layers computed. We don't even have the first layer's KV cache to do a propagation to other layers.
If you're thinking about the KV recomputation idea, well, at least it's only the attention module. Wrong! The FFN/GLU computation is needed for each layer as input to the next layer, so that the next layer's attention module can recompute its KV cache. You're basically redoing prefill for the skipped tokens. And if you're now thinking, great, prefill is much faster and compute-bound, which is far better than using sequential decoding for skipped tokens, you are 100% correct! You've just re-invented the idea of using structured decoding to draft sequences for speculative decoding, since prefill and spec dec's verification phase are both similar types of parallelization. Maybe go file a patent anyway.
Because of these KV issues, the main usage of structured decoding in industry models and engines does not do token skipping, even if the grammar predicts them. The heuristic to auto-insert new tokens where the grammar requires them, without running the model stack, has been largely abandoned.
The modern method still uses a full stack execution of all layers, and enforces the language rules at the end. This involves imposing a mask over invalid tokens inside the decoding algorithm's choice of a token from logits, which makes the model smarter not faster.
References on Structured Decoding
Structured decoding is seeing a lot of research papers, and getting hotter by the second. Research papers include:
- Rajaa El Hamdani, Samy Haffoudhi, Nils Holzenberger, Fabian Suchanek, Thomas Bonald, and Fragkiskos D. Malliaros, 27 Sep 2025, Retrieval-Constrained Decoding Reveals Underestimated Parametric Knowledge in Language Models, https://arxiv.org/abs/2509.23417
- Parv Kapoor, Akila Ganlath, Changliu Liu, Sebastian Scherer, Eunsuk Kang, 1 Sep 2025, Constrained Decoding for Robotics Foundation Models, https://arxiv.org/abs/2509.01728
- Ran Wang, Xiaoxuan Liu, Hao Ren, Gang Chen, Fanchao Qi, Maosong Sun, 22 Jul 2025, WGRAMMAR: Leverage Prior Knowledge to Accelerate Structured Decoding, https://arxiv.org/abs/2507.16768
- Donghoon Kim, Minji Bae, Kyuhong Shim, Byonghyo Shim, 21 Jul 2025, Visually Guided Decoding: Gradient-Free Hard Prompt Inversion with Language Models, https://arxiv.org/abs/2505.08622
- Oscar Ma\~nas, Pierluca D'Oro, Koustuv Sinha, Adriana Romero-Soriano, Michal Drozdzal, Aishwarya Agrawal, 15 Aug 2025, Controlling Multimodal LLMs via Reward-guided Decoding, https://arxiv.org/abs/2508.11616
- Julian Oestreich and Lydia M\"uller, 21 Aug 2025, Evaluating Structured Decoding for Text-to-Table Generation: Evidence from Three Datasets, https://arxiv.org/abs/2508.15910
- Guofu Xie, Chen Zhang, Xiao Zhang, Yunsheng Shi, Ting Yao and Jun Xu, 4 Oct 2025, Merge and Guide: Unifying Model Merging and Guided Decoding for Controllable Multi-Objective Generation, https://arxiv.org/abs/2510.03782
- Zhenhua Liu, Lijun Li, Ruizhe Chen, Yuxian Jiang, Tong Zhu, Zhaochen Su, Wenliang Chen, Jing Shao, 4 Oct 2025, Evolutionary Guided Decoding: Iterative Value Refinement for LLMs, https://arxiv.org/abs/2503.02368
- Piotr Komorowski, Elena Golimblevskaia, Reduan Achtibat, Thomas Wiegand, Sebastian Lapuschkin, Wojciech Samek, 30 Sep 2025, Attribution-Guided Decoding, https://arxiv.org/abs/2509.26307
- Niels M\"undler and Jasper Dekoninck and Martin Vechev, 13 Aug 2025, Constrained Decoding of Diffusion LLMs with Context-Free Grammars, https://arxiv.org/abs/2508.10111
- Lingxiao Li, Salar Rahili, Yiwei Zhao, 20 Aug 2025, Correctness-Guaranteed Code Generation via Constrained Decoding, https://arxiv.org/abs/2508.15866
- Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie, Chuyue Sun, Jeff Huang, Cody Hao Yu, Shiyi Cao, Christos Kozyrakis, Ion Stoica, Joseph E. Gonzalez, Clark Barrett, Ying Sheng, 6 Jun 2024 (v2), SGLang: Efficient Execution of Structured Language Model Programs, https://arxiv.org/abs/2312.07104 https://github.com/sgl-project/sglang
- K Ahmed, KW Chang, G Van den Broeck, Oct 2024, Controllable Generation via Locally Constrained Resampling, Neurips Safe Generative AI Workshop 2024, https://openreview.net/pdf?id=v091fzXTu0
- Gaya Mehenni, Amal Zouaq, 23 Nov 2024, Ontology-Constrained Generation of Domain-Specific Clinical Summaries, https://arxiv.org/abs/2411.15666
- Will Kurt, Nov 2024, Say What You Mean: A Response to 'Let Me Speak Freely', https://blog.dottxt.co/say-what-you-mean.html
- Zhi Rui Tam, Cheng-Kuang Wu, Yi-Lin Tsai, Chieh-Yen Lin, Hung-yi Lee, Yun-Nung Chen, 14 Oct 2024 (v3), Let Me Speak Freely? A Study on the Impact of Format Restrictions on Performance of Large Language Models, https://arxiv.org/abs/2408.02442
- Peter Anderson, Basura Fernando, Mark Johnson, and Stephen Gould. Guided open vocabulary image captioning with constrained beam search, 2017, Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 936–945, https://arxiv.org/abs/1612.00576
- Chris Hokamp and Qun Liu, 2017, Lexically constrained decoding for sequence generation using grid beam search. Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1535–1546, https://arxiv.org/abs/1704.07138
- Yizhe Zhang, Guoyin Wang, Chunyuan Li, Zhe Gan, Chris Brockett, and Bill Dolan. Pointer: Constrained text generation via insertion-based generative pre-training. arXiv preprint arXiv:2005.00558, 2020. https://arxiv.org/abs/2005.00558
- Saibo Geng, Martin Josifoski, Maxime Peyrard, Robert West, 18 Jan 2024 (v6), Grammar-Constrained Decoding for Structured NLP Tasks without Finetuning, https://arxiv.org/abs/2305.13971 https://github.com/epfl-dlab/GCD
- Yanjun Fu, Ethan Baker, Yu Ding, Yizheng Chen, 20 Jul 2024 (v3), Constrained Decoding for Secure Code Generation, https://arxiv.org/abs/2405.00218 https://codeguardplus.github.io/
- Zekun Hao, David W. Romero, Tsung-Yi Lin, Ming-Yu Liu, 12 Dec 2024, Meshtron: High-Fidelity, Artist-Like 3D Mesh Generation at Scale, https://arxiv.org/abs/2412.09548 https://research.nvidia.com/labs/dir/meshtron/ (Optimizations to avoid the quadratic Transformer cost, in both training and inference, include "hourglass neural architecture" analogous to widthwise pruning or slimming, sliding window attention, rolling KV cache, truncated sequence training, and a "robust sampling strategy" that is effectively a type of constrained decoding based on mesh layouts.)
- Xiaoxi Li, Jiajie Jin, Yujia Zhou, Yongkang Wu, Zhonghua Li, Qi Ye, Zhicheng Dou, 16 Dec 2024, RetroLLM: Empowering Large Language Models to Retrieve Fine-grained Evidence within Generation, https://arxiv.org/abs/2412.11919 https://github.com/sunnynexus/RetroLLM
- Xiangjue Dong, Maria Teleki, James Caverlee, 18 Dec 2024, A Survey on LLM Inference-Time Self-Improvement, https://arxiv.org/abs/2412.14352 https://github.com/dongxiangjue/Awesome-LLM-Self-Improvement (Broad survey of reasoning improvement methods from multi-step inference to RALM to decoding algorithms.)
- Haoran Wang, Kai Shu, Jan 2025, Make Every Token Count: A Systematic Survey on Decoding Methods for Foundation Model, https://www.researchgate.net/profile/Haoran-Wang-96/publication/387703971_Make_Every_Token_Count_A_Systematic_Survey_on_Decoding_Methods_for_Foundation_Models/links/67784c8ce74ca64e1f49eb15/Make-Every-Token-Count-A-Systematic-Survey-on-Decoding-Methods-for-Foundation-Models.pdf https://github.com/wang2226/Awesome-LLM-Decoding
- D Banerjee, T Suresh, S Ugare, S Misailovic, G Singh, Mar 2025, Preserving Reasoning Capabilities Under Constrained LLM Generation, https://openreview.net/pdf?id=RX3GIOkGHr
- Changran Xu, Yi Liu, Yunhao Zhou, Shan Huang, Ningyi Xu, Qiang Xu, 18 Mar 2025, Speculative Decoding for Verilog: Speed and Quality, All in One, https://arxiv.org/abs/2503.14153
- Devansh, Sep 2025, The Chocolate Milk Cult’s Guide to Inference Scaling for AI Models: How to Reduce the costs of Running LLMs https://machine-learning-made-simple.medium.com/the-chocolate-milk-cults-guide-to-inference-scaling-for-ai-models-50aa2290eb50 (Deep analysis of using many progressive optimizations to real-life LLM inference.)
- Nan Xu, Shiheng Li, Shengchao Hou, 23 Apr 2026 (v2), From Image to Music Language: A Two-Stage Structure Decoding Approach for Complex Polyphonic OMR, https://arxiv.org/abs/2604.20522
- Yifan Le, 16 Apr 2026, Schema Key Wording as an Instruction Channel in Structured Generation under Constrained Decoding https://arxiv.org/abs/2604.14862
- Yixin Dong, Charlie F. Ruan, Yaxing Cai, Ruihang Lai, Ziyi Xu, Yilong Zhao, Tianqi Chen, 12 May 2025 (v3), XGrammar: Flexible and Efficient Structured Generation Engine for Large Language Models, https://arxiv.org/abs/2411.15100 (Speeding up CFG-based structured decoding with precomputed token masks.)
- Terry Koo, Frederick Liu, Luheng He, 5 Aug 2024 (v3), Automata-based constraints for language model decoding https://arxiv.org/abs/2407.08103
- Bailin Wang, Zi Wang, Xuezhi Wang, Yuan Cao, Rif A. Saurous, Yoon Kim, 3 Nov 2023 (v3), Grammar Prompting for Domain-Specific Language Generation with Large Language Models, https://arxiv.org/abs/2305.19234
- Baptiste Rozière, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiaoqing Ellen Tan, Yossi Adi, Jingyu Liu, Romain Sauvestre, Tal Remez, Jérémy Rapin, Artyom Kozhevnikov, Ivan Evtimov, Joanna Bitton, Manish Bhatt, Cristian Canton Ferrer, Aaron Grattafiori, Wenhan Xiong, Alexandre Défossez, Jade Copet, Faisal Azhar, Hugo Touvron, Louis Martin, Nicolas Usunier, Thomas Scialom, Gabriel Synnaeve, 31 Jan 2024 (v3), Code Llama: Open Foundation Models for Code, https://arxiv.org/abs/2308.12950
- Chaudhary, S., 2023, Code Alpaca: An Instruction-following LLaMA Model trained on code generation instructions, https://github.com/sahil280114/codealpaca
- 13 Dec 2023 (this version, v2)] StarCoder: may the source be with you! Raymond Li, Loubna Ben Allal, Yangtian Zi, Niklas Muennighoff, Denis Kocetkov, Chenghao Mou, Marc Marone, Christopher Akiki, Jia Li, Jenny Chim, Qian Liu, Evgenii Zheltonozhskii, et al., https://arxiv.org/abs/2305.06161 https://openreview.net/forum?id=KoFOg41haE
- Frederikke I. Marin, Dennis Pultz, Wouter Boomsma, 6 May 2025, Gene finding revisited: improved robustness through structured decoding from learned embeddings, https://arxiv.org/abs/2505.03377
- Zhimin Qiu, Di Wu, Feng Liu, Yuxiao Wang, 28 Jan 2026 (v2), Structure-Aware Decoding Mechanisms for Complex Entity Extraction with Large-Scale Language Models, https://arxiv.org/abs/2512.13980
- Avinash Reddy, Thayne T. Walker, James S. Ide, Amrit Singh Bedi, 8 Feb 2026, Draft-Conditioned Constrained Decoding for Structured Generation in LLMs, https://arxiv.org/abs/2603.03305
- Let's Data Science February 11, 2026, Structured Outputs: Making LLMs Return Reliable JSON, https://letsdatascience.com/blog/structured-outputs-making-llms-return-reliable-json
- Hongxu Zhou, 7 Apr 2026, From Hallucination to Structure Snowballing: The Alignment Tax of Constrained Decoding in LLM Reflection, https://arxiv.org/abs/2604.06066 https://github.com/hongxuzhou/agentic_llm_structured_self_critique
- Brandon T. Willard, Rémi Louf, 19 Aug 2023 (v4), Efficient Guided Generation for Large Language Models, https://arxiv.org/abs/2307.09702
- Zhengyang Su, Isay Katsman, Yueqi Wang, Ruining He, Lukasz Heldt, Raghunandan Keshavan, Shao-Chuan Wang, Xinyang Yi, Mingyan Gao, Onkar Dalal, Lichan Hong, Ed Chi, Ningren Han, 26 Feb 2026, Vectorizing the Trie: Efficient Constrained Decoding for LLM-based Generative Retrieval on Accelerators, https://arxiv.org/abs/2602.22647 https://github.com/youtube/static-constraint-decoding
- Aaron Pham, January 15, 2025, Structured Decoding in vLLM: A Gentle Introduction: Understand structure decoding and vLLM and how recent XGrammar integration can contribute to 5x improvement in TPOT, https://www.bentoml.com/blog/structured-decoding-in-vllm-a-gentle-introduction
- Kanghee Park, Timothy Zhou, Loris D'Antoni, 15 Jul 2025 (v2), Flexible and Efficient Grammar-Constrained Decoding, https://arxiv.org/abs/2502.05111
- Liangsheng Yin, Ying Sheng, Lianmin Zheng Feb 5, 2024, Fast JSON Decoding for Local LLMs with Compressed Finite State Machine, https://www.lmsys.org/blog/2024-02-05-compressed-fsm/
- Chuangtao Chen, Grace Li Zhang, Xunzhao Yin, Cheng Zhuo, Bing Li, Ulf Schlichtmann, 17 Apr 2026 (v2), KV Packet: Recomputation-Free Context-Independent KV Caching for LLMs, https://arxiv.org/abs/2604.13226
More AI Research Topics
Read more about:
Aussie AI Advanced C++ Coding Books
|
CUDA C++ Optimization book:
Get your copy from Amazon: CUDA C++ Optimization |
|
CUDA C++ Debugging book:
Get your copy from Amazon: CUDA C++ Debugging |
|
C++ AVX Optimization: CPU SIMD Vectorization:
Get your copy from Amazon: C++ AVX Optimization: CPU SIMD Vectorization |
|
C++ Ultra-Low Latency: Multithreading and Low-Level Optimizations:
Get your copy from Amazon: C++ Ultra-Low Latency |
|
Advanced C++ Memory Techniques: Efficiency & Safety:
Get your copy from Amazon: Advanced C++ Memory Techniques |
|
Safe C++: Fixing Memory Safety Issues:
Get it from Amazon: Safe C++: Fixing Memory Safety Issues |
|
Efficient C++ Multithreading: Modern Concurrency Optimization:
Get your copy from Amazon: Efficient C++ Multithreading |
|
Efficient Modern C++ Data Structures:
Get your copy from Amazon: Efficient C++ Data Structures |
|
Low Latency C++: Multithreading and Hotpath Optimizations: advanced coding book:
Get your copy from Amazon: Low Latency C++ |