Aussie AI Blog

What's New in Speculative Decoding?

March 3rd, 2025

by David Spuler, Ph.D.

What's New in Speculative Decoding?

Speculative decoding is one of the earliest LLM efficiency improvements that parallelized a lot of decoding steps. And yet, there seems to be a never-ending supply of research papers on the topic of speculative decoding.

So, what's new? Here are some of the more recent research areas:

Draft model accuracy — more papers on this, as always; forgive me if I yawn!
Multiple parallel draft models — ongoing improvements to this idea.
Multi-query prompt lookup decoding — this generalizes prompt lookup decoding to scour not only the current prompt context, but also any previous queries in the history.
Distributed speculative decoding — optimal use of speculative decoding when inference processing is distributed over multiple GPUs or multiple servers.
Long context speculative decoding — examination of particular optimizations when applying speculative decoding to long context or ultralong contexts.
Vision and multimodal speculative decoding — visual tokenization is very different.

Read more about types of speculative decoding.

More AI Research Topics

Read more about:

AI Books from Aussie AI

The Sweetest Lesson: Your Brain Versus AI

The Sweetest Lesson: Your Brain Versus AI: new book on AI intelligence theory:

Your brain is 50 times bigger than the best AI engines.
Truly intelligent AI will require more compute!
Another case of the bitter lesson?
Maybe it's the opposite of that: the sweetest lesson.

Get your copy from Amazon: The Sweetest Lesson

RAG Optimization

RAG Optimization: Accurate and Efficient LLM Applications: new book on RAG architectures:

Smarter RAG
Faster RAG
Cheaper RAG
Agentic RAG
RAG reasoning

Get your copy from Amazon: RAG Optimization

Generative AI in C++

Generative AI Applications book:

Deciding on your AI project
Planning for success and safety
Designs and LLM architectures
Expediting development
Implementation and deployment

Get your copy from Amazon: Generative AI Applications

Generative AI in C++

Generative AI programming book:

Generative AI coding in C++
Transformer engine speedups
LLM models
Phone and desktop AI
Code examples
Research citations

Get your copy from Amazon: Generative AI in C++

CUDA C++ Optimization

CUDA C++ Optimization book:

Faster CUDA C++ kernels
Optimization tools & techniques
Compute optimization
Memory optimization

Get your copy from Amazon: CUDA C++ Optimization

CUDA C++ Optimization

CUDA C++ Debugging book:

Debugging CUDA C++ kernels
Tools & techniques
Self-testing & reliability
Common GPU kernel bugs

Get your copy from Amazon: CUDA C++ Debugging