Aussie AI

Token Reduction

  • Last Updated 15 August, 2025
  • by David Spuler, Ph.D.

What is Token Reduction?

Token reduction is an LLM inference optimization method that aims to speed up LLMs by reducing the number of tokens processed. This is relevant to any inference process, but is particularly applicable to optimizing multi-step reasoning algorithms, such as Chain-of-Thought, because their interim reasoning steps generate long sequences of tokens.

Token reduction methods have a long history of research in single-step inference optimizations. Some examples where input texts have many tokens include RAG chunks and the conversational history context in chatbot sessions. Hence, reducing the total number of tokens will improve speed in several situations:

One of the simpler ways to reduce tokens is to use prompt engineering techniques. There are two main ways:

  • Write a concise prompt for the LLM (fewer input tokens).
  • Politely ask the LLM to "be concise" as part of the prompt (fewer output tokens).

There are various technical ways to do token reduction as part of inference optimization, and some of the many sub-techniques include:

Research on Token Reduction

Some of the general papers on token reduction strategies:

Token Efficiency in Reasoning and CoT

Token reduction is one of the main methods to improve the efficiency of reasoning models, especially using Chain-of-Thought (CoT) algorithms. There are various ways to improve token counts in CoT at the high level by skipping steps or pruning paths, and there are also various low-level methods.

Blog articles on reasoning efficiency:

More research information on general efficiency optimization techniques for reasoning models:

Efficiency optimizations to Chain-of-Thought that aim to reduce token processing include:

AI Books from Aussie AI



The Sweetest Lesson: Your Brain Versus AI The Sweetest Lesson: Your Brain Versus AI: new book on AI intelligence theory:
  • Your brain is 50 times bigger than the best AI engines.
  • Truly intelligent AI will require more compute!
  • Another case of the bitter lesson?
  • Maybe it's the opposite of that: the sweetest lesson.

Get your copy from Amazon: The Sweetest Lesson



RAG Optimization RAG Optimization: Accurate and Efficient LLM Applications: new book on RAG architectures:
  • Smarter RAG
  • Faster RAG
  • Cheaper RAG
  • Agentic RAG
  • RAG reasoning

Get your copy from Amazon: RAG Optimization



Generative AI in C++ Generative AI Applications book:
  • Deciding on your AI project
  • Planning for success and safety
  • Designs and LLM architectures
  • Expediting development
  • Implementation and deployment

Get your copy from Amazon: Generative AI Applications



Generative AI in C++ Generative AI programming book:
  • Generative AI coding in C++
  • Transformer engine speedups
  • LLM models
  • Phone and desktop AI
  • Code examples
  • Research citations

Get your copy from Amazon: Generative AI in C++



CUDA C++ Optimization CUDA C++ Optimization book:
  • Faster CUDA C++ kernels
  • Optimization tools & techniques
  • Compute optimization
  • Memory optimization

Get your copy from Amazon: CUDA C++ Optimization



CUDA C++ Optimization CUDA C++ Debugging book:
  • Debugging CUDA C++ kernels
  • Tools & techniques
  • Self-testing & reliability
  • Common GPU kernel bugs

Get your copy from Amazon: CUDA C++ Debugging

More AI Research

Read more about: