Aussie AI

Vocabulary Trimming

  • Last Updated 24 January, 2025
  • by David Spuler, Ph.D.

Vocabulary trimming in LLMs is reducing the size of the token vocabulary for inference optimization. This reduces the size of the embedding dimension, thereby reducing both the computation cost and the memory size of model weights.

On the downside, vocabulary size reduction generally means that texts may need to be expressed in more tokens. This means that the token sequence length increases for some input prompts, so this dimension of LLM layer processing is worse, whereas the embedding dimension is improved. Hence, there are important tradeoffs in this approach.

Vocabulary trimming and lexical shortlisting have been use in Neural Machine Translation (NMT) for the translation of foreign languages. This research predates much of the LLM research, with many NMT techniques using other types of AI models, rather than LLMs and Transformers. The use of vocabulary trimming in LLMs remains largely unexplored and is an area warranting further research.

Related areas of LLM inference optimization include:

Research on Vocabulary Trimming

Research papers on reducing the size of an LLM vocabulary:

More Research on Pruning Types

AI Books from Aussie AI



The Sweetest Lesson: Your Brain Versus AI The Sweetest Lesson: Your Brain Versus AI: new book on AI intelligence theory:
  • Your brain is 50 times bigger than the best AI engines.
  • Truly intelligent AI will require more compute!
  • Another case of the bitter lesson?
  • Maybe it's the opposite of that: the sweetest lesson.

Get your copy from Amazon: The Sweetest Lesson



RAG Optimization RAG Optimization: Accurate and Efficient LLM Applications: new book on RAG architectures:
  • Smarter RAG
  • Faster RAG
  • Cheaper RAG
  • Agentic RAG
  • RAG reasoning

Get your copy from Amazon: RAG Optimization



Generative AI in C++ Generative AI Applications book:
  • Deciding on your AI project
  • Planning for success and safety
  • Designs and LLM architectures
  • Expediting development
  • Implementation and deployment

Get your copy from Amazon: Generative AI Applications



Generative AI in C++ Generative AI programming book:
  • Generative AI coding in C++
  • Transformer engine speedups
  • LLM models
  • Phone and desktop AI
  • Code examples
  • Research citations

Get your copy from Amazon: Generative AI in C++



CUDA C++ Optimization CUDA C++ Optimization book:
  • Faster CUDA C++ kernels
  • Optimization tools & techniques
  • Compute optimization
  • Memory optimization

Get your copy from Amazon: CUDA C++ Optimization



CUDA C++ Optimization CUDA C++ Debugging book:
  • Debugging CUDA C++ kernels
  • Tools & techniques
  • Self-testing & reliability
  • Common GPU kernel bugs

Get your copy from Amazon: CUDA C++ Debugging

More AI Research

Read more about: