Aussie AI Blog

State-of-the-Art LLM Backends

  • 26th August, 2024
  • by David Spuler, Ph.D.

State-of-the-Art LLM Backends

It's somewhat difficult to determine the state-of-the-art in LLM serving backends, as used by the industry's top players. Much of this information is commercially sensitive and is no longer appearing in public research papers (maybe it's in patents!). Nevertheless, there are some public papers and articles about various issues. Let's look at a few of them.

Character.AI companionbots backend. As detailed in their blog post, Character.AI has a very high level of traffic to their models. Inference optimization techniques include:

  • INT8 quantization of weights and activations
  • KV cache quantization (also INT8)
  • MatMul INT8 kernels
  • INT8 training (QAT)
  • Hybrid attention with interleaved layers of local attention and global attention (with global attention for only approximately 1 in every 6 layers)
  • KV cache compression
  • Multi-Query Attention (MQA)
  • KV cache layer fusion
  • Session KV caching (for chat sessions)
  • Prefix KV caching
  • Sticky sessions (avoids copying session caches)

They cite a 13.5X reduction in cost versus use of commercial model hosting (i.e., by doing it themselves), and also a 33X reduction compared to when they started optimizing.

Apple Intelligence for On-Device Inference. In announcing their "Apple Intelligence" initiative in June, 2024, Apple released certain information about the platform, specifically in relation to on-device execution of LLMs on iPhones and Mac. The exact details are somewhat opaque, but some aspects include:

  • M-series CPUs with NPU capabilities
  • 3B base LLM (with 16-bit precision)
  • LoRA adapters for fine-tuning (with 16-bit parameters, sized "in the tens of millions")
  • Multi-LoRA inference
  • Grouped Query Attention (GQA)
  • Low-bit quantization for some parameters (mixed 2-bit and 4-bit quantizations)
  • Talaria (tool for analysis)
  • KV cache quantization (bit precision undisclosed)
  • KV cache optimizations for "KV cache update" (details undisclosed)

With these optimizations, Apple reported performance on an iPhone 15 Pro of time-to-first-token latency of 0.6 milliseconds per prompt token, and decoding phase performance of 30 tokens per second.

Together AI data center networking. In various papers and announcements, Together AI has offered various details of its backend platform. Some example software optimizations available include:

  • CUDA backend for NVIDIA GPUs
  • Flash Attention
  • Flash Decoding
  • Medusa decoding (multi-token prediction)

Together AI also described their GPU cluster management, including the networking aspects and validation steps. This involves techniques and components such as:

  • H100 GPUs
  • Infiniband networking
  • NVLink and NVSwitch
  • NCCL
  • HPCX
  • SLURM (workload management)
  • Remote Direct Memory Access (RDMA) (GPUDirect RDMA)
  • Telegraf (open source monitoring)

Important steps in the process include:

  • GPU validation
  • Network validation
  • Storage validation

References

More AI Research Topics

Read more about:

AI Books from Aussie AI



The Sweetest Lesson: Your Brain Versus AI The Sweetest Lesson: Your Brain Versus AI: new book on AI intelligence theory:
  • Your brain is 50 times bigger than the best AI engines.
  • Truly intelligent AI will require more compute!
  • Another case of the bitter lesson?
  • Maybe it's the opposite of that: the sweetest lesson.

Get your copy from Amazon: The Sweetest Lesson



RAG Optimization RAG Optimization: Accurate and Efficient LLM Applications: new book on RAG architectures:
  • Smarter RAG
  • Faster RAG
  • Cheaper RAG
  • Agentic RAG
  • RAG reasoning

Get your copy from Amazon: RAG Optimization



Generative AI in C++ Generative AI Applications book:
  • Deciding on your AI project
  • Planning for success and safety
  • Designs and LLM architectures
  • Expediting development
  • Implementation and deployment

Get your copy from Amazon: Generative AI Applications



Generative AI in C++ Generative AI programming book:
  • Generative AI coding in C++
  • Transformer engine speedups
  • LLM models
  • Phone and desktop AI
  • Code examples
  • Research citations

Get your copy from Amazon: Generative AI in C++



CUDA C++ Optimization CUDA C++ Optimization book:
  • Faster CUDA C++ kernels
  • Optimization tools & techniques
  • Compute optimization
  • Memory optimization

Get your copy from Amazon: CUDA C++ Optimization



CUDA C++ Optimization CUDA C++ Debugging book:
  • Debugging CUDA C++ kernels
  • Tools & techniques
  • Self-testing & reliability
  • Common GPU kernel bugs

Get your copy from Amazon: CUDA C++ Debugging