Aussie AI

Serving and Deployment

  • Last Updated 25 April, 2026
  • by David Spuler, Ph.D.

Serving

Serving is the practical matter of how to architecture the full production application around the LLM. Other components may include a web server, application server, RAG datastore, retriever, load balancer, and more. Furthermore, there are some techniques that affect the speed of inference:

  • Batching
  • Prefill versus decoding phase
  • Scheduling
  • Load balancing
  • Frameworks (backend)

LLM Serving: Book Excerpts and Blog Articles

Free online book excerpts with full text chapters online and free PDF downloads, and the Aussie AI blog, including related articles:

Research on LLM Serving

Recently, there has been an explosion of papers about the practical aspects of deployment, orchestration, and serving of LLM inference. Here's some of the papers:

Deployment

Research on LLM deployment:

Batching

Research papers on batching:

Continuous Batching

Research papers on continuous batching:

Frameworks

Research on inference frameworks as part of serving:

Serverless

Scheduling

Load Balancing

Research papers on AI load balancing:

Networking

Research papers on networking optimizations for LLMs:

AI Tech Stack

Research on AI tech stacks:

AI Books from Aussie AI



The Sweetest Lesson: Your Brain Versus AI The Sweetest Lesson: Your Brain Versus AI: new book on AI intelligence theory:
  • Your brain is 50 times bigger than the best AI engines.
  • Truly intelligent AI will require more compute!
  • Another case of the bitter lesson?
  • Maybe it's the opposite of that: the sweetest lesson.

Get your copy from Amazon: The Sweetest Lesson



RAG Optimization RAG Optimization: Accurate and Efficient LLM Applications: new book on RAG architectures:
  • Smarter RAG
  • Faster RAG
  • Cheaper RAG
  • Agentic RAG
  • RAG reasoning

Get your copy from Amazon: RAG Optimization



Generative AI in C++ Generative AI Applications book:
  • Deciding on your AI project
  • Planning for success and safety
  • Designs and LLM architectures
  • Expediting development
  • Implementation and deployment

Get your copy from Amazon: Generative AI Applications



Generative AI in C++ Generative AI programming book:
  • Generative AI coding in C++
  • Transformer engine speedups
  • LLM models
  • Phone and desktop AI
  • Code examples
  • Research citations

Get your copy from Amazon: Generative AI in C++



CUDA C++ Optimization CUDA C++ Optimization book:
  • Faster CUDA C++ kernels
  • Optimization tools & techniques
  • Compute optimization
  • Memory optimization

Get your copy from Amazon: CUDA C++ Optimization



CUDA C++ Optimization CUDA C++ Debugging book:
  • Debugging CUDA C++ kernels
  • Tools & techniques
  • Self-testing & reliability
  • Common GPU kernel bugs

Get your copy from Amazon: CUDA C++ Debugging

More AI Research

Read more about: