Aussie AI

AI Middleware

  • Last Updated 12 December, 2024
  • by David Spuler, Ph.D.

AI middleware is the layer in the tech stack above the LLMs. It can provide services such as prompt extensions, conversational history management, and other relatively low-level functionality. As such, middleware can operate as a wrapper around remote AI API services, or can run near a self-hosted open source LLM in the same local servers.

Features of LLM Middleware

Some of the features that middleware components may typically provide in a layer above an individual LLM inference layer include:

  • Multi-LLM access (helping to avoid "vendor lock-in")
  • Prompt templating (e.g., adding global instructions)
  • Programmatic prompting (i.e., automatic prompt improvement)
  • Conversational history management (chatbots and Q&A versions)
  • Prompt caching (e.g., prefix KV caching)
  • Logging
  • Monitoring and observability
  • Reporting and statistics tracking
  • User identity and security credential management

Generally speaking, most of the RAG architecture are not considered by fit under the category of "middleware." Components such as vector databases, rerankers, packers, and other RAG components have their own category in the RAG stack.

However, the recent advances in Chain-of-Thought and other multi-step inference-based reasoning algorithms have spawned another use of AI middleware at a much higher level. An AI middleware can wrap individual LLM queries into sequences of multiple steps, thereby implementing reasoning algorithms, such as:

  • Reflection
  • LLM as Judge
  • Chain-of-Thought
  • Best-of-N
  • Skeleton-of-Thought

And there are many more such multi-step reasoning algorithms.

Research on AI Middleware

Research papers on AI middleware components and the overall AI tech stack:

AI Books from Aussie AI



The Sweetest Lesson: Your Brain Versus AI The Sweetest Lesson: Your Brain Versus AI: new book on AI intelligence theory:
  • Your brain is 50 times bigger than the best AI engines.
  • Truly intelligent AI will require more compute!
  • Another case of the bitter lesson?
  • Maybe it's the opposite of that: the sweetest lesson.

Get your copy from Amazon: The Sweetest Lesson



RAG Optimization RAG Optimization: Accurate and Efficient LLM Applications: new book on RAG architectures:
  • Smarter RAG
  • Faster RAG
  • Cheaper RAG
  • Agentic RAG
  • RAG reasoning

Get your copy from Amazon: RAG Optimization



Generative AI in C++ Generative AI Applications book:
  • Deciding on your AI project
  • Planning for success and safety
  • Designs and LLM architectures
  • Expediting development
  • Implementation and deployment

Get your copy from Amazon: Generative AI Applications



Generative AI in C++ Generative AI programming book:
  • Generative AI coding in C++
  • Transformer engine speedups
  • LLM models
  • Phone and desktop AI
  • Code examples
  • Research citations

Get your copy from Amazon: Generative AI in C++



CUDA C++ Optimization CUDA C++ Optimization book:
  • Faster CUDA C++ kernels
  • Optimization tools & techniques
  • Compute optimization
  • Memory optimization

Get your copy from Amazon: CUDA C++ Optimization



CUDA C++ Optimization CUDA C++ Debugging book:
  • Debugging CUDA C++ kernels
  • Tools & techniques
  • Self-testing & reliability
  • Common GPU kernel bugs

Get your copy from Amazon: CUDA C++ Debugging

More AI Research

Read more about: