Aussie AI

AI Middleware

Last Updated 17 November, 2025

by David Spuler, Ph.D.

AI middleware is the layer in the tech stack above the LLMs. It can provide services such as prompt extensions, conversational history management, and other relatively low-level functionality. As such, middleware can operate as a wrapper around remote AI API services, or can run near a self-hosted open source LLM in the same local servers.

Features of LLM Middleware

Some of the features that middleware components may typically provide in a layer above an individual LLM inference layer include:

Multi-LLM access (helping to avoid "vendor lock-in")
Prompt templating (e.g., adding global instructions)
Programmatic prompting (i.e., automatic prompt improvement)
Conversational history management (chatbots and Q&A versions)
Prompt caching (e.g., prefix KV caching)
Logging
Monitoring and observability
Reporting and statistics tracking
User identity and security credential management

Generally speaking, most of the RAG architecture are not considered by fit under the category of "middleware." Components such as vector databases, rerankers, packers, and other RAG components have their own category in the RAG stack.

However, the recent advances in Chain-of-Thought and other multi-step inference-based reasoning algorithms have spawned another use of AI middleware at a much higher level. An AI middleware can wrap individual LLM queries into sequences of multiple steps, thereby implementing reasoning algorithms, such as:

Reflection
LLM as Judge
Chain-of-Thought
Best-of-N
Skeleton-of-Thought

And there are many more such multi-step reasoning algorithms.

Research on AI Middleware

Research papers on AI middleware components and the overall AI tech stack:

Asankhaya Sharma (codelion), Sep 2024, Optillm: Optimizing inference proxy for LLMs, https://github.com/codelion/optillm
Noah Martin, Abdullah Bin Faisal, Hiba Eltigani, Rukhshan Haroon, Swaminathan Lamelas, Fahad Dogar, 4 Oct 2024, LLMProxy: Reducing Cost to Access Large Language Models, https://arxiv.org/abs/2410.11857 (Deploying a proxy between user and LLM, with handling of conversational history context and caching.)
Narcisa Guran, Florian Knauf, Man Ngo, Stefan Petrescu, Jan S. Rellermeyer, 21 Nov 2024, Towards a Middleware for Large Language Models, https://arxiv.org/abs/2411.14513
Andrew Ng, Nov 2024, Simple, unified interface to multiple Generative AI providers, https://github.com/andrewyng/aisuite
Asif Razzaq, November 29, 2024, Andrew Ng’s Team Releases ‘aisuite’: A New Open Source Python Library for Generative AI, https://www.marktechpost.com/2024/11/29/andrew-ngs-team-releases-aisuite-a-new-open-source-python-library-for-generative-ai/
Ian Drosos, Jack Williams, Advait Sarkar, Nicholas Wilson, 3 Dec 2024, Dynamic Prompt Middleware: Contextual Prompt Refinement Controls for Comprehension Tasks, https://arxiv.org/abs/2412.02357
Stephen MacNeil, Andrew Tran, Joanne Kim, Ziheng Huang, Seth Bernstein, Dan Mogil, 3 Jul 2023, Prompt Middleware: Mapping Prompts for Large Language Models to UI Affordances, https://arxiv.org/abs/2307.01142
Yu Gu, Yiheng Shu, Hao Yu, Xiao Liu, Yuxiao Dong, Jie Tang, Jayanth Srinivasa, Hugo Latapie, Yu Su, 4 Oct 2024 (v2), Middleware for LLMs: Tools Are Instrumental for Language Agents in Complex Environments, https://arxiv.org/abs/2402.14672
Jiejun Tan, Zhicheng Dou, Yutao Zhu, Peidong Guo, Kun Fang, Ji-Rong Wen, 30 May 2024 (v3), Small Models, Big Insights: Leveraging Slim Proxy Models To Decide When and What to Retrieve for LLMs, https://arxiv.org/abs/2402.12052 https://github.com/plageon/SlimPLM
Pavan Reddy, Nithin Reddy, 7 Sep 2025, Preventing Another Tessa: Modular Safety Middleware For Health-Adjacent AI Assistants, https://arxiv.org/abs/2509.07022
Aymen Alsaadi, Jonathan Ash, Mikhail Titov, Matteo Turilli, Andre Merzky, Shantenu Jha, Sagar Khare, 7 Oct 2025, Adaptive Protein Design Protocols and Middleware, https://arxiv.org/abs/2510.06396
Jianan Mu, Mingyu Shi, Yining Wang, Tianmeng Yang, Bin Sun, Xing Hu, Jing Ye, Huawei Li, 9 Oct 2025, Faver: Boosting LLM-based RTL Generation with Function Abstracted Verifiable Middleware, https://arxiv.org/abs/2510.08664