Aussie AI
Chapter 12. Architectures of AI Projects
-
Book Excerpt from "Generative AI Applications: Planning, Design and Implementation"
-
by David Spuler
Chapter 12. Architectures of AI Projects
Easier Architectures
Here are some thoughts on the easier ways to build an AI project:
- Renting GPU juice from cloud vendors or GPU-specific hosting vendors is easier than buying them from NVIDIA for your own servers.
- Cloud-based AI is easier and more solid than on-device inference with AI phones or AI PCs, which are both newer and unproven.
- RAG architectures using document databases are easier than fine-tuning a model with those documents (even with fine-tuning speedups like LoRA).
- Agentic architectures with retrieval capabilities are easier than fine-tuning a model.
- Wrapping ChatGPT is easier than using open source models, but the gap is closing with various good open source servers and toolchains for better productionization.
- Foundation models start with the letter ’f’ for a reason. Don’t try to make your own from scratch, even if it’s only small.
To be a little more specific, here’s an approximate hierarchy of AI architecture complexity, from easiest to hardest, in terms of the difficulty of building or coding:
- Wrap architecture (stateless)
- Wrap architecture (statefull, with conversation history)
- RAG architecture (stateless)
- RAG architecture (statefull)
- Agentic architecture (stateless)
- Agentic architecture (stateful)
- Fine-tuning with LoRA (PEFT)
- Fine-tuning with multi-LoRA serving
- Fine-tuning (full parameter)
- AI platform (based on commercial or open-source models, with fine-tuning)
- Foundation model training (from scratch)
I’m not sure where to put “on-device” in that list. It’s still too early in the cycle to really tell what the SDKs for Google Android and Apple Intelligence offer.
For all of the “wrap” versions, it’s easier to wrap a commercial service (e.g., OpenAI API), because that does all of the backend work for you. Using an open source model requires setting up your own backend server with both an LLM and an open source inference engine. These are both quite freely available online, but it’s still a bit more work than using a paid hosting service.
Components of AI Architectures
A lot of the components of a full production AI architecture are similar to any other online application. However, there are some additional AI-specific components, of which the basic ones are:
- LLM
- Backend inference engine
And every AI application will also require work on the client side:
- Prompt engineering module
- UX components
If it’s a RAG architecture, there are some extra ones:
- Datastore
- Embedding module
- Retriever (keyword and/or vector lookup)
- Vector database
- Query merge component
- Citation manager
Some of the less obvious extra components to consider in a live generative AI deployment include:
- Observability monitoring (instrumentation)
- Conversation state manager
- Caching module (e.g., there are about ten types!)
- Prompt shield
- Safety monitor
- LoRA adapters (or “multi-LoRA”)
- Tool integrations (“function calling”)
I’m sure I could think of a few more to add to the above list, but that’ll do for now.
Software Architecture
If you’re building a significant portion of the software architecture for your AI project, then nothing is more important to the speed of a program than its architecture. I mean, look at AI. The whole architecture is a massive fail, endlessly bloated with far too many weights and a brute-force algorithm. Sadly, that’s the best we’ve come up with so far, but there’s a lot of research about these architectural issues that will probably solve it.
Anyway, as any professional programmer will tell you, it’s not difficult to choose a better architecture for your AI project. Fortunately, the best software architecture in the world is well-known to everyone, and is clearly this one:
- Object-oriented objects (OOO)
- Client-server
- Server-client
- Message passing
- Thin client
- Gamification
- Virtualization with Vectorization
- Model-View-Controller (MVC)
- UI-Application-Database (3-level)
- Event-Driven Architecture (EDA)
#include “beer.h”- Clouds
- Fog computing
- Postel’s law
- RTFM
- Microservices architecture
- Service Oriented Architecture (SOA)
- RESTful API architecture
- Intelligent Autonomous Agent (IAA)
- Intentional virality
- Goto considered helpful
Actually, sorry, that wasn’t the best architecture in the world; it was just a tribute.
AI Tech Stack
The tech stack for an AI project is similar to a non-AI project, with a few extra components. There’s also a much greater importance tied to the choice of underlying hardware (i.e., GPUs) than in many other types of projects. The tech stack looks something like this:
- User interface (client)
- Web server (e.g., Apache or Nginx)
- Application server
- Load balancer (e.g., HAProxy, Nginx, Traefik)
- Message Queue/Event Streaming (e.g., RabbitMQ or Apache Kafka)
- AI request manager
- AI Inference Engine (and model)
- Operating system (e.g., Linux vs Windows)
- CPU hardware (e.g., Intel vs AMD)
- GPU hardware (e.g., NVIDIA V100 vs A100)
Some of these layers are optional or could be merged into a single component. Also, if you’re using a remote hosted AI engine, whether open source hosting or wrapping a commercial engine through their API, then the bottom layers are not always your responsibility.
AI engine choices. How much of your AI tech stack will you control? Probably, at least for your first AI project, you’re renting a commercial model by the token, which is located in the cloud. In this case, you have almost no control at all.
Alternatively, if you’re running your own open source model with an open inference engine (e.g., vLLM or Llama.cpp) on Linux servers in the basement, then you’ve got more problems, ahem, I mean, more flexibility. With full control over the hardware and software, it makes sense to make symbiotic choices that allow maximum optimization of the combined system. For example, if you’ve decided to run the system on a particular GPU version, then your AI engine can assume this hardware acceleration is available, and don’t need to waste resources on ensuring your engine’s software runs on any other hardware platforms. However, you might need to periodically check that your GPU is still running an LLM and not mining Bitcoin.
Chains and Toolchains
AI architectures involve a lot of components, rather than just the Transformer engine and a model. These components are typically called “tools” and usually have these features:
- Receive input (e.g., text prompt)
- Process it in some way
- Produce output (often text, but not always)
The idea is that each tool is:
- Modular
- Specific (one task)
- Reusable
When we put more than one of these “tools” in a sequence, such that the second tool uses the output of the first tool as its input, this is called a “chain” or a “toolchain”. This terminology is particularly formalized in LangChain architectures.
We’ve seen this idea before. If you’re a Linux boffin, this is a toolchain, too:
cat input.txt | sort | uniq -c | sort -nR > output.txt
The main idea is that each tool does a specific task, and they can be put together in a sequence. This is really a founding architectural idea from the original Unix systems.
The only major problem with toolchains is that they aren’t parallelizable. The second tool has to wait for the input from the first tool, so they’re inherently sequential. Sequential execution is often, but not always, a particular limitation of this architectural idiom. One way they can be parallelized is if the second tool only needs partial output from the first tool to do its work, so they can use a streaming model in parallel, which is, again, the original way that Unix worked.
Wrap Architectures
Wrap architectures are those where you create a “wrapper” around someone else’s LLM. This is where you use someone else’s AI-as-a-Service (AaaS) platform, and pay by the token (or by the hour perhaps). The advantages of this are:
- No server work
- No LLM work
Basically, half of your AI app is pre-built for you. Your main work is:
- Prompt engineering
- UX improvements
The main downsides are:
- Pesky invoices with big numbers for per-token charges.
- You pay for both “up” and “down” tokens.
- You also pay for “embeddings” (vector-based).
- You have no control of the server side.
- Privacy issues viz where user data is stored.
For some “wrap” architecture, consideration needs to be given to who pays for the LLM. It’s very possible to wrap an LLM with a user paying, but to do so, the user needs to provide their “keys” for the API. This can help with some of the “pesky” per token charges.
But I don’t know why I’m mentioning all these non-issues. You just send the invoices to your boss, whose only response will be: “Spend more!” The issue of “no control” is probably not that significant, because these companies are likely better at running LLMs than you are. Also, the main thing you’d want to change is to fine-tune an LLM with your proprietary data, or uploaded your data for a RAG architecture, and all of the major LLM platforms have that covered.
The privacy issues are somewhat problematic. If customers are your target for the LLM app, then it’s their private data, but it’s your problem. If your app is internal, then you have employees “leaking” proprietary info to a third-party vendor whenever they use your “internal” AI app. But, come on, company employees are already doing that by getting ChatGPT to edit their M&A takeover memo, aren’t they?
Another consideration is do you allow the user to bring their own AI? That, is instead of wrapping OpenAI and asking for their OpenAI keys, they could bring Claude or Gemini or Mistral or Grok or HuggingFace or a few dozen more. In such cases, you are literally wrapping an AI you do not necessarily know about. It may be useful to control this to some extent, but this may increase your testing costs.
If you are providing the API credentials, you need to be careful not to surface them anywhere. Do not hardcode them into your user’s application in some way. Instead, you’ll need to add your own server between the user and the API endpoints. It’s also good practice to rotate the API key frequently, as there is a market for compromised AI keys.
On-Premises LLM Architecture
On-premises AI architectures are those where you run the LLM yourself on your own servers. This is a great architecture compared to commercial AP wrap architectures, because you don’t have to pay those nasty per-token charges to your LLM provider. Instead, what you do is use open source everything:
- Linux server software
- Apache or Nginx web server
- Open source inference frameworks
- Open source LLMs
And you only need to provide:
- Server hardware
- GPU chips
- Spare parts
- Firewall devices
- Liquid cooling
- Raised floor space
- System administrators
- Build engineers
- Linux experts
- Open source LLM experts
- System builders
- Network engineers
So, totally free!
Client architecture. For an on-premises LLM deployment, the rest of the AI app, on the client side, is very similar to a “wrap” architecture with a third-party LLM vendor, except that you’re wrapping your own LLM. You can use an on-premises self-wrap architecture with any clients: web browser, apps, etc. All of the work that you do for prompt engineering and UX is effectively the same.
But an on-premises implementation of LLMs does require you to do some other extra modules of work:
- Security credential management (i.e., logins, passwords, forgotten password, etc.)
- Fine-tuning methods for open source models using your data.
- Monitoring and “observability” (LLMOps).
- Performance tuning and scaling.
Private Cloud LLM Architecture
The half-way measure is to have your own “private cloud” and that’s not an oxymoron at all! I’m not even sure what that really means, but I’ve seen it plenty of times in nice glossy articles. Maybe it means:
- Running your own hardware servers, but with cloud server software on them, or
- Renting cloud hardware, but putting your own software on it.
The first one is basically the same as “on-premises”, is it not? The computers are down in the basement, and if you’re good at this stuff, then you’re running “cloud software” like Apache, or maybe Nginx if you’re really trendy, and therefore it’s a “cloud” deployment. If you’re not good at this stuff, then someone else is sending spam emails out from your basement.
Alternatively, you can rent physical servers from one of the hyperscalers, like AWS, Azure, or GCP. Instead of running your own physical hardware, you can rent access to their servers, whether "bare metal" servers or per-hour rentals, or whatever you can talk them into. This is really a cloud architecture, and not really that private if you ask me, but you didn’t, so I’ll be quiet about that.
Two-Step Architectures
All of these architectures have different backends, but the front-end piece is the same for each architecture. It differs depending on what type of app you’re creating, in terms of business logic, but you’re still mainly doing prompt engineering and user interface work for the non-server parts.
In an ideal world, you could just write up some JavaScript to do the GUI components and also do some fancy string concatenation for the prompt engineering. No need for you to do anything else!
Back in the real world now, you need what I call is a “two-step architecture” because regardless of what’s happening with the LLM, you still need:
- App serving basics — e.g., IP address location, DNS, etc.
- Basic back-end web serving capabilities — e.g., serving out those JavaScript files (and images, CSS files, fonts, blah blah blah).
- User logins and credential management
- Monitoring and management features
Oh, no! I just reverted to 1990s vocabulary. The correct terms are “observability” and “orchestration” and “LLMOps” for doing, you know, monitoring and management.
Anyway, what you’ll find is that you still need all your Linux backend software developers, even if you’re just using the ChatGPT API. That old backend computer expertise is important for the basic server infrastructure, even if you can offload the new-fangled LLM expertise to the ChatGPT API (for a fee).
And yeah, I know, we had two-step architectures back in the 1990s, too. But I forgot what they were called, so I invented my own terminology. See how AI hype works?
Security Credential Management
One of the things we’ve glossed over in the above discussion is security. How do you validate that only your legitimate users can use your LLM-based app?
This is a problem even for wrap architectures with third-party APIs. For example, with the OpenAI API, you get a set of credentials to use the API, and you’ll get billed for anyone accessing the API with these hexadecimal strings.
But, where do you put this file?
If you’re doing a phone app or a downloaded application for Windows or Mac, you can’t just put the credential file in there. It’s not protected, and someone will find it and abuse it. In theory, you could hard-code it into your client-side app with some kind of encryption, but it would have to be reversible encryption, and a hacker could probably disassemble that.
It seems like the platform vendors should have a better solution for this, but I think it’s still an issue. It’s not easy to fix, but maybe this will improve in the near future.
Web-based applications don’t have any client-side files, so this makes it simpler, because obviously you store the credentials on your server. Hence, if you’re doing a two-step web-based application, then at least you can control the API credentials away from the user. It’s a small file buried somewhere in a Linux server, and also in your version control system, available to all 1,000 of your developers at your company, and also uploaded onto Github in one of their skunkworks projects, too.
The good news is that your users can’t just steal your credentials by using your app’s front-end interface. But you can’t just allow them to anonymously send unlimited queries to ChatGPT via your two-step server-side pathway. So, then you’ve got to code up your own login capabilities for your app interface and manage the users on your server. The work never ends!
On-Device AI Architectures
On-device architectures in the AI industry have come to mean one of these:
- AI phones
- AI PCs
But the term “on-device” actually also includes also sorts of other “edge” devices, such as:
- Cars
- Trucks
- Tractors
- Drones
- Nuclear-capable fighter jets
- IoT network devices
- Raspberry Pi
- Smart refrigerators
- Toaster ovens with personality
Okay, so I’m joking about the military applications, for which old-school ML models are more useful than generative AI. I really don’t think we’re going to see drones with LLMs on board, unless the goal is to fly over to the enemy and tell them some good jokes.
On-device inference is one of the hottest areas in the tech industry at the moment, but it might not be relevant to your business AI applications. It depends on what user devices are going to be used to access your AI applications. Your input devices might well be phones, tablets, or laptops, but even then you might not care about on-device inference, because those devices can simply connect to LLMs in the cloud via the internet. On the other hand, maybe you care tremendously about on-device LLM capabilities, in which case the following chapters examine AI phones and AI PCs in more detail.
References
- Johannes Schneider, 1 Aug 2024, What comes after transformers? -- A selective survey connecting ideas in deep learning, https://arxiv.org/abs/2408.00386
- Cem Dilmegani, Jan 10, 2024, The Future of Large Language Models in 2024, https://research.aimultiple.com/future-of-large-language-models/
- Yehui Tang, Yunhe Wang, Jianyuan Guo, Zhijun Tu, Kai Han, Hailin Hu, Dacheng Tao, 5 Feb 2024. A Survey on Transformer Compression. https://arxiv.org/abs/2402.05964 (Model compression survey paper with focus on pruning, quantization, knowledge distillation, and efficient architecture design.)
- Xinji Mai, Zeng Tao, Junxiong Lin, Haoran Wang, Yang Chang, Yanlan Kang, Yan Wang, Wenqiang Zhang, 27 Jun 2024, From Efficient Multimodal Models to World Models: A Survey, https://arxiv.org/abs/2407.00118 (A survey of multimodal models with coverage of many optimization techniques.)
- Badri Narayana Patro, Vijay Srinivas Agneeswaran, 24 Apr 2024, Mamba-360: Survey of State Space Models as Transformer Alternative for Long Sequence Modelling: Methods, Applications, and Challenges, https://arxiv.org/abs/2404.16112
- Tianyu Ding, Tianyi Chen, Haidong Zhu, Jiachen Jiang, Yiqi Zhong, Jinxin Zhou, Guangzhi Wang, Zhihui Zhu, Ilya Zharkov, Luming Liang, 18 Apr 2024 (v2), The Efficiency Spectrum of Large Language Models: An Algorithmic Survey, https://arxiv.org/abs/2312.00678
- Matt Murphy, Tim Tully, Grace Ge, Derek Xiao, Katie Keller, January 18, 2024, The Modern AI Stack: Design Principles for the Future of Enterprise AI Architectures, https://menlovc.com/perspective/the-modern-ai-stack-design-principles-for-the-future-of-enterprise-ai-architectures/?tpcc=NL_Marketing
- Artem Shelamanov, Jun 30, 2024. Tech Stack For Production-Ready LLM Applications In 2024, https://python.plainenglish.io/tech-stack-for-production-ready-llm-applications-in-2024-5eb14105d1b4
- Thiyagarajan Maruthavan (Rajan), Apr 12, 2024, So what if it is a thin wrapper on OpenAI? https://medium.com/@mtrajan/so-what-if-it-is-a-thin-wrapper-on-openai-274dd005b6d3
- Michael J. Lever, Aug 2024, AI or API? | Chatbot cuckoos are bloating tech. OpenAI wrappers are becoming a shortcut for start-ups, but are they sustainable? https://medium.com/future-ux/ai-or-api-chatbot-cuckoos-are-bloating-tech-d6b8d8255279
- Quang H. Nguyen, Duy C. Hoang, Juliette Decugis, Saurav Manchanda, Nitesh V. Chawla, Khoa D. Doan, 24 Jul 2024 (v2), MetaLLM: A High-performant and Cost-efficient Dynamic Framework for Wrapping LLMs, https://arxiv.org/abs/2407.10834
|
• Online: Table of Contents • PDF: Free PDF book download • Buy: Generative AI Applications: Planning, Design and Implementation |
|
The new Generative AI Applications book by Aussie AI co-founders:
Get your copy from Amazon: Generative AI Applications |