Aussie AI

Chapter 11. Requirements for AI Projects

Book Excerpt from "Generative AI Applications: Planning, Design and Implementation"

by David Spuler

AI Project Requirements

Research the detailed requirements of your AI project might not be a bad idea, considering the potential cost outlay involved in an AI project. You’re probably familiar with the general issues of requirements design, so I’ll focus mainly on the AI-specific issues. The main goal of requirements planning is to create a printed and nicely bound document that makes a big thud when you throw it onto the CEO’s desk.

Researching your AI project will involve issues such as:

What is the specific AI use case?
What in-house proprietary data could be used for training?
Existing staff AI expertise levels.
Capacity of existing hardware viz training or inference workloads.
Vendors and costs of AI-specific hosting versus in-house capabilities.
Adding any random feature requested by the Marketing Department (to position the company as an “AI play”).

Some of the specific decisions in moving ahead with a project plan include:

Use case specific requirements
Proprietary training data cleansing
Choice of foundational model
Commercial versus open source models
Training or fine-tuning versus RAG

It’s not all about AI, and general tech project requirements also apply:

User interface platform
Backend hosting and deployment issues
Development processes
Security risk mitigations
Backup and recovery procedures

In addition to technology issues, there are also broader legal and regulatory issues to consider such as:

Responsible AI (safety issues)
Governmental AI regulatory compliance
Internet regulatory compliance (non-AI)
Organizational legal compliance (e.g., HIPAA, SOC)
Copyright law
Privacy law
IP ownership (e.g., who owns the generated text, code or other artifacts).

These legal and regulatory issues have been mostly covered in other chapters. But they have an impact on requirements, and can require some significant development activity.

Top 10 Really Big Optimizations

Overall cost is one of the big constraints that you’ll have for an AI project. Let’s take a step back and consider the massive optimizations for your entire project. Here’s some ways to save megabucks:

1. Buy an off-the-shelf commercial AI-based solution instead.

2. Wrap a commercial model rather than training your own foundation model (e.g., OpenAI API).

3. Test multiple commercial foundational model API providers and compare pricing.

4. Use an open source pre-trained model and engine (e.g., Meta’s Llama models).

5. Avoid fine-tuning completely via Retrieval-Augmented Generation (RAG).

6. Choose smaller model dimensions when designing your model.

7. Choose a compressed open source pre-trained pre-quantized model (e.g., quantized Llama).

8. Cost-compare GPU hosting options for running your model.

9. Use cheaper commercial API providers for early development and testing.

10. Use smaller open-source models for early development and testing.

If ten cost reductions isn’t enough for you, don’t worry, I’ve got more! There are plenty of ways discussed in this book to improve inference efficiency. And when you’re making an LLM more efficient to execute on GPUs, faster also means cheaper. It’s useful to reevaluate items 1-10 above regularly as the whole area changes rapidly, with new technology appearing and costs also getting pushed down.

Build versus Buy

Before we dive into the mechanics of building your own AI thingummy in a huge project, it’s worth considering the various existing products and tools. You might not need a development project at all, but simply a DevOps project to integrate a new third-party commercial product into your company’s infrastructure.

For example, if your project goal is to have staff writers being more productive in creating drafts of various documents or marketing copy, there’s this product called ChatGPT from OpenAI. Maybe you’ve heard of it?

Actually, there are any number of other tools for writer productivity using AI capabilities, some of which use ChatGPT underneath, and some of which are independent. Similarly, there are already a number of “AI coding copilot” type products, which might make your programmers even more amazingly, astoundingly, RSU-worthily useful than they already are. Across the whole spectrum of creative endeavors, there are also numerous AI products that create images, animations, 3D models, and videos.

More generally, there are starting to be AI products for almost every use case that you can think of, and in all of the major industry verticals (e.g., medicine, law, finance, etc.) so it’s worth a little research as to what’s currently available that might suit your needs. I’m reluctant to offer lists, because it’s changing daily. Anyway, it’s not my job to review them; it’s yours!

Overall, it’s fun to build anything with AI technology, but it’s faster to use something that’s already been built. And these new AI tools are actually so amazing that it’s also fun to test them.

Foundation Model Choices

What model are you going to use as the Foundation Model? There are really three major options:

Commercial models
Open source models
Build Your Own (BYO)

Of course, there’s that fourth option of not using AI, which, as anyone in the AI industry will tell you, leads to analysts shunning your stock, instant bankruptcy, and your toenails catching on fire.

Building your own model is a viable option for small to medium models, that you want to train on your data set. However, only the major tech companies have been successful at training a massive LLM foundation model, given the expertise required and expense of training.

The alternative is to choose an existing foundation model, that is pre-trained on lots of general data. Then you would fine-tune that model on whatever proprietary data that you want to use.

If you have no specific extra data for fine-tuning, then you’re basically using a commercial or open source model underneath. You can still achieve significant customization of an existing model without fine-tuning, using techniques such as prompt engineering, Retrieval-Augmented Generation (RAG), and the simple idea of mixing heuristics with AI inference results.

Open Source Models

When ChatGPT burst into public consciousness in around February 2023, there were already lots of open source models. However, they are mostly smaller models and nowhere near as capable. Nevertheless, you could get a lot of value in them at no cost.

Open source models received a huge jump forward when Meta released its Llama model into the open source world. It was licensed only for non-commercial and research purposes, but it was immediately used in numerous ways in the open source community. This model was also used in various ways to create other models that were, theoretically at least, freed of the non-commercial limitations of the original Llama license. That legal issue was never tested and became moot shortly afterwards when Llama version 2 came out.

Meta open sourced its Llama2 model for both commercial and non-commercial usage in July 2023. The license was non-standard, but for most users who were not already large companies), it was largely free of the restrictions. You should review the details of the Llama2 license yourself, and any future Meta model releases, but it has been widely used in the open source community already.

Commercial-Usage Open Source Models

Although Llama2 probably tops the list, there are several other major models that have been open-sourced in permissive licenses. Again, you should check these license details yourself, as even the permissive licenses impose some level of restrictions or obligations. Here is my list of some of the better models that I think can be used commercially:

Llama2 from Meta (Facebook Research) under a specific license called the Llama 2 Community License Agreement.
Mistral 7B and Mistral 8x7B (both with an Apache 2.0 license).
MPT-7B from MosaicML (DataBricks) with Apache 2.0 license.
Falcon 7B/40B from the Technology Innovation Institute (TII) (Apache 2.0 license)
FastChat T5 from LMSYS (Apache 2.0 license)
Cerebras GPT AI Model (Apache 2.0 license)
GPT4All models (various); some under MIT License.
H2O GPT AI model (Apache 2.0 license)
Orca Mini 13B (MIT License)
Zephyr 7B Alpha (MIT License)

This list is already out-of-date as you read this, I’m sure. There are new models coming out regularly, and there are also various new models being created from other models, such as quantized versions and other re-trained derivative models.

Model Size

Choosing a model size is an important part of the project. For starters, the size of a model has a direct correlation to the cost of both training and inference in terms of GPU juice. Making an astute choice on the type of model you need for this exact use case can make a large impact on the initial and ongoing cost of an AI project.

There’s no doubt that bigger models are enticing. The general rule seems to be that bigger models are more capable, and a multi-billion parameter model seems to be table stakes for a major AI model these days. And the top commercial models are starting to exceed a trillion parameters.

However, some research is starting to cast doubt on this, at least in that the trend that ever-larger models may not always result in increased intelligence. For example, GPT-4 is rumored to be eight models merged together in a Mixture-of-Experts (MoE) architecture, each of size about 220B parameters, rather than one massive model of 1.76T parameters.

Quality matters, not just quantity. The quality of the data set used for training, and the quality of the various techniques are important. The quality is important for intelligence shouldn’t be surprising. In fact, what should be surprising is that quantity has been so successful at raising AI capabilities.

Model optimizations. How can you have a model that’s smarter and faster and cheaper? Firstly, the open source models have improved quickly and continue to do so. Some are starting to offer quite good functionality at a speed that is very fast. There are models that have been compressed (e.g., quantization, pruning, etc.), and there are open source engines that offer various newer AI optimization features (e.g., Flash Attention) You can download both models and engine source code, and run the open source models yourself (admittedly, with hosting costs for renting your own GPUs, or using a commercial GPU hosting service).

For a commercial API, you can’t change their engines until you apply for a job there. However, you can reduce the number of queries being sent to a commercial API, mainly by putting a cache in front of the calls. This cuts costs and speeds up replies for common prompts (or similar ones), with the trade-off that non-cached queries have a slightly slower response time from the additional failed cache lookup. An inference cache is a cache of the responses to identical queries, whereas a semantic cache finds “close-enough” matches in prior queries using nearest-neighbor vector database lookups.

Latency and Response Time

Performance requirements for the overall system are an important part of the design. For a simple text-generation response, such as a chatbot or Q&A, there are two main factors in evaluating the speed of its answering:

Initial response time (prefill phase or encoding mode)
Tokens per second (decoding mode)

Response Time (Latency). The initial time delay before an AI engine emits the first word of its response is called the “latency” or “response time.” It is also sometimes called the “time to first token” (which is shortened to TTFT in research papers) or the “prefill time.” There are two factors here that cause high latency:

Cloud round-trip message time (network)
Prefill or encoding time (engine)

For a cloud-based AI engine on a phone, the LLM isn’t on the phone, and the prompt question must be sent to the cloud engine over the network. Hence, there is time for the message to get sent into the cloud and back again with the answer (i.e., a “round-trip” network message).

For an on-device native LLM execution on a phone, there’s no network, because the engine runs inside the phone. Hence, native execution’s response time is only about engine speed, and that prefill phase.

Prefill phase. Both the cloud and on-device versions have to run the query on an engine. AI engines often have a significant delay before starting to reply. For example, a 2-second response time is not uncommon. An AI engine in the cloud probably runs faster because it can have powerful GPUs on a big box in a data center somewhere near a big lake. Local AI engines running on a phone don’t have a big GPU, or may not have one at all, so they run slower.

Why are Transformers slow to initially respond? The way that Transformers work is to do an initial phase that is called “prefill” in decoder-only engines (e.g., GPT), or “encoding” in the older style of encoder-decoder engines. This phase calculates a lot of data about the input prompt, but doesn’t emit any output tokens. Hence, it’s also called the “prompt processing” phase.

The length of the prefill phase also depends on the size of the input prompt. More tokens to process in the input means a slower initial response time.

The input tokens are not only the user’s question. They also include the “context” that’s sent to the AI engine with the query, such as the conversation history in a chatbot, or the “chunks” of documents from a datastore in a RAG architecture for Q&A. These extra context tokens can be significantly longer than the simple query prompt.

Hence, this prefill or encoding time is part of the initial delay before an AI engine answers. For an on-device phone LLM, it’s the main delay (there’s no network delay). But once the LLM starts talking, then it goes faster in the “decoding phase.”

Decoding speed. The second phase of an AI engine’s answer is when it starts outputting its response, one token at a time (i.e., a word at a time). How fast this runs is called the “decoding speed” and is usually measured in tokens-per-second. With a big GPU, the engines are still quite slow, and may run in tens of tokens per second. Reportedly, GPT 3.5 can run at a maximum of around 100 tokens per second. Note that tokens are not always whole words, so words-per-second will be slightly less than this (e.g., maybe 75% of this).

On a phone, it’s slower because the engine has less hardware support. The SOTA at the moment is more like single digits per second, i.e., only a few words per second of output.

Note that there isn’t a big delay between subsequent token outputs in the decoding mode. There’s only that initial delay before the first one. The time to output each new token is about the same for each token in the answer, because it basically re-runs the same decoder computations for every output token.

How Fast is Needed?

How fast does an AI engine need to run on a phone? Well, it depends on whether your AI is writing you a novel or helping you scroll through your TikTok feed.

Consider the different speed requirements if it’s generating text, voice, images, or video. Also, in what way does it need to be interactive, or is the user just passively reading or watching? Maybe it’s controlling some phone aspects in the UI, too.

But let’s take basic LLM text generation as an example. This means use cases such as a chatbot, Q&A service, or written document drafting.

The way to think about this is to consider human nature with the two main aspects of the speed:

Response time
Decoding output token generation.

Firstly, humans are impatient. We’re used to getting a very snappy response to websites and computers. What that means is something like a 200ms initial response time is desirable.

The news is better for the decoding phase. Humans don’t read fast. An average reading speed is about 240 words-per-minute, which is only 4 words per second.

Hence, it seems likely that the biggest problem with on-device AI responsiveness is going to be the initial prefill time delay until the first word starts appearing.

The SOTA for running small models (e.g., 1B or 2B) on a phone is not there yet. Papers talk about an initial response time in seconds rather than milliseconds, so the prefill phase is problematic. Decoding rates are better, which high single-digit tokens-per-second achievable after the first one. Hence, the initial delay is problematic, but the decoding speed thereafter is not as concerning.

Incremental Output

The assumption here for decoding speed is that the partial answers starts being emitted by the engine before it’s finished its full output. We won’t want to have to wait for the entire AI engine’s inference phase before outputting some words for the user. In other words, incremental or “streamed” output of AI results is critical.

Voice output. Incrementally creating words to say as voice output might seem slightly more problematic than reading text. If the engine’s decoding output speed doesn’t keep up with a normal rate of speech, then the voice-based AI assistance will sound staccato. But it’s actually not a problem, because people read faster than they speak. Human average reading speed is around 240 words-per-minute (4 words-per-second), where speech is about half that at 100-130 words-per-minute (about 2 words-per-second). Hence, a voice assistant outputting the answer from an AI engine is unlikely to get ahead of the answer, assuming it’s basically reading out the text response. In fact, it might be more problematic in reverse, with the voice assistant falling behind, such as if the user can both read and hear the text response at the same time.

Accuracy Requirements

A lot of the decisions about LLMs come down to a trade-off:

Accuracy versus cost
Accuracy versus speed

The accuracy versus cost is exemplified in the choice of commercial LLM API to use. The current front-runner in terms of accuracy is OpenAI’s GPT-4o, which also wins the award for highest per-token cost as well. Even if you stick with OpenAI, you can trade down to GPT 3.5, which is a less capable model, but costs less, too.

Smartness versus speed. Accuracy versus speed is more of a consideration when you’re working with open source models, because you don’t have much control over OpenAI’s speed. For example, you might have to choose between two versions of Llama or Mistral models:

Full precision 32-bit floating point model (FP32, not quantized)
Quantized 4-bit integer model (INT4 quantized)

The FP32 model uses full-size data, and is accurate but slower. The INT8 versus is using 8-bit integers (i.e., 256 possible values), which is faster but less accurate. Note that INT8 models are surprisingly accurate, even though they’re a quarter the size of a 32-bit model. Furthermore, a lot of people are using 4-bit models, which are an eighth of the size, but still quite accurate. Smaller models run much faster, and can be quite adequate for many use cases.

On-device versus cloud-based. The same speed-smartness tradeoff arises with on-device execution of AI models. For example, Apple Intelligence allows some queries to run on the device itself, which uses a 3B model. Alternatively, the query can be sent to the cloud, which uses a bigger model that’s more accurate and capable, but it’s also slower.

Use cases. How accurate do you need? It depends on the use case for your app and whether it’s business-critical or not. Some examples where high accuracy might not be required:

Creative writing
Image generation (for recreation)

In theory, an app that’s responding to customers seems like it has higher accuracy requirement than one for your internal staff. But is that true? Surely, you should not undervalue the importance of your staff getting accurate information.

The trade-off between accuracy and speed is less important when the LLM does “bigger” tasks. For example, consider writing a cover letter for a job application given the position details and resume, or answering a high school history question. Whilst it may take some time for the prefill and further decoding time to emit the whole result, typically the results are “better” and delivered “faster” than the user could do themselves. The “speed” is often not really noticed at all and accuracy is far more important.

I guess context matters a little bit, since this is less true of interactive tasks. For example, if the LLM is auto-completing the next part of a text message being written, or some code in a programmer’s IDE, speed matters there. Perhaps we have discovered a new AI scaling law: the smaller the reward from the LLM, the faster it needs to be. The bigger the reward, the less speed matters.

Another factor in considering accuracy requirements is whether the LLM output will be reviewed by a human. Outputs that are going straight out onto the web would need higher accuracy than something that will be revised and curated by a human. On the other hand, don’t assume that humans are good at reviewing reams of text, as they’ll have the tendency to just hit the “Approve” button without reading it properly. Maybe you should use an LLM for that?

References

Morgan Cheatham, Steve Kraus, December 4, 2023, The six imperatives for AI-first companies, https://www.bvp.com/atlas/six-imperatives-for-ai-first-companies
Matt Asay, Sep 23, 2024, Too much assembly required for AI, https://www.infoworld.com/article/3536292/too-much-assembly-required-for-ai.html
Stan Gibson, 03 Jun 2024, Getting infrastructure right for generative AI, CIO, https://www.cio.com/article/2128440/getting-infrastructure-right-for-generative-ai.html
MongoDB, Jun 20, 2024, Understanding the AI Stack In the Era of Generative AI: Exploring the Layers and Components of Today’s AI Applications, https://medium.com/mongodb/understanding-the-ai-stack-in-the-era-of-generative-ai-f1fcd66e1393
Akash Bajwa and Chia Jeng Yang, May 27, 2024, The RAG Stack: Featuring Knowledge Graphs: Reducing Hallucinations To Make LLMs Production-Grade With Complex RAG, https://akashbajwa.substack.com/p/the-rag-stack-featuring-knowledge
Artem Shelamanov, Jun 30, 2024. Tech Stack For Production-Ready LLM Applications In 2024, https://python.plainenglish.io/tech-stack-for-production-ready-llm-applications-in-2024-5eb14105d1b4
Abhimanyu Bambhaniya, Ritik Raj, Geonhwa Jeong, Souvik Kundu, Sudarshan Srinivasan, Midhilesh Elavazhagan, Madhu Kumar, Tushar Krishna, 3 Jun 2024, Demystifying Platform Requirements for Diverse LLM Inference Use Cases, https://arxiv.org/abs/2406.01698 Code: https://github.com/abhibambhaniya/GenZ-LLM-Analyzer
Dylan Patel and Daniel Nishball, Oct 03, 2024, AI Neocloud Playbook and Anatomy, https://www.semianalysis.com/p/ai-neocloud-playbook-and-anatomy
Ryan Lucchese, Niki Birkner, Yaron Hagai, Virginia Adams, August 13, 2024, A practitioner’s guide to testing and running large GPU clusters for training generative AI models, Together AI, https://www.together.ai/blog/a-practitioners-guide-to-testing-and-running-large-gpu-clusters-for-training-generative-ai-models

• Online: Table of Contents

• PDF: Free PDF book download

• Buy: Generative AI Applications: Planning, Design and Implementation

The new Generative AI Applications book by Aussie AI co-founders:

Deciding on your AI project
Planning for success and safety
Designs and LLM architectures
Expediting development
Implementation and deployment

Get your copy from Amazon: Generative AI Applications