Aussie AI

Chapter 3. LLM Technology Overview

Book Excerpt from "Generative AI Applications: Planning, Design and Implementation"

by David Spuler

Chapter 3. LLM Technology Overview

What is AI?

AI is such a trendy and overhyped term that I hardly need to tell you what it stands for. Every single company on the planet is now calling themselves an “AI Company” and they’re not incorrect. I mean, my toaster is technically an AI engine because there’s silicon in there somewhere and it’s “intelligent” enough not to burn bread.

And when you get your dream job as an overpaid Software Engineer doing LLMs at a major tech company, the phrase “AI Engineer” is a great term to impress your kids. Your official title, “ML Engineer”, not so much.

This cuts both ways, though. If you’re haggling the price of your new car at the dealership, maybe stick to ML Engineer. Similarly, if you send your resume to a major tech company with “AI Engineer” as your career aspiration, they’ll throw it in the trash and say, “Noob!” with a bemused look on their face.

AI means anything you want it to, but ML means “Machine Learning” to anyone important enough to have that title. The category of ML is specific to a piece of software that actually “learns” to be smarter (e.g., by “training”). The main ones this book is about are:

Transformers (e.g., ChatGPT’s “engine”)
Large Language Models (LLMs) (e.g., GPT-3 or GPT-4)
Neural Networks (NNs) (i.e., an “artificial brain”)

The general category of Deep Learning (DL) is the subset of ML involving neural networks. Hence, Transformers are a subset of DL. Some of the other more specific types of ML include:

Computer Vision (CV)
Autonomous Vehicles (AVs) — self-driving cars
Product Recommendation Systems — e-commerce
Machine Translation (MT) — foreign language translation
Content relevancy algorithms — social media feeds

Looking forward, some of the aspirations of the AI industry are capabilities such as:

Artificial General Intelligence (AGI) — human-like reasoning
Artificial Super-Intelligence (ASI) — who knows what?

AI Technology Trends

Multi-model AI is here already. We’re in the early stages of discovering what can be achieved by putting multiple AI models together. The formal research term for this is “ensemble” AI. For example, GPT-4 is rumored to be an eight-model architecture, and this will spur on many similar projects. As multiple-model approaches achieve greater levels of capability, this will in turn create further demand for AI models and their underlying infrastructure. This will amplify the need for optimizations in the underlying AI engines.

Multimodal engines. Multimodality of the ability of an AI to understand inputs in text and images, and also output the same. Google Gemini and ChatGPT 4o are notable large multimodal models. This area of technology is only at the beginning of its journey.

Fine-tuning fights back! For some time, the recommendation has been to use RAG rather than fine-tuning. However, the advent of Low-Rank Adapters (LoRA) and “multi-LoRA” architectures has changed that. LoRA architectures make it cheaper to fine-tune models, and also faster for inference, due to lower memory overhead. Notably, Apple Intelligence on-device inference is based on a single 3B foundation model and multi-LoRA, with dozens of swappable LoRA adapters. The advantage is low memory usage compared to having multiple foundation models, but with capabilities of specialized versions of the main foundation model for particular use cases.

Longer Contexts. The ability of AI engines to handle longer texts has been improving, both in terms of computational efficiency and better understanding and generation results (called “length generalization”). GPT-2 had a context window of 1024 tokens, GPT-3 had 2048, and GPT-4 originally had versions from 4k up to 32k, but has now advanced to 128k tokens as I write this (November, 2023). An average fictional novel starts at 50,000 words and goes up to 200,000 words, so we’re getting to the neighborhood of having AI engines generate a full work from a single prompt, although, at present, the quality is rather lacking compared to professional writers.

AI Gadgets. The use of AI in the user interface has made alternative form factors viable. Some of the novel uses of AI in hardware gadgets include the Rabbit R1, Humane Ai Pin, and Rewind Pendant. There’s been some overreach in these startups, and they haven’t achieved their potential, but I think there’s still room for big changes in how people use AI. On the other hand, maybe the smartphone will win, and the go-to AI interface will be voice conversations with a smart AI assistant app in your pocket.

Agents and Multi-Agent. Agents are LLMs that can do stuff. This means accessing more data sources (e.g., reading your emails), or performing actions for you (e.g., sending an email). Data source agents are sometimes called “plug-ins” after the OpenAI feature. Action agents or “write” agents may require human approval or could run unattended. Both data source and action agents will require a lot of integration work. CrewAI, Pythagora and Devin are examples of multi-agent frameworks with tool usage, too.

Autonomous agents. There are also “autonomous” types of agents that sit in the background and work on a continual basis, like Windows services or Unix daemons, rather than waiting for human requests. The autonomous agent architecture is a combination of an AI engine and LLMs with a datastore and a scheduler.

Small Language Models. Although the mega-size foundation models still capture all the headlines, small or medium size models are becoming more common in both open source and commercial settings. They even have their own acronym now: Small Language Models (SLMs). Notably, Google, Apple and Microsoft are all on the SLM bandwagon. Google has its Gemma models, which are lightweight models that can work on both low-end devices and NVIDIA GPUs. Apple Intelligence has a 3B foundation model as the basis for its on-device inference on iPhone and Mac, using the optimization of multiple LoRA adapters for fine-tuning. Microsoft has also been doing some work in this area with its Orca and Phi models. IBM Granite 3.0 models are enterprise-focused and open source with 3B/8B versions. Apparently 7B or 8B is “small” now.

Specialized Models. High quality, focused training data can obviate the need for a monolithic model. Training a specialized model for a particular task can be effective, and at a much lower cost. Expect to see a lot more of this in medicine, finance, and other industry verticals.

Data Feed Integrations. AI engines cannot answer every question alone. They need to access data from other sources, such as the broad Internet or specific databases such as real estate listings or medical research papers. Third-party data feeds can be integrated using a RAG-style architecture.

Tool Integrations. Answering some types of questions requires integration with various “tools” that the AI Engine can use for supplemental processing in user requests. For example, answering “What time is it?” is not possible via training with the entire Wikipedia corpus, but requires integration with a clock. Implementing an AI engine so that it knows both when and how to access tools is a complex engineering issue.

The Need for Speed. The prevailing problem at the moment is that AI engines are too inefficient, requiring too much computation and too many GPU cycles. The solution will be research into faster software algorithms and optimization techniques, and increasingly powerful hardware underneath.

Why is AI Slow?

Why is AI so slow? It’s a fair question, since the computing power required by AI algorithms is legendary. The cost of training big models is prohibitive, and getting even small models to run fast on a developer’s desktop PC is problematic.

But why?

The bottleneck is the humble multiplication. All AI models use “weights” which are numbers, often quite small fractions, that encode how likely or desirable a particular feature is. In an LLM, it might encode the probabilities of the next word being correct. For example, simplifying it considerably, a weight of “2.0” for the word “dog” would mean to make the word twice as likely to be the next word, and a weight of “0.5” for “cat” would mean to halve the probability of outputting that word. And each of these weights is multiplied against other probabilities in many of the nodes in a neural network.

How many multiplications? Lots! By which we mean billions every time it runs. A model of size 3B has 3 billion weights or “parameters” and each of these needs multiplication to work. GPT-3 as used by the first ChatGPT release had 175B weights, and GPT-4 apparently has more (it’s confidential but an apparent “leak” rumored that it’s a multi-model architecture with 8 models of 220B parameters each, giving a total of more than 1.7 trillion trained parameters).

Why so many weights? Short answer: because every weight is a little tiny bit of braininess.

Longer answer: because it has weights for every combination. Simplifying, a typical LLM will maintain a vector representation of words (called the model’s “vocabulary”), where each number is the probability of emitting that word next. Actually, it’s more complicated, with the use of “embeddings” as an indirect representation of the words, but conceptually the idea is to track word probabilities. To process these word tokens (or embeddings), the model has a set of “weights”, also sometimes called “parameters”, which are typically counted in the billions in advanced LLMs (e.g., a 3B model is considered “small” these days and OpenAI’s GPT-3 had 175B).

Why is it slow on my PC? Each node of the neural network inside the LLMs is doing floating-point multiplications across its vocabulary (embeddings), using the weights, whereby multiplication by a weight either increases or decreases the likelihood of an output. And there are many nodes in a layer of an LLM that need to do these computations, and there are multiple layers in a model that each contain another set of those nodes. And all of that is just to spit out one word of a sentence in a response. Eventually, the combinatorial explosion of the sheer number of multiplication operations catches up to reality and overwhelms the poor CPU.

Bigger and Smarter AI

Although the compute cost of AI is a large negative, let us not forget that this is what achieves the results. The first use of GPUs for AI was a breakthrough that heralded the oncoming age of big models. Without all that computing power, we wouldn’t have discovered how eloquent an LLM could be when helping us reorganize the laundry cupboard.

Here’s a list of some of the bigger models that have already been delivered in terms of raw parameter counts:

MPT-30B (MosaicML) — 30 billion
Llama2 (Meta) — 70 billion
Grok-1 (XAI) — 70 billion
GPT-3 (OpenAI) — 175 billion
Jurassic-1 (AI21 Labs) — 178 billion
Gopher (DeepMind/Google) — 280 billion
PaLM-2 (Google) — 340 billion
MT-NLG (Microsoft/NVIDIA) — 530 billion
PaLM-1 (Google) — 540 billion
Switch-Transformer (Google) — 1 trillion
Gemini Ultra (Google) — (unknown)
Claude 2 (Anthropic) — 130 billion (unconfirmed)
GPT-4 (OpenAI) — 1.76 trillion (unconfirmed)
BaGuaLu (Sunway, China) — 174 trillion (not a typo)

Note that not all of these parameter counts are official, with some based on rumors or estimates from third parties. Also, some counts listed here are not apples-to-apples comparisons. For example, Google’s Switch Transformer is a different architecture.

The general rule of AI models still remains: bigger is better. If you’re promoting your amazing new AI foundation model to investors, it’d better have a “B” after its parameter count number (e.g., 70B), and soon it’ll need a “T” instead. All of the major tech companies are talking about trillion-parameter models now.

The rule that bigger is better is somewhat nuanced now. For example, note that Google’s PaLM version 2 had fewer parameters (340B) than PaLM version 1 (540B), but more capabilities. It seems likely that a few hundred billion is getting to be enough parameters for any use cases, and there is more value in quality of training at that level.

Furthermore, training bigger models only works if you have the data to feed it. The availability of trillions of tokens of input data is starting to be a limiting factor for the industry.

Another change is the appearance of multi-model architectures. Notably, the rumored architecture of GPT-4 is almost two trillion parameters, but not in one model. Instead, the new architecture is (apparently) an eight-model architecture, each with 220 billion parameters, in a “mixture-of-experts” architecture, for a total of 1.76 trillion parameters. Again, it looks like a few hundred billion parameters is enough for quality results.

We’re only at the start of the multi-model wave, which is called “ensemble architectures” in the research literature. But it seems likely that the overall count of parameters will go upwards from here, in the many trillions of parameters, whether in one big model or several smaller ones combined.

Faster AI

It’ll be a few years before a trillion-parameter model runs on your laptop, but the situation is not hopeless for AI’s sluggishness. After all, we’ve all seen amazing AI products such as ChatGPT that respond very quickly. They aren’t slow, even with millions of users, but the cost to achieve that level of speed is very high. The workload sent to the GPU is immense and those electrons aren’t free.

There is currently a large trade-off in AI models: go big or go fast.

The biggest models have trillions of parameters and are lumbering behemoths dependent on an IV-drip of GPU-juice. Or you can wrap a large commercial model provider through their API (e.g., OpenAI’s API, Google PaLM API, etc.), and using a major API has a dollar cost, although it probably replies quickly.

Smaller models are available if you want to run fast. You can pick one of several smaller open-source models. Here’s a list of some of them:

Llama2 (Meta) — 70 billion
MPT-30B (MosaicML) — 30 billion
MPT-7B (MosaicML) — 7 billion
Mistral-7B (Mistral AI) — 7 billion

The compute cost of models in the 7B type range is much less. The problem with using smaller models is that they’re not quite as smart, although a 7B model’s capabilities still amaze me. These can definitely be adequate for many use cases, but tend not to be for areas that require finesse in the outputs, or detailed instruction following. Given the level of intense competition in the AI industry, a sub-optimal output may not be good enough.

For more capability, there are larger open-source models, such as Meta’s Llama2 models, which has up to 70 billion parameters. But that just brings us back to the high compute costs of big models. They might be free of licensing costs, but they’re not free in terms of GPU hosting costs.

What about both faster and smarter? So, you want to have you cake and eat it, too? That’s a little trickier to do, but I know of a book that’s got hundreds of pages on exactly how to do that.

There are many ways to make an AI engine go faster. The simplest is to use more GPUs, and that’s probably been the prevailing optimization used to date. However, companies can’t go on with that business model forever, and anyway, we’ll need even more power to run the super-advanced new architectures, such as the multi-model AI engines that are emerging.

Algorithm-level improvements to AI are required to rein in the compute cost in terms of both cash and environmental impact. An entire industry is quickly evolving and advancing to offer faster and more efficient hardware and software to cope with ever-larger models.

But you can save your money for that Galápagos vacation: code it yourself. This whole book offers a survey of the many ways to combat the workload with optimized data structures and algorithms.

Human ingenuity is also on the prowl for new solutions and there are literally thousands of research papers on how to run an AI engine faster. The continued growth of models into trillions of parameters seems like a brute-force solution to a compute problem, and many approaches are being considered to achieve the same results with fewer resources. Some of these ideas have made their way into commercial and open source engines over the years, but there are many more to be tested and explored.

AGI

The race to Artificial General Intelligence (AGI) is well under way. It seems plausible that it’s possible to have highly intelligent LLMs, but there are also some reasons to doubt. Some of the reasons to be bullish on AGI include:

Brute-force usually wins — it’s called the “bitter lesson” and refers to the fact that computers usually beat humans not by using smarter algorithms, but simply by out-computing us with simpler algorithms.
Rapid progress against benchmarks — watch any LLM leaderboard for a few months, and there are always new models doing better.
Multimodal capabilities — there are many new models that can process inputs and outputs involving images, video, and voice.

Not everyone thinks that LLMs can achieve AGI. In one theory, an LLM is compared to our subconscious, and we need something else that mirrors our rational conscious brain, which would then combine with an LLM to have human-like AGI. There are a lot of LLM limitations, but some of the major obstacles to achieving AGI include:

Reasoning difficulties — LLMs have trouble generalizing information to more complex tasks, or conceptually similar tasks.
Continual learning is missing — LLMs currently learn nothing new from conversations or articles it reads. The LLM literally forgets everything it reads.

Overall, LLMs are great at memorization and parroting. Effectively, it’s all imitation. Will it ever be real?

References

Vincent Koc, Jan 3, 2024, Navigating the AI Landscape of 2024: Trends, Predictions, and Possibilities, Towards Data Science, https://towardsdatascience.com/navigating-the-ai-landscape-of-2024-trends-predictions-and-possibilities-41e0ac83d68f
Taryn Plumb, October 28, 2024 , Gartner predicts AI agents will transform work, but disillusionment is growing, https://venturebeat.com/ai/gartner-predicts-ai-agents-will-transform-work-but-disillusionment-is-growing/
James Thomason, April 12, 2024, Why small language models are the next big thing in AI, https://venturebeat.com/ai/why-small-language-models-are-the-next-big-thing-in-ai/
Andreesen Horowitz, August 21, 2024 The Top 100 Gen AI Consumer Apps, https://a16z.com/100-gen-ai-apps-3/
Sudeep Srivastava, July 12, 2024, Top AI Trends in 2024: Transforming Businesses Across Industries, https://appinventiv.com/blog/ai-trends/
Lucas Mearian, 24 Oct 2024, 2025: The year of the AI PC, Computer World, https://www.computerworld.com/article/3583355/2025-the-year-of-the-ai-pc.html
Kevin Mahaffey. Oct 25, 2024, Defensibility: Where will value accrue in AI? Introduction: Why this time is different. https://writing.snr.vc/p/defensibility
Tanay Jaipuria, Oct 01, 2024, OpenAI and Anthropic Revenue Breakdown: Breaking down revenue growth, the consumer subscription businesses and the importance of partnerships to the API business, https://www.tanayj.com/p/openai-and-anthropic-revenue-breakdown
Daihang Chen, Yonghui Liu, Mingyi Zhou, Yanjie Zhao, Haoyu Wang, Shuai Wang, Xiao Chen, Tegawendé F. Bissyandé, Jacques Klein, Li Li, 9 Jul 2024, LLM for Mobile: An Initial Roadmap, https://arxiv.org/abs/2407.06573
Samantha Kelly, Sept. 29, 2024, 'Superintelligent' AI Is Only a Few Thousand Days Away: OpenAI CEO Sam Altman, https://www.cnet.com/tech/services-and-software/superintelligent-ai-is-only-a-few-thousand-days-away-openai-ceo-sam-altman/
Aki Ranin, Sep 2, 2024, The Code Canaries Are Singing — Our Path Toward AGI: How the fate of human software developers reveals our path toward AGI, https://akiranin.medium.com/the-code-canaries-are-singing-our-path-toward-agi-6c234cae0189
Chloe Berger, October 2, 2024, Mark Cuban says his puppy is ‘smarter than AI is today’, https://fortune.com/2024/10/01/mark-cuban-dog-puppy-smarter-than-ai/

• Online: Table of Contents

• PDF: Free PDF book download

• Buy: Generative AI Applications: Planning, Design and Implementation

The new Generative AI Applications book by Aussie AI co-founders:

Deciding on your AI project
Planning for success and safety
Designs and LLM architectures
Expediting development
Implementation and deployment

Get your copy from Amazon: Generative AI Applications

Aussie AI

Chapter 3. LLM Technology Overview