Aussie AI

Chapter 13. Transformers & LLMs

Book Excerpt from "Generative AI Applications: Planning, Design and Implementation"

by David Spuler

Chapter 13. Transformers & LLMs

AI Engines & Models

An AI application is really two components and it’s not very complicated:

Engine — Transformer
Model — LLM

Transformers are a type of neural network engine that calculates the answers in Generative AI. The Large Language Model (LLM) contains all of the data about the relationships between words and their relative positioning.

In terms of technology, the distinction between engines and models is also very simple:

Engine — code
Model — data

The runtime code is the “engine” and the grunt work is often done in C++ under a Python wrapper. The data is the “model” which is literally all numbers, and no code. So far, not so exciting.

Where it gets more interesting is in the complex meshing between engines and models. Not all engines work with all models, and vice-versa. Even Transformers are tightly interwoven with their LLM data. There are many variants of Transformer architectures, and the data won’t work with an architecture that’s different.

Engines and models are symbiotic and you need both to get anything done. An engine without a model means you ran out of compute budget, whereas a model without an engine cannot really occur because engines create models via training.

Engines

What’s an engine? The engine is code that you have to write. All of the fast low-level code is usually written in C++, but the higher-level control code is often written in Python. Somebody has probably used Java to do AI engines, but I’m not a fan of having ten directory levels. If you’re using Visual Basic or Perl, we’re in trouble.

All of the action is done by the engine based on the data in the model file. The engine needs to load the model file, receive a user query, crank the query through the model weights, and output the best ideas it can think of.

There are two main types of engine, and they’re so closely related, that they’re almost the same thing. Conceptually, there are two engines:

Training engine
Inference engine

The training engine computes answers to queries, compares the results to expectations, and then updates the weights in the model. The “loss function” calculates how close the results are to what’s expected in the training data set. It’s also sometimes called an “error function” because it computes an error metric between the computed results, and the expected results. At a very high level, the basic architecture is:

    Training engine = Inference engine + Loss function + Weight Updater

The training engine is used for training (surprise!) and for mini-training tasks like “fine-tuning” the model with small amounts of extra data. The main purpose of the training engine is to create the model by continually updating the weights.

The inference engine handles user queries at runtime. It requires a model that has been built during training, which is used to answer the users’ prompts according to whatever has been trained into the model.

These two types of engines have the same inference component. A “training engine” is the inference engine plus a mechanism to compare results with expectations and then update weights appropriately. The central difference is that a training engine changes the weights, because it’s creating the model, whereas the weights are static during inference. The weights are not updated by user queries. If you like programming (hopefully?), here’s another way to think about model weights:

Training engine — Read/Write
Inference engine — Read-Only

Both of these engines do the same inference computations on weights for the “Read” phase. Hence, they share a lot of components, but the training engine adds extra components (the “Write” parts). The basic hyper-parameters of the model (e.g., number of weights, number of layers) must be identical for the training and inference phases. Hence, a query computation done by the training engine is the same set of computations as the same query done by the inference engine after training is complete.

Transformers

What’s a Transformer? It’s an engine that processes LLMs, and is an advanced type of neural network. The first Transformer was open-sourced by Google Research in 2017, and then everything got out of hand. And now, I have to mention that Transformers can be of three basic types:

Vanilla (encoder-decoder),
Decoder-only (e.g., GPT or Gemini)
Encoder-only (e.g., BERT)

Now you’ll want definitions for those too? We’ll be here all night, and it’s not even a joke, because the whole book is about Transformers and LLMs. Here’s a list of some of the well-known Transformer-LLM architectures:

GPT-4 (OpenAI)
Gemini (Google)
Llama (Meta)
Mixtral (Mistral)
Celebrity models (Character.AI)
Claude (Anthropic)
Grok (xAI)

Yes, I’ve missed a few! Although it all started as encoder-decoders in 2017, most of the modern Transformers are decoder-only (because it’s faster). In addition, they’ve all tweaked the vanilla Transformer in lots of different ways and usually have also published nice research papers about it (until recently).

And I know what you’re wondering: it’s always about ChatGPT from OpenAI. No, ChatGPT isn’t on the list, because it’s not really a Transformer architecture or an LLM. Rather, it’s more like an “app” (chatbot) or a “brand” or a “platform” that sits on the top using all that GPT stuff.

Models

What’s a model? An AI model is literally a binary data file with mostly numbers and a few text strings. For really big models with billions of parameters, it’s multiple files, but still only numbers and no code.

I really mean it: zero code. You won’t find any programs or scripts, and not even any HTML markup. If you’re looking for rules like “if the previous word was ’the’ then output a noun” then you’re out of luck, not to mention that you’re about thirty years behind the times, because that’s a rule-based expert system, and it’s not how models work in this century.

The main thing in a model file is “weights” which are literally fractional numbers. Billions of them. They’re sometimes called “parameters” when being more precise, but it’s the same idea.

Weights are a multiplier of “signals” such as which word should be output next. A fractional number less than one makes a word less likely (decreasing a signal), whereas more than one increases the likelihood of outputting that word (amplifying a signal). A zero means don’t output that word. A negative weight means really, really don’t output the word.

Programmers don’t create model files. You won’t have to edit a model file and click away on your calculator to get the right parameter numbers. The numbers inside a model file are auto-generated by the training engine.

In fact, it’s hard even to look at a model file, because it’s so crammed full of numbers. You can do a basic sanity check that it’s not spoiled with bogus Inf (infinity) and NaN (not-a-number) floating-point values, but you can’t see the intelligence by looking at the numbers, even if you squint. However, programmers do have to decide on the meta-parameters for their model before they run the training phase.

What are meta-parameters? The meta-parameters of the model are counts of how many billion parameters it has, in how many layers, and how many different words it understands (typically, 50,000). These are all static, fixed numeric values. Most of the meta-parameters are fixed from training through to inference, such as the “dimensions” of the model (e.g., the number of “layers” in the model). The size of the model in terms of how many billions of parameters is mostly fixed, too, except there’s some tricky ways to speed up inference by reducing or modifying parameters, called “pruning” and “quantization,” but now we’re jumping ahead about twenty chapters.

Large Language Models (LLMs)

What’s an LLM? There’s nothing really special about Large Language Models (LLMs) used by ChatGPT, Gemini, or Llama, compared to other types of AI model files, except that they’re:

(a) large,

(b) language-focused (not images), and

Well, you asked, so I answered.

More specifically, LLMs tend to be model files that are processed by Transformers, rather than other types of AI engines.

What’s a Foundation Model? This is a large and general-purpose model that’s already been broadly trained. Any model that has billions of parameters and gets mentioned in a press release is usually a foundation model. The biggest foundation models might support text in multiple languages along with programming language coding knowledge.

Technically, if a foundation model also has image generating capabilities as well as text output, or can also receive an image as part of its input, then that’s not a normal foundation model (i.e., it’s not really an LLM). Instead, this advanced model type is often distinguished as a “multi-modal” model. And if there’s two of those working together, then it’s a “multi-modal multi-model” and you should try saying that ten times in a row.

Other Types of LLMs

LLMs are not the only game in town. There were AI models before LLMs, and there are some newer models, as well. Here’s my list of non-LLMs:

Predictive ML models — old-school AI with various types of neural networks.
Diffusion models — image-related, using new tech, but not based on Transformers.

Some of the newer types of models based on LLMs and Transformers include:

Small Language Models (SLMs) — only a few billion to fit in the palm of your hand.
Mixture-of-Experts (MoE) — multiple LLMs wrapped in one uber-LLM, such as GPT-4, Mixtral, and Google Gemini.
Large Multimodal Models (LMMs) — the latest thing!
Large Vision Models (LVMs) and Vision Transformers (ViTs) — machine vision with Transformer architectures.
Time Series Forecasting LLMs — data prediction with Transformers (e.g., Amazon Chronos, Nixtla TimeGPT, Salesforce Morai, IBM Tiny Time Mixers, Google TimesFM).

Training and Fine-tuning

What’s the difference between training and fine-tuning? At least three zeros.

Training is how you shove all of the brain power from the entire Wikipedia corpus into a bunch of numbers. It takes a long time and the GDP of a small country to train a big model. Training is the big cost of a lot of AI projects.

The good news about training is that if you mess it up, you have to start all over again. Well, this isn’t quite true, because training runs in batches of data. If the evaluation fails, you have to revert to the prior model candidate, since you can’t “un-train” an AI model. However, a review can also suggest areas where a model needs more training, or needs to be directed towards new behavior or personality features. In addition to batched training, there is also research on “incremental learning” as a thing.

What is fine-tuning? Fine-tuning refers to smaller amounts of training that are done to a model that’s already been fully trained. If you’re training a new model from scratch, even a small one, then that’s training, not fine-tuning.

The most common use of fine-tuning is to modify a powerful foundation model to do something more specific. Most foundation models have been broadly trained on general information. You might want to specialize the model for a particular use case or to use a new set of data. This can be done two ways:

Fine-tuning
Retrieval Augmentation Generation (RAG)

Fine-tuning adds new “knowledge” to the LLM about the extra training content. One risk is that it does so by changing the behavior of the underlying foundation model. Hence, the fine tuning could change the LLM enough that is “forget” whole areas of “information” that were in the foundation model, or rather, make it prefer the new fine-tuning over the existing data. There is little that can be done to control what the LLM “forgets” other than testing for this afterwards.

Proprietary business data is a common reason to fine-tune a foundation model (but there’s also RAG to consider). For example, to create a support chatbot for customers using your products, you can customize a foundation model to know about your company’s internal product documents via fine-tuning. To do this, you would fine-tune the foundation model using this extra internal data. In this way, a small amount of fine-tuning has added knowledge to the model about new data, which it can then incorporate into its answers to users. It’s basically telling the LLM: forget about everything you “know” about a topic, and just derive your responses from the data in the newer prompt, rather than the data pre-encoded in the model.

RAG is not training. Note that Retrieval Augmentation Generation (RAG) is not a type of training or fine-tuning. In fact, it’s a way to avoid them like the plague.

RAG is an architectural add-on where the Transformer can talk to a component that knows how to “retrieve” extra information or documents, such as proprietary internal business documents about your products. This extra data is used as input context during inference of the model, thereby extending the basic model to answer questions specific to this extra material.

The main point is that RAG avoids the expense of training and fine-tuning, while incurring some extra cost in implementing the RAG component. Note that a RAG architecture is also not “tied” to the model/engine at all, and it is generally very easy to switch engines or models behind the scenes in the RAG application code.

Data sets. High-quality data is fundamental to both fine-tuning/training and RAG techniques. Historically, much of the training data sets have been painstakingly compiled by humans. A newer technique is to use the output of one LLM as the input training dataset for another model. This method and other types of “synthetic data” are being used more fully.

The required structure of the data is different for RAG versus fine-tuning/training. RAG data needs to be split into “chunks” of relevance to particular questions, which can map reasonably well to internal documents. Data for fine-tuning may require more of a Q&A or conversational structure, which is not typical of product datasheets or corporate policy documents! For example, if you want have a chatbot which answers questions, then the training data needs to be a lot of question/answer pairs. With RAG, on the other hand, the data is wrapped in a question as part of a hidden prompt.

Just feeding an LLM the whole of your corporate marketing materials, or even the whole of Wikipedia, will not help that much with fine-tuning or training. Feeding an LLM with a meal of silky Wikipedia text means it likely can “complete” an input in the style of Wikipedia, and probably cite paragraphs verbatim from the corpus, but not necessarily answer questions about the content in Wikipedia. The training data needs to be a lot of question/answer type pairs in order to answer questions (i.e., so that the “completion” is an answer to the question). LLMs just predict a likely next word and next sentence, so the LLM has to learn to predict an answer after a question. Similarly for a translation use case, if you want your LLM to translate English to Klingon, the training data can’t just be the English Wikipedia content combined with the Klingon one, but needs to be pairs of English phrases and their corresponding Klingon phrases.

Inference

What is inference? The term “inference” is the AI way of saying “running” or “executing” the AI model. Inference and training are different phases. When you’re training or fine-tuning, that’s not inference. But when you’re done and deploy your model live for a nickel a query, that’s inference. When you ask ChatGPT a question, you’re sending a “query” or “prompt” to its “inference engine” and when it politely refuses to do what you ask, that’s the output results of its “inference” code.

What are latency and throughput? Latency is how fast your inference engine runs. It’s similar to the idea of “response time” for a single user. Throughput is a measure of how many queries your engine can handle over time, which relates more to how fast your engine can handle a group of users submitting many queries.

Types of inference. There’s not only one type of inference, and the exact algorithm depends on what you’re trying to do. Some of the types include:

Completions. This means extending the prompt into a longer answer. Common use cases include auto-writing text or answering questions.
Translation. Convert your Python code comments into Klingon.
Summarization. Taking the input prompt, such as a paragraph or document, and creating a brief summary.
Grammatical Error Correction (GEC). Also known to non-researchers as “editing.”
Transformation. Changing the tone or style of a text input, or changing the formatting and presentation.
Categorization. Analyzing the inputs into a set of different categories, which is effectively a subset of summarization.

Inference Settings. In addition to choosing the overarching type of inference algorithm, there are some common parameters to control an inference request.

Temperature. A higher temperature setting gives your engine a fever, and makes it output silly words. This is known as “creativity.”
Token limit. This is the simple idea of limiting the number of words (tokens) that the engine is allowed to output in its response.
Formatting. Do you want the engine to output plain text, a table, or some other format.

This is only a sample list, and API providers typically have many more options. There are also usually various other parameters related to security and tracking of requests. For example, you probably have to submit your security credentials (i.e., password) along with a unique ID for the request. This helps the API validate your request and helps you keep track of which end user submitted the request so you can send the answer back to them.

Context and Conversations

If you’re creating a chatbot, you create a UI that accepts the user’s inputs, sends it off to the AI engine via the network, and then outputs the answer back to the user. They go back-and-forth with a stream of requests and responses, thereby creating a conversation.

Oh, really?

What’s missing is the “context” of every request that’s part of the conversation. You cannot just send the user’s latest response off to the engine, because:

AI engines are stateless.

Hence, the default AI engine doesn’t remember what else it’s already said. Maybe it’s because the GPUs have stolen all their RAM.

Instead, it’s up to you as the programmer to store and re-submit the entire conversational history with every request. This is a wonderful situation when you’re paying per input token.

The API vendors are starting to help handle context management for you. The OpenAI API provides helpful ways to structure the historical context in a request. The context is also stored in the KV Cache of an LLM, and the idea of swapping the KV cache in and out is becoming a thing. It’s called “prefix KV caching” and it’s already available in vLLM, DeepSeek, Anthropic, Google Gemini, and OpenAI. This allows the same results for a repeated query to be shared within a single user’s session, or also across many users as a “short term memory” transplant occurring each context switch.

Extended Transformers

The main type of Transformer that gets all the hype is the Generative Pre-Trained Transformer (GPT). This is the basic text processing Transformer that can process words and generate output with surprisingly human-like elegance.

Modern research has been applying Transformers to other types of input and uses cases. The result has been various new extensions of Transformer architectures.

Multi-modal Transformer. This refers to Transformers that can accept inputs in images (or video) rather than simple text prompts.
Vision Transformer (ViT). These are the use of Transformer technologies for computer vision applications, such as self-driving cars.
Bidirectional Transformer. This is a research type used in the past, that hasn’t received as much attention lately. The idea is that it can examine its input data from both directions at the same time. The main example is “Bidirectional Encoder Representations from Transformers” (BERT) and its many variants.
Retrieval Augmentation Generation (RAG). This is an architecture where a Transformer is combined with a separate component that “retrieves” extra data (e.g., a document search mechanism). The idea is to extend the AI engine to new data without extra training.
Ensemble inference. An “ensemble cast” is a Hollywood term that means a film with a group of famous actors all starring together in the same story. Someone with a sense of humor (or very large ambitions) decided to use the same term for a group of AI models all working together to create the same masterpiece.

Some of the major areas of Transformer research involve addressing the resource-hungry nature of their execution. For example, a basic Transformer has quadratic cost complexity in terms of the input length. Hence, there are numerous modifications in Transformer architectures being created, both in industry and research labs. See Part VII of this book for a full literature review of the extensive body of research related to Transformers.

Other Types of Neural Networks

The Transformer was a breakthrough in the evolution of neural networks. One of its main advantages was its capacity to perform calculations in parallel, allowing it to increase intelligence through sheer brute-force algorithms. This led to a massive increase in the size of models into multi-billion parameter scale, which we now call Large Language Models (LLMs).

Before the Transformer, there were many different neural network architectures. Several of these designs are still being used today in areas where they are stronger than Transformers.

Recurrent Neural Networks (RNNs). An early type of neural network that worked iteratively through a sequence. An RNN processes its inputs one token at a time, creating its output response, and then re-enters its own output as an input to its next phase. Hence, it is “recursive” in processing its own output, which is also known as “auto-regressive” mode when this same idea occurs in Transformers. Transformers have largely displaced RNNs for applications in text processing and generative AI. However, there are still research papers attempting to revive RNNs with advancements, or to create hybrid Transformer-RNN architectures.

Generative Adversarial Networks (GANs). These are an advanced image-generating neural network. The idea is to combine two models, one that generate candidate images (the “generator”), and the other model that evaluates them (the “discriminator”). By a weird kind of “fighting” between the two models, the generator model gradually creates better images that please the discriminator. The results are surprisingly effective, and this technology is still in use today.

Convolutional Neural Networks (CNNs). Whereas RNNs and Transformers are focused on input sequences, CNNs are better at input data that has a structure, especially the spatial structure inherent in images. Modern image processing and computer vision technology still uses CNNs, although enhanced Transformer architectures, such as multimodal or vision transformers, can also be used. CNNs are good at splitting an image into separate input “channels” and then applying a “filter” to each channel. Hence, CNNs have been holding their own against Transformers in areas related to image processing.

There are various other types of neural network, which all have some research attention:

Long short-term memory (LSTM) — a type of RNN.
Spiking neural networks (SNNs)
Liquid neural networks (LNNs)
State Space Models (SSMs) — e.g., Mamba.
Quantum neural networks (QNNs)

There’s another architecture that’s fairly commonly used, in both consumer and business. It’s called a Carbon-Based Model (CBM). However, CBMs are very difficult to train, often needing 20 years worth of training just to start being useful, while consuming numerous inputs and producing only dismissive waves and occasional grunts as output.

This book is mostly about Transformers, so the interested reader is referred to the research literature for these architectures, or to a psychology textbook for the bio-robots. As a general rule, there are so many research papers being written about AI that there are literally exceptions to everything. But those intrepid researchers are doing a great service to programmers by giving us lots of gritty algorithms to code up.

References

Tianyang Lin, Yuxin Wang, Xiangyang Liu, and Xipeng Qiu, 2022, A survey of transformers. AI Open, https://arxiv.org/abs/2106.04554 (An extensive and useful survey of Transformer architectures.)
Yi Tay, Mostafa Dehghani, Dara Bahri, and Donald Metzler, 2022, Efficient transformers: A survey (v2), arXiv preprint arXiv:2009.06732, https://arxiv.org/abs/2009.06732
Ce Zhou, Qian Li, Chen Li, Jun Yu, Yixin Liu, Guangjing Wang, Kai Zhang, Cheng Ji, Qiben Yan, Lifang He, Hao Peng, Jianxin Li, Jia Wu, Ziwei Liu, Pengtao Xie, Caiming Xiong, Jian Pei, Philip S. Yu, Lichao Sun, May 2023, A Comprehensive Survey on Pretrained Foundation Models: A History from BERT to ChatGPT, https://arxiv.org/abs/2302.09419
Q Fournier, GM Caron, D Aloise, 2023, A practical survey on faster and lighter transformers, ACM Computing Surveys, https://dl.acm.org/doi/abs/10.1145/3586074, https://arxiv.org/abs/2103.14636
Xipeng Qiu, TianXiang Sun, Yige Xu, Yunfan Shao, Ning Dai, and Xuanjing Huang, 2020, Pre-trained Models for Natural Language Processing: A Survey, SCIENCE CHINA Technological Sciences 63, 10 (2020), 1872–1897. https://doi.org/10.1007/s11431-020-1647-3, https://arxiv.org/abs/2003.08271 (Good survey of Transformer architectures in 2020.)
Y Chang, X Wang, J Wang, Y Wu, K Zhu, 2023, A survey on evaluation of large language models, arXiv preprint, https://arxiv.org/abs/2307.03109
N Elhage, N Nanda, C Olsson, T Henighan, N Joseph, B Mann, A Askell, Y Bai, A Chen, T Conerly, et al. 2021, A mathematical framework for transformer circuits. https://transformer-circuits.pub/2021/framework/index.html (Detailed theoretical examination of how various Transformer components work.)
W Li, H Hacid, E Almazrouei, M Debbah, 2023, A Comprehensive Review and a Taxonomy of Edge Machine Learning: Requirements, Paradigms, and Techniques, AI 2023, 4(3), 729-786, https://www.mdpi.com/2673-2688/4/3/39 (Extensive survey related to optimizing on edge devices.)
J Zhong, Z Liu, X Chen, Apr 2023, Transformer-based models and hardware acceleration analysis in autonomous driving: A survey, https://arxiv.org/abs/2304.10891
Y Li, S Wang, H Ding, H Chen, 2023, Large Language Models in Finance: A Survey, PDF: https://www.researchgate.net/profile/Yinheng-Li/publication/374546790_Large_Language_Models_in_Finance_A_Survey/links/6523988afc5c2a0c3bc534fc/Large-Language-Models-in-Finance-A-Survey.pdf
Maurizio Capra, Beatrice Bussolino, Alberto Marchisio, Guido Masera, Maurizio Martina, Muhammad Shafique, 2020, Hardware and software optimizations for accelerating deep neural networks: Survey of current trends, challenges, and the road ahead, https://ieeexplore.ieee.org/iel7/6287639/6514899/09269334.pdf, https://arxiv.org/abs/2012.11233
Kai He, Rui Mao, Qika Lin, Yucheng Ruan, Xiang Lan, Mengling Feng, Erik Cambria, Oct 2023, A Survey of Large Language Models for Healthcare: from Data, Technology, and Applications to Accountability and Ethics, https://arxiv.org/abs/2310.05694
Minghao Shao, Abdul Basit, Ramesh Karri, Muhammad Shafique, Architectures: Trends, Benchmarks, and Challenges, https://www.researchgate.net/profile/Minghao_Shao2/publication/383976933Survey of different Large Language Model_Survey_of_different_Large_Language_Model_Architectures_Trends_Benchmarks_and_Challenges/links/66e2d320f84dd1716ce79f85/Survey-of-different-Large-Language-Model-Architectures-Trends-Benchmarks-and-Challenges.pdf
Tianyu Ding, Tianyi Chen, Haidong Zhu, Jiachen Jiang, Yiqi Zhong, Jinxin Zhou, Guangzhi Wang, Zhihui Zhu, Ilya Zharkov, Luming Liang, 18 Apr 2024 (v2), The Efficiency Spectrum of Large Language Models: An Algorithmic Survey, https://arxiv.org/abs/2312.00678
Johannes Schneider, 1 Aug 2024, What comes after transformers? -- A selective survey connecting ideas in deep learning, https://arxiv.org/abs/2408.00386
Rob Toews, Sep 3, 2023, Transformers Revolutionized AI. What Will Replace Them? Forbes, https://www.forbes.com/sites/robtoews/2023/09/03/transformers-revolutionized-ai-what-will-replace-them/

• Online: Table of Contents

• PDF: Free PDF book download

• Buy: Generative AI Applications: Planning, Design and Implementation

The new Generative AI Applications book by Aussie AI co-founders:

Deciding on your AI project
Planning for success and safety
Designs and LLM architectures
Expediting development
Implementation and deployment

Get your copy from Amazon: Generative AI Applications

Aussie AI

Chapter 13. Transformers & LLMs