Aussie AI

Chapter 14. Training and Fine-Tuning

Book Excerpt from "Generative AI Applications: Planning, Design and Implementation"

by David Spuler

Chapter 14. Training and Fine-Tuning

Training Options

It’s easy to make a small fortune in LLM model training these days. You start with a big fortune, and then do training.

If you want a new model, and none of the off-the-shelf commercial or open source models are good enough, here are your basic options for training a smarter model:

Train a new foundation model from scratch.
Fine-tuning (FT) of an existing model.
LoRA-based fine-tuning
Retrieval-Augmented Generation (RAG) using a document database.

It’s a great list. Let’s start at the top.

Training a Foundation Model

Training your own foundation model from scratch is kind of expensive. and I don’t really recommend you try to train your own foundation model at home. Also, why bother? The top LLMs are so good these days, both commercial or open-source models, and you can rent or buy. Hence, training a new model from scratch is probably relegated to the non-language type ML projects, using your own proprietary non-text data.

But don’t listen to me. If you really have a nine-figure funding round, then go ahead and train your own foundation LLM. Send me a selfie from your private jet.

On the other hand, fine-tuning an existing model is cheaper. And you can use LoRA fine-tuning, which is even cheaper, too. RAG is cheaper still (probably), but it’s not even a type of training, so it should really be banned by the European Union for false advertising.

Still reading this, which means you still want to do training? Which is fine, I guess, provided the GPU hosting cost isn’t coming out of your pay packet. In terms of optimizing a training project, here are some methods that might be worth considering:

Choose smaller model dimensions (smaller is cheaper, but bigger is smarter).
Evaluate open-source vs commercial models.
Evaluate fine-tuning (FT) vs Retrieval-Augmented Generation (RAG).
Quantized models (“model compression” methods).
Knowledge distillation (train a small model using a large “teacher” model).
Dataset distillation (train a small model using auto-generated outputs from a large model).

Fine-Tuning

Hopefully, you’ve come to your senses, and dropped the idea of training a foundation model from scratch. But then, why not fine-tune one?

If you want to do fine-tuning, here is a checklist of things to consider:

Data availability — fine-tuning needs some specialized data, preferably some that only you own.
Good data — having bad data is worse than no data.
Prompt engineering versus fine-tuning — goals of fine-tuning can sometimes be more easily achieved via tweaks in prompt engineering (e.g., brand voice with prepended custom instructions).
RAG — ensure you’ve considered RAG versus fine-tuning.
Cost — various types of fine-tuning have different cost-to-accuracy profiles.

How does fine-tuning work? An existing foundation model is trained with new materials using the standard AI training methods. The use of extra specialist text to further train a model is called “fine-tuning.” This is a longstanding method in AI theory and fine-tuning can be performed in all of the major AI platforms. In the fine-tuning approach, the result of the re-training is that proprietary information about your products is all “inside” the model.

Types of fine-tuning. There are multiple ways to do fine-tuning, but the basic categories are:

Full-parameter fine-tuning
Parameter-efficient fine-tuning (PEFT)

Full-parameter fine-tuning is where you tweak every parameter in the entire base model. This is like continuing the original training of a foundation model, but with a specialized fine-tuning data set. It’s also quite costly in terms of GPU juice to do full training, although the idea is to use a lot less data than would be required for training the original foundation model.

Parameter-efficient fine-tuning or PEFT is the new way to do fine-tuning. The idea is to “freeze” most of the parameters in the main foundation model, and tweak a subset of them. This makes updating the model weights much faster.

The most popular way to do PEFT is called Low-Rank Adapters, and has the cute name of LoRA. This uses a trick with low-rank matrices and fancy matrix algebra, so that you can train only the parameters in a pair of much smaller matrices, which optimizes the cost of fine-tuning. LoRA also has a big advantage during inference in that it adds zero cost at runtime, except for an initial setup cost when loading the LoRA adapter. The most common strategy is “multi-LoRA” to have multiple specialized versions of a base model using a set of fine-tuned LoRA adapters.

Fine-Tuning Algorithm

The training algorithm used for fine-tuning is conceptually similar, regardless of whether you’re doing full-parameter fine-tuning, or PEFT, or LoRA, or the many other sub-variants. You grab some data and run it through a training engine that supports whatever method you want.

The general training algorithm at a very high level is as follows:

(a) Split the training data 80/20 (sometimes 90/10) into data to train with (training dataset) and data to evaluate the result of the training (validation dataset). If you have enough training data, use multiple training and validation datasets.

(b) Feed each input into the network, compare with the answer using the “loss function” to generate an “error”, and using the error, tweak the weights according to the learning rate.

(c) After all the 80% of data is fed in, use the validation dataset to evaluate the new model’s performance. This is using new data that the model has not seen yet, also in a question and expected response format.

(d) Based on the evaluation, you can accept the model or make major changes. For example, if you give it totally unseen data (i.e., the 20%) and it only responds correctly 50% of the time, you need to decide whether to continue with the next training dataset, or if it’s time to redesign the model and try again. If the model performs poorly, you have to allocate blame: if the training data is good, if the model’s structure is correct, if the loss function is correct, if the learning rate for incremental changes to the weights on each iteration are aggressive enough, or too aggressive, if biases are wrong, etc. To do this, tweak the model meta-parameters (e.g., number of layers, number of nodes per layer, etc.) or change the training algorithm meta-parameters (e.g., learning rate), and go back to the first step and start over.

This is only a top-level outline of a training algorithm. There are many improvements and finesses to get to a fully advanced fine-tuning algorithm.

Training Data for Fine-tuning

One of the biggest obstacles to a fine-tuning project is getting enough data. Many projects where fine tuning seems like a good idea are scuttled when there is no critical mass of data with which to train. Fine-tuning usually requires more data than RAG.

For fine-tuning data to be viable, it usually needs to have:

(a) Several cases of every concept you want to teach it, with both input and expected output. Depending on the NN architecture, it may also need a score indicating how good the output is.

(b) Corner cases and extra data to capture subtle details or complexities.

Gathering the data is likely the hardest part of training. And the more training iterations you need to do, the more data you need. Training data management is mostly a non-coding task, involving processing of the data files, such as chunking, organizing, indexing, and generating embeddings. It’s arduous to some extent, but not high in technical difficulty.

Model-Based Training

Another way to do training is to have it talk to another previously trained system. Knowledge distillation is one of these techniques, which is available already in major AI frameworks, and has a high level of sophistication. Another simpler method is to train a new model on the prompt-answer pairs from another large model, which doesn’t have a consistent name in the literature, but is sometimes called “dataset distillation” or “downstream data.”

If you like computers, then you can probably think of a few ways to create your own data. Alas, some other smart bunnies have beat you to it! The general idea of using computer-generated input data is called “synthetic data,” but that doesn’t always mean data from another model’s output.

What is LoRA?

LoRA is a type of Parameter-Efficient Fine-Tuning (PEFT). What that means is that instead of “full parameter” fine-tuning, we’re only going to train a small subset. The idea is to “freeze” the main model, and only train a small set of differences.

Full-parameter fine-tuning is where we continue the whole shebang of training, as if we were continuing our training from scratch. And that is expensive and very memory-intensive, because we have to update every single parameter. It’s also inefficient later if we want to have more than one of these, because we have a massive fine-tuned model that’s different each time.

Hence, the advantages of LoRA are:

Efficient fine-tuning
Efficient inference if using more than one (i.e., multi-LoRA)

How does LoRA work? The idea of fine-tuning is that we want to change a set of numbers in a matrix. So, instead of changing the weights in the original model matrix, we could (conceptually) generate a separate “differences” matrix. Then later, in the inference phase, we could load the main model matrix and the difference matrix, add them together, and we’ve got a final model. In practice it’s more complicated, but assume we can generate a difference matrix.

But how is that more efficient? Storing two big matrices, an original and a difference matrix, isn’t better than just storing one.

Well, it’s faster because instead of storing a big difference matrix, we use “low-rank decomposition” of matrices, which is a type of matrix factorization, to generate two much smaller matrices. The idea is that these two smaller matrices re-generate the big matrix when multiplied together.

Matrix decomposition. If you remember matrix multiplication of non-square matrices from High School, the dimensions work like this:

M x N times N x P = M x P

What happened to N? The inner dimension, N, is not used in the final size.

Hence, if we want the final matrix to be 1000x1000, we have M=P=1000. Yes, I know, in practice it would be powers of two sizes, but that makes the math too hard for me.

Now, the funny thing is that to create our MxP size matrix, you can see it doesn’t matter what N is, because it disappears. The smallest case is to use N=1. Hence, we have an 1,000x1 matrix (a column) times a 1x1,000 matrix (a row).

This is very efficient! Instead of 1,000,000 weights in the big matrix, we have only 2,000 weights in a column and a row. Hence, we have very few weights to train. As a percentage, that’s very tiny, like 2% or 0.2% or something. Where’s my AI calculator?

Unfortunately, this matrix trick doesn’t work perfectly. Not all 1,000x1,000 matrices can be exactly decomposed into a column times a row. Instead, we have to find the closest that we can, which means that low-rank decomposition is an approximation.

LoRA adapters. In most LoRA implementations, we don’t use N=1. Higher values of N make the approximations better, and values like N=8 to N=256 are more typical. But the number of parameters in a LoRA adapter is still much less than the quadratic size of the large matrix.

So, the “LoRA adapter” is actually just a set of two small low-rank matrices. It’s much less to store and less expensive to load into memory.

The whole thing is a little more complicated than that because every layer is different. We need a set of low-rank matrices for each layer. Hence, a LoRA adapter is a set of per-layer paired low-rank matrices.

Also, there can be more than one matrix per layer. There are variations in LoRA not just in the matrix dimensions, but also in terms of which parameters we want to target within a layer: attention weights and/or FFN weights. Generally, the collective wisdom is that attention weights are preferred as a first priority for LoRA weights, and maybe do FFN weights if you want extra accuracy (with extra cost). If you do both, it’s really two LoRA adapters per layer of the model, which means four low-rank matrices per layer.

Inference. The act of initialization of an inference server with the fine-tuned model is slightly slower, due to an additional setup step from loading and installing the LoRA parameters in its low-rank matrices. But once it’s loaded, we have zero extra latency in inference. The extra parameters in LoRA adapters are no longer used during inference.

Inference works by loading the main foundation model and also the LoRA adapters into memory. The pairs of decomposed matrices are multiplied together to create a large “difference” matrix, and this is simply added to the main foundation model. After this step, the LoRA adapters are no longer needed, and only the modified foundation model is used for inference, in the normal manner. Neither the low-rank matrices nor the computed difference matrix need remain in memory. Note that practical coded engines don’t actually ever store the big difference matrix in memory, through the magic of kernel fusion optimizations.

Multi-LoRA

The idea of multi-LoRA arises if we want two or more fine-tuned models. Note that multiple pairs of low-rank matrices, one per layer, is still considered to be a single LoRA adapter. Multi-LoRA is only when you do the whole fine-tuning twice, and you get two or more LoRA adapters.

The advantages of multi-LoRA include:

Multiple fine-tuned models that are specialized off one foundation model.
Fine-tuning costs are lower than full-parameter fine-tuning.
Inference extra cost is literally zero.
Switching between multiple fine-tuned models is efficient.

To the last point, let’s see how “hot swapping” of multiple LoRA adapters works. To “unload” a LoRA adapter, the process is similar to loading, but we use subtraction of the differences rather than addition. Simply load the two low-rank matrices of the LoRA adapter and then (a) multiply them together to re-create the differences matrix, and (b) subtract (rather than add) these values from the larger matrices, thereby yielding the original values of the foundation model. In practice, rather than creating a large difference matrix in memory, these two operations can be merged using kernel fusion. In fact, the third operation of loading the next LoRA adapter can also be fused into a triple operation.

This multi-LoRA method has actually been chosen for Apple Intelligence in its on-device version. Technically, since they’re also using quantization, it’s probably using Quantized LoRA (QLoRA) and multi-QLoRA.

Apple wanted different models for each specialized use case, with dozens of different AI tasks that need a model. but didn’t want to store multiple large foundation models. Hence, it uses a multi-LoRA architecture, where each specialized model is generated via one foundation model (for on-device, it’s understandably not an LLM but a small language model of size 3B), and one of many LoRA adapters.

Each LoRA adapter is much smaller than the 3B foundation model, with a size in the “tens of millions” of parameters. Hence, the LoRA adapters can be swapped on-the-fly at relatively low cost every time your iPhone wants to do a different AI task.

Training FAQs

What is pre-training? It’s not a specific type of training algorithm, but just means that you’ve already trained the model. This term mostly appears in Generative Pre-trained Transformers (GPT), which you may have heard of.

It’s common for a commercial service to offer access to a pre-trained model. For example, the OpenAI API allows you to send queries to the pre-trained GPT models, which have a broad level of trained capabilities. Similarly, there are numerous open source pre-trained models available, such as the Meta Llama2 model and various smaller ones.

What is re-training? Again, this isn’t really a technical term. It usually just means the same as fine-tuning.

What is knowledge distillation? Knowledge distillation (KD) is an optimization technique that creates a smaller model from a large model by having the large model train the small model. Hence, it’s a type of “auto-training” using a bigger teacher model to train the smaller student model. The reason it’s faster is that once the training is complete, you use only the smaller student model for processing user queries, and don’t use the bigger model for inference at all.

Distillation is a well-known and often used approach to save cost but retain accuracy. For example, a large foundation model usually has numerous capabilities that you don’t care about. There are various ways to use “distillation” to have the large model teach the smaller model, but within a subset of its capabilities. There are ways to share inference results and also more advanced internal weight-transfer strategies.

What is model initialization? That’s where you use malloc to allocate a memory block that exceeds the capacity of your machine. Umm, no. Model initialization is an important part of the training algorithm, and as you have probably already guessed, this refers to the start of training.

Since training creates a smart model by updating parameters incrementally by small amounts, it works better if the parameters are already close to where they need to be. So, you don’t just start training with all the model full of zeros. Instead, you try to “jumpstart” the process with better initialization. However, it’s far from clear what the best choice of initialization values should be, and there are lots of research papers on this topic.

References

Yarally T, Cruz L, Feitosa D, et. al., 2023, Uncovering energy-efficient practices in deep learning training: Preliminary steps towards green AI, International Conference on AI Engineering - Software Engineering for AI (CAIN), https://arxiv.org/abs/2303.13972
A. Apicella, F. Donnarumma, F. Isgrò, and R. Prevete, 2021, A survey on modern trainable activation functions, Neural Networks, vol. 138, pp.14–32, https://arxiv.org/abs/2005.00817 (Extensive survey all about training with activation functions, e.g., RELU, Swish, Maxout, leaky RELU.)
R. Immonen, T. Hämäläinen et al., 2022, Tiny machine learning for resource-constrained microcontrollers, Journal of Sensors, vol. 2022, https://www.hindawi.com/journals/js/2022/7437023/ (Survey of on-device training for TinyML/edge computing.)
P Freire, E Manuylovich, JE Prilepsky, SK Turitsyn, 2023, Artificial neural networks for photonic applications—from algorithms to implementation: tutorial, Advances in Optics and Photonics, Sep 2023, https://opg.optica.org/directpdfaccess/f0ae8746-2f89-4ac4-bb598eda29c7977c_539680/aop-15-3-739.pdf?da=1&id=539680&seq=0&mobile=no (Large survey covering many aspects of the future of training optimization.)
Marcos Treviso, Tianchu Ji, Ji-Ung Lee, Betty van Aken, Qingqing Cao, Manuel R. Ciosici, Michael Hassid, Kenneth Heafield, Sara Hooker, Pedro H. Martins, Andre F. T. Martins, Peter Milder, Colin Raffel, Edwin Simpson, Noam Slonim, Niranjan Balasubramanian, Leon Derczynski, Roy Schwartz, Aug 2022, Efficient Methods for Natural Language Processing: A Survey. arxiv:2209.00099[cs], August 2022. http://arxiv.org/abs/2209.00099
Epoch AI, 2024, How Much Does It Cost to Train Frontier AI Models? https://epochai.org/blog/how-much-does-it-cost-to-train-frontier-ai-models
Ben Cottier, Robi Rahman, Loredana Fattorini, Nestor Maslej, David Owen, 31 May 2024, The rising costs of training frontier AI models, https://arxiv.org/abs/2405.21015
Bender, Emily M.; Gebru, Timnit; McMillan-Major, Angelina; Shmitchell, Shmargaret (2021), On the Dangers of Stochastic Parrots: Can Language Models Be Too Big?, FAccT ’21. Association for Computing Machinery. pp. 610–623.
Apple, June 2024, Introducing Apple’s On-Device and Server Foundation Models, https://machinelearning.apple.com/research/introducing-apple-foundation-models
Sachin Mehta, Mohammad Hossein Sekhavat, Qingqing Cao, Maxwell Horton, Yanzi Jin, Chenfan Sun, Iman Mirzadeh, Mahyar Najibi, Dmitry Belenko, Peter Zatloukal, Mohammad Rastegari, 22 Apr 2024, OpenELM: An Efficient Language Model Family with Open-source Training and Inference Framework, Apple Research, https://arxiv.org/abs/2404.14619 Code: https://huggingface.co/apple/OpenELM

• Online: Table of Contents

• PDF: Free PDF book download

• Buy: Generative AI Applications: Planning, Design and Implementation

The new Generative AI Applications book by Aussie AI co-founders:

Deciding on your AI project
Planning for success and safety
Designs and LLM architectures
Expediting development
Implementation and deployment

Get your copy from Amazon: Generative AI Applications

Aussie AI

Chapter 14. Training and Fine-Tuning