Aussie AI

15. Faster AI

Book Excerpt from "The Sweetest Lesson: Your Brain vs AI"

by David Spuler, Ph.D.

15. Faster AI

“You’re gonna need a bigger boat.”

— Jaws, 1975.

Faster AI

AI engines are already amazingly fast, but it’s not enough. LLMs are just not big enough, despite already being ginormous. How much computation is going to be needed for the next generation of even smarter models?

Lots more.

It’s not clear yet how far ahead we need to go into the future to get the capacity that’s needed. Processing AI computations on GPUs is continually getting faster. NVIDIA is now on an annual cadence of GPU updates, where each general of silicon beasts gets faster and faster. There are really two things that are getting faster:

Faster training — baking braininess into brawny silicon chips.
Faster inference — answering your queries and generating poetry.

The weird thing about the speed of AI engines is that there’s really only one bottleneck. It’s a complex architecture, but most of the compute grunt is consumed in one small part of the code. AI kernel engineers are forever fiddling with about 1,000 lines of code.

Endless Matrix Multiplications

You studies this in High School math class: matrix multiplications. Remember how you had a weird 2x2 square of numbers, and you had to multiple across a row and down the other column. AI is based on that type of matrix multiplication, and almost nothing else.

Who knew that High School math could be useful?

AI engineers like to make it sound more complicated than that. Our favorite word to use is “tensors” and we hope nobody realizes that it’s just matrices, only more than one. Our 2x2x2 tensor in 3D just means that we have two of the 2x2 matrices.

Hence, when you see the press releases about how great NVIDIA GPUs are with “Tensor Cores” you can roll your eyes, and think: I learned that stuff in High School.

500 Ways to Optimize

There are over 500 ways to optimize an AI model. Not an exaggeration — I literally made a list and counted them. I’m not going to force you to read it here, but it’s in the references. If you’re a new parent with trouble getting the little one to sleep, feel free to read it to them.

The main bottleneck in AI is the above matrix multiplications, and everything else is secondary. Although training and inference are very different algorithms in AI, they both come down to matrices, ahem, sorry, I meant to say tensors. There’s a lot of ways to improve things, but the main categories are:

Fewer matrix multiplications
Smaller matrices
Different types of arithmetic

To do fewer matrix multiplications, we’ve already talked about a big one: Mixture-of-Experts. For example, the 600B model in DeepSeek is cut down into 37B experts, which is a lot fewer matrices every time it crunches out a new word.

Another way is to use smaller matrices inside smaller models. There’s an optimization called “distillation” and I wonder how it got that name? The idea is that you train a huge model, called the “teacher” model. And then you get the teacher to talk to another smaller model, so as to train it, or to “distill” the big model’s answers into the smaller one. It’s called the “student” and I’m not making this stuff up.

Anyway, the smaller model is faster.

Another way to get a smaller model is, you know, just train a smaller model. Go get a refund on half your GPU chips, and just build a small model from the beginning, without needing to train the huge teacher model first. AI engineers like to argue about which approach is better. In short: smaller models are faster but less accurate.

In fact, you can train more than one small model for less cost than a big one, which should really be called a “multi-mini-model” architecture (triple-M!). Instead, it’s called “multi-LoRA” because AI researchers like quirky names. If you’re got an iPhone with Apple Intelligence baked in, that’s the model architecture it’s using.

There are many ways to improve the arithmetic. The first trick is that someone noticed that matrix multiplication is like:

1. Multiply stuff, and

2. Add it up.

Here’s an idea: do both things together. There are now CPU and GPU hardware instructions that do exactly that, and it’s called “Fused Multiply-Add” or FMA. Hence, if your code has both a multiplication and an addition operation, then you’re behind the times.

Fuse them together!

Another way is to use smaller data. For example, there’s a fancy technique called “quantization” that really just means using smaller numbers. The latest Blackwell GPU chips are great at “FP4” processing in all those Tensor Cores. What this means is using data that is 4-bit floating-point (FP4) instead of 32-bit floating-point (FP32) for all the numbers we’re using inside the matrix multiplications. This drop from 32-bit to 4-bit is eight times less computation, according to my calculations.

Hence, eight times faster AI engines.

Another way to make matrices smaller is to “prune” their numbers. I mean, if your model has a billion numbers, may you don’t really need all of them. Unfortunately, since AI engineers don’t really understand what all of the numbers do, we also don’t really understand what happens if we throw some away. The early research papers in the 1990s on this technique were about “Optimal Brain Damage” and it’s an inside joke that’s not really a joke.

The idea is that if a number is 0.0001, then just pretend it’s actually zero. If the weight of a signal is that small, it might as well be zero, so just throw it away. The simpest way is just to scan all the numbers through the whole model for small ones to discard, which is called “magnitude pruning.”

The other way is to look at numbers in particular parts of the model, which is called “structured pruning.” It’s like throwing away a whole lobe or a whole cortex. There are literally four ways to prune the different model structures, along the different model dimensions:

Length — input token pruning (i.e., throw away words)
Depth — layer pruning (“early exit”)
Width — attention “head” pruning (“sparse attention”)
Model dimension — embeddings pruning (“activation sparsity”)

Any way you cut it (I mean, prune it), there’s fewer weights and less computation. Hence, you prune and prune until there’s more zeros than non-zero numbers, and then it’s called “sparsity.” And there are so many ways to do “sparse matrix multiplications” inside a GPU that AI engineers call it “sparse MatMul” or “spGEMM” or other names you can’t print.

Anyway, fast.

Training

Inference takes milliseconds to answer a query. Training a big model can take weeks or months. There are faster GPUs and software algorithms that have sped up training since the olden days (last year), but models are getting bigger, too.

When training a big AI model, the most important thing you need is blankets. All of the AI engineers on the training team have to sleep in the data center for days or weeks. The facility is air-conditioned and liquid-cooled down as low as you can get, which isn’t actually that low when you have 100,000 overclocked GPUs running. The training engineers aren’t really doing much except watching the monitoring dashboard and waiting for the training phase to go: splat!

I’m only half kidding.

Training really does fail, and quite often, which is why you use “checkpoints” for training updates (it’s a fancy word for backups). The idea is that you checkpoint often, and when it all falls over in a crying heap, then you restart from the last checkpoint. When you realize that it costs tens of millions of dollars to run training on a big cluster, you can see why it’s important to have frequent backups.

Why does training fail?

Ah, yes, there are plenty of research papers on that. There are issues like vanishing gradients, exploding gradients, training instabilities due to outliers, and all sorts of other obscure statistical reasons, which are boring, so let’s skip those. More interesting reasons include:

1. Hardware bugs

2. Software bugs

3. Real bugs

4. Space bugs

Hardware bugs are surprisingly common. Well, no, actually not common, but when you have 100,000 of them, then it’s not uncommon. GPUs overheat and burn out. Old GPUs are the worst, especially if you’ve been a cheapskate and bought used GPUs from a retired Bitcoin miner. Weirdly, brand new GPUs can fail, too, due to an occasional fault in the chip manufacturing process. The error rate is very low, but multiply that by 100,000. Middle-aged GPUs are the sweet spot, so you ideally want to run in your new GPUs with some gentle warm-ups (like a new Porsche). Actually, training algoriths are now good enough to detect a failing GPU and isolate it, but it’s not always 100% accurate, so that’s when the sleeping engineers need waking up.

Software bugs are an error in the code by the C++ programmers who wrote the training software for the GPUs. Of course, that never happens, since good coders never misplace a semicolon, except for that one time with Microsoft Vista over 20 years ago. Furthermore, we can always blame the hardware engineers or the network engineers. Anyway, let’s move on.

Real bugs are not very common these days. If a roach or a small lizard gets into the data center, and ends up next to a GPU, it’ll get baked into the training process, literally. Thankfully, most data centers are locked-down so tight to keep out the dust that not many insects or reptiles make an impact.

Space bugs are the funniest thing! Here is where you, Dear Reader, are 100% convinced that it’s a joke, and yet, it’s really not.

Cosmic rays are a type of electromagnetic radiation that comes in from space. Normally, nobody notices them because they rarely interact with matter, and even if they do, it’s such a minor effect that even a human body cell will ignore it and move on with life. However, GPUs have transistors that are etched into silicon at a scale of a few nanometers, which could be two, but let’s say five. In comparison, a human cell is at least 10,000 and maybe 100,000 nanometers in size.

Once in a while, a GPU gets hit by a cosmic ray. And once in a while, it goes near a tiny transistor embedded in the silicon. And once in a while, it hits a transistor. And once in a while, it hits at the right time.

A bit gets toggled.

It’s a very rare event. But multiply the odds by 100,000 GPUs. And then multiply again by a process that takes days or weeks.

Don’t Forget the Network

Programmers tend to avoid network engineers, because they’re covered with dust from crawling under the raised floor to fix cables all the time. But AI engines need networking with the highest bandwidth to connect those clusters of 100,000 GPUs that do all the work. The reason for networking depends on the phase:

Training — send the training data out, get the weight updates back.
Inference — send the user queries out, share some caches, get homework essays back.

Yawn! It’s true that NVIDIA has a bunch of networking products like NVLink and stuff, with coverage of InfiniBand, Ethernet, BlueField, and more. However, the company only makes about $13 billon US dollars in annual revenue from networking products, so nobody pays any attention to the networking area.

I mean, peanuts!

Pity the poor SVP of Networking at NVIDIA. They run a business bigger than half the stock market, but never get a CNBC interview.

The Question

Here’s the real question about AI technology: is it fast enough? Short answer: No.

We’re getting better at running big models at a faster clip, but these big models aren’t really that great. I mean, they’re amazing in some areas, but, come on, so many dumb things are still coming out of its mouth.

The whole industry is going to need a bigger model, or, more likely, a ton of much bigger models, each of them tweaked to run in particular contexts (e.g., doctors, lawyers, marketing managers, etc.). We’re not close to that.

In order to fix all of the areas where AI is still faulty, we need to train it with lots more data, and make the models much bigger. Hence, here’s the real question to ask:

How much faster is needed?

References

Articles and papers on the many ways to optimize AI engines:

David Spuler, September 2nd, 2024, 500+ LLM Inference Optimization Techniques, Aussie AI Blog, https://www.aussieai.com/blog/llm-inference-optimization
Babak Hassibi and David Stork, 1992, Second order derivatives for network pruning: Optimal brain surgeon, Advances in Neural Information Processing Systems, NeurIPS 5, https://proceedings.neurips.cc/paper/1992/file/303ed4c69846ab36c2904d3ba8573050-Paper.pdf
Yann LeCun, John Denker, and Sara Solla, 1989, Optimal brain damage, In D. Touretzky, editor, Advances in Neural Information Processing Systems, volume 2. Morgan Kaufmann, 1989. https://proceedings.neurips.cc/paper_files/paper/1989/file/6c9882bbac1c7093bd25041881277658-Paper.pdf
Diyuan Wu, Ionut-Vlad Modoranu, Mher Safaryan, Denis Kuznedelev, Dan Alistarh, 30 Aug 2024, The Iterative Optimal Brain Surgeon: Faster Sparse Recovery by Leveraging Second-Order Information, https://arxiv.org/abs/2408.17163
Stock Dividend Screener, March 8, 2025, Nvidia Data Center Revenue Breakdown: Compute & Networking, https://stockdividendscreener.com/technology/semiconductor/nvidia/data-center-revenue/#networking