Aussie AI

Chapter 22. AI Research Overview

  • Book Excerpt from "Generative AI Applications: Planning, Design and Implementation"
  • by David Spuler

Chapter 22. AI Research Overview

The Three S’s of AI Research

There are three main types of AI research, which I’ve taken to calling “The Three S’s” when I categorize them. The areas are:

  • Smartness
  • Speed
  • Safety

I’m really a fan of speed. That’s our main area of research at Aussie AI, with expertise and a few patents filed in low-level kernel optimizations, on-device inference, and accelerator add-on components. So, this might be a long chapter if I let myself go.

Oh, wait! We’ve already written a whole book on AI speedups, which is titled Generative AI in C++: Coding Transformers and LLMs. It’s all about how to code up Transformer internals, and the many types of kernel optimizations to use for the components (e.g., KV caching, kernel fusion, memory efficiency, etc.). But that book was written in March 2024, and there are about five new types of KV caching in the research papers, so there are some parts that need updating. I’ll try to be brief.

Smartness Research

The overall goal of AI researchers is, you know, artificial intelligence. Since we’ve got the “artificial” part well covered, there’s a ton of research on “intelligence” and I call it “smartness” research, just to fit in with my alliterative fun. There are many subareas of smartness research, such as:

  • Artificial General Intelligence (AGI)
  • Artificial Super-Intelligence (ASI)
  • Use cases
  • Prompt engineering
  • Reasoning
  • Mathematics

The above research is mostly about making super-smart AI models, no matter what the cost in electrons, and reducing this expense is delegated to other AI researchers. Some of the hotter areas in AI “smartness” research include:

  • Trillion-parameter models
  • Mixture-of-experts (MoE) and other multi-model “ensemble” architectures
  • Multimodal “omni” models
  • Multi-step reasoning algorithms (e.g., Chain-of-Thought)
  • Small Language Model (SLM) capability improvements
  • Multi-agent architectures (“agentic architectures”)

There’s lots of research happening, some of which appears in papers, and the rest is hidden away behind swinging doors. The capabilities of models are astounding and increasing, but we’re not that close to AGI yet.

Multi-step reasoning. The hottest area at the moment is the use of multiple inference steps to improve overall reasoning of an LLM. This has received a surge of interest since OpenAI released its “o1” model, which was code-named “strawberry” in reference to a well-known reasoning problem: LLMs could not correctly count the number of the letter “r” in “strawberry” (they would say two rather than three). This model used the “Chain-of-Thought” (CoT) method of multiple steps of inference to improve its results.

This is a very fundamental change of focus. Until this point, the main way to make an LLM smarter was to give it more parameters, and better training data. This was based on the “scaling laws” that AI would be smarter if you scaled the parameters. However, this has been somewhat superseded by the “inference scaling laws” which says that the LLM can be smarter if you give it more time to run better inference analysis in multiple steps.

There are many subtypes of this multi-step inference approach to reasoning. Chain-of-Thought is obviously getting the most attention because of its use by OpenAI. Hence, the subfields of AI reasoning research areas includes:

  • Chain-of-Thought (CoT)
  • Self-reflection
  • LLM as Judge
  • Tree of Thoughts (ToT)
  • Best-of-N (BoN)
  • Skeleton of Thoughts (SoT)
  • Graph of Thoughts (GoT)
  • Agentic architectures

There is also a lot of crossover between these ideas and prompt engineering. The two areas are mostly orthogonal, so there is a combinatorial explosion of options when you consider that all of the above multi-step algorithms can also use different prompting optimizations at every step.

On the other hand, some recent research put out by Apple has cast doubts on whether LLMs are doing any type of reasoning at all. Their paper asserts that most of the results of LLMs are due to memorization and pattern matching, rather than using any generalized reasoning analysis.

Personally, I think it’s another case of the “bitter lesson” whereby human researchers always think that advancements must come by coding up human-like reasoning algorithms, but the best solution for computers is often simply brute-force computations. I guess time will tell who’s right!

References on Reasoning

Below are a number of relevant research papers, and the full list is available at https://www.aussieai.com/research/reasoning.

  1. Jason Wei and Denny Zhou, May 11, 2022, Language Models Perform Reasoning via Chain of Thought, https://research.google/blog/language-models-perform-reasoning-via-chain-of-thought/
  2. Cameron R. Wolfe, Jul 24, 2023, Chain of Thought Prompting for LLMs: A practical and simple approach for “reasoning” with LLMs, https://towardsdatascience.com/chain-of-thought-prompting-for-llms-33c963eead38
  3. Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Thomas L. Griffiths, Yuan Cao, Karthik Narasimhan, 3 Dec 2023 (v2), Tree of Thoughts: Deliberate Problem Solving with Large Language Models, https://arxiv.org/abs/2305.10601 Code: https://github.com/princeton-nlp/tree-of-thought-llm
  4. Cameron R. Wolfe, Aug 21, 2023, Tree of Thoughts Prompting. Solving multi-step problems with LLMs via deliberate planning and exploration, https://cameronrwolfe.substack.com/p/tree-of-thoughts-prompting
  5. M.Besta, N. Blach, A. Kubicek, R. Gerstenberger, M. Podstawski, L. Gianinazzi, J. Gajda, T. Lehmann, H. Niewiadomski, P. Nyczyk et al., 2024, Graph of thoughts: Solving elaborate problems with large language models, Proceedings of the AAAI Conference on Artificial Intelligence, vol. 38, no. 16, pp. 17682–17690. https://arxiv.org/abs/2308.09687
  6. Cameron R. Wolfe, Jan 3, 2024, Graph-Based Prompting and Reasoning with Language Models. Understanding graph of thoughts prompting and several variants, https://towardsdatascience.com/graph-based-prompting-and-reasoning-with-language-models-d6acbcd6b3d8
  7. Xuefei Ning , Zinan Lin , November 17, 2023 Skeleton-of-Thought: Parallel decoding speeds up and improves LLM output, Microsoft Research Blog, https://www.microsoft.com/en-us/research/blog/skeleton-of-thought-parallel-decoding-speeds-up-and-improves-llm-output/ Code: https://github.com/imagination-research/sot/
  8. Cogni Down Under, Sep 2024, Reflection 70B: The AI That Thinks Before It Speaks, https://medium.com/@cognidownunder/reflection-70b-the-ai-that-thinks-before-it-speaks-8a70d3a0e38a
  9. Pranab Sahoo, Ayush Kumar Singh, Sriparna Saha, Vinija Jain, Samrat Mondal, Aman Chadha, 5 Feb 2024, A Systematic Survey of Prompt Engineering in Large Language Models: Techniques and Applications, https://arxiv.org/abs/2402.07927
  10. Lingjiao Chen, Jared Quincy Davis, Boris Hanin, Peter Bailis, Ion Stoica, Matei Zaharia, James Zou, 4 Jun 2024 (v2), Are More LLM Calls All You Need? Towards Scaling Laws of Compound Inference Systems, https://arxiv.org/abs/2403.02419
  11. Zehui Chen, Kuikun Liu, Qiuchen Wang, Jiangning Liu, Wenwei Zhang, Kai Chen, Feng Zhao, 29 Jul 2024, MindSearch: Mimicking Human Minds Elicits Deep AI Searcher, https://arxiv.org/abs/2407.20183 Code: https://github.com/InternLM/MindSearch Project: https://mindsearch.netlify.app
  12. Bradley Brown, Jordan Juravsky, Ryan Ehrlich, Ronald Clark, Quoc V. Le, Christopher Ré, Azalia Mirhoseini, 31 Jul 2024, Large Language Monkeys: Scaling Inference Compute with Repeated Sampling, https://arxiv.org/abs/2407.21787 (Generating multiple answers by repeated inference queries, and then using a verifier to choose the best one, which is shown to greatly increase overall accuracy.)
  13. Justin Chih-Yao Chen, Archiki Prasad, Swarnadeep Saha, Elias Stengel-Eskin, Mohit Bansal, 18 Sep 2024, MAgICoRe: Multi-Agent, Iterative, Coarse-to-Fine Refinement for Reasoning, https://arxiv.org/abs/2409.12147 https://github.com/dinobby/MAgICoRe
  14. Xiaohan Xu, Chongyang Tao, Tao Shen, Can Xu, Hongbo Xu, Guodong Long, Jian-guang Lou, 29 Feb 2024 (v2), Re-Reading Improves Reasoning in Large Language Models, https://arxiv.org/abs/2309.06275

Safety Research

There’s a lot of research about making AIs safer. Well, actually, there’s a lot more research on smartness and speed than there is on safety (about 2% of all AI papers), but with 250,000 research papers published per year on AI, there’s still plenty of safety papers to fill your weekends.

We’ve covered a whole host of safety topics in Chapter 10, and every bullet point in that chapter is a whole research field in itself. Some of the main ones are:

  • Hallucinations
  • Bias
  • Fairness
  • Explainability
  • Adversarial attacks

Apple Intelligence. In addition to speedup, safety has also been a priority factor in training of the Apple AI models for both on-device and cloud-based inference. Here are some of the approaches they used:

  • Data preparation (of training data)
  • Filtering profanity
  • Filtering personal details — e.g., credit card numbers.
  • Refusal-specific training
  • Model evaluation on safety benchmarks

I must admit not being fully versed with AI safety research. I’m more focused on speed, especially for advanced software kernels and on-device inference.

Speed Research

AI research tends to use the word “performance” to mean “intelligence” or “smartness” in the vernacular. Hence, you have to look for more specific words like “efficiency” or “latency” or “throughput” to find all the speed papers. Even the word “optimization” can mean optimizing the performance, but can actually appear in the titles of both types of papers.

The basic categorization of speed research papers goes like this:

  • Hardware acceleration
  • Software acceleration
  • Some combination thereof

Personally, as a software engineer, I tend to skim over all the hardware papers, because they’re in the “too hard basket” for me. But hardware systems and silicon chips is the area where the greatest speed advances have been made in the past, and this continues to be the case into the foreseeable future.

The best software acceleration algorithms for AI engines and models tend to be in the range of ten-fold improvement in speed, whereas hardware speedup is hundred-fold and above. And there are plenty of software papers that give improvements of ten or twenty percent, which is obviously still valuable when you consider the cost of running AI platforms, but it’s not the kind of revolution possible with hardware advances.

SOTA Speed Research

What is the state-of-the-art for speedup optimizations in AI? Mainly I’m going to focus on inference, although there are speedups for training and fine-tuning as well. There are two main areas:

  • Data center inference (i.e., lots of GPUs)
  • On-device inference (GPU-free phones and PCs)

In terms of data center optimizations, the main focus is all that parallelization capacity available from multiple GPUs. Some of the newer multi-GPU optimizations include:

  • Multi-GPU scheduling with preemption
  • Serving optimizations
  • Batching optimizations
  • Disaggregating prefill and decoding phases
  • Offloading

There are several software improvements that can be used in both on-device and data center inference. Memory reductions or using fewer computations are techniques for any inference platform. Some of the software optimizations that have garnered some traction in both open-source and commercial platforms include:

  • 4-bit quantization (of weights, activations, and/or KV cache)
  • Grouped-Query Attention (GQA) — beyond Multi-Query Attention (MQA)
  • KV caching in autoregressive decoding (this is basic table stakes now)
  • Flash attention
  • Flash decoding
  • Paged attention
  • Paged Flash attention (combined)
  • Prefix KV caching (session-based or generic)
  • Chunked prefills
  • Multi-LoRA
  • Continuous batching

There are several software optimizations that are specific to parallelization, and thus more beneficial for GPU-based data centers (although NPUs are increasingly making this comment incorrect!). Examples include:

  • Kernel optimizations
  • Speculative decoding
  • Prompt lookup decoding
  • Distributed tensor parallelism

Commercial AI Platform Speedups

But what are the big companies using? Well, it’s hard to say because the big US companies have “gone green” and aren’t putting out many research papers. Most of the best papers now are coming out of China. Maybe I should look in the US companies’ patent filings, but there’s a big lag time between filing them and their public availability. Nevertheless, here are some examples.

Character.AI platform: This is the company that does AI companions online, co-founded by prominent AI researcher Noam Shazeer, so it should be using some advanced stuff. They recently put out a research blog article describing their data center platform, which is obviously GPU-based. Apparently, they’re doing 20,000 queries per second, which is astounding. The techniques that they mentioned included:

  • GPUs
  • INT8 inference quantization (for weights, attention, and KV cache data)
  • INT8 quantized training
  • Multi-Query Attention (MQA)
  • Hybrid Local/Global Attention — interleaving layers of local attention and global attention.
  • Cross-layer KV sharing — a type of “layer fusion” in the KV cache data.
  • “Stateful caching” — session-based caching.
  • Session-based multi-turn KV caching (prefix KV caching)
  • “Sticky sessions” — each network session returns to the same box, so its prefix KV cache data is there.

According to the blog post, their estimate is a 13x cost reduction by using these techniques versus a more naive commercial platform, and a 30-fold efficiency improvement since inception.

Apple Intelligence: At the other end of the spectrum, Apple recently announced details of their on-device capabilities, planned for late 2024 and early 2025. Their main methods include:

  • M-series hardware (i.e., NPUs)
  • Small language models — a 3B on-device foundation model.
  • Multiple LoRA adapters (fine-tuning of small models)

Apple’s strategy here is to use a small-ish model for the foundation model, with a 3B on-device model and a “larger” server-based model for cloud execution. Interestingly, their other main on-device strategy is to use fine-tuned versions of this small on-device model, but doing so via multiple LoRA adapters.

LoRA adapters are a way to do fine-tuning by adding a small number of extra parameter, but leaving the main foundational model’s parameter “frozen” (unchanged during fine-tuning). This is a type of Parameter-Efficient Fine-Tuning (PEFT).

LoRA adapters are advantageous in a few ways for on-device inference. Apple mentions using different LoRA adapters per use case, and also a size in the “tens of millions of parameters.” Hence, the LoRA adapters are much smaller than the 3B model, making them easier to switch in and out of memory, while still offering more specialized models for different activities. Apple calls this “on-the-fly specialization” of the foundation model.

This LoRA approach is better than having multiple fine-tuned versions of the 3B foundation model, and trying to swap gigabytes of weights in and out of memory. Instead, just load up the 3B model once, and leave it in the memory permanently, while swapping the much smaller adapters. This may also simplify the process of providing software updates or fixes to the models over the internet.

The details of the LoRA adapters are also included somewhat. These are applied to attention matrices, the attention projection matrix, and the fully connected layers. It is a little unclear since the document mentions 2-bit to 4-bit configurations, but also 16-bit representations of LoRA weights.

The size of the models is the main factor for faster on-device inference, but some additional software acceleration techniques are also used. For inference speedup, in addition to LoRA adapters, these techniques are mentioned:

  • Grouped-query attention (memory-efficient)
  • Shared embedding/unembedding tensors
  • Smaller on-device vocabulary (49k local versus 100k for server-based)
  • LoRA adapters (see above)
  • Activation quantization
  • Embedding quantization
  • Efficient KV cache update method for neural engines (details undisclosed!)

Apple has not disclosed the details of its KV cache methods as yet. I wonder whether it is based on session-based prefix KV caching, which would make sense for on-device inference, since every device effectively has only one session.

The only downside to all of this is that most of the early iPhone versions won’t have the hardware to run these features. For iPhone 15 and beyond, Apple reports latency measurements of 0.6ms per prompt token (i.e., prefill) and 30 tokens-per-second of decoding.

Training efficiency. Although not directly relevant to users, Apple has also detailed some of its training improvements in efficiency and safety. For its training capabilities, Apple mentions:

  • TPUs and on-premise GPUs
  • Data parallelism
  • Tensor parallelism
  • Pipeline parallelism
  • Fully Sharded Data Parallel (FSDP)

Model Compression

Model compression is where you make the LLM smaller. Using a smaller LLM is a simple way to go faster because it reduces both memory usage and the total number of computations needed to run inference. There are several types of “model compression” you can consider:

  • Small Language Models (SLMs)
  • Quantization
  • Pruning

Quantization is commonplace, and only the most stringent requirements for accuracy would have you eschew it. Typically, quantization uses sizes such as 16-bit integer or floating point (INT16 or FP16), 8-bit integers (INT8), and even 4-bit integer quantization (INT4) is commonly used as a good speed-versus accuracy trade-off. Note that 32-bit floating point is the basic non-quantized size, so 16-bit is half-size, 8-bit is a quarter, and 4-bit is an eighth. For more about quantization research, see https://www.aussieai.com/research/quantization.

Pruning is where you discard small weights by making them zero. You can do “unstructured pruning” where small weights are zeroed no matter where they are. This is related to “sparsity” where there are mostly zeros, but they are not quite the same thing.

Alternatively, you can do “structured pruning” where you cut out only big structures. There are literally four dimensions on which you can prune structures, which is usually done dynamically:

  • Depth — layers of the model (e.g., early exit, layer pruning, layer skipping).
  • Width — the neurons across a layer (e.g., attention head pruning, filter pruning, channel pruning).
  • Length — the input token sequence (e.g., input prompt pruning, token pruning, token merging).
  • Internal dimension — embeddings vector pruning methods.

However, pruning is not as commonplace as quantization, and many of the above methods are still mostly in research papers, rather than widespread industry practice. For example, you’ll note that both Character.AI and Apple Intelligence mention quantization, but neither mentions pruning. For more research on model pruning, see https://www.aussieai.com/research/model-pruning.

Finally, note that there are other types of model compression, such as knowledge distillation, weight sharing, layer fusion, and weight clustering. For example, Character.AI mentions layer fusion, but it’s a little different because it is in relation to KV cache data rather than weights.

Kernel Optimizations

The low-level code that runs the Transformer inference steps is called a “kernel.” The main types of kernels that have the most optimizations are:

  • Matrix multiplications
  • Attention
  • Decoding algorithms
  • KV cache

The main computations in LLMs are matrix multiplications (“MatMul”), which are generalized to “tensor products.” In the early days, a lot of brain power went into making matrices and tensors compute faster. Surprisingly, there are still some breakthroughs happening in MatMul, especially as on-device platforms have different characteristics. For more about MatMul research, see https://www.aussieai.com/research/matmul.

Attention kernel optimizations. Attention is the special algorithm that made Transformers famous. However, it’s also slow, and has now been optimized by dozens of variations, most of which aim to either: (a) reduce total computations by paying less attention to some tokens, or (b) make it more “memory efficient” in its computation pathways. Here’s a list:

  • Multi-Head Attention (MHA) — the basic idea from 2017.
  • Multi-Query Attention (MQA)
  • Group-Query Attention (GQA)
  • Local Attention
  • Sliding Window Attention
  • Linear Attention (other types)
  • Flash Attention — there’s also a Flash Attention version 2.
  • Paged Attention — the claim to fame of the vLLM open-source platform.
  • Paged Flash attention (combination of methods)

For more research on attention optimization, see https://www.aussieai.com/research/attention.

Parallel decoding algorithms. The non-autoregressive decoding method is a bottleneck that enforces sequential execution of an LLM, one token at a time. Various attempts to parallelize this idea have been discovered:

  • Speculative decoding
  • Retrieval lookup decoding
  • Prompt lookup decoding
  • Self-speculative decoding
  • Tree speculative decoding
  • Multi-token prediction
  • Aggressive decoding

For more information on speculative decoding and other parallel decoding advances, see https://www.aussieai.com/research/speculative-decoding.

KV caching. The KV cache was an innovation at the time, but it’s a basic method nowadays. When the Transformer computes the 10th token, it uses computed data from the first 9 tokens, and someone noticed that you could store that “Key-Value” (KV) data in memory, and avoid re-doing some computations for those first 9 tokens. That’s the basic “KV cache” method.

Unfortunately, it has become too much of a good thing. If you write a Tolstoi novel, about 700,000 words, which is about a million tokens, then you have a KV cache that uses memory for each of those one million tokens. And I mean, a lot per token, with a whole slice of a tensor per token. So, the memory needed for the KV cache actually gets bigger than the whole LLM, and those things aren’t small.

The first solution was to limit the length of the “context window” so that you cannot track such long texts. Early models had context lengths of 2048 or 4096, and then gradually improved to 8k and 32k. But that’s not very good for understanding a long book, or for processing big images or video, so researchers have developed two types of optimizations: (a) attention optimizations, as above, and (b) KV cache compression.

The idea with “KV cache compression” is to use less memory for the KV cache. Amusingly, almost every method researchers ever tried for model compression for LLM weights can also be applied to compressing the KV cache data — quantization, pruning, and so on. Quantization of the KV cache data is quite commonplace, such as in the Character.AI platform, and this also uses layer fusion of KV cache data (which is similar to depthwise pruning of the KV cache). There are various newer research papers on techniques such as lengthwise per-token KV cache data pruning.

Maybe we could do a KV cache for the KV cache? KV-squared cache. Now there’s a patentable idea!

For more research on the various KV caching methods, see https://www.aussieai.com/research/caching.

An important limitation of the basic KV cache, whether compressed or not, is that it only works within the current query for one user. Optimizing inference across multiple queries from multiple users is where other accelerators come in.

AI Accelerators

The main way to speed up an AI engine is to use better silicon, and find an inference engine that supports the right chipsets. Once you’ve maxed that out, you have to look at software, which means kernel optimizations as already discussed above, and then other add-on accelerator components, which is what this section is about.

The first point is don’t just optimize your LLM and its Transformer, no matter how much fun that is. Any production architecture has other components, such as a basic web server, utility servers, DNS server, identity validation, and so on. You should measure end-to-end response time and optimize the whole system.

In order to further confuse the issue, let us note that a lot of these “accelerators” for LLMs actually plug into the KV cache mechanism. However, the idea is a “global KV cache” that works across multiple queries, rather than the basic KV cache within a single query. But first, let’s pretend we don’t know anything about the letters K and V.

Basic Inference Caching. If you have an AI application with lots of users pounding it with queries, how would you speed it up? Well, one thing to think about is that users often ask the same questions. If it’s your internal HR chatbot, here’s one:

        How do I sign up for 401(k)?

Presumably, you’ve got a policy document for that, and you can just return an answer that links to that document. Hence, every person that asks that question can receive the same answer. Thus, the idea of an “Inference Cache” is found: put a basic text-to-text cache mechanism between the query input and the LLM. Any answers that get a cache hit are never seen again by the LLM.

Non-Cacheable Queries. Like everything in AI, there are exceptions to this method. Not all queries can be cached. For example, think about this one:

        What time is it?

Any query that cannot be answered by just the LLM weights cannot be cached. If the answer requires any external data access or any tool usages (e.g., a clock in this example), then the cache must be voided for that query. Note that LLMs already have mechanisms to detect when a query needs to use external data sources and computation tools (e.g., “trigger tokens” and “function calls”), so you can extend those interfaces to also void the cache for that query whenever they are triggered.

Semantic Cache. Another problem with the basic inference cache is that the tokens must be identical. Slightly different wording of the same question will cause a cache miss. The generalization is therefore to detect queries that have different tokens, but the same meaning, using a semantic cache. This method uses embeddings and a vector database, and works better than a basic text-to-text inference cache. As with the inference cache, you insert it between the input and the LLM. The same restrictions about non-cacheable queries with data sources or tools also apply to the semantic cache.

RAG Accelerators. If you have a RAG architecture, there are other components to speed up. For example, the performance depends on the datastore and retrieval mechanism. Speeding up databases and indexed retrieval methods sounds like something we’ve done before, like for the last 50 years or more.

Also, you can put an inference cache or a semantic cache component in front of a RAG query architecture. This will work provided that the RAG retriever isn’t accessing external data, or doing anything time-dependent. If you add new data to the RAG datastore, you’ll need to clear some or all of the cache.

Note that caching of results for recurring queries is a technique for database optimization, too. In this case, it refers to basic database querying where you cache the chunks of text that a RAG retriever returns, rather than having any unsavory interactions with the KV cache.

But there’s also a global KV cache optimization for RAG architectures. No doubt, you’re pleased to hear that. Instead of storing text chunks, you can store pre-calculated KV cache data. This idea is a variant of prefix global KV caching and fused global KV caching.

High-level accelerators. So, if you’ve been skimming, here’s my summary of all the different types of accelerators mentioned above. Here are the high-level ones that can plug externally to the Transformer’s inference engine:

  • Inference cache
  • Semantic cache
  • RAG retriever cache (database cache)
  • RAG inference cache
  • RAG semantic cache
  • Prompt compression (token pruning/token merging)
  • Prompt shield (block some queries, more for safety than speed)

Low-level KV cache accelerators. And here’s the really fun ones that plug deeply into the KV cache mechanism inside the Transformer kernels:

  • Global KV cache (inference cache)
  • Semantic global KV cache (semantic cache)
  • Prefix global KV cache
  • Session global KV cache
  • Fused global KV cache
  • RAG prefix global KV cache
  • RAG fused global KV cache

And for extra coding fun, any of these KV cache methods can be further optimized by the various “KV cache compression” methods:

  • KV cache quantization
  • KV cache layer fusion (depth dimension)
  • KV cache token pruning (length dimension)
  • KV cache head fusion (width dimension)

Those ones are my favorites!

References on Inference Optimization

We have catalogued literally over 500 different techniques for speeding up LLM inference, from activation quantization to zero skipping. Some of these techniques are widely-used in industry (e.g., quantization, KV cache compression), whereas others are very obscure (e.g., zero-multiplication models, logarithmic number systems). Here are the links for our inference optimization research literature review:

  1. Aussie AI, Nov 2024, AI Research Overview, https://www.aussieai.com/research/overview.
  2. David Spuler, September 2024, 500+ LLM Inference Optimization Techniques, Aussie AI Blog, https://www.aussieai.com/blog/llm-inference-optimization.
  3. David Spuler, August 2024, Hot Inference Optimization Techniques, Aussie AI Blog, https://www.aussieai.com/blog/hot-inference-research.
  4. Aussie AI, Nov 2024, Inference Optimization Techniques, https://www.aussieai.com/research/list.

This is our extensive AI literature survey, focused on inference optimizations, along with other areas such as reasoning algorithms. There are subpages on this site with research paper lists for almost any inference optimization technique you care to examine. If you prefer looking at code, or reading the above research in e-book or print formats, there is our book Generative AI in C++ published in March 2024, where Part VII of the book is focused entirely on research areas.

Green AI

Environmental impact is a concern for AI architectures, because they consume so much GPU juice. The environmental concerns include the usage of these precious resources:

  • Electricity to run GPUs.
  • Water to cool them.

In fact, the papers I’ve read tend to say it costs ten times more per query to run an AI-based query than a regular search using Google or Bing. It may get worse. We’re not at the end of this AI ride, obviously, and multi-AI architectures and multi-step reasoning may increase the load. On the other hand, small language models, kernel optimizations, and on-device inference may reduce costs.

There are a lot of “green AI” research papers, with broad analyses about the overarching measurement of environmental impact, metrics to use, and other general areas. For more research on green AI, see https://www.aussieai.com/research/green.

The point I like the most is that this area has a huge overlap with “speed” research. Faster AI engines means less electricity required, and I’m rather good at making AI run faster, so the work I’m doing is environmentally friendly!

And that’s a nice way to finish out this book. Apparently, profiling my vectorized CPU code with gprof and tweaking GPU kernels has some good karma.

 

Online: Table of Contents

PDF: Free PDF book download

Buy: Generative AI Applications: Planning, Design and Implementation

Generative AI in C++ The new Generative AI Applications book by Aussie AI co-founders:
  • Deciding on your AI project
  • Planning for success and safety
  • Designs and LLM architectures
  • Expediting development
  • Implementation and deployment

Get your copy from Amazon: Generative AI Applications