Aussie AI
Tuning an AI Engine
- 
                                            Book Excerpt from "Generative AI in C++"
- 
                                            by David Spuler, Ph.D.
Tuning an AI Engine
As with any other C++ application, tuning an AI engine requires timing and profiling of the underlying C++ code. To do so, you'll need a batch interface whereby the prompt query text can be supplied as a command-line argument, or via a text file.
To measure what impact your code optimizations are having on your Transformer engine's performance, you'll need to re-run exactly the same query (or many queries) after each major code change. To isolate the effects of C++ engine code changes, you ideally need everything else to stay exactly the same:
- Hardware (same CPU, same GPU, same settings, etc.)
- Thread and OS settings
- Server load (i.e., avoid other processes running)
- Inference query (exactly the same text)
- Model file
- Configuration settings (e.g., temperature).
To really finesse the engine profiling, you can ensure that it returns exactly the same results, as for regression testing, by managing these code issues:
- Random number seed (e.g., impacts the top-k decoding algorithm).
- Time-specific tools (e.g., the timefunction needs an intercept so it doesn't change).
The other part is to test your AI engine separately from other parts of the system. Yes, the overall system performance is important, but that is a separate performance analysis from the pure C++ profiling of the Transformer engine. Some of the issues include:
- RAG databases. Test the engine on its query after the retriever has looked up its chunks of text. The full input for a profiling query should be the extra RAG context plus a question.
- Inference cache. Ensure the engine is not bypassed by the caching component. If the exact same query runs super-fast the second time you test it, umm, that's not you.
To test the overall response time to the user, system tuning is required. The responsiveness of the RAG retriever component, the cache hit ratio, and other practical deployment issues are all important for real-world performance. See Chapter 7 for more information on efficient architectures for deploying AI engines.
| • Next: • Up: Table of Contents | 
|   | The new AI programming book by Aussie AI co-founders: 
 Get your copy from Amazon: Generative AI in C++ | 
