Aussie AI
Chapter 4. Cheaper RAG
-
Book Excerpt from "RAG Optimization: Accurate and Efficient LLM Applications"
-
by David Spuler and Michael Sharpe
Chapter 4. Cheaper RAG
RAG Cost Optimization
The first point to note is that RAG is itself a cost optimization. The idea of RAG was to allow the LLM to give more accurate answers, but without the expense of extra fine-tuning or training of the LLM. The idea was to use chunks of data plus prompting rather than changing the LLM.
Nevertheless, a RAG project is also a significant expense. The cost of building a RAG application is like any other IT project, and has several main components:
- Development and testing costs
- Tech platform operating costs
- LLM token costs
Overall cost is one of the big constraints that you’ll have for a RAG project. Let’s take a step back and consider the massive optimizations for your entire project.
1. Buy an off-the-shelf commercial AI-based solution instead of a DIY RAG project (haha!).
2. Test multiple commercial foundational model API providers and compare pricing.
3. Use an open source pre-trained model and engine (e.g., Meta’s Llama models).
4. Choose a compressed open source pre-trained pre-quantized model (e.g., quantized Llama).
5. Use cheaper commercial API providers for early development and testing.
6. Use smaller open-source models for early development and testing.
Reducing Production Costs
Your RAG application has several major components, which can each contribute to the operating costs of the overall system. The main components tend to be:
- LLM inference costs
- Vector database query costs
- Keyword datastore query costs
- Database refresh and maintenance costs
There are also all the traditional types of IT costs for running the backend of a large user application:
- Web servers (e.g., running Apache or Nginx)
- Utility servers
- Test and deployment servers
- Licensing costs (e.g., domain names)
- Internet query costs (e.g., DNS).
Reducing Token Costs
The main RAG cost has typically been the token cost of performing all of your user’s LLM inference queries, which are often large due to the hidden cost of RAG data chunks and underlying prompt engineering (e.g., global instructions and/or prompt template for packing). However, pricing per token of LLM inference has been plummeting over the last year or more, and there are now many cost-effective options to consider for the LLM model and the inference engine.
There are two main choices: engine and model. You can choose a smaller or cheaper model, but this can be a trade-off for speed versus accuracy. Nevertheless, choose your underlying LLM carefully:
- Choose the best model (for speed versus smartness)
- Quantized models
- Small Language Models (SLMs)
The basic ways to reduce costs for your RAG application include focusing on the basic LLM inference engine component:
- Choose cheaper LLM inference providers (there are many!)
- Consider LLM APIs versus LLM hosting providers
- Consider commercial hosting versus self-hosted open source LLMs
- Choose your underlying GPU hardware
- Use cheaper models for testing and debugging (to some extent)
Other low-level ways to make RAG LLM inference queries cheaper by looking at the text include:
- Fewer chunks returned by the retriever and/or reranker
- Smaller chunks (fewer tokens)
- Context compression (automatically reducing size of chunks)
- Chunk pre-summarization (chunk compression)
- Cached tokens from API providers (i.e., prefix KV caching)
- Text-to-text inference cache on the front (i.e., avoid LLM inference completely!)
- Shorten the global system instructions and/or make the prompt template more concise
We have a list of over 500 techniques for LLM inference optimization in the Appendix, with various ideas that will reduce compute and memory cost, if you want some bedtime reading.
Financial Optimizations
An AI project is expensive in terms of the hardware, the software, and the people you need. There are some considerations that can reduce the cost somewhat.
Use existing assets. What internal data assets do you possess? Can you re-purpose any of your company’s existing hardware assets? And can you “re-purpose” any of your staff, too?
Buy vs rent. If it’s floating, flying, or foundational modeling: rent, don’t buy! Similarly, do you need to buy your own servers and GPUs? The decision may be different for the different phases of a project:
- Development and testing
- Training the model (fine-tuning/specialization)
- Inference (live execution)
For example, you might want to buy for training phases and rent for the inference phase. This depends on how much training you need, the size of your model, and whether you plan to avoid fine-tuning for proprietary data by using RAG instead. The cost of inference depends on the user counts, which is significantly different if it’s an internal employee project versus a live public user application.
Idle VMs and GPUs. Watch out for virtual machines and rented GPUs being idle early in the project. You’re paying money for nothing in such cases. This can occur in the development phases and in the early live deployment when user levels are low.
Scrimp on developer models. During the development and testing phases, there’s no need for gold-plated AI models. The cost of development and testing of your AI application can be reduced by using low-end models for simple testing. Many of the components needed are not dependent on whether the AI engine returns stellar results. Initial development, prototyping, and ongoing regression testing of these parts of the system can proceed with small models.
There is also vendor support for testing on lower-end models. There are various other AI platforms that offer interfaces that mimic OpenAI’s API, but at a lower cost, so you can test on these platforms, and then do final testing on the live commercial platform.
Technical Debt in AI Projects
Everything’s changing fast in AI research and industry practices. Hence, the current methods of building and deployment AI applications are a work-in-progress. Nobody really knows what’s optimal in regard to:
- What to use AI for?
- Which models?
- What tech infrastructure?
- How to optimize?
- Safety concerns?
Hence, as part of planning an AI project, consider paying more attention to “technical debt” inherent in this situation. You may need to refresh your tech stack much sooner than in a non-AI project. It’s hard to quantify in terms of effort or timescales, but it’s an important issue to make note of in your AI project proposals. The key point is mainly to budget for extra funding for post-launch maintenance tasks.
AI Costs Research
Research papers and articles on costs of AI applications in general, including API usage and inference costs per token:
- Waleed Kadous, August 23, 2023, Llama 2 is about as factually accurate as GPT-4 for summaries and is 30X cheaper, https://www.anyscale.com/blog/llama-2-is-about-as-factually-accurate-as-gpt-4-for-summaries-and-is-30x-cheaper Code: https://github.com/anyscale/factuality-eval
- Matt Murphy, Tim Tully, Derek Xiao, January 18, 2024, The Modern AI Stack: Design Principles for the Future of Enterprise AI Architectures, Menlo Ventures, https://menlovc.com/perspective/the-modern-ai-stack-design-principles-for-the-future-of-enterprise-ai-architectures/ (Various details about the AI tech stack, organizational AI maturity levels, and several interesting facts: inference is 95% of AI cost now, 60% of organizations are using multi-model methods, RAG is the dominant architecture currently, and AI application development teams are primarily made up of non-ML software engineers leveraging on top of AI models.)
- Batchmon, July 18, 2024, AI paid for by Ads – the gpt-4o mini inflection point, https://batchmon.com/blog/ai-cheaper-than-ads/
- Anthropic, 15 Aug 2024, Prompt caching with Claude, https://www.anthropic.com/news/prompt-caching (Anthropic is now supporting prompt caching with approximately tenfold reduction in token pricing.)
- Andrew Ng, Sep 2024, X post, https://x.com/AndrewYNg/status/1829190549842321758 (Dropping token prices for LLMs means developers can focus on the app layer.)
- David Spuler, March 2024, Financial Optimizations, in Generative AI in C++, https://www.aussieai.com/book/ch5-financial-optimizations
- Florian Douetteau, September 7, 2024, Get ready for a tumultuous era of GPU cost volatility, https://venturebeat.com/ai/get-ready-for-a-tumultuous-era-of-gpu-cost-volitivity/
- Tanay Jaipuria, Sep 16, 2024, The Plummeting Cost of Intelligence: On smaller models, on-device inference and the path to zero cost intelligence, https://www.tanayj.com/p/the-plummeting-cost-of-intelligence
- Lingjiao Chen, Matei Zaharia, James Zou, 9 May 2023, FrugalGPT: How to Use Large Language Models While Reducing Cost and Improving Performance, https://arxiv.org/abs/2305.05176
- Robert Corwin Nov 2024, Running Large Language Models Privately: A comparison of frameworks, models, and costs, https://towardsdatascience.com/running-large-language-models-privately-a-comparison-of-frameworks-models-and-costs-ac33cfe3a462
- Paul Krill Dec 05, 2024, OpenAI unveils API for tracking OpenAI API usage, costs, https://www.infoworld.com/article/3618202/openai-unveils-api-for-tracking-openai-api-usage-costs.html
- Waleed Kadous, May 17, 2023, Numbers every LLM Developer should know, https://www.anyscale.com/blog/num-every-llm-developer-should-know (Includes discussion of “be concise” prompting.)
- Tim Urista, Dec 2024, Dramatically Reduce Inference Costs with DeepSeek-V3: A New Era in Open-Source LLMs, https://ai.gopubby.com/dramatically-reduce-inference-costs-with-deepseek-v3-a-new-era-in-open-source-llms-4f1adf760ee1
- Kyle Wiggers, December 23, 2024, A popular technique to make AI more efficient has drawbacks, https://techcrunch.com/2024/12/23/a-popular-technique-to-make-ai-more-efficient-has-drawbacks/
- Ryan Browne, Dec 31 2024, Alibaba slashes prices on large language models by up to 85% as China AI rivalry heats up, https://www.cnbc.com/2024/12/31/alibaba-baba-cloud-unit-slashes-prices-on-ai-models-by-up-to-85percent.html
- Kyle Wiggers, January 5, 2025, OpenAI is losing money on its pricey ChatGPT Pro plan, CEO Sam Altman says, https://techcrunch.com/2025/01/05/openai-is-losing-money-on-its-pricey-chatgpt-pro-plan-ceo-sam-altman-says/ (OpenAI is losing money on its $200/month plan because people use it too much.)
- Janelle Teng, Dec 24, 2024, Unwrapping OpenAI’s o3, https://nextbigteng.substack.com/p/unwrapping-openai-o3-reasoning-model ( “...it costs a whopping $17-$20 per task to run o3 in low-compute mode...o3 and other CoT models are currently expensive at inference”)
- Austin Starks, Jan 2025, You are an absolute moron for believing in the hype of “AI Agents”, https://medium.com/@austin-starks/you-are-an-absolute-moron-for-believing-in-the-hype-of-ai-agents-c0f760e7e48e
- Akash Bajwa Jan 27, 2025, The Post-R1 World: AI Economics Have Irreversibly Changed, https://akashbajwa.substack.com/p/the-post-r1-world
- Mohammed Karimkhan Pathan, February 3, 2025, Open-source revolution: How DeepSeek-R1 challenges OpenAI’s o1 with superior processing, cost efficiency, https://venturebeat.com/ai/open-source-revolution-how-deepseek-r1-challenges-openais-o1-with-superior-processing-cost-efficiency/
- R Szilágyi, 2024, OpenSource alternatives of Generative Artificial Intelligence for SME's, Journal of Agricultural Informatics, Vol. 15 No. 2 (2024), https://doi.org/10.17700/jai.2024.15.2.733 https://journal.magisz.org/index.php/jai/article/view/733 https://journal.magisz.org/index.php/jai/article/view/733/412
- Wade Tyler Millward, February 4, 2025, Google Q4 2024 Earnings: CEO Pichai Says DeepSeek Models Less ‘Efficient’ Than Gemini’s, 'The cost of actually using it is going to keep coming down, which will make more use cases feasible,' Alphabet CEO Sundar Pichai says. https://www.crn.com/news/ai/2025/google-q4-2024-earnings-ceo-pichai-says-deepseek-models-less-efficient-than-gemini-s
- Sam Altman, Feb 10, 2025, Three Observations, https://blog.samaltman.com/three-observations (Talks about scaling laws, inference costs reducing, and AGI. One of them: “The cost to use a given level of AI falls about 10x every 12 months, and lower prices lead to much more use.”)
- Dave Salvator, January 23, 2025, Fast, Low-Cost Inference Offers Key to Profitable AI. The NVIDIA inference platform boosts AI inference performance, saving millions of dollars across retail, telco and more, https://blogs.nvidia.com/blog/ai-inference-platform/
- swyx, Jan 2025, X post: updated price-elo pareto frontier with DeepSeek v3/r1 and Gemini 2 flash thinking 2 results, https://x.com/swyx/status/1882933368444309723
- Reuters, February 26, 2025, DeepSeek cuts off-peak pricing for developers by up to 75%, https://www.reuters.com/technology/chinas-deepseek-cuts-off-peak-pricing-by-up-75-2025-02-26/
- Reuters, March 1, 2025, China’s DeepSeek claims theoretical cost-profit ratio of 545% per day, https://www.reuters.com/technology/chinas-deepseek-claims-theoretical-cost-profit-ratio-545-per-day-2025-03-01/
- Zhouzi Li, Benjamin Berg, Arpan Mukhopadhyay, Mor Harchol-Balter, 31 Jul 2024 (v3), How to Rent GPUs on a Budget, https://arxiv.org/abs/2406.15560
- Prasanth Aby Thomas, Feb 27, 2025, DeepSeek offers steep discounts, escalating AI price war, https://www.infoworld.com/article/3834662/deepseek-offers-steep-discounts-escalating-ai-price-war.html
- Alberto Romero, Mar 19, 2025, Why the Cost of AI Keeps Magically Going Down? Houses, healthcare, and education surely don’t work that way, https://www.thealgorithmicbridge.com/p/why-the-cost-of-ai-keeps-magically
- Lavanya Gupta, May 1, 2025, Hidden costs in AI deployment: Why Claude models may be 20-30% more expensive than GPT in enterprise settings, https://venturebeat.com/ai/hidden-costs-in-ai-deployment-why-claude-models-may-be-20-30-more-expensive-than-gpt-in-enterprise-settings/ (The Claude tokenizer apparently creates more tokens than GPT.)
- Peter Wayner, 14 tiny tricks for big cloud savings, Apr 28, 2025, https://www.infoworld.com/article/3964101/14-tiny-tricks-for-big-cloud-savings.html
AI Costs and Revenue Market Research
Articles and research papers on AI companies and their costs or revenues include:
- Tanay Jaipuria, Oct 01, 2024, OpenAI and Anthropic Revenue Breakdown: Breaking down revenue growth, the consumer subscription businesses and the importance of partnerships to the API business, https://www.tanayj.com/p/openai-and-anthropic-revenue-breakdown
- Ashu Garg, August 30, 2024, The AI Hype: $600B question or $4.6T+ opportunity? Foundation Capital, https://foundationcapital.com/the-ai-hype-600b-question-or-4-6t-opportunity/ (Quote: “...today’s uncertainty will give rise to the next magnificent seven...”)
- Kaya Ginsky June 27, 2024, Figma CEO says it is ‘eating cost’ of AI upgrade for customers in 2024, https://www.cnbc.com/2024/06/27/figma-ceo-says-its-eating-cost-of-ai-for-customers-in-2024-upgrade-.html
- Jing Hu, Aug 2024, The AI Bubble. A Reality I Just Realized in the GenAI Landscape, https://ai.gopubby.com/the-ai-bubble-some-reality-i-just-realized-in-the-genai-startup-landscape-c837a567ae3e
- Ben Evans, Nov 2024, For 2025, ‘AI eats the world’, https://www.ben-evans.com/presentations
- Paula Rooney, 18 Jul 2024, GenAI sticker shock sends CIOs in search of solutions, CIO, https://www.cio.com/article/2518411/genai-sticker-shock-sends-cios-in-search-of-solutions.html
- Grant Gross, 21 Nov 2024, CIOs view cost management as possible AI value killer, https://www.cio.com/article/3608214/cios-view-cost-management-as-possible-ai-value-killer.html
- Paula Rooney, 23 Jan 2025, Cost concerns put CIOs’ AI strategies on edge, https://www.cio.com/article/3808191/cost-concerns-put-cios-ai-strategies-on-edge.html (Most businesses want consumption-based pricing to control costs, not up-front commitments.)
- Grant Gross, 19 Dec 2024, How will AI agents be priced? CIOs need to pay attention, https://www.cio.com/article/3624540/how-will-ai-agents-be-priced-cios-need-to-pay-attention.html
Training Costs Research
Research papers on AI training costs include:
- Epoch AI, 2024, How Much Does It Cost to Train Frontier AI Models? https://epochai.org/blog/how-much-does-it-cost-to-train-frontier-ai-models
- Ben Cottier, Robi Rahman, Loredana Fattorini, Nestor Maslej, David Owen, 31 May 2024, The rising costs of training frontier AI models, https://arxiv.org/abs/2405.21015
- David Linthicum, Aug 23, 2024, Navigating the AI frontier, https://www.infoworld.com/article/3491416/navigating-the-ai-frontier.html
- Maxwell Zeff, February 5, 2025, Researchers created an open rival to OpenAI’s o1 ‘reasoning’ model for under $50, https://techcrunch.com/2025/02/05/researchers-created-an-open-rival-to-openais-o1-reasoning-model-for-under-50/
- Kyle Wiggers, January 11, 2025, Researchers open source Sky-T1, a ‘reasoning’ AI model that can be trained for less than $450, https://techcrunch.com/2025/01/11/researchers-open-source-sky-t1-a-reasoning-ai-model-that-can-be-trained-for-less-than-450/
- NovaSky, Jan 2025, Sky-T1: Train your own O1 preview model within $450, https://novasky-ai.github.io/posts/sky-t1/
- Douglas C. Youvan, September 27, 2024, Building and Running Large-Scale Language Models: The Infrastructure and Techniques Behind GPT-4, https://www.researchgate.net/profile/Douglas-Youvan/publication/384398902_Building_and_Running_Large-Scale_Language_Models_The_Infrastructure_and_Techniques_Behind_GPT-4/links/66f6f4d3906bca2ac3d20e68/Building-and-Running-Large-Scale-Language-Models-The-Infrastructure-and-Techniques-Behind-GPT-4.pdf
- Will Henshall June 3, 2024, The Billion-Dollar Price Tag of Building AI, Time, https://time.com/6984292/cost-artificial-intelligence-compute-epoch-report/
- Jaime Sevilla Edu Roldán, May 28, 2024, Training Compute of Frontier AI Models Grows by 4-5x per Year, Epoch AI blog, https://epochai.org/blog/training-compute-of-frontier-ai-models-grows-by-4-5x-per-year
|
• Online: Table of Contents • PDF: Free PDF book download • Buy: RAG Optimization: Accurate and Efficient LLM Applications |
|
RAG Optimization: Accurate and Efficient LLM Applications:
new book on RAG architectures:
Get your copy from Amazon: RAG Optimization |