Aussie AI

Chapter 13. Reasoning and RAG

Book Excerpt from "RAG Optimization: Accurate and Efficient LLM Applications"

by David Spuler and Michael Sharpe

Chapter 13. Reasoning and RAG

Remember The Wall?

Do you remember when there’s was a lot of articles about how AI companies were “hitting the wall.” Several major companies are having trouble training their next-generation models, with only incremental improvements in performance found. What a wonderful outcome for RAG, because then all those billions of dollars in funding for training trillion-parameter models could now be redirected to RAG.

It’s nice to have a dream.

Actually, no, the new new thing became inference at the end of 2024. This is exemplified by the recent OpenAI “o1” (“Strawberry”) model release in September, 2024, which was based on the “Chain-of-Thought” reasoning strategy. The basic idea was to run multiple LLM inference steps for every user question, rather than a “one-shot” attempt to answer the user. Over multiple repeating steps, the model could reassess its own output, and then converge on a much smarter answer.

It worked great. A little slow, but great.

But then the pendulum swung back the other way with the release of the DeepSeek R1 reasoning model in January, 2025. Instead of multiple steps of inference, it used “longer answers” as a way for the model to reason its way to an answer. It turned out that you could train an LLM to do reasoning better just by training it on a lot of such longer sequences, which it could them mimic. The result was smarter answers at a much lower token cost than multi-step reasoning. Who knew that “talking to yourself” would be an AI strategy for reasoning.

The point of all this is that there are now reasoning models that you can use with RAG. But do you really need advanced reasoning to answer a user’s question about your refund policy? It’s a somewhat open question, and the use of reasoning models with RAG is in its early stages.

Reasoning Prompts

The first point about reasoning models is that you can get a long way with prompting. In fact, most of the early research on better LLM reasoning was just that you could get it to do better with a few tweaks to the prompts. The classic example is the original Chain-of-Thought algorithm which was a single-step prompting method of simply telling the LLM:

Let’s think step-by-step.

Surprisingly, this actually made the LLM think step-by-step, which was closer to reasoning than basic answers.

There’s a whole swathe of prompting strategies that you can use with your RAG application, if you want to. The types of prompt engineering methods for improving the model’s ability to “reason” with more intelligence include:

Chain-of-Thought (CoT)
Emotional prompting
Skeleton-of-Thought (SoT)

The relevance of all these reasoning-prompt things to your RAG application is that you can simply add words to your prepended system instructions prompt, if you want a more reasoning-ish RAG application.

Chain-of-Thought

This is an advanced technique where the LLM can do better just with a little encouragement, like a toddler on a swing. The idea is to suggest via prompting that the LLM should generate a sequence of steps, which thereby helps it to reason in steps.

Step-by-Step. In its simplest form, this is a method where the prompt has a helpful reminder prepended, encouraging that the LLM to proceed “step-by-step” in its answer. The idea is literally to include in the prompt an English sentence that literally says something like: Let’s try step-by-step.

More advanced versions of CoT use trickier prompting strategies. Complex prompting templates can be used to encourage stepwise refinement of multiple possible answers, so as to select the final choice in a more ordered way.

Emotional Prompting

LLMs supposedly don’t have emotions, and yet appealing to their emotional vulnerability seems to improve accuracy of answers. Anecdotally, some users had reported that they got better answers if they begged or yelled at ChatGPT. In November 2023, research was published confirming that LLMs did respond to “emotional stimuli” in the prompts.

The technique is to add an emotional sentence to the prompt. For example, after the main question, append: This is very important to my career. Another one was: You’d better be sure.

Nobody thinks they’ve got emotions or become aware of their inner child. But somehow, the addition of emotive wording to a prompt triggered better working. Is there some kind of emotional signals in all that training data? Actually, the paper discusses why it works, and suggests a simpler explanation that the extra sentences add more definitive and positive word signals such as “important” and “sure.”

But they aren’t very sure, although it’s certainly important to their career. I cried when I read that paper.

Skeleton-of-Thought Prompting

The skeleton-of-thought (SoT) method is from some recent research, and it has been getting significant attention in the literature. SoT is not just a reasoning improvement method, but has two goals:

Smarter, and
Speedier

The SoT idea mimics the way humans would write a long paragraph. Most writers don’t just have the words stream out of their fingertips in one long writing session. Why should we expect the LLM to do that?

Instead, the SoT method is a multi-phase writing method that works in a more human-like fashion:

1. Generate a rough outline (i.e., with paragraph main points or a list).

2. Process each sub-point in a separate LLM query.

3. Run a final LLM query to combine all the results nicely.

This few-shot method aims to generate a much better answer than a one-shot response. Each sub-point should get a more detailed consideration, and then the final output should be well-written. It’s almost like a RAG architecture with a query for each sub-point, but the answers come out of the LLM itself.

Or, you know, why couldn’t the sub-answers come out of a RAG system? Oh, wow! I just invented the multi-RAG multi-shot multi-model, which I’ll now officially name the “manga” model.

Anyway, this combined multi-response idea in SoT isn’t just more effective. It’s also faster, because each sub-point can be processed in parallel. Each paragraph’s LLM query can be running at the same time, although the first outlining query, and the final summarizing query, must still run sequentially. But still, that’s three LLM query phases, rather than many more if there are ten sub-points in the answer.

Finally, note that although this is faster in terms of latency, it’s inefficient in terms of computation cost. The parallelization reduces the time it takes to come back to the user, but all those parallelized sub-point LLM requests are chewing GPU juice. It’s also not going to work well with “on-device” models, such as AI phones and PCs, where parallel capabilities are limited.

Two-Step Reasoning

The advanced LLMs don’t do all of their answers in one LLM inference sequence. In fact, they do many, and the state-of-the-art is “multi-step” reasoning. One of the basic multi-step methods is the use of “tools”, such as:

LLM devises a “plan” to execute the user’s query, including tool executions.
Execute the tools to get their outputs.
LLM executes the final query to summarize the overall response, including any data from the tools.

This method has two LLM inference computations, whereas the “tools” are probably non-LLM code applications. This is assuming that tools are doing things like:

(a) computations — e.g., a clock or calculator, and/or

(b) data source integrations — e.g., searching real estate listing in another database.

Big LLMs have lots of calculation-type tools, and they also can integrate with a variety of disparate data sources. The issues of tool integrations and data sources are covered in a separate section.

Multi-Step Reasoning

A more generalized idea for advanced reasoning capabilities is that the LLM makes a plan, which can include any number of other LLM sub-processing tasks. The idea is also called “few-shot” processing, because it allows multiple LLM calls, rather than “one-shot” methods, where there’s only a single LLM request. This is an area of state-of-the-art research in trying to reach AGI, by improving the LLM’s ability to plan and reason.

You usually don’t even know it’s happening if you use a third-party inference API to get answers to your queries. Which is good news if you don’t happen to have a PhD in machine learning.

There are many more prompting techniques, both zero-shot and few-shot, that you can research. Here is just a smattering:

Rephrase and Respond (RaR)
Re-reading (RE2) — appends “Read the question again:” and the question a second time.
Self-Ask — encourages the LLM to ask “follow-up questions.”
Memory-of-Thought
Active Prompting
Ensemble prompting — various multi-answer combination ideas.

Unfortunately, I’ve run out of electrons, so I’m not going to cover all of these. There are various huge survey papers on the internet if you happen to like strange nonsense that actually works.

Recent Research on RAG Reasoning

The level of recent research attention to “multi-step reasoning” is off the charts. We’ve now seen papers on “chain-of-X” for almost every word in the English language. And this has now started to make its way into RAG research, although it’s not a flood of papers by any means.

Every implementation of a RAG application can use any of those advanced reasoning algorithms, which would probably make it 1% smarter and 100% slower. Arguably, the type of advanced reasoning problems that OpenAI Strawberry or DeepSeek R1 is better at solving are not really in the bread basket for RAG, but it’s worth a try!

Nevertheless, there’s room for some more research on RAG and reasoning. For example, does RAG with Chain-of-Thought work better if you prepend the RAG chunks at every step of the way, or is it better just to have them at the first step? We’re yet to see that in a research paper!

Research papers on reasoning models and RAG include:

B Zhan, A Li, X Yang, D He, Y Duan, S Yan, 2024, RARoK: Retrieval-Augmented Reasoning on Knowledge for Medical Question Answering, 2024 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), pages 2837-2843, DOI: 10.1109/BIBM62325.2024.10822341, https://www.computer.org/csdl/proceedings-article/bibm/2024/10822341/23onp6dXOSI (RAG combined with Chain-of-Thought for medical reasoning.)
Xinyan Guan, Jiali Zeng, Fandong Meng, Chunlei Xin, Yaojie Lu, Hongyu Lin, Xianpei Han, Le Sun, Jie Zhou. 3 Feb 2025, DeepRAG: Thinking to Retrieval Step by Step for Large Language Models, https://arxiv.org/abs/2502.01142
P Verma, SP Midigeshi, G Sinha, A Solin, N Natarajan, Mar 2025, Plan *RAG: Efficient Test-Time Planning for Retrieval Augmented Generation, ICLR 2025 review, https://openreview.net/pdf?id=gi9aqlYdBk (Improve RAG reasoning efficiency via planning for parallel reasoning.)
Qiguang Chen, Libo Qin, Jinhao Liu, Dengyun Peng, Jiannan Guan, Peng Wang, Mengkang Hu, Yuhang Zhou, Te Gao, Wanxiang Che, 13 Mar 2025 (v2), Towards Reasoning Era: A Survey of Long Chain-of-Thought for Reasoning Large Language Models, https://arxiv.org/abs/2503.09567 (Massive and broad survey of all types of reasoning.)
Mingyue Cheng, Yucong Luo, Jie Ouyang, Qi Liu, Huijie Liu, Li Li, Shuo Yu, Bohou Zhang, Jiawei Cao, Jie Ma, Daoyu Wang, Enhong Chen, 17 Mar 2025 (v2), A Survey on Knowledge-Oriented Retrieval-Augmented Generation, https://arxiv.org/abs/2503.10677
Thang Nguyen, Peter Chin, Yu-Wing Tai, 26 May 2025, MA-RAG: Multi-Agent Retrieval-Augmented Generation via Collaborative Chain-of-Thought Reasoning, https://arxiv.org/abs/2505.20096