Aussie AI
9. Reasoning Models
-
Book Excerpt from "The Sweetest Lesson: Your Brain vs AI"
-
by David Spuler, Ph.D.
9. Reasoning Models
“As the pace of AI progress accelerates,
developing superintelligence is coming into sight.”
— Mark Zuckerberg, June 2025.
Reasoning Prompts
The first point about reasoning models is that you can get a long way with prompting. In fact, most of the early research on better LLM reasoning was just that you could get it to do better with a few tweaks to the prompts. The classic example is the original Chain-of-Thought algorithm which was a single-step prompting method of simply telling the LLM:
-
Let’s think step-by-step.
Surprisingly, this actually made the LLM think step-by-step, which was closer to reasoning than basic answers.
There’s a whole swathe of prompting strategies that you can use with your RAG application, if you want to. The types of prompt engineering methods for improving the model’s ability to “reason” with more intelligence include:
- Chain-of-Thought (CoT)
- Emotional prompting
- Skeleton-of-Thought (SoT)
The relevance of all these reasoning-prompt things to your RAG application is that you can simply add words to your prepended system instructions prompt, if you want a more reasoning-ish RAG application.
Chain-of-Thought
This is an advanced technique where the LLM can do better just with a little encouragement, like a toddler on a swing. The idea is to suggest via prompting that the LLM should generate a sequence of steps, which thereby helps it to reason in steps.
Step-by-Step. In its simplest form, this is a method where the prompt has a helpful reminder prepended, encouraging that the LLM to proceed “step-by-step” in its answer. The idea is literally to include in the prompt an English sentence that literally says something like: Let’s try step-by-step.
More advanced versions of CoT use trickier prompting strategies. Complex prompting templates can be used to encourage stepwise refinement of multiple possible answers, so as to select the final choice in a more ordered way.
AI researchers are nothing if not copiers (e.g., they copied the human brain). Hence, everyone’s jumped on the bandwagon for technique names, and there’s also:
- Chain-of-Draft
- Chain-of-Tree
- Chain-of-Recall
- Chain-of-Verify
- Chain-of-Edit
- Chain-of-Symbols
- Chain-of-Tables
- Chain-of-Note
I have no idea what any of these are for, except that Chain-of-Draft is good at efficient reasoning, but feel free to look the rest up. Even better, come up with your own Chain-of-Bananas technique, and retire on patent royalties.
Emotional Prompting
LLMs supposedly don’t have emotions, and yet appealing to their emotional vulnerability seems to improve accuracy of answers. Anecdotally, some users had reported that they got better answers if they begged or yelled at ChatGPT. In November 2023, research was published confirming that LLMs did respond to “emotional stimuli” in the prompts.
The technique is to add an emotional sentence to the prompt. For example, after the main question, append: This is very important to my career. Another one was: You’d better be sure.
Nobody thinks they’ve got emotions or become aware of their inner child. But somehow, the addition of emotive wording to a prompt triggered better working. Is there some kind of emotional signals in all that training data? Actually, the paper discusses why it works, and suggests a simpler explanation that the extra sentences add more definitive and positive word signals such as “important” and “sure.”
But they aren’t very sure, although it’s certainly important to their career. I cried when I read that paper.
Skeleton-of-Thought Prompting
The skeleton-of-thought (SoT) method is from some recent research, and it has been getting significant attention in the literature. SoT is not just a reasoning improvement method, but has two goals:
- Smarter, and
- Speedier
The SoT idea mimics the way humans would write a long paragraph. Most writers don’t just have the words stream out of their fingertips in one long writing session. Why should we expect the LLM to do that?
Instead, the SoT method is a multi-phase writing method that works in a more human-like fashion:
1. Generate a rough outline (i.e., with paragraph main points or a list).
2. Process each sub-point in a separate LLM query.
3. Run a final LLM query to combine all the results nicely.
This few-shot method aims to generate a much better answer than a one-shot response. Each sub-point should get a more detailed consideration, and then the final output should be well-written. It’s almost like a RAG architecture with a query for each sub-point, but the answers come out of the LLM itself.
Or, you know, why couldn’t the sub-answers come out of a RAG system? Oh, wow! I just invented the multi-RAG multi-shot multi-model, which I’ll now officially name the “manga” model.
Anyway, this combined multi-response idea in SoT isn’t just more effective. It’s also faster, because each sub-point can be processed in parallel. Each paragraph’s LLM query can be running at the same time, although the first outlining query, and the final summarizing query, must still run sequentially. But still, that’s three LLM query phases, rather than many more if there are ten sub-points in the answer.
Finally, note that although this is faster in terms of latency, it’s inefficient in terms of computation cost. The parallelization reduces the time it takes to come back to the user, but all those parallelized sub-point LLM requests are chewing GPU juice. It’s also not going to work well with “on-device” models, such as AI phones and PCs, where parallel capabilities are limited.
Two-Step Reasoning
The advanced LLMs don’t do all of their answers in one LLM inference sequence. In fact, they do many, and the state-of-the-art is “multi-step” reasoning. One of the basic multi-step methods is the use of “tools”, such as:
- LLM devises a “plan” to execute the user’s query, including tool executions.
- Execute the tools to get their outputs.
- LLM executes the final query to summarize the overall response, including any data from the tools.
This method has two LLM inference computations, whereas the “tools” are probably non-LLM code applications. This is assuming that tools are doing things like:
(a) computations — e.g., a clock or calculator, and/or
(b) data source integrations — e.g., searching real estate listing in another database.
Big LLMs have lots of calculation-type tools, and they also can integrate with a variety of disparate data sources. The issues of tool integrations and data sources are covered in a separate section.
Multi-Step Reasoning
A more generalized idea for advanced reasoning capabilities is that the LLM makes a plan, which can include any number of other LLM sub-processing tasks. The idea is also called “few-shot” processing, because it allows multiple LLM calls, rather than “one-shot” methods, where there’s only a single LLM request. This is an area of state-of-the-art research in trying to reach AGI, by improving the LLM’s ability to plan and reason.
You usually don’t even know it’s happening if you use a third-party inference API to get answers to your queries. Which is good news if you don’t happen to have a PhD in machine learning.
There’s not always a clear distinction between one-step and multi-step reasoning. “Hybrid reasoning models” are LLMs that use a combination of single-step reasoning and inference-based multi-step reasoning. For example, Large Reasoning Models (LRMs) may be trained to use only a single step for some queries. Another example is that less powerful smaller reasoning models may be improved by using multi-step inference-based reasoning, known as “test time compute.”
There are many more prompting techniques, both zero-shot and few-shot, that you can research. Here is just a smattering:
- Rephrase and Respond (RaR)
- Re-reading (RE2) — appends “Read the question again:” and the question a second time.
- Self-Ask — encourages the LLM to ask “follow-up questions.”
- Memory-of-Thought
- Active Prompting
- Ensemble prompting — various multi-answer combination ideas.
Unfortunately, I’ve run out of electrons, so I’m not going to cover all of these. There are various huge survey papers on the internet if you happen to like strange nonsense that actually works.
Deep Research Models
Deep research models are advanced LLMs that can complete complex research projects. They’re kind of like multi-multi-multi-hop reasoning models.
Typically, deep research models will search for information on a topic, and then perform a reasoning function on the information. Advanced models may use multiple steps of information search and reasoning to refine their answer. There are several commercial models available that can perform deep research tasks, including:
- Google Gemini
- OpenAI Deep Research
There are several main architectural components that are key to a deep research model:
- Web search plugin (“web agent”) or other information source
- Large Reasoning Model
- Multi-step reasoning algorithm (e.g., Chain-of-Thought)
There is also an overarching algorithm that controls the whole plan. This may be LLM-based planning or could be other non-LLM heuristics (or some combination of both).
Deep research models are not cheap to run, because they have to process lots of information, and perform multiple reasoning steps. There is also the cost of information access, such as an internet search, although this would typically be less than LLM inference costs. There are various ways to improve LLM reasoning costs by reducing the number of tokens produced and processed by each reasoning step.
Meta-Reasoning
Meta-reasoning is thinking about how to solve a problem. People say that AI is bad at “planning,” but I say that’s being quite unfair. The first order of business for an LLM that’s been presented with a user’s question is planning. Rather than thinking of the answer, it should first think about the plan for how to answer. The decisions it needs to consider (and be trained about) include:
- Whether to invoke a “tool” such as a clock or calculator.
- Should an internet search be done?
- Is a query of a “plugin” required?
- Does it need a chunk of a company-specific document (in a RAG system)?
I’m a little vague on the difference between a “tool” and a “plugin.” And isn’t an internet search kind of the same? And looking up a text chunk in a RAG vector database? They’re all kind of “tools” that you “plug in” to your LLM. Maybe it’s just me.
Anyway, all those integrations above all require an extra “step” or “hop” to some other integrated system. Not all queries require these steps. There are also other issues in relation to producing the answer directly from the LLM:
- What output format is required (e.g., text or HTML).
- Should the LLM refuse to answer this question?
- Should it ask the user a followup question before answering?
Want to know something funny? These are all the same! They’re just training. If you want an LLM to refuse a query, you train it with bad queries and nice words that are refusals. If you want it to ask followup questions to clarify ambiguous questions, you train the LLM with ambiguous queries along with brief answers that contain the followup questions. If you want the LLM to invoke the clock when you ask for the time, you have to train it with queries about the time and answers that contain tricky “tool tokens” that run the clock. We’re giving the LLM too much credit for being humanlike.
It’s all just words to the LLM. Planning, not at all.
The final option in “planning” a response to a user’s query is: none of the above. There are a lot of questions that an LLM can just answer immediately, without hesitation, and without consulting some other system. When it does this without any extra step, we say it’s using its “parametric knowledge,” because it’s only using whatever has been trained into the many billions of “parameters” that are the weights in its model.
Don’t worry if you don’t understand how that all works, because I don’t, and nobody else does. There’s just a bunch of over a billion magic numbers that can help you fold your pillowcases in origami patterns.
Not Following Instructions
LLMs are great at following instructions, and it’s largely considered a “solved problem” in AI. If you ask the AI to “pretend to be Golum” or “answer in Klingon” then you can be fairly sure it will do so.
There’s also a whole lot of research on not following instructions, because there are lots of answers that AIs shouldn’t give. This whole area of theory is called “refusal training” and the opposite research is where users try to bypass the refusals, which is called “jailbreaking” the AI.
But recently, there was a weird example of LLMs having a totally different failure to follow instructions. The research by Shojaee et. al. (2025) at Apple found that LLMs could not use the information in questions to solve complex puzzle problems. This is a different type of “instructions” from the above:
- Procedural steps
- Abstract puzzle problems
Why did the AI fail? Well, it’s all words to the LLM, and clearly the words in the question that described how to solve the puzzle couldn’t be understood properly by the LLM. It’s like it couldn’t map the words given in the prompt to the steps it was using in its reasoning.
Part of me wants to say that maybe we need a whole different architecture that understands “concepts” better and that will be able to map words to concepts that represent abstract steps, and back again to words at the end. Surely there’s a chasm to overcome here that needs a rewrite:
- Words
- Abstract concepts
The idea of not mapping words to reasoning steps seems somewhat reasonable, until you realize that reasoning models use words to describe their steps. In fact, they “think out loud” in words, whether it’s doing a one-step or a multi-step reasoning method.
So, I wonder if I’m about to be taught another “bitter lesson” in having models follow solution steps in the question. Maybe the problem was just not enough training data where the solution was given in the problem. Maybe the AI engines was just not paying enough attention to the original question. I mean, who hasn’t forgotten to read the whole question in a math test?
And there’s also the fact that the basic level of LLM instruction following is simply based on training. LLMs don’t follow even the most basic human instructions by themselves, but need a huge training set of examples of human instructions and the correct LLM output. Then the poor AIs get smacked with a stick called “reinforcement learning” and “penalties” during training if they get the answer wrong. That’s how they learn to speak Klingon.
Hence, I have a horrible feeling that this apparently huge hole in AI abstract reasoning, where the solution in the question is unusable by the AI, might be fixable with a few thousand examples of math quiz questions where they also contain big hints about the answer.
RAG Reasoning
The level of recent research attention to “multi-step reasoning” is off the charts. We’ve now seen papers on “chain-of-X” for almost every word in the English language. And this has now started to make its way into RAG research, although it’s not a flood of papers by any means.
Every implementation of a RAG application can use any of those advanced reasoning algorithms, which would probably make it 1% smarter and 100% slower. Arguably, the type of advanced reasoning problems that OpenAI Strawberry or DeepSeek R1 is better at solving are not really in the bread basket for RAG, but it’s worth a try!
Nevertheless, there’s room for some more research on RAG and reasoning. For example, does RAG with Chain-of-Thought work better if you prepend the RAG chunks at every step of the way, or is it better just to have them at the first step? We’re yet to see that in a research paper!
Implicit Associative Priors
There’s a weird thing about LLMs where they shouldn’t be good at something, but they are. If your LLM starts writing a story about a guy named Morgan, then the LLM will use the name “Morgan” again later in the story.
Everyone says that AIs are bad at “generalization.” However, if I change the first part of the story to be about a woman named Mary, rather than Morgan, then the LLM will dutifully write the rest of the story using the name “Mary” instead. The story will be changed in a very general way to use a different name and gender of the main character.
Sounds like generalization to me.
I can’t really see how this works. Also, it’s going to work regardless of exactly which word contains the name. I can add a few extra adjectives, changing its position, and the name will still get re-used.
Note that I’m not talking about “context windows” where it used to be that the LLMs would forget that name if it was 4,000 words earlier in the story. Rather, I’m asking the question: how does it know to output the same name? The first time, sure, it just chooses a random name. What about the second time? What is it in its training that tells it to re-use whatever name tokens appeared earlier in the story, rather than generating a random name each time.
The answer is called “associative priors,” and the LLM starts to learn to re-use tokens earlier in its context, rather than any fixed name. This behavior arises over the course of extensive training whereby the model is paying “attention” to the tokens in the previous text. In fact, to learn to pay attention to tokens at an earlier position, there is a technology called “positional encoding.” Weirdly, this technology doesn’t actually try to tell the LLM which position is best, but positional encoding simply makes the numbers at every position a little bit different, and then the LLM learns which positions are important during training.
Don’t worry; nobody else understands how this works, either.
Somehow, the combination of weights and “positional encoding” allows the LLM to re-use words from earlier in the text. Eventually, after passing a lot of training data under the bridge, this generalizes not just to know what words fit after other ones, but also to the higher-level answer of re-using the existing tokens at an earlier position, irrespective of the value of those tokens.
Really Radical Reasoning References
Implicit Associative Priors. Research papers on associative priors:
- Giorgos Borboudakis, Ioannis Tsamardinos, 9 Aug 2014, Scoring and Searching over Bayesian Networks with Causal and Associative Priors, https://arxiv.org/abs/1408.2057
- Yuwei Sun, Hideya Ochiai, Zhirong Wu, Stephen Lin, Ryota Kanai, 11 Mar 2025 (v4), Associative Transformer, https://arxiv.org/abs/2309.12862
- Frederick Liu and Besim Avci, 2019, Incorporating Priors with Feature Attribution on Text Classification, Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 6274–6283, Florence, Italy. Association for Computational Linguistics, https://aclanthology.org/P19-1631/, PDF: https://aclanthology.org/P19-1631.pdf
- Menglin Wang, Zhun Zhong, Xiaojin Gong, 13 Feb 2025, Prior-Constrained Association Learning for Fine-Grained Generalized Category Discovery, https://arxiv.org/abs/2502.09501
- Shuai Li, Kui Jia, Xiaogang Wang, 12 Jan 2017 (v2), Automatic Discoveries of Physical and Semantic Concepts via Association Priors of Neuron Groups, https://arxiv.org/abs/1612.09438
- Feng He, Chao Zhang, Zhixue Zhao, 4 Dec 2024, Implicit Priors Editing in Stable Diffusion via Targeted Token Adjustment, https://arxiv.org/abs/2412.03400
- J. Urain, A. T. Le, A. Lambert, G. Chalvatzaki, B. Boots and J. Peters, 2022, Learning Implicit Priors for Motion Optimization, 2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Kyoto, Japan, 2022, pp. 7672-7679, doi: 10.1109/IROS47612.2022.9981264, https://ieeexplore.ieee.org/abstract/document/9981264
One-Step Reasoning. Research papers on one-step reasoning include:
- Zhipeng Chen, Yingqian Min, Beichen Zhang, Jie Chen, Jinhao Jiang, Daixuan Cheng, Wayne Xin Zhao, Zheng Liu, Xu Miao, Yang Lu, Lei Fang, Zhongyuan Wang, Ji-Rong Wen, 6 Mar 2025, An Empirical Study on Eliciting and Improving R1-like Reasoning Models, https://arxiv.org/abs/2503.04548 https://github.com/RUCAIBox/Slow_Thinking_with_LLMs
- Yijiong Yu, 4 Dec 2024 (v3), Patience Is The Key to Large Language Model Reasoning, https://arxiv.org/abs/2411.13082 (Training a reasoning model to give longer one-step answers using training data with long answers as positive examples and short answers as negative answers.)
- Yijiong Yu, 16 Jan 2025 (v4), Do LLMs Really Think Step-by-step In Implicit Reasoning? https://arxiv.org/abs/2411.15862 https://github.com/yuyijiong/if_step_by_step_implicit_CoT
- Qiguang Chen, Libo Qin, Jinhao Liu, Dengyun Peng, Jiannan Guan, Peng Wang, Mengkang Hu, Yuhang Zhou, Te Gao, Wanxiang Che, 13 Mar 2025 (v2), Towards Reasoning Era: A Survey of Long Chain-of-Thought for Reasoning Large Language Models, https://arxiv.org/abs/2503.09567 (Massive and broad survey of all types of reasoning.)
- Jiaran Ye, Zijun Yao, Zhidian Huang, Liangming Pan, Jinxin Liu, Yushi Bai, Amy Xin, Liu Weichuan, Xiaoyin Che, Lei Hou, Juanzi Li, 29 May 2025, How does Transformer Learn Implicit Reasoning? https://arxiv.org/abs/2505.23653
- Michael Nuñez, July 15, 2025, OpenAI, Google DeepMind and Anthropic sound alarm: ‘We may be losing the ability to understand AI’, https://venturebeat.com/ai/openai-google-deepmind-and-anthropic-sound-alarm-we-may-be-losing-the-ability-to-understand-ai/ (Monitoring the text-based interim “thinking-out-loud” reasoning of models in CoT.)
- Tomek Korbak, Mikita Balesni, (and many more authors) July 2025, Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety, https://tomekkorbak.com/cot-monitorability-is-a-fragile-opportunity/cot_monitoring.pdf
Multi-Hop Reasoning. Research papers on multi-step or multi-hop reasoning include:
- Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Thomas L. Griffiths, Yuan Cao, Karthik Narasimhan, 3 Dec 2023 (v2), Tree of Thoughts: Deliberate Problem Solving with Large Language Models, https://arxiv.org/abs/2305.10601 Code: https://github.com/princeton-nlp/tree-of-thought-llm
- Justin Chih-Yao Chen, Archiki Prasad, Swarnadeep Saha, Elias Stengel-Eskin, Mohit Bansal, 18 Sep 2024, MAgICoRe: Multi-Agent, Iterative, Coarse-to-Fine Refinement for Reasoning, https://arxiv.org/abs/2409.12147 https://github.com/dinobby/MAgICoRe
- Xiaohan Xu, Chongyang Tao, Tao Shen, Can Xu, Hongbo Xu, Guodong Long, Jian-guang Lou, 29 Feb 2024 (v2), Re-Reading Improves Reasoning in Large Language Models, https://arxiv.org/abs/2309.06275
- TED, Oct 2024, Multi-Step Reasoning Agents, https://tedai-sanfrancisco.ted.com/glossary/multi-step-reasoning-agents/
- Yao Fu, Hao Peng, Ashish Sabharwal, Peter Clark, Tushar Khot, 30 Jan 2023 (v2), Complexity-Based Prompting for Multi-Step Reasoning, https://arxiv.org/abs/2210.00720
- Junting Lu, Oct 2024 (accessed), Awesome-LLM-Reasoning-Techniques, https://github.com/Junting-Lu/Awesome-LLM-Reasoning-Techniques
- Cameron R. Wolfe, Dec 23, 2023, Tree of Thoughts Prompting: Solving multi-step problems with LLMs via deliberate planning and exploration, https://towardsdatascience.com/tree-of-thoughts-prompting-65a3e51f9ac4
- Data Camp, Jul 10, 2024, Chain-of-Thought Prompting: Step-by-Step Reasoning with LLMs, https://www.datacamp.com/tutorial/chain-of-thought-prompting
- Pankaj, Dec 21, 2023, Chain of Thought Prompting: Guiding LLMs Step-by-Step, https://medium.com/@pankaj_pandey/chain-of-thought-prompting-guiding-llms-step-by-step-e6eac32d02d8
- Cobus Greyling, Aug 2, 2023, 12 Prompt Engineering Techniques, https://cobusgreyling.medium.com/12-prompt-engineering-techniques-644481c857aa
- Cameron R. Wolfe, Jan 3, 2024, Graph-Based Prompting and Reasoning with Language Models: Understanding graph of thoughts prompting and several variants, https://towardsdatascience.com/graph-based-prompting-and-reasoning-with-language-models-d6acbcd6b3d8
- Jason Wei and Denny Zhou, May 11, 2022, Language Models Perform Reasoning via Chain of Thought, https://research.google/blog/language-models-perform-reasoning-via-chain-of-thought/
- Tanay Jaipuria, Oct 29, 2024, OpenAI’s o-1 and inference-time scaling laws, https://www.tanayj.com/p/openais-o-1-and-inference-time-scaling
- Jinlin Wang, Suyuchen Wang, Ziwen Xia, Sirui Hong, Yun Zhu, Bang Liu, Chenglin Wu, 28 Oct 2024, FACT: Examining the Effectiveness of Iterative Context Rewriting for Multi-fact Retrieval, https://arxiv.org/abs/2410.21012
- Carl Franzen, November 20, 2024, DeepSeek’s first reasoning model R1-Lite-Preview turns heads, beating OpenAI o1 performance, https://venturebeat.com/ai/deepseeks-first-reasoning-model-r1-lite-preview-turns-heads-beating-openai-o1-performance/
- mshumer, Nov 2024, Open Reasoning Engine, https://github.com/mshumer/OpenReasoningEngine
- Eric Horvitz , Harsha Nori , Naoto Usuyama , November 27, 2024 Advances in run-time strategies for next-generation foundation models, Microsoft Research Blog, https://www.microsoft.com/en-us/research/blog/advances-in-run-time-strategies-for-next-generation-foundation-models/
- Harsha Nori, Naoto Usuyama, Nicholas King, Scott Mayer McKinney, Xavier Fernandes, Sheng Zhang, Eric Horvitz, 6 Nov 2024, From Medprompt to o1: Exploration of Run-Time Strategies for Medical Challenge Problems and Beyond, https://arxiv.org/abs/2411.03590
- Hieu Tran, Zonghai Yao, Junda Wang, Yifan Zhang, Zhichao Yang, Hong Yu, 5 Dec 2024 (v2), RARE: Retrieval-Augmented Reasoning Enhancement for Large Language Models, https://arxiv.org/abs/2412.02830
- Zhe Chen, Weiyun Wang, Yue Cao, Yangzhou Liu, Zhangwei Gao, Erfei Cui, Jinguo Zhu, Shenglong Ye, Hao Tian, Zhaoyang Liu, Lixin Gu, Xuehui Wang, Qingyun Li, Yimin Ren, Zixuan Chen, Jiapeng Luo, Jiahao Wang, Tan Jiang, Bo Wang, Conghui He, Botian Shi, Xingcheng Zhang, Han Lv, Yi Wang, Wenqi Shao, Pei Chu, Zhongying Tu, Tong He, Zhiyong Wu, Huipeng Deng, Jiaye Ge, Kai Chen, Min Dou, Lewei Lu, Xizhou Zhu, Tong Lu, Dahua Lin, Yu Qiao, Jifeng Dai, Wenhai Wang, 6 Dec 2024, Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling, https://arxiv.org/abs/2412.05271
- Mingchen Zhuge, Changsheng Zhao, Dylan Ashley, Wenyi Wang, Dmitrii Khizbullin, Yunyang Xiong, Zechun Liu, Ernie Chang, Raghuraman Krishnamoorthi, Yuandong Tian, Yangyang Shi, Vikas Chandra, Jürgen Schmidhuber, 16 Oct 2024 (v2), Agent-as-a-Judge: Evaluate Agents with Agents, https://arxiv.org/abs/2410.10934
- Kyle Wiggers, December 14, 2024, ‘Reasoning’ AI models have become a trend, for better or worse, https://techcrunch.com/2024/12/14/reasoning-ai-models-have-become-a-trend-for-better-or-worse/
- Arda Sevinc, Abdurrahman Gumus, 9 Dec 2024, AutoReason: Automatic Few-Shot Reasoning Decomposition, https://arxiv.org/abs/2412.06975
- Ekin Akyürek, Mehul Damani, Linlu Qiu, Han Guo, Yoon Kim, Jacob Andreas, 11 Nov 2024, The Surprising Effectiveness of Test-Time Training for Abstract Reasoning, https://arxiv.org/abs/2411.07279
- Noam Brown, Tuomas Sandholm, 16 Nov 2017 (v3), Safe and Nested Subgame Solving for Imperfect-Information Games, https://arxiv.org/abs/1705.02955 (An early pre-AI paper on reasoning in multiple steps.)
- Maxwell Zeff, November 20, 2024, Current AI scaling laws are showing diminishing returns, forcing AI labs to change course, https://techcrunch.com/2024/11/20/ai-scaling-laws-are-showing-diminishing-returns-forcing-ai-labs-to-change-course/ ("at least 10 to 20x gains in model performance ...intelligent prompting, UX decisions, and passing context at the right time into the models...")
- Agnostiq, Dec 2024, multi-agent-llm: LLM based Multi-Agent methods: Lean implementation of various multi-agent LLM methods, including Iteration of Thought (IoT), https://github.com/AgnostiqHQ/multi-agent-llm
- Xiangjue Dong, Maria Teleki, James Caverlee, 18 Dec 2024, A Survey on LLM Inference-Time Self-Improvement, https://arxiv.org/abs/2412.14352 https://github.com/dongxiangjue/Awesome-LLM-Self-Improvement (Broad survey of reasoning improvement methods from multi-step inference to RALM to decoding algorithms.)
- Xingyu Chen, Jiahao Xu, Tian Liang, Zhiwei He, Jianhui Pang, Dian Yu, Linfeng Song, Qiuzhi Liu, Mengfei Zhou, Zhuosheng Zhang, Rui Wang, Zhaopeng Tu, Haitao Mi, Dong Yu, 30 Dec 2024, Do NOT Think That Much for 2+3=? On the Overthinking of o1-Like LLMs, https://arxiv.org/abs/2412.21187
- Rohin Manvi, Anikait Singh, Stefano Ermon, 3 Oct 2024, Adaptive Inference-Time Compute: LLMs Can Predict if They Can Do Better, Even Mid-Generation, https://arxiv.org/abs/2410.02725
- Yiwei Li, Peiwen Yuan, Shaoxiong Feng, Boyuan Pan, Xinglin Wang, Bin Sun, Heda Wang, and Kan Li, 19 Jan 2024, Escape sky-high cost: Early-stopping self-consistency for multi-step reasoning, The Twelfth International Conference on Learning Representations, 2024, https://arxiv.org/abs/2401.10480 https://github.com/Yiwei98/ESC (Uses “early stopping” idea to improve CoT efficiency during inference.)
- Akash Bajwa, Jan 06, 2025, Test-Time Search: A Path To AGI: Stacking Scaling Laws And Reward Engineering, https://akashbajwa.substack.com/p/test-time-search-a-path-to-agi
- Zekun Xi, Wenbiao Yin, Jizhan Fang, Jialong Wu, Runnan Fang, Ningyu Zhang, Jiang Yong, Pengjun Xie, Fei Huang, Huajun Chen, 16 Jan 2025, OmniThink: Expanding Knowledge Boundaries in Machine Writing through Thinking, https://arxiv.org/abs/2501.09751 (Iteratively going deeper into a topic while generating.)
- Siddharth Narayanan, James D. Braza, Ryan-Rhys Griffiths, Manu Ponnapati, Albert Bou, Jon Laurent, Ori Kabeli, Geemi Wellawatte, Sam Cox, Samuel G. Rodriques, Andrew D. White, 30 Dec 2024, Aviary: training language agents on challenging scientific tasks, https://arxiv.org/abs/2412.21154 (Using smaller models combined with multi-step reasoning to compete with big models with 100x less inference cost.)
- Kuang-Huei Lee, Ian Fischer, Yueh-Hua Wu, Dave Marwood, Shumeet Baluja, Dale Schuurmans, Xinyun Chen, 17 Jan 2025, Evolving Deeper LLM Thinking, https://arxiv.org/abs/2501.09891 (An alternative search strategy broad/deep, compared to CoT and reflection.)
- Edward Beeching, Lewis Tunstall, Sasha Rush Dec 16, 2024, Scaling Test Time Compute with Open Source Models, https://huggingface.co/spaces/HuggingFaceH4/blogpost-scaling-test-time-compute
- Maciej Besta, Julia Barth, Eric Schreiber, Ales Kubicek, Afonso Catarino, Robert Gerstenberger, Piotr Nyczyk, Patrick Iff, Yueling Li, Sam Houliston, Tomasz Sternal, Marcin Copik, Grzegorz Kwaśniewski, Jürgen Müller, Łukasz Flis, Hannes Eberhard, Hubert Niewiadomski, Torsten Hoefler, 23 Jan 2025 (v3), Reasoning Language Models: A Blueprint, https://arxiv.org/abs/2501.11223 (Survey and blueprint for how to build a Large Reasoning Model.)
- Niklas Muennighoff, Zitong Yang, Weijia Shi, Xiang Lisa Li, Li Fei-Fei, Hannaneh Hajishirzi, Luke Zettlemoyer, Percy Liang, Emmanuel Candès, Tatsunori Hashimoto, 3 Feb 2025 (v2), s1: Simple test-time scaling, https://arxiv.org/abs/2501.19393 https://github.com/simplescaling/s1 (Method of “budget forcing” that allows either shortening or lengthening multi-step reasoning sequences.)
- Manish Sanwal, 3 Feb 2025 (v2), Layered Chain-of-Thought Prompting for Multi-Agent LLM Systems: A Comprehensive Approach to Explainable Large Language Models, https://arxiv.org/abs/2501.18645
- Sebastian Raschka, PhD, Feb 05, 2025, Understanding Reasoning LLMs: Methods and Strategies for Building and Refining Reasoning Models, https://magazine.sebastianraschka.com/p/understanding-reasoning-llms
- Ling Yang, Zhaochen Yu, Bin Cui, Mengdi Wang, 10 Feb 2025, ReasonFlux: Hierarchical LLM Reasoning via Scaling Thought Templates, https://arxiv.org/abs/2502.06772 https://github.com/Gen-Verse/ReasonFlux (RALM-like retrieval of reasoning prompt templates at inference time.)
- Hanmeng Liu, Zhizhang Fu, Mengru Ding, Ruoxi Ning, Chaoli Zhang, Xiaozhang Liu, Yue Zhang, 13 Feb 2025, Logical Reasoning in Large Language Models: A Survey, https://arxiv.org/abs/2502.09100
- Zeping Yu, Yonatan Belinkov, Sophia Ananiadou, 15 Feb 2025, Back Attention: Understanding and Enhancing Multi-Hop Reasoning in Large Language Models, https://arxiv.org/abs/2502.10835
- Dacheng Li, Shiyi Cao, Chengkun Cao, Xiuyu Li, Shangyin Tan, Kurt Keutzer, Jiarong Xing, Joseph E. Gonzalez, Ion Stoica, 20 Feb 2025, S*: Test Time Scaling for Code Generation, https://arxiv.org/abs/2502.14382 https://github.com/NovaSky-AI/SkyThought
- Ben Dickson, February 20, 2025, How test-time scaling unlocks hidden reasoning abilities in small language models (and allows them to outperform LLMs), https://venturebeat.com/ai/how-test-time-scaling-unlocks-hidden-reasoning-abilities-in-small-language-models-and-allows-them-to-outperform-llms/
- Zhiyuan Zeng, Qinyuan Cheng, Zhangyue Yin, Yunhua Zhou, Xipeng Qiu, 17 Feb 2025, Revisiting the Test-Time Scaling of o1-like Models: Do they Truly Possess Test-Time Scaling Capabilities? https://arxiv.org/abs/2502.12215
- Zihao Zeng, Xuyao Huang, Boxiu Li, Zhijie Deng, 19 Feb 2025, SIFT: Grounding LLM Reasoning in Contexts via Stickers, https://arxiv.org/abs/2502.14922 https://github.com/zhijie-group/SIFT (Multi-step reasoning where the LLM first generates a modified prompt that summarizes the key points, and then does inference for both the original and modified prompts, then comparing results and adjusting forwards and backwards.)
- Marthe Ballon, Andres Algaba, Vincent Ginis, 21 Feb 2025, The Relationship Between Reasoning and Performance in Large Language Models -- o3 (mini) Thinks Harder, Not Longer, https://arxiv.org/abs/2502.15631
- Maxwell Zeff, February 24, 2025, Anthropic launches a new AI model that ‘thinks’ as long as you want, https://techcrunch.com/2025/02/24/anthropic-launches-a-new-ai-model-that-thinks-as-long-as-you-want/
- Yuchen Yan, Yongliang Shen, Yang Liu, Jin Jiang, Mengdi Zhang, Jian Shao, Yueting Zhuang, 13 Mar 2025 (v2), InftyThink: Breaking the Length Limits of Long-Context Reasoning in Large Language Models, https://arxiv.org/abs/2503.06692
- Qiguang Chen, Libo Qin, Jinhao Liu, Dengyun Peng, Jiannan Guan, Peng Wang, Mengkang Hu, Yuhang Zhou, Te Gao, Wanxiang Che, 13 Mar 2025 (v2), Towards Reasoning Era: A Survey of Long Chain-of-Thought for Reasoning Large Language Models, https://arxiv.org/abs/2503.09567 (Massive and broad survey of all types of reasoning.)
- Eric Zhao, Pranjal Awasthi, Sreenivas Gollapudi, 20 Feb 2025 (v2), Sample, Scrutinize and Scale: Effective Inference-Time Search by Scaling Verification, https://arxiv.org/abs/2502.01839 (Wrapping a single model with a Best-of-N approach that self-selects the best answer can significantly improve reasoning rates.)
- Qianjun Pan, Wenkai Ji, Yuyang Ding, Junsong Li, Shilian Chen, Junyi Wang, Jie Zhou, Qin Chen, Min Zhang, Yulan Wu, Liang He, 8 May 2025 (v2), A Survey of Slow Thinking-based Reasoning LLMs using Reinforced Learning and Inference-time Scaling Law, https://arxiv.org/abs/2505.02665
- Michael Nuñez, July 15, 2025, OpenAI, Google DeepMind and Anthropic sound alarm: ‘We may be losing the ability to understand AI’, https://venturebeat.com/ai/openai-google-deepmind-and-anthropic-sound-alarm-we-may-be-losing-the-ability-to-understand-ai/ (Monitoring the text-based interim “thinking-out-loud” reasoning of models in CoT.)
- Tomek Korbak, Mikita Balesni, (and many more authors) July 2025, Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety, https://tomekkorbak.com/cot-monitorability-is-a-fragile-opportunity/cot_monitoring.pdf
Deep Research Models. Research papers on deep research modelsinclude:
- HuggingFace, February 3, 2025, OpenAI’s Deep Research vs DeepSeek R1, https://huggingface.co/blog/LLMhacker/openais-deep-research-vs-deepseek-r1
- Timothy B. Lee, Feb 24, 2025, These experts were stunned by OpenAI Deep Research: “I would use this model professionally,” an antitrust lawyer told me, https://www.understandingai.org/p/these-experts-were-stunned-by-openai
- Ravi Teja,: December 24, 2024, Google Gemini’s Deep Research: What is it and How to Use it? https://techwiser.com/google-geminis-deep-research-what-is-it-and-how-to-use-it/
- Dave Citron, Dec 11, 2024, Try Deep Research and our new experimental model in Gemini, your AI assistant: Deep Research rolls out to Gemini Advanced subscribers today, saving you hours of time. Plus, you can now try out a chat optimized version of 2.0 Flash Experimental in Gemini on the web, Google Blog, https://blog.google/products/gemini/google-gemini-deep-research/
- Sundar Pichai, Demis Hassabis, Koray Kavukcuoglu, Dec 11, 2024, Introducing Gemini 2.0: our new AI model for the agentic era, Google Blog, https://blog.google/technology/google-deepmind/google-gemini-ai-update-december-2024/
- Emma Roth, Dec 12, 2024, Google built an AI tool that can do research for you: With Deep Research, you can ask Gemini to scour the web on your behalf and write up a report based on its findings, https://www.theverge.com/2024/12/11/24318217/google-gemini-advanced-deep-research-launch
- Asif Razzaq, March 8, 2025, Meet Manus: A New AI Agent from China with Deep Research + Operator + Computer Use + Lovable + Memory, https://www.marktechpost.com/2025/03/08/meet-manus-a-new-ai-agent-from-china-with-deep-research-operator-computer-use-lovable-memory/
- Jordan Gibbs, Feb 17, 2025, 4 Lifechanging ChatGPT Features You May Not Know About (Feb. 2025): ChatGPT has been releasing a ton of powerful features recently… Are you caught up? https://medium.com/@jordan_gibbs/4-lifechanging-chatgpt-features-you-may-not-know-about-feb-2025-01eaeb4e68c1
- Jim the AI Whisperer, Mar 10, 2025, My “Prompt Grafting” technique outperforms Deep Research in head-to-head test — and is 5900% faster: My prompt gets more insightful, comprehensive research from AI, https://generativeai.pub/prompt-grafting-vs-deep-research-faster-better-ai-essays-824968b1a47a
Hybrid Reasoning Models. Research papers on hybrid one-step/multi-step reasoning include:
- Maxwell Zeff, February 24, 2025, Anthropic launches a new AI model that ‘thinks’ as long as you want, https://techcrunch.com/2025/02/24/anthropic-launches-a-new-ai-model-that-thinks-as-long-as-you-want/
- Xiaoyu Tian, Liangyu Chen, Na Liu, Yaxuan Liu, Wei Zou, Kaijiang Chen, Ming Cui, 24 Nov 2023 (v4), DUMA: a Dual-Mind Conversational Agent with Fast and Slow Thinking, https://arxiv.org/abs/2310.18075
- Daniele Paliotta, Junxiong Wang, Matteo Pagliardini, Kevin Y. Li, Aviv Bick, J. Zico Kolter, Albert Gu, François Fleuret, Tri Dao, 27 Feb 2025, Thinking Slow, Fast: Scaling Inference Compute with Distilled Reasoners, https://arxiv.org/abs/2502.20339
- Jianyuan Zhong, Zeju Li, Zhijian Xu, Xiangyu Wen, Qiang Xu, 16 Feb 2025, Dyve: Thinking Fast and Slow for Dynamic Process Verification, https://arxiv.org/abs/2502.11157
- Kangan Qian, Zhikun Ma, Yangfan He, Ziang Luo, Tianyu Shi, Tianze Zhu, Jiayin Li, Jianhui Wang, Ziyu Chen, Xiao He, Yining Shi, Zheng Fu, Xinyu Jiao, Kun Jiang, Diange Yang, Takafumi Matsumaru, 27 Nov 2024, FASIONAD : FAst and Slow FusION Thinking Systems for Human-Like Autonomous Driving with Adaptive Feedback, https://arxiv.org/abs/2411.18013
- DiJia Su, Sainbayar Sukhbaatar, Michael Rabbat, Yuandong Tian, Qinqing Zheng, 13 Oct 2024, Dualformer: Controllable Fast and Slow Thinking by Learning with Randomized Reasoning Traces, https://arxiv.org/abs/2410.09918
- Konstantina Christakopoulou, Shibl Mourad, Maja Matarić, 10 Oct 2024, Agents Thinking Fast and Slow: A Talker-Reasoner Architecture, https://arxiv.org/abs/2410.08328
- Pengbo Hu, Ji Qi, Xingyu Li, Hong Li, Xinqi Wang, Bing Quan, Ruiyu Wang, Yi Zhou, 21 Aug 2023 (v2), Tree-of-Mixed-Thought: Combining Fast and Slow Thinking for Multi-hop Visual Reasoning, https://arxiv.org/abs/2308.09658
- Thilo Hagendorff, Sarah Fabi, Michal Kosinski, 2 Aug 2023 (v2), Thinking Fast and Slow in Large Language Models, https://arxiv.org/abs/2212.05206
- Wenlin Yao, Haitao Mi, Dong Yu, 25 Sep 2024, HDFlow: Enhancing LLM Complex Problem-Solving with Hybrid Thinking and Dynamic Workflows, https://arxiv.org/abs/2409.17433
- Kyle Wiggers, March 4, 2025, Amazon is reportedly developing its own AI ‘reasoning’ model: Amazon reportedly wants to get in on the AI “reasoning” model game, https://techcrunch.com/2025/03/04/amazon-is-reportedly-developing-its-own-ai-reasoning-model/
- X Zhang, F Zhang, C Du, C Du, T Pang, W Gao, M Lin, Mar 2025, LightTransfer: Your Long-Context LLM is Secretly a Hybrid Model with Effortless Adaptation, https://openreview.net/pdf?id=DfgfGTfObm
- Supreeth Koundinya, March 10, 2025, Manus is a Wrapper of Anthropic’s Claude, and It’s Okay, https://analyticsindiamag.com/ai-features/manus-is-a-wrapper-of-anthropics-claude-and-its-okay/ (“Manus didn’t just slap an API on a model. They built an autonomous system that can execute deep research, deep thinking, and multi-step tasks in a way that no other AI have.”)
- Sean Michael Kerner, March 18, 2025, Nvidia debuts Llama Nemotron open reasoning models in a bid to advance agentic AI, https://venturebeat.com/ai/nvidia-debuts-llama-nemotron-open-reasoning-models-in-a-bid-to-advance-agentic-ai/
- Xiaoye Qu, Yafu Li, Zhaochen Su, Weigao Sun, Jianhao Yan, Dongrui Liu, Ganqu Cui, Daizong Liu, Shuxian Liang, Junxian He, Peng Li, Wei Wei, Jing Shao, Chaochao Lu, Yue Zhang, Xian-Sheng Hua, Bowen Zhou, Yu Cheng, 27 Mar 2025, A Survey of Efficient Reasoning for Large Reasoning Models: Language, Multimodality, and Beyond, https://arxiv.org/abs/2503.21614
RAG Reasoning. Research papers on reasoning models and RAG include:
- B Zhan, A Li, X Yang, D He, Y Duan, S Yan, 2024, RARoK: Retrieval-Augmented Reasoning on Knowledge for Medical Question Answering, 2024 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), pages 2837-2843, DOI: 10.1109/BIBM62325.2024.10822341, https://www.computer.org/csdl/proceedings-article/bibm/2024/10822341/23onp6dXOSI (RAG combined with Chain-of-Thought for medical reasoning.)
- Xinyan Guan, Jiali Zeng, Fandong Meng, Chunlei Xin, Yaojie Lu, Hongyu Lin, Xianpei Han, Le Sun, Jie Zhou. 3 Feb 2025, DeepRAG: Thinking to Retrieval Step by Step for Large Language Models, https://arxiv.org/abs/2502.01142
- P Verma, SP Midigeshi, G Sinha, A Solin, N Natarajan, Mar 2025, Plan *RAG: Efficient Test-Time Planning for Retrieval Augmented Generation, ICLR 2025 review, https://openreview.net/pdf?id=gi9aqlYdBk (Improve RAG reasoning efficiency via planning for parallel reasoning.)
- Qiguang Chen, Libo Qin, Jinhao Liu, Dengyun Peng, Jiannan Guan, Peng Wang, Mengkang Hu, Yuhang Zhou, Te Gao, Wanxiang Che, 13 Mar 2025 (v2), Towards Reasoning Era: A Survey of Long Chain-of-Thought for Reasoning Large Language Models, https://arxiv.org/abs/2503.09567 (Massive and broad survey of all types of reasoning.)
- Mingyue Cheng, Yucong Luo, Jie Ouyang, Qi Liu, Huijie Liu, Li Li, Shuo Yu, Bohou Zhang, Jiawei Cao, Jie Ma, Daoyu Wang, Enhong Chen, 17 Mar 2025 (v2), A Survey on Knowledge-Oriented Retrieval-Augmented Generation, https://arxiv.org/abs/2503.10677
- Thang Nguyen, Peter Chin, Yu-Wing Tai, 26 May 2025, MA-RAG: Multi-Agent Retrieval-Augmented Generation via Collaborative Chain-of-Thought Reasoning, https://arxiv.org/abs/2505.20096
|
• Online: Table of Contents • PDF: Free PDF book download |
|
The Sweetest Lesson: Your Brain Versus AI: new book on AI intelligence theory:
Get your copy from Amazon: The Sweetest Lesson |