Aussie AI
Training Reasoning Models
-
Last Updated 29 August, 2025
-
by David Spuler, Ph.D.
What is Training Reasoning Models?
Training reasoning models is a focus on training data and techniques with the goal of better reasoning results. Depending on the model architecture, the aim may be better reasoning in a single step, or multi-step reasoning improvements such as Chain-of-Thought. Hybrid methods can be used where the LLM is trained to output "longer answers" with multiple steps, that are nevertheless in a single step of inference.
The importance of training directly to the goal of reasoning improvements was shown by the release of the impressive DeepSeek R1 reasoning model, which beat metrics of other Large Reasoning Models. This was achieved using a dataset of approximately 800,000 reasoning-specific sets of training data. The result was the R1 model as a single-step reasoning model, which created longer answers as it "talked to itself" through the steps of how to reason out the answer to a question.
Research on Training Reasoning Models
Research papers include:
- Maciej Besta, Julia Barth, Eric Schreiber, Ales Kubicek, Afonso Catarino, Robert Gerstenberger, Piotr Nyczyk, Patrick Iff, Yueling Li, Sam Houliston, Tomasz Sternal, Marcin Copik, Grzegorz Kwaśniewski, Jürgen Müller, Łukasz Flis, Hannes Eberhard, Hubert Niewiadomski, Torsten Hoefler, 23 Jan 2025 (v3), Reasoning Language Models: A Blueprint, https://arxiv.org/abs/2501.11223 (Survey and blueprint for how to build a Large Reasoning Model.)
- Edward Yeo, Yuxuan Tong, Morry Niu, Graham Neubig, Xiang Yue, 5 Feb 2025, Demystifying Long Chain-of-Thought Reasoning in LLMs, https://arxiv.org/abs/2502.03373 https://github.com/eddycmu/demystify-long-cot
- Sebastian Raschka, PhD, Feb 05, 2025, Understanding Reasoning LLMs: Methods and Strategies for Building and Refining Reasoning Models https://magazine.sebastianraschka.com/p/understanding-reasoning-llms
- Zhong-Zhi Li, Duzhen Zhang, Ming-Liang Zhang, Jiaxin Zhang, Zengyan Liu, Yuxuan Yao, Haotian Xu, Junhao Zheng, Pei-Jie Wang, Xiuyi Chen, Yingying Zhang, Fei Yin, Jiahua Dong, Zhijiang Guo, Le Song, Cheng-Lin Liu, 25 Feb 2025 (v2), From System 1 to System 2: A Survey of Reasoning Large Language Models, https://arxiv.org/abs/2502.17419
- Komal Kumar, Tajamul Ashraf, Omkar Thawakar, Rao Muhammad Anwer, Hisham Cholakkal, Mubarak Shah, Ming-Hsuan Yang, Phillip H.S. Torr, Salman Khan, Fahad Shahbaz Khan, 28 Feb 2025, LLM Post-Training: A Deep Dive into Reasoning Large Language Models, https://arxiv.org/abs/2502.21321
- Tergel Munkhbat, Namgyu Ho, Seo Hyun Kim, Yongjin Yang, Yujin Kim, Se-Young Yun, 28 Feb 2025 (v2), Self-Training Elicits Concise Reasoning in Large Language Models, https://arxiv.org/abs/2502.20122
- Qianxi He, Qianyu He, Jiaqing Liang, Yanghua Xiao, Weikang Zhou, Zeye Sun, Fei Yu, 27 Feb 2025, Order Doesn't Matter, But Reasoning Does: Training LLMs with Order-Centric Augmentation, https://arxiv.org/abs/2502.19907
- Yihang Yao, Zhepeng Cen, Miao Li, William Han, Yuyou Zhang, Emerson Liu, Zuxin Liu, Chuang Gan, Ding Zhao, 25 Feb 2025, Your Language Model May Think Too Rigidly: Achieving Reasoning Consistency with Symmetry-Enhanced Training, https://arxiv.org/abs/2502.17800
- Yong Zhang, Bingyuan Zhang, Zhitao Li, Ming Li, Ning Cheng, Minchuan Chen, Tao Wei, Jun Ma, Shaojun Wang, Jing Xiao, 18 Feb 2025, Self-Enhanced Reasoning Training: Activating Latent Reasoning in Small Models for Enhanced Reasoning Distillation, https://arxiv.org/abs/2502.12744
- Jackson Coleman, Isaiah Lawrence, Benjamin Turner, 9 Feb 2025, Multi-granular Training Strategies for Robust Multi-hop Reasoning Over Noisy and Heterogeneous Knowledge Sources, https://arxiv.org/abs/2502.05944
- Xinhao Yao, Ruifeng Ren, Yun Liao, Yong Liu, 7 Feb 2025, Unveiling the Mechanisms of Explicit CoT Training: How Chain-of-Thought Enhances Reasoning Generalization, https://arxiv.org/abs/2502.04667
- Daman Arora, Andrea Zanette, 11 Feb 2025 (v2), Training Language Models to Reason Efficiently, https://arxiv.org/abs/2502.04463
- Pengxiao Lin, Zhongwang Zhang, Zhi-Qin John Xu, 20 Feb 2025 (v2), Reasoning Bias of Next Token Prediction Training, https://arxiv.org/abs/2502.02007
- Zui Chen, Tianqiao Liu, Mi Tian, Qing Tong, Weiqi Luo, Zitao Liu, 18 Feb 2025 (v2), Advancing Math Reasoning in Language Models: The Impact of Problem-Solving Data, Data Synthesis Methods, and Training Stages, https://arxiv.org/abs/2501.14002
- Alberto Romero, Jan 2025, DeepSeek, a little-known Chinese startup, released R1 yesterday, https://substack.com/@thealgorithmicbridge/note/c-87664591-
- DeepSeek-AI, Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, Z.F. Wu, Zhibin Gou, Zhihong Shao, Zhuoshu Li, Ziyi Gao, Aixin Liu, Bing Xue, Bingxuan Wang, Bochao Wu, Bei Feng, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, Damai Dai, Deli Chen, Dongjie Ji, Erhang Li, Fangyun Lin, Fucong Dai, Fuli Luo, Guangbo Hao, Guanting Chen, Guowei Li, H. Zhang, Han Bao, Hanwei Xu, Haocheng Wang, Honghui Ding, Huajian Xin, Huazuo Gao, Hui Qu, Hui Li, Jianzhong Guo, et al. (100+ additional authors not shown), 22 Jan 2025, DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning, https://arxiv.org/abs/2501.12948 (The DeepSeek R1 large reasoning model.)
- Akash Bajwa Jan 27, 2025, The Post-R1 World: AI Economics Have Irreversibly Changed, https://akashbajwa.substack.com/p/the-post-r1-world
- Mohammed Karimkhan Pathan, February 3, 2025, Open-source revolution: How DeepSeek-R1 challenges OpenAI’s o1 with superior processing, cost efficiency, https://venturebeat.com/ai/open-source-revolution-how-deepseek-r1-challenges-openais-o1-with-superior-processing-cost-efficiency/
- Guiyao Tie, Zeli Zhao, Dingjie Song, Fuyang Wei, Rong Zhou, Yurou Dai, Wen Yin, Zhejian Yang, Jiangyue Yan, Yao Su, Zhenhan Dai, Yifeng Xie, Yihan Cao, Lichao Sun, Pan Zhou, Lifang He, Hechang Chen, Yu Zhang, Qingsong Wen, Tianming Liu, Neil Zhenqiang Gong, Jiliang Tang, Caiming Xiong, Heng Ji, Philip S. Yu, Jianfeng Gao, 8 Mar 2025, A Survey on Post-training of Large Language Models, https://arxiv.org/abs/2503.06072
- Xiaomi LLM-Core Team: Bingquan Xia, Bowen Shen, Cici, Dawei Zhu, Di Zhang, Gang Wang, Hailin Zhang, Huaqiu Liu, Jiebao Xiao, Jinhao Dong, Liang Zhao, Peidian Li, Peng Wang, Shihua Yu, Shimao Chen, Weikun Wang, Wenhan Ma, Xiangwei Deng, Yi Huang, Yifan Song, Zihan Jiang, Bowen Ye, Can Cai, Chenhong He, Dong Zhang, Duo Zhang, Guoan Wang, Hao Tian, Haochen Zhao, Heng Qu, Hongshen Xu, Jun Shi, Kainan Bao, QingKai Fang, Kang Zhou, Kangyang Zhou, Lei Li, Menghang Zhu, Nuo Chen, Qiantong Wang, Shaohui Liu, Shicheng Li, Shuhao Gu, Shuhuai Ren, Shuo Liu, Sirui Deng, Weiji Zhuang, Weiwei Lv, Wenyu Yang, Xin Zhang, Xing Yong, Xing Zhang, Xingchen Song, Xinzhe Xu, Xu Wang, Yihan Yan, Yu Tu, Yuanyuan Tian, Yudong Wang, Yue Yu, Zhenru Lin, Zhichao Song, Zihao Yue, Xiaomi, 12 May 2025, MiMo: Unlocking the Reasoning Potential of Language Model -- From Pretraining to Posttraining, https://arxiv.org/abs/2505.07608
- Jiaran Ye, Zijun Yao, Zhidian Huang, Liangming Pan, Jinxin Liu, Yushi Bai, Amy Xin, Liu Weichuan, Xiaoyin Che, Lei Hou, Juanzi Li, 29 May 2025, How does Transformer Learn Implicit Reasoning? https://arxiv.org/abs/2505.23653
- Wei Sun, Qianlong Du, Fuwei Cui, Jiajun Zhang, 23 Jul 2025, An Efficient and Precise Training Data Construction Framework for Process-supervised Reward Model in Mathematical Reasoning, https://arxiv.org/abs/2503.02382
- Yanjun Zheng, Xiyang Du, Longfei Liao, Xiaoke Zhao, Zhaowen Zhou, Bo Zhang, Jiawei Liu, Xiang Qi, Zhe Li, Zhiqiang Zhang, Wei Wang and Peng Zhang, 23 Jul 2025, Agentar-Fin-R1: Enhancing Financial Intelligence through Domain Expertise, Training Efficiency, and Advanced Reasoning, https://arxiv.org/abs/2507.16802
- Mehul Damani, Isha Puri, Stewart Slocum, Idan Shenfeld, Leshem Choshen, Yoon Kim, Jacob Andreas, 22 Jul 2025, Beyond Binary Rewards: Training LMs to Reason About Their Uncertainty, https://arxiv.org/abs/2507.16806
- Run-Ze Fan and Zengzhi Wang and Pengfei Liu, 22 Jul 2025, MegaScience: Pushing the Frontiers of Post-Training Datasets for Science Reasoning, https://arxiv.org/abs/2507.16812
- Datta Nimmaturi, Vaishnavi Bhargava, Rajat Ghosh, Johnu George, Debojyoti Dutta, 24 Jul 2025, Predictive Scaling Laws for Efficient GRPO Training of Large Reasoning Models, https://arxiv.org/abs/2507.18014
- Bowen Jin, Hansi Zeng, Zhenrui Yue, Jinsung Yoon, Sercan Arik, Dong Wang, Hamed Zamani, Jiawei Han, 21 Jul 2025, Search-R1: Training LLMs to Reason and Leverage Search Engines with Reinforcement Learning, https://arxiv.org/abs/2503.09516
- Lijie Yang, Zhihao Zhang, Arti Jain, Shijie Cao, Baihong Yuan, Yiwei Chen, Zhihao Jia, Ravi Netravali, 9 Aug 2025, Less Is More: Training-Free Sparse Attention with Global Locality for Efficient Reasoning, https://arxiv.org/abs/2508.07101
- Zhipeng Chen, Xiaobo Qin, Youbin Wu, Yue Ling, Qinghao Ye, Wayne Xin Zhao, Guang Shi, 14 Aug 2025, Pass@k Training for Adaptively Balancing Exploration and Exploitation of Large Reasoning Models, https://arxiv.org/abs/2508.10751
- Kai Zhao, Yanjun Zhao, Jiaming Song, Shien He, Lusheng Zhang, Qiang Zhang, Tianjiao Li, 8 Aug 2025, SABER: Switchable and Balanced Training for Efficient LLM Reasoning, https://arxiv.org/abs/2508.10026
- Hongbo Jin, Ruyang Liu, Wenhao Zhang, Guibo Luo, Ge Li, 3 Aug 2025, CoT-Vid: Dynamic Chain-of-Thought Routing with Self Verification for Training-Free Video Reasoning, https://arxiv.org/abs/2505.11830
- Hasan Abed Al Kader Hammoud, Kumail Alhamoud, Abed Hammoud, Elie Bou-Zeid, Marzyeh Ghassemi, Bernard Ghanem, 12 Aug 2025, Train Long, Think Short: Curriculum Learning for Efficient Reasoning, https://arxiv.org/abs/2508.08940
- Qiaoyu Zheng, Yuze Sun, Chaoyi Wu, Weike Zhao, Pengcheng Qiu, Yongguo Yu, Kun Sun, Yanfeng Wang, Ya Zhang and Weidi Xie, 21 Aug 2025, End-to-End Agentic RAG System Training for Traceable Diagnostic Reasoning, https://arxiv.org/abs/2508.15746
- Danlong Yuan, Tian Xie, Shaohan Huang, Zhuocheng Gong, Huishuai Zhang, Chong Luo, Furu Wei, Dongyan Zhao, 22 Aug 2025, Efficient RL Training for Reasoning Models via Length-Aware Optimization, https://arxiv.org/abs/2505.12284
Other Reasoning Topics
Some other areas to research include:
- LLM Reasoning Techniques
- Small Reasoning Models (SRMs)
- Hybrid Reasoning Models
- Reasoning Theory
- Reasoning Survey Papers
- Reasoning Research Papers (long list)
- Reasoning Model Evaluation
- Large Reasoning Model (LRM)
- Open-Source Reasoning Models
- Program Synthesis-based Reasoning
- Temporal Reasoning
Specific reasoning techniques:
- Chain-of-Thought (CoT)
- Tree-of-Thought (ToT)
- Graph Reasoning
- Skeleton-of-Thought Reasoning
- Reflection
- LLM-as-Judge
- System 2 Reasoning
- Best-of-N
Reasoning and CoT Efficiency Topics
Reasoning models can be expensive in terms of token count, whether it is long answers in a single step, or many steps in test time compute solutions. There can also be many trees or alternative reasoning pathways, which adds to the cost.
Blog articles on reasoning efficiency:
More research information on general efficiency optimization techniques for reasoning models:
- Reasoning inference optimization (RIO)
- Chain-of-Thought (CoT) optimization
- Small Reasoning Models (SRMs)
- Adaptive Inference Time Compute
- Reasoning Tokens
General methods of achieving CoT efficiency:
Efficiency optimizations from specific types of Chain-of-Thought include:
- Hidden Token Chain-of-Thought (HCot)
- Continuous Chain-of-Thought (Coconut)
- Chain of Draft (CoD)
- CoT Reasoning Decoding
- Concise Chain-of-Thought
- Constrained Chain-of-Thought
AI Books from Aussie AI
![]() |
The Sweetest Lesson: Your Brain Versus AI: new book on AI intelligence theory:
Get your copy from Amazon: The Sweetest Lesson |
![]() |
RAG Optimization: Accurate and Efficient LLM Applications:
new book on RAG architectures:
Get your copy from Amazon: RAG Optimization |
![]() |
Generative AI Applications book:
Get your copy from Amazon: Generative AI Applications |
![]() |
Generative AI programming book:
Get your copy from Amazon: Generative AI in C++ |
![]() |
CUDA C++ Optimization book:
Get your copy from Amazon: CUDA C++ Optimization |
![]() |
CUDA C++ Debugging book:
Get your copy from Amazon: CUDA C++ Debugging |
More AI Research
Read more about: