Aussie AI

Long List of LLM Reasoning Papers

Last Updated 10 March, 2026

by David Spuler, Ph.D.

Reasoning Research Papers

This page is a long list of paper citations in the area of LLM reasoning. For a more useful discussion and categorization of research papers, see:

Blog articles: You may also be interested in our recent blog articles related to reasoning:

Multi-Step Inference for Reasoning

A general list of multi-step reasoning or "test time compute" reasoning papers:

Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Thomas L. Griffiths, Yuan Cao, Karthik Narasimhan, 3 Dec 2023 (v2), Tree of Thoughts: Deliberate Problem Solving with Large Language Models, https://arxiv.org/abs/2305.10601 Code: https://github.com/princeton-nlp/tree-of-thought-llm
Zehui Chen, Kuikun Liu, Qiuchen Wang, Jiangning Liu, Wenwei Zhang, Kai Chen, Feng Zhao, 29 Jul 2024, MindSearch: Mimicking Human Minds Elicits Deep AI Searcher, https://arxiv.org/abs/2407.20183 Code: https://github.com/InternLM/MindSearch Project: https://mindsearch.netlify.app
Bradley Brown, Jordan Juravsky, Ryan Ehrlich, Ronald Clark, Quoc V. Le, Christopher Ré, Azalia Mirhoseini, 31 Jul 2024, Large Language Monkeys: Scaling Inference Compute with Repeated Sampling, https://arxiv.org/abs/2407.21787 (Generating multiple answers by repeated inference queries, and then using a verifier to choose the best one, which is shown to greatly increase overall accuracy.)
David Gu, July 18, 2024, Text Compression for Efficient Language Generation, Master’s Thesis, Distributed Computing Group, Computer Engineering and Networks Laboratory, ETH Zürich, https://pub.tik.ee.ethz.ch/students/2023-HS/MA-2023-19.pdf (Training and inference at the sentence level, including caching of embeddings per sentence, which also has the side-effect of compressing the input prompts and reducing computation analogously to token pruning.)
Ignacio de Gregorio, Aug 2024, Grokking, a New Form of Reasoning, https://medium.com/@ignacio.de.gregorio.noblejas/grokking-a-new-form-of-reasoning-6785ea89d2ec
Lingjiao Chen, Jared Quincy Davis, Boris Hanin, Peter Bailis, Ion Stoica, Matei Zaharia, James Zou, 4 Jun 2024 (v2), Are More LLM Calls All You Need? Towards Scaling Laws of Compound Inference Systems, https://arxiv.org/abs/2403.02419
Asankhaya Sharma (codelion), Sep 2024, Optillm: Optimizing inference proxy for LLMs, https://github.com/codelion/optillm
Justin Chih-Yao Chen, Archiki Prasad, Swarnadeep Saha, Elias Stengel-Eskin, Mohit Bansal, 18 Sep 2024, MAgICoRe: Multi-Agent, Iterative, Coarse-to-Fine Refinement for Reasoning, https://arxiv.org/abs/2409.12147 https://github.com/dinobby/MAgICoRe
Xiaohan Xu, Chongyang Tao, Tao Shen, Can Xu, Hongbo Xu, Guodong Long, Jian-guang Lou, 29 Feb 2024 (v2), Re-Reading Improves Reasoning in Large Language Models, https://arxiv.org/abs/2309.06275
Artem Shelamanov, Sep 2024, Why OpenAI’s o1 Model Is A Scam, https://pub.towardsai.net/why-openais-o1-model-is-a-scam-eb3356c3d70e
Rinon Gal, Adi Haviv, Yuval Alaluf, Amit H. Bermano, Daniel Cohen-Or, Gal Chechik, 2 Oct 2024, ComfyGen: Prompt-Adaptive Workflows for Text-to-Image Generation, https://arxiv.org/abs/2410.01731 https://comfygen-paper.github.io/
Yi Cheng, Xiao Liang, Yeyun Gong, Wen Xiao, Song Wang, Yuji Zhang, Wenjun Hou, Kaishuai Xu, Wenge Liu, Wenjie Li, Jian Jiao, Qi Chen, Peng Cheng, Wayne Xiong, Integrative Decoding: Improve Factuality via Implicit Self-consistency, 3 Oct 2024 (v2), https://arxiv.org/abs/2410.01556 (Prepends a previous response to improve decoding accuracy.)
Zhenwen Liang, Ye Liu, Tong Niu, Xiangliang Zhang, Yingbo Zhou, Semih Yavuz, 5 Oct 2024, Improving LLM Reasoning through Scaling Inference Computation with Collaborative Verification, https://arxiv.org/abs/2410.05318
Sonya Huang, Pat Grady, and o1, Sequoia, October 9, 2024 Generative AI’s Act o1, https://www.sequoiacap.com/article/generative-ais-act-o1/
Yingqian Cui, Pengfei He, Xianfeng Tang, Qi He, Chen Luo, Jiliang Tang, Yue Xing, 21 Oct 2024, A Theoretical Understanding of Chain-of-Thought: Coherent Reasoning and Error-Aware Demonstration, https://arxiv.org/abs/2410.16540
Jiangming Liu, Matt Gardner, Shay B. Cohen, Mirella Lapata, 7 Jun 2021 (v2), Multi-Step Inference for Reasoning Over Paragraphs, https://arxiv.org/abs/2004.02995
Aditya Kalyanpur, Kailash Karthik Saravanakumar, Victor Barres, CJ McFate, Lori Moon, Nati Seifu, Maksim Eremeev, Jose Barrera, Abraham Bautista-Castillo, Eric Brown, David Ferrucci 24 Jul 2024 (v4), Multi-step Inference over Unstructured Data https://arxiv.org/abs/2406.17987
Banghao Chen, Zhaofeng Zhang, Nicolas Langrené, Shengxin Zhu, 5 Sep 2024 (v5), Unleashing the potential of prompt engineering in Large Language Models: a comprehensive review, https://arxiv.org/abs/2310.14735
Xiaodong Liu, Kevin Duh, Jianfeng Gao, 30 Mar 2019 (v2), Stochastic Answer Networks for Natural Language Inference, https://arxiv.org/abs/1804.07888
TED, Oct 2024, Multi-Step Reasoning Agents, https://tedai-sanfrancisco.ted.com/glossary/multi-step-reasoning-agents/
Yao Fu, Hao Peng, Ashish Sabharwal, Peter Clark, Tushar Khot, 30 Jan 2023 (v2), Complexity-Based Prompting for Multi-Step Reasoning, https://arxiv.org/abs/2210.00720
Junting Lu, Oct 2024 (accessed), Awesome-LLM-Reasoning-Techniques, https://github.com/Junting-Lu/Awesome-LLM-Reasoning-Techniques
Cameron R. Wolfe, Dec 23, 2023, Tree of Thoughts Prompting. Solving multi-step problems with LLMs via deliberate planning and exploration, https://towardsdatascience.com/tree-of-thoughts-prompting-65a3e51f9ac4
Data Camp, Jul 10, 2024, Chain-of-Thought Prompting: Step-by-Step Reasoning with LLMs, https://www.datacamp.com/tutorial/chain-of-thought-prompting
Pankaj, Dec 21, 2023, Chain of Thought Prompting: Guiding LLMs Step-by-Step, https://medium.com/@pankaj_pandey/chain-of-thought-prompting-guiding-llms-step-by-step-e6eac32d02d8
Cobus Greyling, Aug 2, 2023, 12 Prompt Engineering Techniques, https://cobusgreyling.medium.com/12-prompt-engineering-techniques-644481c857aa
Cameron R. Wolfe, Aug 21, 2023, Tree of Thoughts Prompting. Solving multi-step problems with LLMs via deliberate planning and exploration, https://cameronrwolfe.substack.com/p/tree-of-thoughts-prompting
Cameron R. Wolfe, Jan 3, 2024, Graph-Based Prompting and Reasoning with Language Models. Understanding graph of thoughts prompting and several variants… https://towardsdatascience.com/graph-based-prompting-and-reasoning-with-language-models-d6acbcd6b3d8
Jason Wei and Denny Zhou, May 11, 2022, Language Models Perform Reasoning via Chain of Thought, https://research.google/blog/language-models-perform-reasoning-via-chain-of-thought/
Cameron R. Wolfe, Jul 24, 2023, Chain of Thought Prompting for LLMs: A practical and simple approach for “reasoning” with LLMs, https://towardsdatascience.com/chain-of-thought-prompting-for-llms-33c963eead38
Siwei Wu, Zhongyuan Peng, Xinrun Du, Tuney Zheng, Minghao Liu, Jialong Wu, Jiachen Ma, Yizhi Li, Jian Yang, Wangchunshu Zhou, Qunshu Lin, Junbo Zhao, Zhaoxiang Zhang, Wenhao Huang, Ge Zhang, Chenghua Lin, J.H. Liu, 22 Oct 2024 (v2), A Comparative Study on Reasoning Patterns of OpenAI's o1 Model, https://arxiv.org/abs/2410.13639
Arun Shankar, Oct 2024, Designing Cognitive Architectures: Agentic Workflow Patterns from Scratch, https://medium.com/google-cloud/designing-cognitive-architectures-agentic-workflow-patterns-from-scratch-63baa74c54bc
Tanay Jaipuria, Oct 29, 2024, OpenAI's o-1 and inference-time scaling laws, https://www.tanayj.com/p/openais-o-1-and-inference-time-scaling
Jinlin Wang, Suyuchen Wang, Ziwen Xia, Sirui Hong, Yun Zhu, Bang Liu, Chenglin Wu, 28 Oct 2024, FACT: Examining the Effectiveness of Iterative Context Rewriting for Multi-fact Retrieval, https://arxiv.org/abs/2410.21012
Latent Space, Nov 05, 2024, Inference, Fast and Slow. When System 1/System 2 analogies are not enough: The 6 types of LLM inference https://www.latent.space/p/inference-fast-and-slow
Shanghaoran Quan, Tianyi Tang, Bowen Yu, An Yang, Dayiheng Liu, Bofei Gao, Jianhong Tu, Yichang Zhang, Jingren Zhou, Junyang Lin, 31 Oct 2024, Language Models can Self-Lengthen to Generate Long Texts, https://arxiv.org/abs/2410.23933?
LangChain, Nov 7, 2024. SCIPE - Systematic Chain Improvement and Problem Evaluation, https://blog.langchain.dev/scipe-systematic-chain-improvement-and-problem-evaluation/ https://github.com/garg-ankush/scipe/tree/main
X Wang, L Mu, J Zhang, H Xu, 2024, Multi-pass Decoding for Grammatical Error Correction, Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 9904–9916, November 12-16, 2024, https://aclanthology.org/2024.emnlp-main.553.pdf
Siyun Zhao, Yuqing Yang, Zilong Wang, Zhiyuan He, Luna K. Qiu, Lili Qiu, 23 Sep 2024, Retrieval Augmented Generation (RAG) and Beyond: A Comprehensive Survey on How to Make your LLMs use External Data More Wisely, https://arxiv.org/abs/2409.14924
Guowei Xu, Peng Jin, Li Hao, Yibing Song, Lichao Sun, Li Yuan, 15 Nov 2024, LLaVA-o1: Let Vision Language Models Reason Step-by-Step, https://arxiv.org/abs/2411.10440
Carl Franzen, November 20, 2024, DeepSeek’s first reasoning model R1-Lite-Preview turns heads, beating OpenAI o1 performance, https://venturebeat.com/ai/deepseeks-first-reasoning-model-r1-lite-preview-turns-heads-beating-openai-o1-performance/
mshumer, Nov 2024, Open Reasoning Engine, https://github.com/mshumer/OpenReasoningEngine
Eric Horvitz , Harsha Nori , Naoto Usuyama , November 27, 2024 Advances in run-time strategies for next-generation foundation models, Microsoft Research Blog, https://www.microsoft.com/en-us/research/blog/advances-in-run-time-strategies-for-next-generation-foundation-models/
Harsha Nori, Naoto Usuyama, Nicholas King, Scott Mayer McKinney, Xavier Fernandes, Sheng Zhang, Eric Horvitz, 6 Nov 2024, From Medprompt to o1: Exploration of Run-Time Strategies for Medical Challenge Problems and Beyond, https://arxiv.org/abs/2411.03590
Hieu Tran, Zonghai Yao, Junda Wang, Yifan Zhang, Zhichao Yang, Hong Yu, 5 Dec 2024 (v2), RARE: Retrieval-Augmented Reasoning Enhancement for Large Language Models, https://arxiv.org/abs/2412.02830
Zhe Chen, Weiyun Wang, Yue Cao, Yangzhou Liu, Zhangwei Gao, Erfei Cui, Jinguo Zhu, Shenglong Ye, Hao Tian, Zhaoyang Liu, Lixin Gu, Xuehui Wang, Qingyun Li, Yimin Ren, Zixuan Chen, Jiapeng Luo, Jiahao Wang, Tan Jiang, Bo Wang, Conghui He, Botian Shi, Xingcheng Zhang, Han Lv, Yi Wang, Wenqi Shao, Pei Chu, Zhongying Tu, Tong He, Zhiyong Wu, Huipeng Deng, Jiaye Ge, Kai Chen, Min Dou, Lewei Lu, Xizhou Zhu, Tong Lu, Dahua Lin, Yu Qiao, Jifeng Dai, Wenhai Wang, 6 Dec 2024, Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling, https://arxiv.org/abs/2412.05271
Arda Sevinc, Abdurrahman Gumus, 9 Dec 2024, AutoReason: Automatic Few-Shot Reasoning Decomposition, https://arxiv.org/abs/2412.06975 https://github.com/miralab-ai/autoreason
Mingchen Zhuge, Changsheng Zhao, Dylan Ashley, Wenyi Wang, Dmitrii Khizbullin, Yunyang Xiong, Zechun Liu, Ernie Chang, Raghuraman Krishnamoorthi, Yuandong Tian, Yangyang Shi, Vikas Chandra, Jürgen Schmidhuber, 16 Oct 2024 (v2), Agent-as-a-Judge: Evaluate Agents with Agents, https://arxiv.org/abs/2410.10934
Kyle Wiggers, December 14, 2024, ‘Reasoning’ AI models have become a trend, for better or worse, https://techcrunch.com/2024/12/14/reasoning-ai-models-have-become-a-trend-for-better-or-worse/
Ekin Akyürek, Mehul Damani, Linlu Qiu, Han Guo, Yoon Kim, Jacob Andreas, 11 Nov 2024, The Surprising Effectiveness of Test-Time Training for Abstract Reasoning, https://arxiv.org/abs/2411.07279
Noam Brown, Tuomas Sandholm, 16 Nov 2017 (v3), Safe and Nested Subgame Solving for Imperfect-Information Games, https://arxiv.org/abs/1705.02955 (An early pre-AI paper on reasoning in multiple steps.)
Maxwell Zeff, November 20, 2024, Current AI scaling laws are showing diminishing returns, forcing AI labs to change course, https://techcrunch.com/2024/11/20/ai-scaling-laws-are-showing-diminishing-returns-forcing-ai-labs-to-change-course/ ("at least 10 to 20x gains in model performance ...intelligent prompting, UX decisions, and passing context at the right time into the models...")
Agnostiq, Dec 2024, multi-agent-llm: LLM based Multi-Agent methods: Lean implementation of various multi-agent LLM methods, including Iteration of Thought (IoT), https://github.com/AgnostiqHQ/multi-agent-llm
Maxwell Nye, Anders Johan Andreassen, Guy Gur-Ari, Henryk Michalewski, Jacob Austin, David Bieber, David Dohan, Aitor Lewkowycz, Maarten Bosma, David Luan, Charles Sutton, Augustus Odena, 30 Nov 2021, Show Your Work: Scratchpads for Intermediate Computation with Language Models, https://arxiv.org/abs/2112.00114
Xiangjue Dong, Maria Teleki, James Caverlee, 18 Dec 2024, A Survey on LLM Inference-Time Self-Improvement, https://arxiv.org/abs/2412.14352 https://github.com/dongxiangjue/Awesome-LLM-Self-Improvement (Broad survey of reasoning improvement methods from multi-step inference to RALM to decoding algorithms.)
Xingyu Chen, Jiahao Xu, Tian Liang, Zhiwei He, Jianhui Pang, Dian Yu, Linfeng Song, Qiuzhi Liu, Mengfei Zhou, Zhuosheng Zhang, Rui Wang, Zhaopeng Tu, Haitao Mi, Dong Yu, 30 Dec 2024, Do NOT Think That Much for 2+3=? On the Overthinking of o1-Like LLMs, https://arxiv.org/abs/2412.21187
Rohin Manvi, Anikait Singh, Stefano Ermon, 3 Oct 2024, Adaptive Inference-Time Compute: LLMs Can Predict if They Can Do Better, Even Mid-Generation, https://arxiv.org/abs/2410.02725
Yiwei Li, Peiwen Yuan, Shaoxiong Feng, Boyuan Pan, Xinglin Wang, Bin Sun, Heda Wang, and Kan Li, 19 Jan 2024, Escape sky-high cost: Early-stopping self-consistency for multi-step reasoning. The Twelfth International Conference on Learning Representations, 2024, https://arxiv.org/abs/2401.10480 https://github.com/Yiwei98/ESC (Uses "early stopping" idea to improve CoT efficiency during inference.)
Akash Bajwa, Jan 06, 2025, Test-Time Search: A Path To AGI: Stacking Scaling Laws And Reward Engineering, https://akashbajwa.substack.com/p/test-time-search-a-path-to-agi
NovaSky, Jan 2025, Sky-T1: Train your own O1 preview model within $450, https://novasky-ai.github.io/posts/sky-t1/
Tong Xiao, Jingbo Zhu, 16 Jan 2025, Foundations of Large Language Models, https://arxiv.org/abs/2501.09223 (Huge 230 page paper on many topics such as training, prompting, alignment, and long context.)
Zekun Xi, Wenbiao Yin, Jizhan Fang, Jialong Wu, Runnan Fang, Ningyu Zhang, Jiang Yong, Pengjun Xie, Fei Huang, Huajun Chen, 16 Jan 2025, OmniThink: Expanding Knowledge Boundaries in Machine Writing through Thinking, https://arxiv.org/abs/2501.09751 (Iteratively going deeper into a topic while generating.)
Siddharth Narayanan, James D. Braza, Ryan-Rhys Griffiths, Manu Ponnapati, Albert Bou, Jon Laurent, Ori Kabeli, Geemi Wellawatte, Sam Cox, Samuel G. Rodriques, Andrew D. White, 30 Dec 2024, Aviary: training language agents on challenging scientific tasks, https://arxiv.org/abs/2412.21154 (Using smaller models combined with multi-step reasoning to compete with big models with 100x less inference cost.)
Kuang-Huei Lee, Ian Fischer, Yueh-Hua Wu, Dave Marwood, Shumeet Baluja, Dale Schuurmans, Xinyun Chen, 17 Jan 2025, Evolving Deeper LLM Thinking, https://arxiv.org/abs/2501.09891 (An alternative search strategy broad/deep, compared to CoT and reflection.)
Edward Beeching, Lewis Tunstall, Sasha Rush Dec 16, 2024, Scaling Test Time Compute with Open Source Models, https://huggingface.co/spaces/HuggingFaceH4/blogpost-scaling-test-time-compute
Maciej Besta, Julia Barth, Eric Schreiber, Ales Kubicek, Afonso Catarino, Robert Gerstenberger, Piotr Nyczyk, Patrick Iff, Yueling Li, Sam Houliston, Tomasz Sternal, Marcin Copik, Grzegorz Kwaśniewski, Jürgen Müller, Łukasz Flis, Hannes Eberhard, Hubert Niewiadomski, Torsten Hoefler, 23 Jan 2025 (v3), Reasoning Language Models: A Blueprint, https://arxiv.org/abs/2501.11223 (Survey and blueprint for how to build a Large Reasoning Model.)
Enze Xie, Junsong Chen, Yuyang Zhao, Jincheng Yu, Ligeng Zhu, Yujun Lin, Zhekai Zhang, Muyang Li, Junyu Chen, Han Cai, Bingchen Liu, Daquan Zhou, Song Han, 30 Jan 2025, SANA 1.5: Efficient Scaling of Training-Time and Inference-Time Compute in Linear Diffusion Transformer, https://arxiv.org/abs/2501.18427 (Diffusion model optimization using block-level depth pruning and inference-time scaling.)
S Wang, X Zhang, J Ma, A Hwang, Z Yu, Jan 2025, JumpStarter: Getting Started on Personal Goals with Adaptive Personal Context Curation, https://sitong-wang.github.io/data/JumpStarter.pdf (Long-term planning of goal-oriented long multi-step projects.)
Niklas Muennighoff, Zitong Yang, Weijia Shi, Xiang Lisa Li, Li Fei-Fei, Hannaneh Hajishirzi, Luke Zettlemoyer, Percy Liang, Emmanuel Candès, Tatsunori Hashimoto, 3 Feb 2025 (v2), s1: Simple test-time scaling, https://arxiv.org/abs/2501.19393 https://github.com/simplescaling/s1 (Method of "budget forcing" that allows either shortening or lengthening multi-step reasoning sequences.)
Manish Sanwal, 3 Feb 2025 (v2), Layered Chain-of-Thought Prompting for Multi-Agent LLM Systems: A Comprehensive Approach to Explainable Large Language Models, https://arxiv.org/abs/2501.18645
Sebastian Raschka, PhD, Feb 05, 2025, Understanding Reasoning LLMs: Methods and Strategies for Building and Refining Reasoning Models https://magazine.sebastianraschka.com/p/understanding-reasoning-llms
Ling Yang, Zhaochen Yu, Bin Cui, Mengdi Wang, 10 Feb 2025, ReasonFlux: Hierarchical LLM Reasoning via Scaling Thought Templates, https://arxiv.org/abs/2502.06772 https://github.com/Gen-Verse/ReasonFlux (RALM-like retrieval of reasoning prompt templates at inference time.)
Hanmeng Liu, Zhizhang Fu, Mengru Ding, Ruoxi Ning, Chaoli Zhang, Xiaozhang Liu, Yue Zhang, 13 Feb 2025, Logical Reasoning in Large Language Models: A Survey, https://arxiv.org/abs/2502.09100
Zeping Yu, Yonatan Belinkov, Sophia Ananiadou, 15 Feb 2025, Back Attention: Understanding and Enhancing Multi-Hop Reasoning in Large Language Models, https://arxiv.org/abs/2502.10835
Dacheng Li, Shiyi Cao, Chengkun Cao, Xiuyu Li, Shangyin Tan, Kurt Keutzer, Jiarong Xing, Joseph E. Gonzalez, Ion Stoica, 20 Feb 2025, S*: Test Time Scaling for Code Generation, https://arxiv.org/abs/2502.14382 https://github.com/NovaSky-AI/SkyThought
Ben Dickson, February 20, 2025, How test-time scaling unlocks hidden reasoning abilities in small language models (and allows them to outperform LLMs), https://venturebeat.com/ai/how-test-time-scaling-unlocks-hidden-reasoning-abilities-in-small-language-models-and-allows-them-to-outperform-llms/
Zhiyuan Zeng, Qinyuan Cheng, Zhangyue Yin, Yunhua Zhou, Xipeng Qiu, 17 Feb 2025, Revisiting the Test-Time Scaling of o1-like Models: Do they Truly Possess Test-Time Scaling Capabilities? https://arxiv.org/abs/2502.12215
Shubham Parashar, Blake Olson, Sambhav Khurana, Eric Li, Hongyi Ling, James Caverlee, Shuiwang Ji, 18 Feb 2025, Inference-Time Computations for LLM Reasoning and Planning: A Benchmark and Insights, https://arxiv.org/abs/2502.12521
Zihao Zeng, Xuyao Huang, Boxiu Li, Zhijie Deng, 19 Feb 2025, SIFT: Grounding LLM Reasoning in Contexts via Stickers, https://arxiv.org/abs/2502.14922 https://github.com/zhijie-group/SIFT (Multi-step reasoning where the LLM first generates a modified prompt that summarizes the key points, and then does inference for both the original and modified prompts, then comparing results and adjusting forwards and backwards.)
Marthe Ballon, Andres Algaba, Vincent Ginis, 21 Feb 2025, The Relationship Between Reasoning and Performance in Large Language Models -- o3 (mini) Thinks Harder, Not Longer, https://arxiv.org/abs/2502.15631
Maxwell Zeff, February 24, 2025, Anthropic launches a new AI model that ‘thinks’ as long as you want, https://techcrunch.com/2025/02/24/anthropic-launches-a-new-ai-model-that-thinks-as-long-as-you-want/
Kif Leswing, Feb 26 2025, Nvidia CEO Huang says AI has to do ’100 times more’ computation now than when ChatGPT was released, https://www.cnbc.com/2025/02/26/nvidia-ceo-huang-says-next-generation-ai-will-need-more-compute.html (The thesis that AI reasoning will need 100 times more compute, regardless of whether it is a single-step "long answers" model thinking out loud, or a multi-step test time compute model.)
Zhong-Zhi Li, Duzhen Zhang, Ming-Liang Zhang, Jiaxin Zhang, Zengyan Liu, Yuxuan Yao, Haotian Xu, Junhao Zheng, Pei-Jie Wang, Xiuyi Chen, Yingying Zhang, Fei Yin, Jiahua Dong, Zhijiang Guo, Le Song, Cheng-Lin Liu, 25 Feb 2025 (v2), From System 1 to System 2: A Survey of Reasoning Large Language Models, https://arxiv.org/abs/2502.17419
Wenkai Yang, Shuming Ma, Yankai Lin, Furu Wei, 25 Feb 2025, Towards Thinking-Optimal Scaling of Test-Time Compute for LLM Reasoning, https://arxiv.org/abs/2502.18080 (Trying to generate the "shortest correct response" by examining the lengths needed for CoT.)
Yuchen Yan, Yongliang Shen, Yang Liu, Jin Jiang, Mengdi Zhang, Jian Shao, Yueting Zhuang, 13 Mar 2025 (v2), InftyThink: Breaking the Length Limits of Long-Context Reasoning in Large Language Models, https://arxiv.org/abs/2503.06692
Qiguang Chen, Libo Qin, Jinhao Liu, Dengyun Peng, Jiannan Guan, Peng Wang, Mengkang Hu, Yuhang Zhou, Te Gao, Wanxiang Che, 13 Mar 2025 (v2), Towards Reasoning Era: A Survey of Long Chain-of-Thought for Reasoning Large Language Models, https://arxiv.org/abs/2503.09567 (Massive and broad survey of all types of reasoning.)
Eric Zhao, Pranjal Awasthi, Sreenivas Gollapudi, 20 Feb 2025 (v2), Sample, Scrutinize and Scale: Effective Inference-Time Search by Scaling Verification https://arxiv.org/abs/2502.01839 (Wrapping a single model with a Best-of-N approach that self-selects the best answer can significantly improve reasoning rates.)
Qianjun Pan, Wenkai Ji, Yuyang Ding, Junsong Li, Shilian Chen, Junyi Wang, Jie Zhou, Qin Chen, Min Zhang, Yulan Wu, Liang He, 8 May 2025 (v2), A Survey of Slow Thinking-based Reasoning LLMs using Reinforced Learning and Inference-time Scaling Law, https://arxiv.org/abs/2505.02665
Michael Nuñez, July 15, 2025, OpenAI, Google DeepMind and Anthropic sound alarm: ‘We may be losing the ability to understand AI’, https://venturebeat.com/ai/openai-google-deepmind-and-anthropic-sound-alarm-we-may-be-losing-the-ability-to-understand-ai/ (Monitoring the text-based interim "thinking-out-loud" reasoning of models in CoT.)
Tomek Korbak, Mikita Balesni, (and many more authors) July 2025, Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety, https://tomekkorbak.com/cot-monitorability-is-a-fragile-opportunity/cot_monitoring.pdf
Sebastian Raschka, Mar 8, 2025, Inference-Time Compute Scaling Methods to Improve Reasoning Models: Part 1: Inference-Time Compute Scaling Methods, https://sebastianraschka.com/blog/2025/state-of-llm-reasoning-and-inference-scaling.html
Runze Liu, Junqi Gao, Jian Zhao, Kaiyan Zhang, Xiu Li, Biqing Qi, Wanli Ouyang, Bowen Zhou, 10 Feb 2025, Can 1B LLM Surpass 405B LLM? Rethinking Compute-Optimal Test-Time Scaling, https://arxiv.org/abs/2502.06703
Shubham Parashar, Blake Olson, Sambhav Khurana, Eric Li, Hongyi Ling, James Caverlee, Shuiwang Ji, 18 Feb 2025, Inference-Time Computations for LLM Reasoning and Planning: A Benchmark and Insights, https://www.arxiv.org/abs/2502.12521
Yilong Chen, Junyuan Shang, Zhenyu Zhang, Yanxi Xie, Jiawei Sheng, Tingwen Liu, Shuohuan Wang, Yu Sun, Hua Wu, Haifeng Wang, 23 Feb 2025 (v2), Inner Thinking Transformer: Leveraging Dynamic Depth Scaling to Foster Adaptive Internal Thinking, https://arxiv.org/abs/2502.13842
Brown Ebouky, Andrea Bartezzaghi, Mattia Rigotti, 13 Jun 2025, Eliciting Reasoning in Language Models with Cognitive Tools, https://arxiv.org/abs/2506.12115
Tao Xu, Dung-Yang Lee and Momiao Xiong, 21 Jul 2025, Reinforcement Learning in hyperbolic space for multi-step reasoning, https://arxiv.org/abs/2507.16864
Yaozong Gan, Guang Li, Ren Togo, Keisuke Maeda, Takahiro Ogawa, Miki Haseyama, 23 Jul 2025, Cross-domain Multi-step Thinking: Zero-shot Fine-grained Traffic Sign Recognition in the Wild, https://arxiv.org/abs/2409.01534
Tong Yang, Yu Huang, Yingbin Liang, Yuejie Chi, 11 Aug 2025, Multi-head Transformers Provably Learn Symbolic Multi-step Reasoning via Gradient Descent, https://arxiv.org/abs/2508.08222
Jiaye Lin, Yifu Guo, Yuzhen Han, Sen Hu, Ziyi Ni, Licheng Wang, Mingguang Chen, Daxin Jiang, Binxing Jiao, Chen Hu, Huacan Wang, 4 Aug 2025, SE-Agent: Self-Evolution Trajectory Optimization in Multi-Step Reasoning with LLM-Based Agents, https://arxiv.org/abs/2508.02085
Jiashuo Yu, Yue Wu, Meng Chu, Zhifei Ren, Zizheng Huang, Pei Chu, Ruijie Zhang, Yinan He, Qirui Li, Songze Li, Zhenxiang Li, Zhongying Tu, Conghui He, Yu Qiao, Yali Wang, Yi Wang, Limin Wang, 4 Aug 2025, VRBench: A Benchmark for Multi-Step Reasoning in Long Narrative Videos, https://arxiv.org/abs/2506.10857
Shaofeng Yin, Ting Lei, Yang Liu, 5 Aug 2025, ToolVQA: A Dataset for Multi-step Reasoning VQA with External Tools, https://arxiv.org/abs/2508.03284
Aske Plaat, Annie Wong, Suzan Verberne, Joost Broekens, Niki van Stein, Thomas Back, 13 Aug 2025, Multi-Step Reasoning with Large Language Models, a Survey, https://arxiv.org/abs/2407.11511
Ayoub Ben Chaliah and Hela Dellagi, 18 Aug 2025, Datarus-R1: An Adaptive Multi-Step Reasoning LLM for Automated Data Analysis, https://arxiv.org/abs/2508.13382
Wen Ye, Jinbo Liu, Defu Cao, Wei Yang, Yan Liu, 1 Sep 2025, When LLM Meets Time Series: Can LLMs Perform Multi-Step Time Series Reasoning and Inference, https://arxiv.org/abs/2509.01822
Zhiwei Wang, Yunji Wang, Zhongwang Zhang, Zhangchen Zhou, Hui Jin, Tianyang Hu, Jiacheng Sun, Zhenguo Li, Yaoyu Zhang, Zhi-Qin John Xu, 9 Sep 2025, Understanding the Language Model to Solve the Symbolic Multi-Step Reasoning Problem from the Perspective of Buffer Mechanism, https://arxiv.org/abs/2405.15302
Zidong Chen, Zihao Guo, Peng Wang, ThankGod Itua Egbe, Yan Lyu, Chenghao Qian, 16 Sep 2025, Dense-Jump Flow Matching with Non-Uniform Time Scheduling for Robotic Policies: Mitigating Multi-Step Inference Degradation, https://arxiv.org/abs/2509.13574
Mingliang Zhai, Hansheng Liang, Xiaomeng Fan, Zhi Gao, Chuanhao Li, Che Sun, Xu Bin, Yuwei Wu, Yunde Jia, 23 Oct 2025, Multi-Step Reasoning for Embodied Question Answering via Tool Augmentation, https://arxiv.org/abs/2510.20310
Teng Wang, Zhangyi Jiang, Zhenqi He, Shenyang Tong, Wenhan Yang, Yanan Zheng, Zeyu Li, Zifan He, Hailei Gong, Zewen Ye, Shengjie Ma, Jianping Zhang, 26 Oct 2025, Towards Hierarchical Multi-Step Reward Models for Enhanced Reasoning in Large Language Models, https://arxiv.org/abs/2503.13551
Tian Qin, Yuhan Chen, Zhiwei Wang, Zhi-Qin John Xu, 27 Sep 2025, Limit Analysis for Symbolic Multi-step Reasoning Tasks with Information Propagation Rules Based on Transformers, https://arxiv.org/abs/2509.23178
Kyumin Lee, Minjin Jeon, Sanghwan Jang, Hwanjo Yu, 9 Oct 2025, STEPER: Step-wise Knowledge Distillation for Enhancing Reasoning Ability in Multi-Step Retrieval-Augmented Language Models, https://arxiv.org/abs/2510.07923

General Reasoning Papers

Junlin Wang, Siddhartha Jain, Dejiao Zhang, Baishakhi Ray, Varun Kumar, Ben Athiwaratkun, 11 Jun 2024 (v2), Reasoning in Token Economies: Budget-Aware Evaluation of LLM Reasoning Strategies, https://arxiv.org/abs/2406.06461
Xuan Zhang, Chao Du, Tianyu Pang, Qian Liu, Wei Gao, Min Lin, 13 Jun 2024, Chain of Preference Optimization: Improving Chain-of-Thought Reasoning in LLMs, https://arxiv.org/abs/2406.09136 Code: https://github.com/sail-sg/CPO
Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Thomas L. Griffiths, Yuan Cao, Karthik Narasimhan, 3 Dec 2023 (v2), Tree of Thoughts: Deliberate Problem Solving with Large Language Models, https://arxiv.org/abs/2305.10601 Code: https://github.com/princeton-nlp/tree-of-thought-llm
Hayden Field, June 20, 2024, OpenAI competitor Anthropic announces its most powerful AI yet, CNBC, https://www.cnbc.com/2024/06/20/anthropic-claude-3point5-sonnet-ai-announced.html
Wai-Chung Kwan, Xingshan Zeng, Yuxin Jiang, Yufei Wang, Liangyou Li, Lifeng Shang, Xin Jiang, Qun Liu, Kam-Fai Wong, 30 Jan 2024, MT-Eval: A Multi-Turn Capabilities Evaluation Benchmark for Large Language Models, https://arxiv.org/abs/2401.16745 Code: https://github.com/KwanWaiChung/MT-Eval
M.Besta, N. Blach, A. Kubicek, R. Gerstenberger, M. Podstawski, L. Gianinazzi, J. Gajda, T. Lehmann, H. Niewiadomski, P. Nyczyk et al., “Graph of thoughts: Solving elaborate problems with large language models,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 38, no. 16, 2024, pp. 17682–17690. https://arxiv.org/abs/2308.09687
Q. Sun, Z. Yin, X. Li, Z. Wu, X. Qiu, and L. Kong, “Corex: Pushing the boundaries of complex reasoning through multi model collaboration,” arXiv preprint arXiv:2310.00280, 2023. https://arxiv.org/abs/2310.00280
Tianle Li, Wei-Lin Chiang, Lisa Dunlap, May 20, 2024, Introducing Hard Prompts Category in Chatbot Arena, https://lmsys.org/blog/2024-05-17-category-hard/
Myeonghwa Lee, Seonho An, Min-Soo Kim, 18 Jun 2024, PlanRAG: A Plan-then-Retrieval Augmented Generation for Generative Large Language Models as Decision Makers, https://arxiv.org/abs/2406.12430 Code: https://github.com/myeon9h/PlanRAG
Rahul Verma, June 21, 2024, OpenAI's GPT-5 Pushed Back To Late 2025, But Promises Ph.D.-Level Abilities, https://in.mashable.com/tech/77593/openais-gpt-5-pushed-back-to-late-2025-but-promises-phd-level-abilities
Krista Opsahl-Ong, Michael J Ryan, Josh Purtell, David Broman, Christopher Potts, Matei Zaharia, Omar Khattab, 17 Jun 2024, Optimizing Instructions and Demonstrations for Multi-Stage Language Model Programs, https://arxiv.org/abs/2406.11695
Sachit Menon, Richard Zemel, Carl Vondrick, 20 Jun 2024, Whiteboard-of-Thought: Thinking Step-by-Step Across Modalities, https://arxiv.org/abs/2406.14562
Lina M. Rojas-Barahona, 2024, Talking to Machines: do you read me?. Computation and Language, Universite de Lorraine, https://hal.science/tel-04620199/document
Rafe Brena, May 24, 2024, 3 Key Differences Between Human and Machine Intelligence You Need to Know: AI is an alien intelligence https://pub.towardsai.net/3-key-differences-between-human-and-machine-intelligence-you-need-to-know-7a34dcee2cd3 (Good article about how LLMs don't have "emotions" or "intelligence" and they don't "pause".)
Vishal Rajput, Apr 11, 2024, What’s next for AI: AI agentic workflows? https://medium.com/aiguys/next-for-llms-and-rag-ai-agentic-workflows-1869ba0a6796
Rachel Metz, July 12, 2024, OpenAI Scale Ranks Progress Toward ‘Human-Level’ Problem Solving: The company believes its technology is approaching the second level of five on the path to artificial general intelligence, Bloomberg, https://www.bloomberg.com/news/articles/2024-07-11/openai-sets-levels-to-track-progress-toward-superintelligent-ai?sref=P6Q0mxvj
Sirui Chen, Bo Peng, Meiqi Chen, Ruiqi Wang, Mengying Xu, Xingyu Zeng, Rui Zhao, Shengjie Zhao, Yu Qiao, Chaochao Lu, 1 May 2024, Causal Evaluation of Language Models, https://arxiv.org/abs/2405.00622 Project: https://opencausalab.github.io/CaLM
Anna Tong and Katie Paul July 16, 2024, Exclusive: OpenAI working on new reasoning technology under code name ‘Strawberry’, https://www.reuters.com/technology/artificial-intelligence/openai-working-new-reasoning-technology-under-code-name-strawberry-2024-07-12/
Zehui Chen, Kuikun Liu, Qiuchen Wang, Jiangning Liu, Wenwei Zhang, Kai Chen, Feng Zhao, 29 Jul 2024, MindSearch: Mimicking Human Minds Elicits Deep AI Searcher, https://arxiv.org/abs/2407.20183 Code: https://github.com/InternLM/MindSearch Project: https://mindsearch.netlify.app
Ethan Mollick, May 12, 2024, Superhuman? What does it mean for AI to be better than a human? And how can we tell? https://www.oneusefulthing.org/p/superhuman
Ignacio de Gregorio, Aug 2024, Grokking, a New Form of Reasoning, https://medium.com/@ignacio.de.gregorio.noblejas/grokking-a-new-form-of-reasoning-6785ea89d2ec
Zarif Bin Akhtar, Mapping Generative Artificial Intelligence (GAI's) Exciting Future: From Gemini to Q* and Beyond, https://publications.eai.eu/index.php/airo/article/view/5962 https://doi.org/10.4108/airo.5962 PDF: https://publications.eai.eu/index.php/airo/article/view/5962/3329
Guangyan Sun, Mingyu Jin, Zhenting Wang, Cheng-Long Wang, Siqi Ma, Qifan Wang, Ying Nian Wu, Yongfeng Zhang, Dongfang Liu, 16 Aug 2024, Visual Agents as Fast and Slow Thinkers, https://arxiv.org/abs/2408.08862
Adam Zewe, June 14, 2024, Technique improves the reasoning capabilities of large language models: Combining natural language and programming, the method enables LLMs to solve numerical, analytical, and language-based tasks transparently, MIT News, https://news.mit.edu/2024/technique-improves-reasoning-capabilities-large-language-models-0614
Pei Zhou, Jay Pujara, Xiang Ren, Xinyun Chen, Heng-Tze Cheng, Quoc V. Le, Ed H. Chi, Denny Zhou, Swaroop Mishra, Huaixiu Steven Zheng, 6 Feb 2024, Self-Discover: Large Language Models Self-Compose Reasoning Structures, https://arxiv.org/abs/2402.03620
Tinghui Zhu, Kai Zhang, Jian Xie, Yu Su, 4 Feb 2024 (v2), Deductive Beam Search: Decoding Deducible Rationale for Chain-of-Thought Reasoning, https://arxiv.org/abs/2401.17686
Xiaoyu Liu, Paiheng Xu, Junda Wu, Jiaxin Yuan, Yifan Yang, Yuhang Zhou, Fuxiao Liu, Tianrui Guan, Haoliang Wang, Tong Yu, Julian 14 Mar 2024, Large Language Models and Causal Inference in Collaboration: A Comprehensive Survey, McAuley, Wei Ai, Furong Huang, https://arxiv.org/abs/2403.09606
Jiace Zhu, Yingtao Shen, Jie Zhao, An Zou, 25 Aug 2024, Path-Consistency: Prefix Enhancement for Efficient Inference in LLM, https://arxiv.org/abs/2409.01281
Cogni Down Under, Sep 2024, Reflection 70B: The AI That Thinks Before It Speaks, https://medium.com/@cognidownunder/reflection-70b-the-ai-that-thinks-before-it-speaks-8a70d3a0e38a
Anson Ho, Tamay Besiroglu, Ege Erdil, David Owen, Robi Rahman, Zifan Carl Guo, David Atkinson, Neil Thompson, Jaime Sevilla, 9 Mar 2024, Algorithmic progress in language models, https://arxiv.org/abs/2403.05812
Alberto Romero. Sep 10, 2024, Big News: OpenAI to Launch AI Model That Can Reason in 2 Weeks, https://www.thealgorithmicbridge.com/p/big-news-openai-to-launch-ai-model
Zifan Zheng, Yezhaohui Wang, Yuxin Huang, Shichao Song, Bo Tang, Feiyu Xiong, Zhiyu Li, 5 Sep 2024, Attention Heads of Large Language Models: A Survey, https://arxiv.org/abs/2409.03752 https://github.com/IAAR-Shanghai/Awesome-Attention-Heads (This survey is about making attention mechanisms more performant, accurate and intelligent, rather than improving efficiency.)
Asankhaya Sharma (codelion), Sep 2024, Optillm: Optimizing inference proxy for LLMs, https://github.com/codelion/optillm
Louis Bouchard, Sep 12, 2024, OpenAI's o1 Model: The Future of Reasoning AI? What Sets It Apart, How OpenAI's o1 Model Thinks Through Problems (And Why It's Slower), https://www.louisbouchard.ai/openai-o1/
OpenAI, September 12, 2024, Introducing OpenAI o1-preview, A new series of reasoning models for solving hard problems. https://openai.com/index/introducing-openai-o1-preview/
OpenAI, September 12, 2024, Learning to Reason with LLMs, https://openai.com/index/learning-to-reason-with-llms/
Nathan Lambert, Sep 05, 2024, OpenAI’s Strawberry, LM self-talk, inference scaling laws, and spending more on inference, Whether or not scaling works, we should spend more on inference, https://www.interconnects.ai/p/openai-strawberry-and-inference-scaling-laws
Ignacio de Gregorio Noblejas, September 15, 2024, OpenAI Launches o1. Here’s All You Need to Know, https://thetechoasis.beehiiv.com/p/openai-launches-o1-heres-need-know
Trung Quoc Luong, Xinbo Zhang, Zhanming Jie, Peng Sun, Xiaoran Jin, Hang Li, 27 Jun 2024 (v2), ReFT: Reasoning with Reinforced Fine-Tuning, https://arxiv.org/abs/2401.08967
Tianqiao Liu, Zui Chen, Zitao Liu, Mi Tian, Weiqi Luo, 13 Sep 2024, Expediting and Elevating Large Language Model Reasoning via Hidden Chain-of-Thought Decoding, https://arxiv.org/abs/2409.08561
Michael Nuñez, September 16, 2024, SambaNova challenges OpenAI’s o1 model with Llama 3.1-powered demo on HuggingFace, https://venturebeat.com/ai/sambanova-challenges-openais-o1-model-with-llama-3-1-powered-demo-on-huggingface/
Zayne Sprague, Fangcong Yin, Juan Diego Rodriguez, Dongwei Jiang, Manya Wadhwa, Prasann Singhal, Xinyu Zhao, Xi Ye, Kyle Mahowald, Greg Durrett, 18 Sep 2024, To CoT or not to CoT? Chain-of-thought helps mainly on math and symbolic reasoning, https://arxiv.org/abs/2409.12183
Santosh Kumar Radha, Yasamin Nouri Jelyani, Ara Ghukasyan, Oktay Goktas, 19 Sep 2024, Iteration of Thought: Leveraging Inner Dialogue for Autonomous Large Language Model Reasoning, https://arxiv.org/abs/2409.12618
Justin Chih-Yao Chen, Archiki Prasad, Swarnadeep Saha, Elias Stengel-Eskin, Mohit Bansal, 18 Sep 2024, MAgICoRe: Multi-Agent, Iterative, Coarse-to-Fine Refinement for Reasoning, https://arxiv.org/abs/2409.12147 https://github.com/dinobby/MAgICoRe
Siyun Zhao, Yuqing Yang, Zilong Wang, Zhiyuan He, Luna K. Qiu, Lili Qiu, 23 Sep 2024, Retrieval Augmented Generation (RAG) and Beyond: A Comprehensive Survey on How to Make your LLMs use External Data More Wisely, https://arxiv.org/abs/2409.14924
Artem Shelamanov, Sep 2024, Why OpenAI’s o1 Model Is A Scam, https://pub.towardsai.net/why-openais-o1-model-is-a-scam-eb3356c3d70e
Chloe Berger, October 2, 2024, Mark Cuban says his puppy is ‘smarter than AI is today’, https://fortune.com/2024/10/01/mark-cuban-dog-puppy-smarter-than-ai/
Julia Love and Rachel Metz, October 2, 2024, Google Is Working on Reasoning AI, Chasing OpenAI’s Efforts, https://www.bloomberg.com/news/articles/2024-10-02/google-is-working-on-reasoning-ai-chasing-openai-s-efforts
Zhenwen Liang, Ye Liu, Tong Niu, Xiangliang Zhang, Yingbo Zhou, Semih Yavuz, 5 Oct 2024, Improving LLM Reasoning through Scaling Inference Computation with Collaborative Verification, https://arxiv.org/abs/2410.05318
Iman Mirzadeh, Keivan Alizadeh, Hooman Shahrokhi, Oncel Tuzel, Samy Bengio, Mehrdad Farajtabar, 7 Oct 2024, GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models, https://arxiv.org/abs/2410.05229
Sonya Huang, Pat Grady, and o1, Sequoia, October 9, 2024 Generative AI’s Act o1, https://www.sequoiacap.com/article/generative-ais-act-o1/
Ignacio de Gregorio Noblejas, October 20, 2024, The Anti-LLM Revolution Begins, https://thetechoasis.beehiiv.com/p/the-anti-llm-revolution-begins
By Asif Razzaq, October 13, 2024, OpenR: An Open-Source AI Framework Enhancing Reasoning in Large Language Models, https://www.marktechpost.com/2024/10/13/openr-an-open-source-ai-framework-enhancing-reasoning-in-large-language-models/
Siwei Wu, Zhongyuan Peng, Xinrun Du, Tuney Zheng, Minghao Liu, Jialong Wu, Jiachen Ma, Yizhi Li, Jian Yang, Wangchunshu Zhou, Qunshu Lin, Junbo Zhao, Zhaoxiang Zhang, Wenhao Huang, Ge Zhang, Chenghua Lin, J.H. Liu, 22 Oct 2024 (v2), A Comparative Study on Reasoning Patterns of OpenAI's o1 Model, https://arxiv.org/abs/2410.13639
Latent Space, Nov 05, 2024, Inference, Fast and Slow. When System 1/System 2 analogies are not enough: The 6 types of LLM inference https://www.latent.space/p/inference-fast-and-slow
Will Lockett Nov 2024, Apple Calls BS On The AI Revolution, They aren’t late to the AI game; they are just the only sceptical big tech company. https://medium.com/predict/apple-calls-bullshit-on-the-ai-revolution-ae38fdf83392
Anthony Ha, Nov 2024, OpenAI reportedly developing new strategies to deal with AI improvement slowdown, https://techcrunch.com/2024/11/09/openai-reportedly-developing-new-strategies-to-deal-with-ai-improvement-slowdown/
Michael Nuñez, November 11, 2024, AI’s math problem: FrontierMath benchmark shows how far technology still has to go, https://venturebeat.com/ai/ais-math-problem-frontiermath-benchmark-shows-how-far-technology-still-has-to-go/
Kyle Orland, 13 Nov 2024, What if AI doesn’t just keep getting better forever? New reports highlight fears of diminishing returns for traditional LLM training. https://arstechnica.com/ai/2024/11/what-if-ai-doesnt-just-keep-getting-better-forever/
Carl Franzen, November 20, 2024, DeepSeek’s first reasoning model R1-Lite-Preview turns heads, beating OpenAI o1 performance, https://venturebeat.com/ai/deepseeks-first-reasoning-model-r1-lite-preview-turns-heads-beating-openai-o1-performance/
Zhi Rui Tam, Cheng-Kuang Wu, Yi-Lin Tsai, Chieh-Yen Lin, Hung-yi Lee, Yun-Nung Chen, 14 Oct 2024 (v3), Let Me Speak Freely? A Study on the Impact of Format Restrictions on Performance of Large Language Models, https://arxiv.org/abs/2408.02442
Janelle Teng, Nov 26, 2024, AI's reasoning quandary, https://nextbigteng.substack.com/p/ais-reasoning-quandary
Qwen Team, November 28, 2024, QwQ: Reflect Deeply on the Boundaries of the Unknown, https://qwenlm.github.io/blog/qwq-32b-preview/
mshumer, Nov 2024, Open Reasoning Engine, https://github.com/mshumer/OpenReasoningEngine
Tom Schaul, 25 Nov 2024, Boundless Socratic Learning with Language Games, https://arxiv.org/abs/2411.16905
Alberto Romero, Dec 06, 2024, OpenAI Announces o1 Model And ChatGPT Pro ($200/Mo). OpenAI Christmas event: Day 1 of 12, https://www.thealgorithmicbridge.com/p/openai-announces-o1-model-and-chatgpt
Justin Chih-Yao Chen, Zifeng Wang, Hamid Palangi, Rujun Han, Sayna Ebrahimi, Long Le, Vincent Perot, Swaroop Mishra, Mohit Bansal, Chen-Yu Lee, Tomas Pfister, 29 Nov 2024, Reverse Thinking Makes LLMs Stronger Reasoners, https://arxiv.org/abs/2411.19865
Tiernan Ray, Dec. 10, 2024, How Cerebras boosted Meta's Llama to 'frontier model' performance The company also demonstrates initial training of a one-trillion-parameter AI model on a single machine using conventional DDR5 memory chips. https://www.zdnet.com/article/how-cerebras-boosted-metas-llama-to-frontier-model-performance/
Shibo Hao, Sainbayar Sukhbaatar, DiJia Su, Xian Li, Zhiting Hu, Jason Weston, Yuandong Tian, 9 Dec 2024, Training Large Language Models to Reason in a Continuous Latent Space, https://arxiv.org/abs/2412.06769 (Performing reasoning in a model trained to operate in the embedding vector space, rather than more directly in the token space.)
Arda Sevinc, Abdurrahman Gumus, 9 Dec 2024, AutoReason: Automatic Few-Shot Reasoning Decomposition, https://arxiv.org/abs/2412.06975 https://github.com/miralab-ai/autoreason
Kyle Wiggers, December 14, 2024, ‘Reasoning’ AI models have become a trend, for better or worse, https://techcrunch.com/2024/12/14/reasoning-ai-models-have-become-a-trend-for-better-or-worse/
Vincent-Pierre Berges, Barlas Oguz, December 12, 2024, Memory Layers at Scale, Meta, https://ai.meta.com/research/publications/memory-layers-at-scale/ https://github.com/facebookresearch/memory (Augmention of an LLM with an additional key-value associative memory, by replacing some FFNs with a "memory layer".)
Ekin Akyürek, Mehul Damani, Linlu Qiu, Han Guo, Yoon Kim, Jacob Andreas, 11 Nov 2024, The Surprising Effectiveness of Test-Time Training for Abstract Reasoning, https://arxiv.org/abs/2411.07279
Noam Brown, Tuomas Sandholm, 16 Nov 2017 (v3), Safe and Nested Subgame Solving for Imperfect-Information Games, https://arxiv.org/abs/1705.02955 (An early pre-AI paper on reasoning in multiple steps.)
Maxwell Zeff, November 20, 2024, Current AI scaling laws are showing diminishing returns, forcing AI labs to change course, https://techcrunch.com/2024/11/20/ai-scaling-laws-are-showing-diminishing-returns-forcing-ai-labs-to-change-course/ ("at least 10 to 20x gains in model performance ...intelligent prompting, UX decisions, and passing context at the right time into the models...")
Agnostiq, Dec 2024, multi-agent-llm: LLM based Multi-Agent methods: Lean implementation of various multi-agent LLM methods, including Iteration of Thought (IoT), https://github.com/AgnostiqHQ/multi-agent-llm
Denise Holt, Dec 18, 2024, VERSES AI Crushes OpenAI o1 in Head to Head Competition: VERSES AI 's New Genius™ Platform Delivers Far More Performance than Open AI's Most Advanced Model at a Fraction of the Cost. https://deniseholt.substack.com/p/verses-ai-crushes-openai-o1-in-head
Junnan Liu, Hongwei Liu, Linchen Xiao, Ziyi Wang, Kuikun Liu, Songyang Gao, Wenwei Zhang, Songyang Zhang, Kai Chen, 18 Dec 2024 (v2), Are Your LLMs Capable of Stable Reasoning? https://arxiv.org/abs/2412.13147 https://github.com/open-compass/GPassK
Alberto Romero, Dec 21, 2024, OpenAI o3 Model Is a Message From the Future: Update All You Think You Know About AI. Incredible, a miracle, more than just a better state-of-the-art AI model. https://www.thealgorithmicbridge.com/p/openai-o3-model-is-a-message-from
Sabrina Ortiz, Dec. 20, 2024, OpenAI unveils its most advanced o3 reasoning model on its last day of 'shipmas', https://www.zdnet.com/article/openai-unveils-its-most-advanced-o3-reasoning-model-on-its-last-day-of-shipmas/
Jie He, Nan Hu, Wanqiu Long, Jiaoyan Chen, Jeff Z. Pan, 22 Dec 2024, MINTQA: A Multi-Hop Question Answering Benchmark for Evaluating LLMs on New and Tail Knowledge, https://arxiv.org/abs/2412.17032 https://github.com/probe2/multi-hop/ (Model evaluation of reasoning abilities.)
Huanjin Yao, Jiaxing Huang, Wenhao Wu, Jingyi Zhang, Yibo Wang, Shunyu Liu, Yingjie Wang, Yuxin Song, Haocheng Feng, Li Shen, Dacheng Tao, 24 Dec 2024, Mulberry: Empowering MLLM with o1-like Reasoning and Reflection via Collective Monte Carlo Tree Search, https://arxiv.org/abs/2412.18319 https://github.com/HJYao00/Mulberry (Multimodal multi-step reasoning like CoT.)
Xiangjue Dong, Maria Teleki, James Caverlee, 18 Dec 2024, A Survey on LLM Inference-Time Self-Improvement, https://arxiv.org/abs/2412.14352 https://github.com/dongxiangjue/Awesome-LLM-Self-Improvement (Broad survey of reasoning improvement methods from multi-step inference to RALM to decoding algorithms.)
Junying Chen, Zhenyang Cai, Ke Ji, Xidong Wang, Wanlong Liu, Rongsheng Wang, Jianye Hou, Benyou Wang, 25 Dec 2024, HuatuoGPT-o1, Towards Medical Complex Reasoning with LLMs, https://arxiv.org/abs/2412.18925
Lori Dajose, December 17, 2024, Thinking Slowly: The Paradoxical Slowness of Human Behavior, https://www.caltech.edu/about/news/thinking-slowly-the-paradoxical-slowness-of-human-behavior
Aske Plaat, Annie Wong, Suzan Verberne, Joost Broekens, Niki van Stein, Thomas Back, 16 Jul 2024, Reasoning with Large Language Models, a Survey, https://arxiv.org/abs/2407.11511
Mingyu Jin, Weidi Luo, Sitao Cheng, Xinyi Wang, Wenyue Hua, Ruixiang Tang, William Yang Wang, Yongfeng Zhang, 21 Nov 2024 (v2), Disentangling Memory and Reasoning Ability in Large Language Models, https://arxiv.org/abs/2411.13504 https://github.com/MingyuJ666/Disentangling-Memory-and-Reasoning
Allen Nie, Yi Su, Bo Chang, Jonathan N. Lee, Ed H. Chi, Quoc V. Le, Minmin Chen, 8 Oct 2024, EVOLvE: Evaluating and Optimizing LLMs For Exploration, https://arxiv.org/abs/2410.06238
Tianhao Wu, Janice Lan, Weizhe Yuan, Jiantao Jiao, Jason Weston, Sainbayar Sukhbaatar, 14 Oct 2024, Thinking LLMs: General Instruction Following with Thought Generation, https://arxiv.org/abs/2410.10630 (Training an LLM to reason by generating additional "thoughts" during training.)
Xiang Huang, Jiayu Shen, Shanshan Huang, Sitao Cheng, Xiaxia Wang, Yuzhong Qu, 27 Dec 2024, TARGA: Targeted Synthetic Data Generation for Practical Reasoning over Structured Data, https://arxiv.org/abs/2412.19544?
Xiaoxue Cheng, Junyi Li, Wayne Xin Zhao, Ji-Rong Wen, 2 Jan 2025, Think More, Hallucinate Less: Mitigating Hallucinations via Dual Process of Fast and Slow Thinking, https://arxiv.org/abs/2501.01306
Mayi Xu, Yunfeng Ning, Yongqi Li, Jianhao Chen, Jintao Wen, Yao Xiao, Shen Zhou, Birong Pan, Zepeng Bao, Xin Miao, Hankun Kang, Ke Sun, Tieyun Qian, 2 Jan 2025, Reasoning based on symbolic and parametric knowledge bases: a survey, https://arxiv.org/abs/2501.01030 (Extensive survey of reasoning from CoT to knowledge graphs to table-based reasoning.)
Yixin Ji, Juntao Li, Hai Ye, Kaixin Wu, Jia Xu, Linjian Mo, Min Zhang, 5 Jan 2025, Test-time Computing: from System-1 Thinking to System-2 Thinking, https://arxiv.org/abs/2501.02497
Xiaoxi Li, Guanting Dong, Jiajie Jin, Yuyao Zhang, Yujia Zhou, Yutao Zhu, Peitian Zhang, Zhicheng Dou, 9 Jan 2025, Search-o1: Agentic Search-Enhanced Large Reasoning Models, https://arxiv.org/abs/2501.05366 https://github.com/sunnynexus/Search-o1 (RAG retrieval and agentic methods applied to Large Reasoning Models.)
Yuhang Liu, Pengxiang Li, Zishu Wei, Congkai Xie, Xueyu Hu, Xinchen Xu, Shengyu Zhang, Xiaotian Han, Hongxia Yang, Fei Wu, 8 Jan 2025, InfiGUIAgent: A Multimodal Generalist GUI Agent with Native Reasoning and Reflection, https://arxiv.org/abs/2501.04575
Andrea Matarazzo, Riccardo Torlone, 3 Jan 2025, A Survey on Large Language Models with some Insights on their Capabilities and Limitations, https://arxiv.org/abs/2501.04040 (Broad survey with many LLM topics covered from history to architectures to optimizations.)
Ben Hylak and swyx & Alessio, Jan 12, 2025, o1 isn’t a chat model (and that’s the point): How Ben Hylak turned from ol pro skeptic to fan by overcoming his skill issue. https://www.latent.space/p/o1-skill-issue (Prompting reasoning models like "o1" is different from previous model generations.)
NovaSky, Jan 2025, Sky-T1: Train your own O1 preview model within $450, https://novasky-ai.github.io/posts/sky-t1/
Omkar Thawakar, Dinura Dissanayake, Ketan More, Ritesh Thawkar, Ahmed Heakl, Noor Ahsan, Yuhao Li, Mohammed Zumri, Jean Lahoud, Rao Muhammad Anwer, Hisham Cholakkal, Ivan Laptev, Mubarak Shah, Fahad Shahbaz Khan, Salman Khan, 10 Jan 2025, LlamaV-o1: Rethinking Step-by-step Visual Reasoning in LLMs, https://arxiv.org/abs/2501.06186
François Chollet, 25 Nov 2019 (v2), On the Measure of Intelligence, https://arxiv.org/abs/1911.01547
Siddharth Narayanan, James D. Braza, Ryan-Rhys Griffiths, Manu Ponnapati, Albert Bou, Jon Laurent, Ori Kabeli, Geemi Wellawatte, Sam Cox, Samuel G. Rodriques, Andrew D. White, 30 Dec 2024, Aviary: training language agents on challenging scientific tasks, https://arxiv.org/abs/2412.21154 (Using smaller models combined with multi-step reasoning to compete with big models with 100x less inference cost.)
Di Zhang, Jianbo Wu, Jingdi Lei, Tong Che, Jiatong Li, Tong Xie, Xiaoshui Huang, Shufei Zhang, Marco Pavone, Yuqiang Li, Wanli Ouyang, Dongzhan Zhou, 21 Nov 2024 (v2), LLaMA-Berry: Pairwise Optimization for O1-like Olympiad-Level Mathematical Reasoning, https://arxiv.org/abs/2410.02884
Dan Zhang, Sining Zhoubian, Ziniu Hu, Yisong Yue, Yuxiao Dong, Jie Tang, 18 Nov 2024 (v3), ReST-MCTS*: LLM Self-Training via Process Reward Guided Tree Search, https://arxiv.org/abs/2406.03816 https://github.com/THUDM/ReST-MCTS
Jun Wang, Meng Fang, Ziyu Wan, Muning Wen, Jiachen Zhu, Anjie Liu, Ziqin Gong, Yan Song, Lei Chen, Lionel M. Ni, Linyi Yang, Ying Wen, Weinan Zhang, 12 Oct 2024, OpenR: An Open Source Framework for Advanced Reasoning with Large Language Models, https://arxiv.org/abs/2410.09671 https://openreasoner.github.io/
Yiwei Qin, Xuefeng Li, Haoyang Zou, Yixiu Liu, Shijie Xia, Zhen Huang, Yixin Ye, Weizhe Yuan, Hector Liu, Yuanzhi Li, Pengfei Liu, 8 Oct 2024, O1 Replication Journey: A Strategic Progress Report -- Part 1. https://arxiv.org/abs/2410.18982
Matthias Bastian, Oct 6, 2024, Study reveals major reasoning flaws in smaller AI language models, https://the-decoder.com/study-reveals-major-reasoning-flaws-in-smaller-ai-language-models/
Paul Sawers, January 23, 2025, Meta’s Yann LeCun predicts a ‘new AI architectures paradigm’ within 5 years and ‘decade of robotics’, https://techcrunch.com/2025/01/23/metas-yann-lecun-predicts-a-new-ai-architectures-paradigm-within-5-years-and-decade-of-robotics/
Latent Space, Jan 25, 2025, Why o3-mini *had* to be free: the coming DeepSeek R1, 2.0 Flash, and Sky-T1 Price War: 2025's biggest surprise so far: Reasoning is less of a moat than anyone thought. https://www.latent.space/p/reasoning-price-war
Maciej Besta, Julia Barth, Eric Schreiber, Ales Kubicek, Afonso Catarino, Robert Gerstenberger, Piotr Nyczyk, Patrick Iff, Yueling Li, Sam Houliston, Tomasz Sternal, Marcin Copik, Grzegorz Kwaśniewski, Jürgen Müller, Łukasz Flis, Hannes Eberhard, Hubert Niewiadomski, Torsten Hoefler, 23 Jan 2025 (v3), Reasoning Language Models: A Blueprint, https://arxiv.org/abs/2501.11223 (Survey and blueprint for how to build a Large Reasoning Model.)
Akash Bajwa Jan 27, 2025, The Post-R1 World: AI Economics Have Irreversibly Changed, https://akashbajwa.substack.com/p/the-post-r1-world
G Wang, S Zhang, T Zhan, Z Shen, J Li, X Hu, X Sun, Jan 2025, Unlocking the Mysteries of OpenAI o1: A Survey of the Reasoning Abilities of Large Language Models, https://openreview.net/pdf?id=J0ADLa2rNp
Lan Pan, Hanbo Xie, Robert C. Wilson, 29 Jan 2025, Large Language Models Think Too Fast To Explore Effectively, https://arxiv.org/abs/2501.18009
Yue Wang, Qiuzhi Liu, Jiahao Xu, Tian Liang, Xingyu Chen, Zhiwei He, Linfeng Song, Dian Yu, Juntao Li, Zhuosheng Zhang, Rui Wang, Zhaopeng Tu, Haitao Mi, Dong Yu, 30 Jan 2025, Thoughts Are All Over the Place: On the Underthinking of o1-Like LLMs, https://arxiv.org/abs/2501.18585
Han Zhong, Yutong Yin, Shenao Zhang, Xiaojun Xu, Yuanxin Liu, Yifei Zuo, Zhihan Liu, Boyi Liu, Sirui Zheng, Hongyi Guo, Liwei Wang, Mingyi Hong, Zhaoran Wang, 31 Jan 2025, BRiTE: Bootstrapping Reinforced Thinking Process to Enhance Language Model Reasoning, https://arxiv.org/abs/2501.18858
Sebastian Raschka, PhD, Feb 05, 2025, Understanding Reasoning LLMs: Methods and Strategies for Building and Refining Reasoning Models https://magazine.sebastianraschka.com/p/understanding-reasoning-llms
Ahmed El-Kishky, Alexander Wei, Andre Saraiva, Borys Minaev, Daniel Selsam, David Dohan, Francis Song, Hunter Lightman, Ignasi Clavera, Jakub Pachocki, Jerry Tworek, Lorenz Kuhn, Lukasz Kaiser, Mark Chen, Max Schwarzer, Mostafa Rohaninejad, Nat McAleese, o3 contributors, Oleg Mürk, Rhythm Garg, Rui Shu, Szymon Sidor, Vineet Kosaraju, Wenda Zhou, 3 Feb 2025, Competitive Programming with Large Reasoning Models, https://arxiv.org/abs/2502.06807 (OpenAI's paper on o3 that has similar conclusions to what DeepSeek showed about Reinforcement Learning for reasoning models, namely that "scaling general-purpose reinforcement learning" still works.)
Hieu Minh "Jord" Nguyen, 10 Feb 2025, A Survey of Theory of Mind in Large Language Models: Evaluations, Representations, and Safety Risks, https://arxiv.org/abs/2502.06470
Daniel Fleischer, Moshe Berchansky, Gad Markovits, Moshe Wasserblat, 13 Feb 2025, SQuARE: Sequential Question Answering Reasoning Engine for Enhanced Chain-of-Thought in Large Language Models, https://arxiv.org/abs/2502.09390 https://github.com/IntelLabs/RAG-FiT/tree/square
Salvatore Raieli, Feb 2025, The LLMs’ Dilemma: Thinking Too Much OR Too Little? Exploring the fine line between deep reasoning and computational overkill in large language models., https://levelup.gitconnected.com/the-llms-dilemma-thinking-too-much-or-too-little-619a7532a47e
Ben Dickson, February 20, 2025, How test-time scaling unlocks hidden reasoning abilities in small language models (and allows them to outperform LLMs), https://venturebeat.com/ai/how-test-time-scaling-unlocks-hidden-reasoning-abilities-in-small-language-models-and-allows-them-to-outperform-llms/
Ali Razghandi, Seyed Mohammad Hadi Hosseini, Mahdieh Soleymani Baghshah, 20 Feb 2025, CER: Confidence Enhanced Reasoning in LLMs, https://arxiv.org/abs/2502.14634 (Using model confidence metrics, i.e., logits, to evaluate reasoning pathways.)
Zhipeng Chen, Yingqian Min, Beichen Zhang, Jie Chen, Jinhao Jiang, Daixuan Cheng, Wayne Xin Zhao, Zheng Liu, Xu Miao, Yang Lu, Lei Fang, Zhongyuan Wang, Ji-Rong Wen, 6 Mar 2025, An Empirical Study on Eliciting and Improving R1-like Reasoning Models, https://arxiv.org/abs/2503.04548 https://github.com/RUCAIBox/Slow_Thinking_with_LLMs
Xiaoye Qu, Yafu Li, Zhaochen Su, Weigao Sun, Jianhao Yan, Dongrui Liu, Ganqu Cui, Daizong Liu, Shuxian Liang, Junxian He, Peng Li, Wei Wei, Jing Shao, Chaochao Lu, Yue Zhang, Xian-Sheng Hua, Bowen Zhou, Yu Cheng, 27 Mar 2025, A Survey of Efficient Reasoning for Large Reasoning Models: Language, Multimodality, and Beyond, https://arxiv.org/abs/2503.21614
Kenneth Payne, Baptiste Alloui-Cros, 3 Jul 2025, Strategic Intelligence in Large Language Models: Evidence from evolutionary Game Theory, https://arxiv.org/abs/2507.02618
Bin Hong, Jiayu Liu, Zhenya Huang, Kai Zhang, Mengdi Zhang, 13 Aug 2025, Pruning Long Chain-of-Thought of Large Reasoning Models via Small-Scale Preference Optimization, https://arxiv.org/abs/2508.10164
Jingde Cheng, 14 Aug 2025, Why Cannot Large Language Models Ever Make True Correct Reasoning?, https://arxiv.org/abs/2508.10265
Chuhuai Yue, Chengqi Dong, Yinan Gao, Hang He, Jiajun Chai, Guojun Yin and Wei Lin, 14 Aug 2025, Promoting Efficient Reasoning with Verifiable Stepwise Reward, https://arxiv.org/abs/2508.10293
Mengtao Zhou, Sifan Wu, Huan Zhang, Qi Sima, Bang Liu, 14 Aug 2025, What to Ask Next? Probing the Imaginative Reasoning of LLMs with TurtleSoup Puzzles, https://arxiv.org/abs/2508.10358
Runqi Qiao and Qiuna Tan and Peiqing Yang and Yanzi Wang and Xiaowan Wang and Enhui Wan and Sitong Zhou and Guanting Dong and Yuchen Zeng and Yida Xu and Jie Wang and Chong Sun and Chen Li and Honggang Zhang, 14 Aug 2025, We-Math 2.0: A Versatile MathBook System for Incentivizing Visual Mathematical Reasoning, https://arxiv.org/abs/2508.10433
Yushi Feng, Junye Du, Yingying Hong, Qifan Wang, Lequan Yu, 14 Aug 2025, PASS: Probabilistic Agentic Supernet Sampling for Interpretable and Adaptive Chest X-Ray Reasoning, https://arxiv.org/abs/2508.10501
Ma\"el Jullien, Marco Valentino, and Andr\'e Freitas, 14 Aug 2025, The Knowledge-Reasoning Dissociation: Fundamental Limitations of LLMs in Clinical Natural Language Inference, https://arxiv.org/abs/2508.10777
Zhipeng Chen, Xiaobo Qin, Youbin Wu, Yue Ling, Qinghao Ye, Wayne Xin Zhao, Guang Shi, 14 Aug 2025, Pass@k Training for Adaptively Balancing Exploration and Exploitation of Large Reasoning Models, https://arxiv.org/abs/2508.10751
Li Wang, Changhao Zhang, Zengqi Xiu, Kai Lu, Xin Yu, Kui Zhang, Wenjun Wu, 7 Aug 2025, Decoupling Understanding from Reasoning via Problem Space Mapping for Small-scale Model Reasoning, https://arxiv.org/abs/2508.10019
Chuan Li, Qianyi Zhao, Fengran Mo, Cen Chen, 7 Aug 2025, FedCoT: Communication-Efficient Federated Reasoning Enhancement for Large Language Models, https://arxiv.org/abs/2508.10020
Kai Zhao, Yanjun Zhao, Jiaming Song, Shien He, Lusheng Zhang, Qiang Zhang, Tianjiao Li, 8 Aug 2025, SABER: Switchable and Balanced Training for Efficient LLM Reasoning, https://arxiv.org/abs/2508.10026
Christopher Pinier, Sonia Acu\~na Vargas, Mariia Steeghs-Turchina, Dora Matzke, Claire E. Stevenson, Michael D. Nunez, 12 Aug 2025, Large Language Models Show Signs of Alignment with Human Neurocognition During Abstract Reasoning, https://arxiv.org/abs/2508.10057
Juyuan Wang, Rongchen Zhao, Wei Wei, Yufeng Wang, Mo Yu, Jie Zhou, Jin Xu, Liyan Xu, 14 Aug 2025, ComoRAG: A Cognitive-Inspired Memory-Organized RAG for Stateful Long Narrative Reasoning, https://arxiv.org/abs/2508.10419
Mojtaba Safari, Shansong Wang, Mingzhe Hu, Zach Eidex, Qiang Li, and Xiaofeng Yang, 14 Aug 2025, Performance of GPT-5 in Brain Tumor MRI Reasoning, https://arxiv.org/abs/2508.10865
Xingyu Wu, Yuchen Yan, Shangke Lyu, Linjuan Wu, Yiwen Qiu, Yongliang Shen, Weiming Lu, Jian Shao, Jun Xiao, Yueting Zhuang, 14 Aug 2025, LAPO: Internalizing Reasoning Efficiency via Length-Adaptive Policy Optimization, https://arxiv.org/abs/2507.15758
Liang Zhang, Edith Aurora Graf, 14 Aug 2025, Mathematical Computation and Reasoning Errors by Large Language Models, https://arxiv.org/abs/2508.09932
Atin Pothiraj, Elias Stengel-Eskin, Jaemin Cho, Mohit Bansal, 13 Aug 2025, CAPTURe: Evaluating Spatial Reasoning in Vision Language Models via Occluded Object Counting, https://arxiv.org/abs/2504.15485
GLM-V Team: Wenyi Hong, Wenmeng Yu, Xiaotao Gu, Guo Wang, Guobing Gan, Haomiao Tang, Jiale Cheng, Ji Qi, Junhui Ji, Lihang Pan, Shuaiqi Duan, Weihan Wang, Yan Wang, Yean Cheng, Zehai He, Zhe Su, Zhen Yang, Ziyang Pan, Aohan Zeng, Baoxu Wang, Bin Chen, Boyan Shi, Changyu Pang, Chenhui Zhang, Da Yin, Fan Yang, Guoqing Chen, Jiazheng Xu, Jiale Zhu, Jiali Chen, Jing Chen, Jinhao Chen, Jinghao Lin, Jinjiang Wang, Junjie Chen, Leqi Lei, Letian Gong, Leyi Pan, Mingdao Liu, Mingde Xu, Mingzhi Zhang, Qinkai Zheng, Sheng Yang, Shi Zhong, Shiyu Huang, Shuyuan Zhao, Siyan Xue, Shangqin Tu, Shengbiao Meng, Tianshu Zhang, Tianwei Luo, Tianxiang Hao, Tianyu Tong, Wenkai Li, Wei Jia, Xiao Liu, Xiaohan Zhang, Xin Lyu, Xinyue Fan, Xuancheng Huang, Yanling Wang, Yadong Xue, Yanfeng Wang, Yanzi Wang, Yifan An, et al. (22 additional authors not shown), 14 Aug 2025, GLM-4.1V-Thinking and GLM-4.5V: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning, https://arxiv.org/abs/2507.01006
Zhangquan Chen, Ruihui Zhao, Chuwei Luo, Mingze Sun, Xinlei Yu, Yangyang Kang, Ruqi Huang, 14 Aug 2025, SIFThinker: Spatially-Aware Image Focus for Visual Reasoning, https://arxiv.org/abs/2508.06259
Yanhui Li, Yunkang Cao, Chengliang Liu, Yuan Xiong, Xinghui Dong, Chao Huang, 14 Aug 2025, IAD-R1: Reinforcing Consistent Reasoning in Industrial Anomaly Detection, https://arxiv.org/abs/2508.09178
Mo Yu, Tsz Ting Chung, Chulun Zhou, Tong Li, Rui Lu, Jiangnan Li, Liyan Xu, Haoshu Lu, Ning Zhang, Jing Li, Jie Zhou, 14 Aug 2025, PRELUDE: A Benchmark Designed to Require Global Comprehension and Reasoning over Long Contexts, https://arxiv.org/abs/2508.09848
Yihao Xue, Baharan Mirzasoleiman, 22 Jul 2025, LoRA is All You Need for Safety Alignment of Reasoning LLMs, https://arxiv.org/abs/2507.17075
Qifan Zhang, Nuo Chen, Zehua Li, Miao Peng, Jing Tang, Jia Li, 23 Jul 2025, Improving LLMs' Generalized Reasoning Abilities by Graph Problems, https://arxiv.org/abs/2507.17168
Luca Salvatore Lorello, Nikolaos Manginas, Marco Lippi, Stefano Melacci, 23 Jul 2025, LTLZinc: a Benchmarking Framework for Continual Learning and Neuro-Symbolic Temporal Reasoning, https://arxiv.org/abs/2507.17482
Yu Li, Zhuoshi Pan, Honglin Lin, Mengyuan Sun, Conghui He, Lijun Wu, 23 Jul 2025, Can One Domain Help Others? A Data-Centric Study on Multi-Domain Reasoning via Reinforcement Learning, https://arxiv.org/abs/2507.17512
Xinyao Liu, Diping Song, 23 Jul 2025, Constructing Ophthalmic MLLM for Positioning-diagnosis Collaboration Through Clinical Cognitive Chain Reasoning, https://arxiv.org/abs/2507.17539
Zhao Song, Song Yue, Jiahao Zhang, 23 Jul 2025, Thinking Isn't an Illusion: Overcoming the Limitations of Reasoning Models via Tool Augmentations, https://arxiv.org/abs/2507.17699
Tao Xu, Dung-Yang Lee and Momiao Xiong, 21 Jul 2025, Reinforcement Learning in hyperbolic space for multi-step reasoning, https://arxiv.org/abs/2507.16864
Rishi Parekh, Saisubramaniam Gopalakrishnan, Zishan Ahmad, Anirudh Deodhar, 23 Jul 2025, Leveraging Knowledge Graphs and LLM Reasoning to Identify Operational Bottlenecks for Warehouse Planning Assistance, https://arxiv.org/abs/2507.17273
Zhuokun Chen, Zeren Chen, Jiahao He, Mingkui Tan, Jianfei Cai, and Bohan Zhuang, 23 Jul 2025, R-Stitch: Dynamic Trajectory Stitching for Efficient Reasoning, https://arxiv.org/abs/2507.17307
Xuchen Li, Xuzhao Li, Shiyu Hu, Kaiqi Huang, Wentao Zhang, 22 Jul 2025, CausalStep: A Benchmark for Explicit Stepwise Causal Reasoning in Videos, https://arxiv.org/abs/2507.16878
Aleksandr Perevalov, Andreas Both, 22 Jul 2025, Text-to-SPARQL Goes Beyond English: Multilingual Question Answering Over Knowledge Graphs through Human-Inspired Reasoning, https://arxiv.org/abs/2507.16971
Zhe Xu, Ziyi Liu, Junlin Hou, Jiabo Ma, Cheng Jin, Yihui Wang, Zhixuan Chen, Zhengyu Zhang, Zhengrui Guo, Fengtao Zhou, Yingxue Xu, Xi Wang, Ronald Cheong Kin Chan, Li Liang, Hao Chen, 23 Jul 2025, A Versatile Pathology Co-pilot via Reasoning Enhanced Multimodal Large Language Model, https://arxiv.org/abs/2507.17303
Situo Zhang, Hanqi Li, Lu Chen, Zihan Zhao, Xuanze Lin, Zichen Zhu, Bo Chen, Xin Chen, Kai Yu, 23 Jul 2025, Reasoning-Driven Retrosynthesis Prediction with Large Language Models via Reinforcement Learning, https://arxiv.org/abs/2507.17448
Alexander R. Fabbri, Diego Mares, Jorge Flores, Meher Mankikar, Ernesto Hernandez, Dean Lee, Bing Liu, Chen Xing, 23 Jul 2025, MultiNRC: A Challenging and Native Multilingual Reasoning Evaluation Benchmark for LLMs, https://arxiv.org/abs/2507.17476
Nima Fathi, Amar Kumar, Tal Arbel, 22 Jul 2025, AURA: A Multi-Modal Medical Agent for Understanding, Reasoning & Annotation, https://arxiv.org/abs/2507.16940
Adrian Kaiser and Claudiu Leoveanu-Condrei and Ryan Gold and Marius-Constantin Dinu and Markus Hofmarcher, 23 Jul 2025, HyDRA: A Hybrid-Driven Reasoning Architecture for Verifiable Knowledge Graphs, https://arxiv.org/abs/2507.15917
Wei Sun, Qianlong Du, Fuwei Cui, Jiajun Zhang, 23 Jul 2025, An Efficient and Precise Training Data Construction Framework for Process-supervised Reward Model in Mathematical Reasoning, https://arxiv.org/abs/2503.02382
Lai Wei, Yuting Li, Kaipeng Zheng, Chen Wang, Yue Wang, Linghe Kong, Lichao Sun, Weiran Huang, 23 Jul 2025, Advancing Multimodal Reasoning via Reinforcement Learning with Cold Start, https://arxiv.org/abs/2505.22334
Yanjun Zheng, Xiyang Du, Longfei Liao, Xiaoke Zhao, Zhaowen Zhou, Bo Zhang, Jiawei Liu, Xiang Qi, Zhe Li, Zhiqiang Zhang, Wei Wang and Peng Zhang, 23 Jul 2025, Agentar-Fin-R1: Enhancing Financial Intelligence through Domain Expertise, Training Efficiency, and Advanced Reasoning, https://arxiv.org/abs/2507.16802
Shai Shalev-Shwartz and Amnon Shashua, 13 Jul 2025, From Reasoning to Super-Intelligence: A Search-Theoretic Perspective, https://arxiv.org/abs/2507.15865
Yin Wu, Daniel Slieter, Vivek Subramanian, Ahmed Abouelazm, Robin Bohn, and J. Marius Z\"ollner, 17 Jul 2025, Why Braking? Scenario Extraction and Reasoning Utilizing LLM, https://arxiv.org/abs/2507.15874
Andy E. Williams, 18 Jul 2025, The Recursive Coherence Principle: A Formal Constraint on Scalable Intelligence, Alignment, and Reasoning Architecture, https://arxiv.org/abs/2507.15880
Lisa Dargasz, 20 Jul 2025, Integrating Reason-Based Moral Decision-Making in the Reinforcement Learning Architecture, https://arxiv.org/abs/2507.15895
Shengchao Liu, Hannan Xu, Yan Ai, Huanxin Li, Yoshua Bengio, Harry Guo, 21 Jul 2025, Expert-Guided LLM Reasoning for Battery Discovery: From AI-Driven Hypothesis to Synthesis and Characterization, https://arxiv.org/abs/2507.16110
Fred Mutisya (1 and 2), Shikoh Gitau (1), Christine Syovata (2), Diana Oigara (2), Ibrahim Matende (2), Muna Aden (2), Munira Ali (2), Ryan Nyotu (2), Diana Marion (2), Job Nyangena (2), Nasubo Ongoma (1), Keith Mbae (1), Elizabeth Wamicha (1), Eric Mibuari (1), Jean Philbert Nsengemana (3), Talkmore Chidede (4) ((1) Qhala (Nairobi, Kenya), (2) Kenya Medical Association (Nairobi, Kenya), (3) Africa CDC (Addis Ababa, Ethiopia), (4) AfCFTA (Accra, Ghana)), 22 Jul 2025, Mind the Gap: Evaluating the Representativeness of Quantitative Medical Language Reasoning LLM Benchmarks for African Disease Burdens, https://arxiv.org/abs/2507.16322
Lucas de Lara (IECL), 22 Jul 2025, Canonical Representations of Markovian Structural Causal Models: A Framework for Counterfactual Reasoning, https://arxiv.org/abs/2507.16370
Bo Hou and Xin Tan and Kai Zheng and Fang Liu and Yinghao Zhu and Li Zhang, 22 Jul 2025, LLM-Driven Collaborative Model for Untangling Commits via Explicit and Implicit Dependency Reasoning, https://arxiv.org/abs/2507.16395
Jean Lelong, Adnane Errazine and Annabelle Blangero, 22 Jul 2025, Agentic RAG with Knowledge Graphs for Complex Multi-Hop Reasoning in Real-World Applications, https://arxiv.org/abs/2507.16507
Xu Yang, Qi Zhang, Shuming Jiang, Yaowen Xu, Zhaofan Zou, Hao Sun, Xuelong Li, 22 Jul 2025, METER: Multi-modal Evidence-based Thinking and Explainable Reasoning -- Algorithm and Benchmark, https://arxiv.org/abs/2507.16206
Mehul Damani, Isha Puri, Stewart Slocum, Idan Shenfeld, Leshem Choshen, Yoon Kim, Jacob Andreas, 22 Jul 2025, Beyond Binary Rewards: Training LMs to Reason About Their Uncertainty, https://arxiv.org/abs/2507.16806
Junhao Shen, Haiteng Zhao, Yuzhe Gu, Songyang Gao, Kuikun Liu, Haian Huang, Jianfei Gao, Dahua Lin, Wenwei Zhang, Kai Chen, 22 Jul 2025, Semi-off-Policy Reinforcement Learning for Vision-Language Slow-thinking Reasoning, https://arxiv.org/abs/2507.16814
Isaac Shi and Zeyuan Li and Fan Liu and Wenli Wang and Lewei He and Yang Yang and Tianyu Shi, 13 Jul 2025, eSapiens's DEREK Module: Deep Extraction & Reasoning Engine for Knowledge with LLMs, https://arxiv.org/abs/2507.15863
Run-Ze Fan and Zengzhi Wang and Pengfei Liu, 22 Jul 2025, MegaScience: Pushing the Frontiers of Post-Training Datasets for Science Reasoning, https://arxiv.org/abs/2507.16812
Chi-Pin Huang, Yueh-Hua Wu, Min-Hung Chen, Yu-Chiang Frank Wang, Fu-En Yang, 22 Jul 2025, ThinkAct: Vision-Language-Action Reasoning via Reinforced Visual Latent Planning, https://arxiv.org/abs/2507.16815
Xiuwei Chen, Wentao Hu, Hanhui Li, Jun Zhou, Zisheng Chen, Meng Cao, Yihan Zeng, Kui Zhang, Yu-Jie Yuan, Jianhua Han, Hang Xu, Xiaodan Liang, 22 Jul 2025, C2-Evo: Co-Evolving Multimodal Data and Model for Self-Improving Reasoning, https://arxiv.org/abs/2507.16518
Ang Li, Charles Wang, Kaiyu Yue, Zikui Cai, Ollie Liu, Deqing Fu, Peng Guo, Wang Bill Zhu, Vatsal Sharan, Robin Jia, Willie Neiswanger, Furong Huang, Tom Goldstein, Micah Goldblum, 22 Jul 2025, Zebra-CoT: A Dataset for Interleaved Vision Language Reasoning, https://arxiv.org/abs/2507.16746
Edward Y. Chang, Zeyneb N. Kaya, Ethan Chang, 22 Jul 2025, The Unified Cognitive Consciousness Theory for Language Models: Anchoring Semantics, Thresholds of Activation, and Emergent Reasoning, https://arxiv.org/abs/2506.02139
Guan Wang, Jin Li, Yuhao Sun, Xing Chen, Changling Liu, Yue Wu, Meng Lu, Sen Song, Yasin Abbasi Yadkori, 22 Jul 2025, Hierarchical Reasoning Model, https://arxiv.org/abs/2506.21734
Yitong Lin, Jiaying He, Jiahe Chen, Xinnan Zhu, Jianwei Zheng, Tao Bo, 22 Jul 2025, BioGraphFusion: Graph Knowledge Embedding for Biological Completion and Reasoning, https://arxiv.org/abs/2507.14468
Shangke Lyu, Linjuan Wu, Yuchen Yan, Xingyu Wu, Hao Li, Yongliang Shen, Peisheng Jiang, Weiming Lu, Jun Xiao, Yueting Zhuang, 22 Jul 2025, Hierarchical Budget Policy Optimization for Adaptive Reasoning, https://arxiv.org/abs/2507.15844
Haiteng Zhao, Chang Ma, Fangzhi Xu, Lingpeng Kong, Zhi-Hong Deng, 22 Jul 2025, BioMaze: Benchmarking and Enhancing Large Language Models for Biological Pathway Reasoning, https://arxiv.org/abs/2502.16660
Xiachong Feng, Longxu Dou, Lingpeng Kong, 22 Jul 2025, Reasoning Does Not Necessarily Improve Role-Playing Ability, https://arxiv.org/abs/2502.16940
Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, Luke Marris, Sam Petulla, Colin Gaffney, Asaf Aharoni, Nathan Lintz, Tiago Cardal Pais, Henrik Jacobsson, Idan Szpektor, Nan-Jiang Jiang, Krishna Haridasan, Ahmed Omran, Nikunj Saunshi, Dara Bahri, Gaurav Mishra, Eric Chu, Toby Boyd, Brad Hekman, Aaron Parisi, Chaoyi Zhang, Kornraphop Kawintiranon, Tania Bedrax-Weiss, Oliver Wang, Ya Xu, Ollie Purkiss, Uri Mendlovic, Ila\"i Deutel, Nam Nguyen, Adam Langley, Flip Korn, Lucia Rossazza, Alexandre Ram\'e, Sagar Waghmare, Helen Miller, Nathan Byrd, Ashrith Sheshan, Raia Hadsell Sangnie Bhardwaj, Pawel Janus, Tero Rissa, Dan Horgan, Sharon Silver, Ayzaan Wahid, Sergey Brin, Yves Raimond, Klemen Kloboves, et al. (3255 additional authors not shown), 22 Jul 2025, Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities, https://arxiv.org/abs/2507.06261
SaiBarath Sundar, Pranav Satheesan, Udayaadithya Avadhanam, 23 Jul 2025, I2I-STRADA -- Information to Insights via Structured Reasoning Agent for Data Analysis, https://arxiv.org/abs/2507.17874
Mutian Yang, Jiandong Gao, and Ji Wu, 24 Jul 2025, Decoupling Knowledge and Reasoning in LLMs: An Exploration Using Cognitive Dual-System Theory, https://arxiv.org/abs/2507.18178
Zhuang Qiang Bok and Watson Wei Khong Chua, 24 Jul 2025, Reasoning Beyond the Obvious: Evaluating Divergent and Convergent Thinking in LLMs for Financial Scenarios, https://arxiv.org/abs/2507.18368
Shiye Lei, Zhihao Cheng, Kai Jia, Dacheng Tao, 24 Jul 2025, Revisiting LLM Reasoning via Information Bottleneck, https://arxiv.org/abs/2507.18391
Xiaoxu Guo, Siyan Liang, Yachao Cui, Juxiang Zhou, Lei Wang, Han Cao, 21 Jul 2025, Multimodal Fine-grained Reasoning for Post Quality Evaluation, https://arxiv.org/abs/2507.17934
Datta Nimmaturi, Vaishnavi Bhargava, Rajat Ghosh, Johnu George, Debojyoti Dutta, 24 Jul 2025, Predictive Scaling Laws for Efficient GRPO Training of Large Reasoning Models, https://arxiv.org/abs/2507.18014
Matthias Otth, Jonas H\"ubotter, Ido Hakimi, Andreas Krause, 24 Jul 2025, Maximizing Prefix-Confidence at Test-Time Efficiently Improves Mathematical Reasoning, https://arxiv.org/abs/2507.18122
Dongyang Guo, Yasmeen Abdrabou, Enkeleda Thaqi, Enkelejda Kasneci, 24 Jul 2025, Multimodal Behavioral Patterns Analysis with Eye-Tracking and LLM-Based Reasoning, https://arxiv.org/abs/2507.18252
Rana Alshaikh, Israa Alghanmi, Shelan Jeawak, 24 Jul 2025, AraTable: Benchmarking LLMs' Reasoning and Understanding of Arabic Tabular Data, https://arxiv.org/abs/2507.18442
Busra Icoz, Goksel Biricik, 24 Jul 2025, Automated Code Review Using Large Language Models with Symbolic Reasoning, https://arxiv.org/abs/2507.18476
Andres M Bran, Theo A Neukomm, Daniel P Armstrong, Zlatko Jon\v{c}ev, Philippe Schwaller, 23 Jul 2025, Chemical reasoning in LLMs unlocks strategy-aware synthesis planning and reaction mechanism elucidation, https://arxiv.org/abs/2503.08537
Bowen Zhang, Pengcheng Luo, 24 Jul 2025, OR-LLM-Agent: Automating Modeling and Solving of Operations Research Optimization Problems with Reasoning LLM, https://arxiv.org/abs/2503.10009
David Guzman Piedrahita, Yongjin Yang, Mrinmaya Sachan, Giorgia Ramponi, Bernhard Sch\"olkopf, Zhijing Jin, 24 Jul 2025, Corrupted by Reasoning: Reasoning Language Models Become Free-Riders in Public Goods Games, https://arxiv.org/abs/2506.23276
Martina Miliani, Serena Auriemma, Alessandro Bondielli, Emmanuele Chersoni, Lucia Passaro, Irene Sucameli, Alessandro Lenci, 24 Jul 2025, ExpliCa: Evaluating Explicit Causal Reasoning in Large Language Models, https://arxiv.org/abs/2502.15487
Junming Liu, Siyuan Meng, Yanting Gao, Song Mao, Pinlong Cai, Guohang Yan, Yirong Chen, Zilin Bian, Ding Wang, Botian Shi, 24 Jul 2025, Aligning Vision to Language: Annotation-Free Multimodal Knowledge Graph Construction for Enhanced LLMs Reasoning, https://arxiv.org/abs/2503.12972
Renato Ghisellini and Remo Pareschi and Marco Pedroni and Giovanni Battista Raggi, 18 Jul 2025, From Extraction to Synthesis: Entangled Heuristics for Agent-Augmented Strategic Reasoning, https://arxiv.org/abs/2507.13768
Shmuel Berman, Jia Deng, 4 Jul 2025, VLMs have Tunnel Vision: Evaluating Nonlocal Visual Reasoning in Leading VLMs, https://arxiv.org/abs/2507.13361
Binbin Ji, Siddharth Agrawal, Qiance Tang, and Yvonne Wu, 6 Jul 2025, Enhancing Spatial Reasoning in Vision-Language Models via Chain-of-Thought Prompting and Reinforcement Learning, https://arxiv.org/abs/2507.13362
Peican Zhu, Yubo Jing, Le Cheng, Bin Chen, Xiaodong Cui, Lianwei Wu, Keke Tang, 17 Jul 2025, SEER: Semantic Enhancement and Emotional Reasoning Network for Multimodal Fake News Detection, https://arxiv.org/abs/2507.13415
Ishant Chintapatla, Kazuma Choji, Naaisha Agarwal, Andrew Lin, Hannah You, Charles Duong, Kevin Zhu, Sean O'Brien, Vasu Sharma, 17 Jul 2025, COREVQA: A Crowd Observation and Reasoning Entailment Visual Question Answering Benchmark, https://arxiv.org/abs/2507.13405
Qiguang Chen, Libo Qin, Jinhao Liu, Dengyun Peng, Jiannan Guan, Peng Wang, Mengkang Hu, Yuhang Zhou, Te Gao, Wanxiang Che, 18 Jul 2025, Towards Reasoning Era: A Survey of Long Chain-of-Thought for Reasoning Large Language Models, https://arxiv.org/abs/2503.09567
Parshin Shojaee, Iman Mirzadeh, Keivan Alizadeh, Maxwell Horton, Samy Bengio, Mehrdad Farajtabar, 18 Jul 2025, The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity, https://arxiv.org/abs/2506.06941
Zhiting Mei, Christina Zhang, Tenny Yin, Justin Lidard, Ola Shorinwa, Anirudha Majumdar, 18 Jul 2025, Reasoning about Uncertainty: Do Reasoning Models Know When They Don't Know?, https://arxiv.org/abs/2506.18183
Ahmed Bahloul, Simon Malberg, 18 Jul 2025, From Roots to Rewards: Dynamic Tree Reasoning with RL, https://arxiv.org/abs/2507.13142
Thomas Foster, Anya Sims, Johannes Forkel, Mattie Fellows, Jakob Foerster, 18 Jul 2025, Learning to Reason at the Frontier of Learnability, https://arxiv.org/abs/2502.12272
Constantin Venhoff, Iv\'an Arcuschin, Philip Torr, Arthur Conmy, Neel Nanda, 17 Jul 2025, Understanding Reasoning in Thinking Language Models via Steering Vectors, https://arxiv.org/abs/2506.18167
Jiayu Song, Mahmud Elahi Akhter, Dana Atzil Slonim, Maria Liakata, 18 Jul 2025, Temporal reasoning for timeline summarisation in social media, https://arxiv.org/abs/2501.00152
Z.Z. Ren, Zhihong Shao, Junxiao Song, Huajian Xin, Haocheng Wang, Wanjia Zhao, Liyue Zhang, Zhe Fu, Qihao Zhu, Dejian Yang, Z.F. Wu, Zhibin Gou, Shirong Ma, Hongxuan Tang, Yuxuan Liu, Wenjun Gao, Daya Guo, Chong Ruan, 18 Jul 2025, DeepSeek-Prover-V2: Advancing Formal Mathematical Reasoning via Reinforcement Learning for Subgoal Decomposition, https://arxiv.org/abs/2504.21801
Bingqian Lin, Yunshuang Nie, Khun Loun Zai, Ziming Wei, Mingfei Han, Rongtao Xu, Minzhe Niu, Jianhua Han, Liang Lin, Cewu Lu, Xiaodan Liang, 18 Jul 2025, EvolveNav: Self-Improving Embodied Reasoning for LLM-Based Vision-Language Navigation, https://arxiv.org/abs/2506.01551
Humza Sami, Mubashir ul Islam, Pierre-Emmanuel Gaillardon, Valerio Tenace, 18 Jul 2025, Adaptive Multi-Agent Reasoning via Automated Workflow Generation, https://arxiv.org/abs/2507.14393
Michael J. Zellinger and Matt Thomson, 18 Jul 2025, Fail Fast, or Ask: Mitigating the Deficiencies of Reasoning LLMs with Human-in-the-Loop Systems Engineering, https://arxiv.org/abs/2507.14406
Cole Robertson and Philip Wolff, 21 Jul 2025, LLM world models are mental: Output layer evidence of brittle world model use in LLM mechanical reasoning, https://arxiv.org/abs/2507.15521
Licheng Liu, Zihan Wang, Linjie Li, Chenwei Xu, Yiping Lu, Han Liu, Avirup Sil, Manling Li, 18 Jul 2025, A Simple "Try Again" Can Elicit Multi-Turn LLM Reasoning, https://arxiv.org/abs/2507.14295
Ole-Christoffer Granmo and Youmna Abdelwahab and Per-Arne Andersen and Paul F. A. Clarke and Kunal Dumbre and Ylva Gr{\o}nnins{\ae}ter and Vojtech Halenka and Runar Helin and Lei Jiao and Ahmed Khalid and Rebekka Omslandseter and Rupsa Saha and Mayur Shende and Xuan Zhang, 20 Jul 2025, The Tsetlin Machine Goes Deep: Logical Learning and Reasoning With Graphs, https://arxiv.org/abs/2507.14874
Yiyuan Yang, Zichuan Liu, Lei Song, Kai Ying, Zhiguang Wang, Tom Bamford, Svitlana Vyetrenko, Jiang Bian, Qingsong Wen, 20 Jul 2025, Time-RA: Towards Time Series Reasoning for Anomaly with LLM Feedback, https://arxiv.org/abs/2507.15066
Jiaao Li, Kaiyuan Li, Chen Gao, Yong Li, Xinlei Chen, 21 Jul 2025, EgoPrune: Efficient Token Pruning for Egomotion Video Reasoning in Embodied Agent, https://arxiv.org/abs/2507.15428
Seok Hwan Song, Mohna Chakraborty, Qi Li, Wallapak Tavanapong, 21 Jul 2025, Is Large Language Model Performance on Reasoning Tasks Impacted by Different Ways Questions Are Asked?, https://arxiv.org/abs/2507.15707
Sahana Srinivasan, Xuguang Ai, Thaddaeus Wai Soon Lo, Aidan Gilson, Minjie Zou, Ke Zou, Hyunjae Kim, Mingjia Yang, Krithi Pushpanathan, Samantha Yew, Wan Ting Loke, Jocelyn Goh, Yibing Chen, Yiming Kong, Emily Yuelei Fu, Michelle Ongyong Hui, Kristen Nwanyanwu, Amisha Dave, Kelvin Zhenghao Li, Chen-Hsin Sun, Mark Chia, Gabriel Dawei Yang, Wendy Meihua Wong, David Ziyou Chen, Dianbo Liu, Maxwell Singer, Fares Antaki, Lucian V Del Priore, Jost Jonas, Ron Adelman, Qingyu Chen, Yih-Chung Tham, 21 Jul 2025, BEnchmarking LLMs for Ophthalmology (BELO) for Ophthalmological Knowledge and Reasoning, https://arxiv.org/abs/2507.15717
Yihao Li, Jiayi Xin, Miranda Muqing Miao, Qi Long, Lyle Ungar, 21 Jul 2025, The Impact of Language Mixing on Bilingual LLM Reasoning, https://arxiv.org/abs/2507.15849
Fengxiang Cheng, Haoxuan Li, Fenrong Liu, Robert van Rooij, Kun Zhang, Zhouchen Lin, 21 Jul 2025, Empowering LLMs with Logical Reasoning: A Comprehensive Survey, https://arxiv.org/abs/2502.15652
Shaohang Wei, Wei Li, Feifan Song, Wen Luo, Tianyi Zhuang, Haochen Tan, Zhijiang Guo, Houfeng Wang, 19 Jul 2025, TIME: A Multi-level Benchmark for Temporal Reasoning of LLMs in Real-World Scenarios, https://arxiv.org/abs/2505.12891
Kun Xiang and Heng Li and Terry Jingchen Zhang and Yinya Huang and Zirong Liu and Peixin Qu and Jixi He and Jiaqi Chen and Yu-Jie Yuan and Jianhua Han and Hang Xu and Hanhui Li and Mrinmaya Sachan and Xiaodan Liang, 21 Jul 2025, SeePhys: Does Seeing Help Thinking? -- Benchmarking Vision-Based Physics Reasoning, https://arxiv.org/abs/2505.19099
Xin Wang, Jiyao Liu, Yulong Xiao, Junzhi Ning, Lihao Liu, Junjun He, Botian Shi, Kaicheng Yu, 21 Jul 2025, THE-Tree: Can Tracing Historical Evolution Enhance Scientific Verification and Reasoning?, https://arxiv.org/abs/2506.21763
Konstantinos I. Roumeliotis, Ranjan Sapkota, Manoj Karkee, Nikolaos D. Tselikas, 18 Jul 2025, Orchestrator-Agent Trust: A Modular Agentic AI Visual Classification System with Trust-Aware Orchestration and RAG-Based Reasoning, https://arxiv.org/abs/2507.10571
Michal Spiegel, Michal \v{S}tef\'anik, Marek Kadl\v{c}\'ik, Josef Kucha\v{r}, 21 Jul 2025, Attend or Perish: Benchmarking Attention in Algorithmic Reasoning, https://arxiv.org/abs/2503.01909
Junhong Lin, Xinyue Zeng, Jie Zhu, Song Wang, Julian Shun, Jun Wu, Dawei Zhou, 21 Jul 2025, Plan and Budget: Effective and Efficient Test-Time Scaling on Large Language Model Reasoning, https://arxiv.org/abs/2505.16122
Justin Chih-Yao Chen, Sukwon Yun, Elias Stengel-Eskin, Tianlong Chen, Mohit Bansal, 18 Jul 2025, Symbolic Mixture-of-Experts: Adaptive Skill-based Routing for Heterogeneous Reasoning, https://arxiv.org/abs/2503.05641
Bowen Jin, Hansi Zeng, Zhenrui Yue, Jinsung Yoon, Sercan Arik, Dong Wang, Hamed Zamani, Jiawei Han, 21 Jul 2025, Search-R1: Training LLMs to Reason and Leverage Search Engines with Reinforcement Learning, https://arxiv.org/abs/2503.09516
Bo-Cheng Chiu, Jen-Jee Chen, Yu-Chee Tseng and Feng-Chi Chen, 21 Jul 2025, DaMO: A Data-Efficient Multimodal Orchestrator for Temporal Reasoning with Video LLMs, https://arxiv.org/abs/2506.11558
Yiming Yang, Yueru Luo, Bingkun He, Hongbin Lin, Suzhong Fu, Chao Zheng, Zhipeng Cao, Erlong Li, Chao Yan, Shuguang Cui, Zhen Li, 20 Jul 2025, TopoStreamer: Temporal Lane Segment Topology Reasoning in Autonomous Driving, https://arxiv.org/abs/2507.00709
Anirudh Choudhary, Mosbah Aouad, Krishnakant Saboo, Angelina Hwang, Jacob Kechter, Blake Bordeaux, Puneet Bhullar, David DiCaudo, Steven Nelson, Nneka Comfere, Emma Johnson, Olayemi Sokumbi, Jason Sluzevich, Leah Swanson, Dennis Murphree, Aaron Mangold, Ravishankar Iyer, 19 Jul 2025, RACR-MIL: Rank-aware contextual reasoning for weakly supervised grading of squamous cell carcinoma using whole slide images, https://arxiv.org/abs/2308.15618
Blair Johnson, Clayton Kerce, Faramarz Fekri, 8 Aug 2025, GLIDR: Graph-Like Inductive Logic Programming with Differentiable Reasoning, https://arxiv.org/abs/2508.06716
Amit Dhanda, 10 Aug 2025, Multi-Dimensional Summarization Agents with Context-Aware Reasoning over Enterprise Tables, https://arxiv.org/abs/2508.07186
Yi Tang, Kaini Wang, Yang Chen, Guangquan Zhou, 10 Aug 2025, EndoAgent: A Memory-Guided Reflective Agent for Intelligent Endoscopic Vision-to-Decision Reasoning, https://arxiv.org/abs/2508.07292
He Kong, Die Hu, Jingguo Ge, Liangxiong Li, Hui Li and Tong Li, 10 Aug 2025, Pentest-R1: Towards Autonomous Penetration Testing Reasoning Optimized via Two-Stage Reinforcement Learning, https://arxiv.org/abs/2508.07382
Wenkai Li, Liwen Sun, Zhenxiang Guan, Xuhui Zhou, Maarten Sap, 11 Aug 2025, 1-2-3 Check: Enhancing Contextual Privacy in LLM via Multi-Agent Reasoning, https://arxiv.org/abs/2508.07667
Zhenpeng Su, Leiyu Pan, Xue Bai, Dening Liu, Guanting Dong, Jiaming Huang, Wenping Hu, Guorui Zhou, 11 Aug 2025, Klear-Reasoner: Advancing Reasoning Capability via Gradient-Preserving Clipping Policy Optimization, https://arxiv.org/abs/2508.07629
Zihe Liu, Jiashun Liu, Yancheng He, Weixun Wang, Jiaheng Liu, Ling Pan, Xinyu Hu, Shaopan Xiong, Ju Huang, Jian Hu, Shengyi Huang, Siran Yang, Jiamang Wang, Wenbo Su, Bo Zheng, 11 Aug 2025, Part I: Tricks or Traps? A Deep Dive into RL for LLM Reasoning, https://arxiv.org/abs/2508.08221
Tong Yang, Yu Huang, Yingbin Liang, Yuejie Chi, 11 Aug 2025, Multi-head Transformers Provably Learn Symbolic Multi-step Reasoning via Gradient Descent, https://arxiv.org/abs/2508.08222
Logan Cross, Erik Brockbank, Tobias Gerstenberg, Judith E. Fan, Daniel L. K. Yamins, Nick Haber, 25 Jul 2025, Understanding Human Limits in Pattern Recognition: A Computational Model of Sequential Reasoning in Rock, Paper, Scissors, https://arxiv.org/abs/2508.06503
Wenhan Liu, Xinyu Ma, Weiwei Sun, Yutao Zhu, Yuchen Li, Dawei Yin, Zhicheng Dou, 9 Aug 2025, ReasonRank: Empowering Passage Ranking with Strong Reasoning Ability, https://arxiv.org/abs/2508.07050
Lijie Yang, Zhihao Zhang, Arti Jain, Shijie Cao, Baihong Yuan, Yiwei Chen, Zhihao Jia, Ravi Netravali, 9 Aug 2025, Less Is More: Training-Free Sparse Attention with Global Locality for Efficient Reasoning, https://arxiv.org/abs/2508.07101
Fabio Vitali, 10 Aug 2025, From Knowledge to Conjectures: A Modal Framework for Reasoning about Hypotheses, https://arxiv.org/abs/2508.07304
Anirudh Iyengar Kaniyar Narayana Iyengar, Srija Mukhopadhyay, Adnan Qidwai, Shubhankar Singh, Dan Roth, Vivek Gupta, 11 Aug 2025, InterChart: Benchmarking Visual Reasoning Across Decomposed and Distributed Chart Information, https://arxiv.org/abs/2508.07630
Chaohong Guo, Xun Mo, Yongwei Nie, Xuemiao Xu, Chao Xu, Fei Yu, and Chengjiang Long, 11 Aug 2025, TAR-TVG: Enhancing VLMs with Timestamp Anchor-Constrained Reasoning for Temporal Video Grounding, https://arxiv.org/abs/2508.07683
Shoaib Ahmmad, Zubayer Ahmed Aditto, Md Mehrab Hossain, Noushin Yeasmin, Shorower Hossain, 11 Aug 2025, Autonomous Navigation of Cloud-Controlled Quadcopters in Confined Spaces Using Multi-Modal Perception and LLM-Driven High Semantic Reasoning, https://arxiv.org/abs/2508.07885
Meixiu Long, Duolin Sun, Dan Yang, Junjie Wang, Yue Shen, Jian Wang, Peng Wei, Jinjie Gu, Jiahai Wang, 11 Aug 2025, DIVER: A Multi-Stage Approach for Reasoning-intensive Information Retrieval, https://arxiv.org/abs/2508.07995
Zhonghao Yan, Muxi Diao, Yuxuan Yang, Jiayuan Xu, Kaizhou Zhang, Ruoyan Jing, Lele Yang, Yanxi Liu, Kongming Liang and Zhanyu Ma, 11 Aug 2025, MedReasoner: Reinforcement Learning Drives Reasoning Grounding from Clinical Thought to Pixel-Level Precision, https://arxiv.org/abs/2508.08177
Shansong Wang and Mingzhe Hu and Qiang Li and Mojtaba Safari and Xiaofeng Yang, 11 Aug 2025, Capabilities of GPT-5 on Multimodal Medical Reasoning, https://arxiv.org/abs/2508.08224
Yuan Sui, Yufei He, Tri Cao, Simeng Han, Yulin Chen, Bryan Hooi, 8 Aug 2025, Meta-Reasoner: Dynamic Guidance for Optimized Inference-time Reasoning in Large Language Models, https://arxiv.org/abs/2502.19918
Alberto Pozanco, Marianela Morales, Daniel Borrajo, Manuela Veloso, 11 Aug 2025, A Planning Compilation to Reason about Goal Achievement at Planning Time, https://arxiv.org/abs/2503.09545
Annie Wong, Thomas B\"ack, Aske Plaat, Niki van Stein and Anna V. Kononova, 10 Aug 2025, Reasoning Capabilities of Large Language Models on Dynamic Tasks, https://arxiv.org/abs/2505.10543
Lucia Cipolina-Kun and Marianna Nezhurina and Jenia Jitsev, 10 Aug 2025, Game Reasoning Arena: A Framework and Benchmark for Assessing Reasoning Capabilites of Large Language Models via Game Play, https://arxiv.org/abs/2508.03368
Yiye Chen, Harpreet Sawhney, Nicholas Gyd\'e, Yanan Jian, Jack Saunders, Patricio Vela, Ben Lundell, 8 Aug 2025, Schema-Guided Scene-Graph Reasoning based on Multi-Agent Large Language Model System, https://arxiv.org/abs/2502.03450
Rong Cheng, Jinyi Liu, Yan Zheng, Fei Ni, Jiazhen Du, Hangyu Mao, Fuzheng Zhang, Bo Wang, Jianye Hao, 9 Aug 2025, DualRAG: A Dual-Process Approach to Integrate Reasoning and Retrieval for Multi-Hop Question Answering, https://arxiv.org/abs/2504.18243
Xuandong Zhao, Zhewei Kang, Aosong Feng, Sergey Levine, Dawn Song, 11 Aug 2025, Learning to Reason without External Rewards, https://arxiv.org/abs/2505.19590
Yunpeng Gao, Zhigang Wang, Pengfei Han, Linglin Jing, Dong Wang, Bin Zhao, 11 Aug 2025, Exploring Spatial Representation to Enhance LLM Reasoning in Aerial Vision-Language Navigation, https://arxiv.org/abs/2410.08500
Seyed Pouyan Mousavi Davoudi, Amin Gholami Davodi, Alireza Amiri-Margavi, Alireza Shafiee Fard, Mahdi Jafari, 9 Aug 2025, Collective Reasoning Among LLMs: A Framework for Answer Validation Without Ground Truth, https://arxiv.org/abs/2502.20758
Lo Pang-Yun Ting, Chengshuai Zhao, Yu-Hua Zeng, Yuan Jee Lim, Kun-Ta Chuang, Huan Liu, 9 Aug 2025, Leaps Beyond the Seen: Reinforced Reasoning Augmented Generation for Clinical Notes, https://arxiv.org/abs/2506.05386
Mihir Godbole, Xiangbo Gao, Zhengzhong Tu, 9 Aug 2025, DRAMA-X: A Fine-grained Intent Prediction and Risk Reasoning Benchmark For Driving, https://arxiv.org/abs/2506.17590
Minghao Guo, Xi Zhu, Jingyuan Huang, Kai Mei, Yongfeng Zhang, 11 Aug 2025, ReaGAN: Node-as-Agent-Reasoning Graph Agentic Network, https://arxiv.org/abs/2508.00429
Andrew Kiruluta, 18 Jul 2025, Wavelet Logic Machines: Learning and Reasoning in the Spectral Domain Without Neural Networks, https://arxiv.org/abs/2507.19514
Aditya Sharma, Linh Nguyen, Ananya Gupta, Chengyu Wang, Chiamaka Adebayo, and Jakub Kowalski, 26 Jul 2025, Inducing Causal World Models in LLMs for Zero-Shot Physical Reasoning, https://arxiv.org/abs/2507.19855
Enjun Du, Siyi Liu, Yongqi Zhang, 28 Jul 2025, Mixture of Length and Pruning Experts for Knowledge Graphs Reasoning, https://arxiv.org/abs/2507.20498
Ansh Poonia, Maeghal Jain, 28 Jul 2025, Dissecting Persona-Driven Reasoning in Language Models via Activation Patching, https://arxiv.org/abs/2507.20936
Dong Du, Shulin Liu, Tao Yang, Shaohua Chen, Yang Li, 26 Jul 2025, UloRL:An Ultra-Long Output Reinforcement Learning Approach for Advancing Large Language Models' Reasoning Abilities, https://arxiv.org/abs/2507.19766
Eunkyu Park, Wesley Hanwen Deng, Gunhee Kim, Motahhare Eslami, Maarten Sap, 27 Jul 2025, Cognitive Chain-of-Thought: Structured Multimodal Reasoning about Social Situations, https://arxiv.org/abs/2507.20409
Xun Liang, Xin Guo, Zhongming Jin, Weihang Pan, Penghui Shang, Deng Cai, Binbin Lin, Jieping Ye, 28 Jul 2025, Enhancing Spatial Reasoning through Visual and Textual Thinking, https://arxiv.org/abs/2507.20529
Adrien Bazoge, 28 Jul 2025, MediQAl: A French Medical Question Answering Dataset for Knowledge and Reasoning Evaluation, https://arxiv.org/abs/2507.20917
Aleksandar Pavlovic, Emanuel Sallinger, Steven Schockaert, 26 Jul 2025, Faithful Differentiable Reasoning with Reshuffled Region-based Embeddings, https://arxiv.org/abs/2406.09529
Zheng Zhang, Nuoqian Xiao, Qi Chai, Deheng Ye, Hao Wang, 28 Jul 2025, MultiMind: Enhancing Werewolf Agents with Multimodal Reasoning and Theory of Mind, https://arxiv.org/abs/2504.18039
Kun Li, Zhennan Wu, Shoupeng Wang, Jia Wu, Shirui Pan and Wenbin Hu, 28 Jul 2025, DrugPilot: LLM-based Parameterized Reasoning Agent for Drug Discovery, https://arxiv.org/abs/2505.13940
Yun Qu, Qi Wang, Yixiu Mao, Vincent Tao Hu, Bj\"orn Ommer, Xiangyang Ji, 28 Jul 2025, Can Prompt Difficulty be Online Predicted for Accelerating RL Finetuning of Reasoning Models?, https://arxiv.org/abs/2507.04632
Xuzhao Li and Xuchen Li and Shiyu Hu and Yongzhen Guo and Wentao Zhang, 26 Jul 2025, VerifyBench: A Systematic Benchmark for Evaluating Reasoning Verifiers Across Domains, https://arxiv.org/abs/2507.09884
Ningyuan Xi, Xiaoyu Wang, Yetao Wu, Teng Chen, Qingqing Gu, Yue Zhao, Jinxian Qu, Zhonglin Jiang, Yong Chen, Luo Ji, 26 Jul 2025, MeTHanol: Modularized Thinking Language Models with Intermediate Layer Thinking, Decoding and Bootstrapping Reasoning, https://arxiv.org/abs/2409.12059
Shamus Sim and Tyrone Chen, 28 Jul 2025, Critique of Impure Reason: Unveiling the reasoning behaviour of medical Large Language Models, https://arxiv.org/abs/2412.15748
Yixin Ye, Zhen Huang, Yang Xiao, Ethan Chern, Shijie Xia, Pengfei Liu, 28 Jul 2025, LIMO: Less is More for Reasoning, https://arxiv.org/abs/2502.03387
Runyu Jiao, Alice Fasoli, Francesco Giuliari, Matteo Bortolon, Sergio Povoli, Guofeng Mei, Yiming Wang, Fabio Poiesi, 28 Jul 2025, Free-form language-based robotic reasoning and grasping, https://arxiv.org/abs/2503.13082
Xiangning Yu, Zhuohan Wang, Linyi Yang, Haoxuan Li, Anjie Liu, Xiao Xue, Jun Wang, Mengyue Yang, 26 Jul 2025, Causal Sufficiency and Necessity Improves Chain-of-Thought Reasoning, https://arxiv.org/abs/2506.09853
Yifu Han and Geo Zhang, 27 Jul 2025, Reinforcement learning fine-tuning of language model for instruction following and math reasoning, https://arxiv.org/abs/2506.21560
Shenghe Zheng, Qianjia Cheng, Junchi Yao, Mengsong Wu, Haonan He, Ning Ding, Yu Cheng, Shuyue Hu, Lei Bai, Dongzhan Zhou, Ganqu Cui, Peng Ye, 28 Jul 2025, Scaling Physical Reasoning with the PHYSICS Dataset, https://arxiv.org/abs/2506.00022
Khanh Son Pham, Christian Witte, Jens Behley, Johannes Betz, Cyrill Stachniss, 28 Jul 2025, Coherent Online Road Topology Estimation and Reasoning with Standard-Definition Maps, https://arxiv.org/abs/2507.01397
Tom Smykowski, Jan 26, 2026, Richard Stallman Exposes the AI Lie; He Calls It “Pretend Intelligence”, https://tomaszs2.medium.com/richard-stallman-exposes-the-ai-lie-he-calls-it-pretend-intelligence-0cec84e8b6c1
Jiawei Xu, Chia Xin Liang, Ziqian Bi, Xiaoming Li, Danyang Zhang, Zhenyu Yu, Dec 2025, A Comprehensive Survey on Large Language Models: From Pre-training to Autonomous Agents, https://www.researchgate.net/profile/Ziqian_Bi/publication/399059225_A_Comprehensive_Survey_on_Large_Language_Models_From_Pre-training_to_Autonomous_Agents/links/694c94a07e61d05b5312836f/A-Comprehensive-Survey-on-Large-Language-Models-From-Pre-training-to-Autonomous-Agents.pdf
Ben Dickson, January 31, 2025, Beyond benchmarks: How DeepSeek-R1 and o1 perform on real-world tasks, https://venturebeat.com/ai/beyond-benchmarks-how-deepseek-r1-and-o1-perform-on-real-world-tasks/
DiJia Su, Hanlin Zhu, Yingchen Xu, Jiantao Jiao, Yuandong Tian, Qinqing Zheng, 5 Feb 2025. Token Assorted: Mixing Latent and Text Tokens for Improved Language Model Reasoning, https://arxiv.org/abs/2502.03275
Carl Franzen, March 17, 2025, Baidu delivers new LLMs ERNIE 4.5 and ERNIE X1 undercutting DeepSeek, OpenAI on cost — but they’re not open source (yet), https://venturebeat.com/ai/baidu-delivers-new-llms-ernie-4-5-and-ernie-x1-undercutting-deepseek-openai-on-cost-but-theyre-not-open-source-yet/
Carl Franzen, September 24, 2025, Chinese food delivery app Meituan's open source AI model LongCat-Flash-Thinking rivals GPT-5, https://venturebeat.com/ai/chinese-food-delivery-firm-meituans-open-source-ai-model-longcat-flash