Aussie AI

Best of N Inference

Last Updated 18 September, 2025

by David Spuler, Ph.D.

Research on Best of N Inference

Research papers include:

Siwei Wu, Zhongyuan Peng, Xinrun Du, Tuney Zheng, Minghao Liu, Jialong Wu, Jiachen Ma, Yizhi Li, Jian Yang, Wangchunshu Zhou, Qunshu Lin, Junbo Zhao, Zhaoxiang Zhang, Wenhao Huang, Ge Zhang, Chenghua Lin, J.H. Liu, 22 Oct 2024 (v2), A Comparative Study on Reasoning Patterns of OpenAI's o1 Model, https://arxiv.org/abs/2410.13639
Hanshi Sun, Momin Haider, Ruiqi Zhang, Huitao Yang, Jiahao Qiu, Ming Yin, Mengdi Wang, Peter Bartlett, Andrea Zanette, 26 Oct 2024, Fast Best-of-N Decoding via Speculative Rejection, https://arxiv.org/abs/2410.20290
Do Xuan Long, Duong Ngoc Yen, Anh Tuan Luu, Kenji Kawaguchi, Min-Yen Kan, Nancy F. Chen, 1 Nov 2024, Multi-expert Prompting Improves Reliability, Safety, and Usefulness of Large Language Models, https://arxiv.org/abs/2411.00492
Yinlam Chow, Guy Tennenholtz, Izzeddin Gur, Vincent Zhuang, Bo Dai, Sridhar Thiagarajan, Craig Boutilier, Rishabh Agarwal, Aviral Kumar, Aleksandra Faust, 18 Dec 2024, Inference-Aware Fine-Tuning for Best-of-N Sampling in Large Language Models, https://arxiv.org/abs/2412.15287
Violet Xiang, Charlie Snell, Kanishk Gandhi, Alon Albalak, Anikait Singh, Chase Blagden, Duy Phung, Rafael Rafailov, Nathan Lile, Dakota Mahan, Louis Castricato, Jan-Philipp Franken, Nick Haber, Chelsea Finn, 8 Jan 2025, Towards System 2 Reasoning in LLMs: Learning How to Think With Meta Chain-of-Thought, https://arxiv.org/abs/2501.04682
Tong Xiao, Jingbo Zhu, 16 Jan 2025, Foundations of Large Language Models, https://arxiv.org/abs/2501.09223 (Huge 230 page paper on many topics such as training, prompting, alignment, and long context.)
Kuang-Huei Lee, Ian Fischer, Yueh-Hua Wu, Dave Marwood, Shumeet Baluja, Dale Schuurmans, Xinyun Chen, 17 Jan 2025, Evolving Deeper LLM Thinking, https://arxiv.org/abs/2501.09891 (An alternative search strategy broad/deep, compared to CoT and reflection.)
Edward Beeching, Lewis Tunstall, Sasha Rush Dec 16, 2024, Scaling Test Time Compute with Open Source Models, https://huggingface.co/spaces/HuggingFaceH4/blogpost-scaling-test-time-compute
Yafu Li, Zhilin Wang, Tingchen Fu, Ganqu Cui, Sen Yang, Yu Cheng, 21 Jan 2025, From Drafts to Answers: Unlocking LLM Potential via Aggregation Fine-Tuning, https://arxiv.org/abs/2501.11877 (Fine-tune an LLM to accept multiple candidate answers and output a final one.)
Weihua Du, Yiming Yang, Sean Welleck, 7 Feb 2025, Optimizing Temperature for Language Models with Multi-Sample Inference, https://arxiv.org/abs/2502.05234 https://github.com/StigLidu/TURN
Juntai Cao, Xiang Zhang, Raymond Li, Chuyuan Li, Shafiq Joty, Giuseppe Carenini, 27 Feb 2025, Multi2: Multi-Agent Test-Time Scalable Framework for Multi-Document Processing, https://arxiv.org/abs/2502.20592 (Test time computed applied to the multi-document summarization use case.)
Komal Kumar, Tajamul Ashraf, Omkar Thawakar, Rao Muhammad Anwer, Hisham Cholakkal, Mubarak Shah, Ming-Hsuan Yang, Phillip H.S. Torr, Salman Khan, Fahad Shahbaz Khan, 28 Feb 2025, LLM Post-Training: A Deep Dive into Reasoning Large Language Models, https://arxiv.org/abs/2502.21321 https://github.com/mbzuai-oryx/Awesome-LLM-Post-training
Chengsong Huang, Langlin Huang, Jixuan Leng, Jiacheng Liu, Jiaxin Huang, 25 Feb 2025, Efficient Test-Time Scaling via Self-Calibration, https://arxiv.org/abs/2503.00031
Yiming Wang, Pei Zhang, Siyuan Huang, Baosong Yang, Zhuosheng Zhang, Fei Huang, Rui Wang, 3 Mar 2025, Sampling-Efficient Test-Time Scaling: Self-Estimating the Best-of-N Sampling in Early Decoding, https://arxiv.org/abs/2503.01422
Yiwei Li, Jiayi Shi, Shaoxiong Feng, Peiwen Yuan, Xinglin Wang, Yueqi Zhang, Ji Zhang, Chuyi Tan, Boyuan Pan, Yao Hu, Kan Li, 7 Mar 2025, Speculative Decoding for Multi-Sample Inference, https://arxiv.org/abs/2503.05330 (Optimizing speculative decoding when generating multiple answers for a single query, such as for Best-of-N reasoning.)
Eric Zhao, Pranjal Awasthi, Sreenivas Gollapudi, 20 Feb 2025 (v2), Sample, Scrutinize and Scale: Effective Inference-Time Search by Scaling Verification https://arxiv.org/abs/2502.01839 (Wrapping a single model with a Best-of-N approach that self-selects the best answer can significantly improve reasoning rates.)
Ningning Wang, Xavier Hu, Pai Liu, He Zhu, Yue Hou, Heyuan Huang, Shengyu Zhang, Jian Yang, Jiaheng Liu, Ge Zhang, Changwang Zhang, Jun Wang, Yuchen Eleanor Jiang, Wangchunshu Zhou, 24 Jul 2025, Efficient Agents: Building Effective Agents While Reducing Cost, https://arxiv.org/pdf/2508.02694 https://github.com/OPPO-PersonalAI/OAgents
Shubham Toshniwal, Ivan Sorokin, Aleksander Ficek, Ivan Moshkov, Igor Gitman, 23 Jul 2025, GenSelect: A Generative Approach to Best-of-N, https://arxiv.org/abs/2507.17797
Jizhou Guo, Zhaomin Wu, Hanchen Yang, Philip S. Yu, 29 Jul 2025, Mining Intrinsic Rewards from LLM Hidden States for Efficient Best-of-N Sampling, https://arxiv.org/abs/2505.12225
Jiahao Qiu, Yifu Lu, Yifan Zeng, Jiacheng Guo, Jiayi Geng, Chenhao Zhu, Xinzhe Juan, Ling Yang, Huazheng Wang, Kaixuan Huang, Yue Wu, Mengdi Wang, 3 Sep 2025, TreeBoN: Enhancing Inference-Time Alignment with Speculative Tree-Search and Best-of-N Sampling, https://arxiv.org/abs/2410.16033