Aussie AI

Debugging AI Models and Frameworks

Last Updated 14 August, 2025

by David Spuler, Ph.D.

I heard a rumor that AI frameworks are just code, and AI models are just data. So this means that there must be bugs! And this article is about real, hard-core coding bugs, the nasty kind that sneak in with all of this performance tuning that's going around, not the higher level AI problems with safety and accuracy issues.

The reality is that an AI engine is some of the most difficult code you'll ever see. Parallelized code of any kind (e.g. low-level hardware acceleration, multi-threaded, multi-GPU, etc.) multiplies this complexity by another order of magnitude. Hence, starting with the basics of high quality coding practices are ever more important, such as:

Unit tests
Assertions and self-testing code
Debug tracing code
Automated system tests (regression testing)
Error handling (e.g. starting with checking error return codes)
Exception handling (wrapping code in a full exception handling stack)

All of these techniques involve a significant chunk of extra coding work. Theory says that full exception handling can be 80% of a finalized software product, so it's a four-fold amount of extra work! Maybe that estimate is a little outdated, given improvements in modern tech stacks, but it still contains many grains of truth.

There are many programming tools to help the debugging cycle:

C++ memory debugging tools (e.g. Valgrind on Linux)
Performance profiling tools (for "de-slugging")
Memory usage tracking (ie. allocated memory measurement)
Interactive debugging tools (eg. in the IDE, Gnu gdb, etc.)

Random Number Seeds

Neural network code often uses random numbers to improve accuracy, for a stochastic algorithm, or even just for random testing. Random numbers need a "seed" to get started, which is done via the "srand" function in C++. The typical way to initialize the random number generator, so it's truly random, is to use the current time:

    srand(time(NULL));

But that's not good for debugging! We don't want randomness when we're trying to reproduce a bug!

A generalized plan is to have a debugging or regression testing mode where the seed is fixed.

    if (g_yapi_debug_srand_seed != 0) {
	srand(g_yapi_debug_srand_seed);   // Non-random randomness!
    }
    else {  // Normal run
	srand(time(NULL));
    }

The test harness has to set the global debug variable when it's doing a regression test. For example, either it's manually hard-coded into a testing function, or it could be set via a command-line argument to your test harness executable.

This is better, but if we have a bug in production, we won't know the seed number. So the better code also prints out the seed number in case you need to use it later to reproduce a bug that occurred live.

    if (g_yapi_debug_srand_seed != 0) {
	srand(g_yapi_debug_srand_seed);   // Non-random randomness!
    }
    else {  // Normal run
	long int iseed = (long)time(NULL);
	fprintf(stderr, "INFO: Random number seed: %ld 0x%lx\n", iseed, iseed);
	srand(iseed);
    }

Research on Debugging AI Framework Code

Papers on the issues of debugging the actual code that runs AI models, including the code inside the frameworks and ML compilers, includes:

H Guan, Y Xiao, J Li, Y Liu, G Bai, May 2023, A comprehensive study of real-world bugs in machine learning model optimization, 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE), https://ieeexplore.ieee.org/document/10172690, PDF: https://yepangliu.github.io/files/ICSE2023-MOB.pdf, PDF: https://baigd.github.io/files/ICSE23-MOB.pdf (Extensive coverage of bugs in the AI tech stack, including those introduced by optimizing models!)
J Han, E Shihab, Z Wan, S Deng, X Xia, 2020, Empirical Software Engineering, What do programmers discuss about deep learning frameworks, https://link.springer.com/article/10.1007/s10664-020-09819-6, PDF: https://zhiyuan-wan.github.io/assets/publications/han_emse_20_dl_discussion.pdf
Juneyoung Lee, Chung-Kil Hur, Ralf Jung, Zhengyang Liu, John Regehr, and Nuno P. Lopes, 2018, Reconciling High-Level Optimizations and Low-Level Code in LLVM, Proc. of the ACM on Programming Languages, Volume 2 Issue OOPSLA, Nov. 2018. PDF: http://web.ist.utl.pt/nuno.lopes/pubs/llvmmem-oopsla18.pdf (This paper has discussion of "bounds checking" and "out of memory" handling.)
F Mince, D Dinh, J Kgomo, N Thompson, S Hooker, 2023, The Grand Illusion: The Myth of Software Portability and Implications for ML Progress, arXiv preprint arXiv:2309.07181, PDF: https://arxiv.org/pdf/2309.07181.pdf (Examines ML software frameworks TensorFlow, Pytorch, and JAX, from the perspective of portability across hardware.)
Harshil Patel, 11th August, 2023, Model Debugging Strategies: Machine Learning Guide, MLOps Blog, Neptune Labs, https://neptune.ai/blog/model-debugging-strategies-machine-learning
Ian Colbert, Alessandro Pappalardo, Jakoba Petri-Koenig, Aug 2023, A2Q: Accumulator-Aware Quantization with Guaranteed Overflow Avoidance, https://arxiv.org/pdf/2308.13504.pdf (Examines avoiding computation overflow in quantization.)
Guanpeng Li, Siva Kumar Sastry Hari, Michael Sullivan, Timothy Tsai, Karthik Pattabiraman, Joel Emer, and Stephen W Keckler. Understanding error propagation in deep learning neural network (dnn) accelerators and applications. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, page 8. ACM, 2017. https://ieeexplore.ieee.org/document/9926241, PDF: https://people.csail.mit.edu/emer/media/papers/2017.11.sc.error_propagation_in_DNNs.pdf
Vincenzo Piuri. Analysis of fault tolerance in artificial neural networks. Journal of Parallel and Distributed Computing, 61(1):18–48, 2001. https://www.sciencedirect.com/science/article/abs/pii/S0743731500916630
Cesar Torres-Huitzil; Bernard Girau, Fault and Error Tolerance in Neural Networks: A Review, IEEE Access, Volume 5, DOI: 10.1109/ACCESS.2017.2742698, https://ieeexplore.ieee.org/document/8013784
H Liu, V Singh, M Filipiuk, SKS Hari - arXiv preprint arXiv:2310.03841, Oct 2023 ALBERTA: ALgorithm-Based Error Resilience in Transformer Architectures, https://arxiv.org/pdf/2310.03841.pdf (Detects and tolerates errors in GEMM at runtime, including testing with some error injection tests by flipping bits in neurons.)
Zitao Chen, Guanpeng Li, and Karthik Pattabiraman. 2021, A low-cost fault corrector for deep neural networks through range restriction. In IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), pages 1–13, https://ieeexplore.ieee.org/document/9505066/, https://arxiv.org/abs/2003.13874
Y Guan, Y Qiu, J Leng, F Yang, S Yu, Y Liu, Y Feng, 2023, Amanda: Unified Instrumentation Framework for Deep Neural Networks, Conference, April 27–May 1,2024, San Diego, CA, Association for Computing Machinery, https://www.cs.sjtu.edu.cn/~leng-jw/resources/Files/guan2024asplos-amanda.pdf (AI tracing and instrumentation methods)
M. A. Hanif, R. Hafiz, and M. Shafique, Error resilience analysis for systematically employing approximate computing in convolutional neural networks, Proc. Design, Automat. Test Eur. Conf. Exhib. (DATE), Mar. 2018, pp. 913–916. https://ieeexplore.ieee.org/document/8342139
A. Marchisio, V. Mrazek, M. A. Hanif, and M. Shafique, ReD-CaNe: A systematic methodology for resilience analysis and design of capsule networks under approximations, in Proc. Design, Autom. Test Eur. Conf. Exhib. (DATE), Mar. 2020, pp. 1205–1210. https://arxiv.org/abs/1912.00700
Ansong Ni, Miltiadis Allamanis, Arman Cohan, Yinlin Deng, Kensen Shi, Charles Sutton, Pengcheng Yin, 23 Apr 2024, NExT: Teaching Large Language Models to Reason about Code Execution, https://arxiv.org/abs/2404.14662
Francisco Ribeiro, José Nuno Castro de Macedo, Kanae Tsushima, Rui Abreu, João Saraiva, 2023, GPT-3-Powered Type Error Debugging: Investigating the Use of Large Language Models for Code Repair, SLE 2023: Proceedings of the 16th ACM SIGPLAN International Conference on Software Language Engineering, October 2023, Pages 111–124, https://doi.org/10.1145/3623476.3623522 (Code corrections are a type of GEC.)
Stephen Macneil, Paul Denny, Andrew Tran, Juho Leinonen, Seth Bernstein, Arto Hellas, Sami Sarsa, Joanne Kim, 2024, Decoding Logic Errors: A Comparative Study on Bug Detection by Students and Large Language Models, ACE '24: Proceedings of the 26th Australasian Computing Education Conference, January 2024, Pages 11–18, https://doi.org/10.1145/3636243.3636245, https://dl.acm.org/doi/abs/10.1145/3636243.3636245
Shixun Wu, Yujia Zhai, Jinyang Liu, Jiajun Huang, Zizhe Jian, Bryan M. Wong, Zizhong Chen, May 2023, Anatomy of High-Performance GEMM with Online Fault Tolerance on GPUs, https://arxiv.org/abs/2305.01024 (Focuses on error tolerance of failures within matrix multiplication algorithms.)
Javier Ferrando, Gabriele Sarti, Arianna Bisazza, Marta R. Costa-jussà, 2 May 2024 (v2), A Primer on the Inner Workings of Transformer-based Language Models, https://arxiv.org/pdf/2405.00208 (Analyzes the theory of the Transformer architecture, including an interesting separation of the effects of attention versus FFNs on logits to give attributions.)
Lan Chu, Jan 2024, LLM Output — Evaluating, debugging, and interpreting, Towards AI, https://pub.towardsai.net/llm-output-evaluating-debugging-and-interpreting-f3bd29e7d14d
Sun, Youcheng and Huang, Xiaowei and Kroening, Daniel and Sharp, James and Hill, Matthew and Ashmore, Rob, Structural Test Coverage Criteria for Deep Neural Networks, https://doi.org/10.1145/3358233
Youcheng Sun, Xiaowei Huang, Daniel Kroening, James Sharp, Matthew Hill, Rob Ashmore, Apr 2019, Testing Deep Neural Networks, https://arxiv.org/pdf/1803.04792.pdf
DeepConcolic (Testing for Deep Neural Networks), https://github.com/TrustAI/DeepConcolic
Jing Yu; Shukai Duan; Xiaojun Ye, 2023, A White-Box Testing for Deep Neural Networks Based on Neuron Coverage, IEEE Transactions on Neural Networks and Learning Systems ( Volume: 34, Issue: 11, November 2023), https://ieeexplore.ieee.org/document/9737039
David Spuler, March 2024, Chapter 42. Debugging, Generative AI in C++: Coding Transformers and LLMs, https://www.amazon.com/dp/B0CXJKCWX9
Samantha Robertson, Zijie J. Wang, Dominik Moritz, Mary Beth Kery, Fred Hohman, 12 Apr 2023, Angler: Helping Machine Translation Practitioners Prioritize Model Improvements, https://arxiv.org/abs/2304.05967 https://machinelearning.apple.com/research/helping-machine-translation Code: https://github.com/apple/ml-translate-vis
David Hashe, 19 July 2024, How to build highly-debuggable C++ binaries, https://dhashe.com/how-to-build-highly-debuggable-c-binaries.html
Jiangfei Duan, Shuo Zhang, Zerui Wang, Lijuan Jiang, Wenwen Qu, Qinghao Hu, Guoteng Wang, Qizhen Weng, Hang Yan, Xingcheng Zhang, Xipeng Qiu, Dahua Lin, Yonggang Wen, Xin Jin, Tianwei Zhang, Peng Sun, 29 Jul 2024, Efficient Training of Large Language Models on Distributed Infrastructures: A Survey, https://arxiv.org/abs/2407.20018
Angie Boggust, Venkatesh Sivaraman, Yannick Assogba, Donghao Ren, Dominik Moritz, Fred Hohman, 6 Aug 2024, Compress and Compare: Interactively Evaluating Efficiency and Behavior Across ML Model Compression Experiments, https://arxiv.org/abs/2408.03274
Victor Dibia, Jingya Chen, Gagan Bansal, Suff Syed, Adam Fourney, Erkang Zhu, Chi Wang, Saleema Amershi, 9 Aug 2024, AutoGen Studio: A No-Code Developer Tool for Building and Debugging Multi-Agent Systems, https://arxiv.org/abs/2408.15247
Fred Hohman, Chaoqun Wang, Jinmook Lee, Jochen Görtler, Dominik Moritz, Jeffrey P Bigham, Zhile Ren, Cecile Foret, Qi Shan, and Xiaoyi Zhang. 2024. Talaria: Interactively Optimizing Machine Learning Models for Efficient Inference. In Proceedings of the CHI Conference on Human Factors in Computing Systems (CHI '24). Association for Computing Machinery, New York, NY, USA, Article 648, 1–19. https://doi.org/10.1145/3613904.3642628 https://dl.acm.org/doi/full/10.1145/3613904.3642628
Xinyi Hou, Yanjie Zhao, Haoyu Wang, 3 Aug 2024, Voices from the Frontier: A Comprehensive Analysis of the OpenAI Developer Forum, https://arxiv.org/abs/2408.01687
Xiang Chen, Chaoyang Gao, Chunyang Chen, Guangbei Zhang, Yong Liu, 12 Aug 2024 (v2), An Empirical Study on Challenges for LLM Developers, https://arxiv.org/abs/2408.05002
Mingyuan Wu, Husheng Zhou, Lingming Zhang, Cong Liu, Yuqun Zhang, 29 May 2019 (v3), Characterizing and Detecting CUDA Program Bugs, https://arxiv.org/abs/1905.01833 (Study of CUDA bugs in several production-level CUDA projects, including memory resource issues and synchronization errors.)
M. Wu, Y. Ouyang, H. Zhou, L. Zhang, C. Liu and Y. Zhang, "Simulee: Detecting CUDA Synchronization Bugs via Memory-Access Modeling," 2020 IEEE/ACM 42nd International Conference on Software Engineering (ICSE), Seoul, Korea (South), 2020, pp. 937-948, doi: 10.1145/3377811.3380358. https://ieeexplore.ieee.org/document/9284094 (Simulation tool to detect CUDA bugs by interpreting the LLVM byte code.)
Pengcheng Li, Chen Ding, Xiaoyu Hu, Tolga Soyata, 2014, LDetector: A Low Overhead Race Detector For GPU Programs, https://wodet.cs.washington.edu/wp-content/uploads/2014/02/wodet2014-final14.pdf
S. Lagouvardos, J. Dolby, N. Grech, A. Antoniadis, and Y. Smaragdakis, 2020, “Static analysis of shape in Tensorflow programs,” in 34th European Conference on Object-Oriented Programming (ECOOP 2020). Schloss Dagstuhl-Leibniz-Zentrum fur Informatik, 2020. https://drops.dagstuhl.de/entities/document/10.4230/DARTS.6.2.6 PDF: https://drops.dagstuhl.de/storage/05darts/darts-vol006/darts-vol006-issue002_ecoop2020/DARTS.6.2.6/DARTS.6.2.6.pdf
H. Y. Jhoo, S. Kim, W. Song, K. Park, D. Lee, and K. Yi, “A static analyzer for detecting tensor shape errors in deep neural network training code,” in Proceedings of the ACM/IEEE 44th International Conference on Software Engineering: Companion Proceedings, 2022, pp. 337 338. https://arxiv.org/abs/2112.09037
S. Hong, H. Sun, X. Gao and S. H. Tan, "Investigating and Detecting Silent Bugs in PyTorch Programs," 2024 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER), Rovaniemi, Finland, 2024, pp. 272-283, doi: 10.1109/SANER60148.2024.00035. https://ieeexplore.ieee.org/abstract/document/10589839 PDF: https://gaoxiang9430.github.io/papers/saner24a.pdf
Florian Tambon, Amin Nikanjam, Le An, Foutse Khomh, Giuliano Antoniol, 1 Sep 2023 (v2), Silent Bugs in Deep Learning Frameworks: An Empirical Study of Keras and TensorFlow, https://arxiv.org/abs/2112.13314
M Hattori, N Kobayashi, R Sato, 2023, Gradual tensor shape checking, PDF: https://library.oapen.org/bitstream/handle/20.500.12657/63011/1/978-3-031-30044-8.pdf#page=210
Yiran Wang, José Antonio Hernández López, Ulf Nilsson, and Dániel Varró. 2024. Using Run-Time Information to Enhance Static Analysis of Machine Learning Code in Notebooks. In Companion Proceedings of the 32nd ACM International Conference on the Foundations of Software Engineering (FSE 2024). Association for Computing Machinery, New York, NY, USA, 497–501. https://doi.org/10.1145/3663529.3663785 https://dl.acm.org/doi/abs/10.1145/3663529.3663785 PDF: https://dl.acm.org/doi/pdf/10.1145/3663529.3663785
Aparna Dhinakaran, Sep 2024, Choosing Between LLM Agent Frameworks. The tradeoffs between building bespoke code-based agents and the major agent frameworks. https://towardsdatascience.com/choosing-between-llm-agent-frameworks-69019493b259
Yash Belhe & Zhefan Xu, Sep 2024 (accessed), Neural Networks and Debugging, https://deeplearning.cs.cmu.edu/S20/document/recitation/recitation-4.pdf (Detailed discussion of debugging your AI model.)
LangChain, Nov 7, 2024. SCIPE - Systematic Chain Improvement and Problem Evaluation, https://blog.langchain.dev/scipe-systematic-chain-improvement-and-problem-evaluation/ https://github.com/garg-ankush/scipe/tree/main
Xiaoyu Zhang, Weipeng Jiang, Chao Shen, Qi Li, Qian Wang, Chenhao Lin, Xiaohong Guan, 27 Apr 2024, A Survey of Deep Learning Library Testing Methods, https://arxiv.org/abs/2404.17871
Utpal Bora, Saurabh Joshi, Gautam Muduganti, Ramakrishna Upadrasta, 21 Nov 2024, LLOR: Automated Repair of OpenMP Programs, https://arxiv.org/abs/2411.14590 (Addressing data race errors in programs.)
Harish Dattatraya Dixit, Sneha Pendharkar, Matt Beadon, Chris Mason, Tejasvi Chakravarthy, Bharath Muthiah, Sriram Sankar, 22 Feb 2021, Silent Data Corruptions at Scale, Facebook Research, https://arxiv.org/abs/2102.11245
Qinghao Hu, Zhisheng Ye, Zerui Wang, Guoteng Wang, Meng Zhang, Qiaoling Chen, Peng Sun, Dahua Lin, Xiaolin Wang, Yingwei Luo, Yonggang Wen, Tianwei Zhang, 3 Apr 2024 (v2), Characterization of Large Language Model Development in the Datacenter, https://arxiv.org/abs/2403.07648
Taylor Allred, Xinyi Li, Ashton Wiersdorf, Ben Greenman, Ganesh Gopalakrishnan, 22 Mar 2024, FlowFPX: Nimble Tools for Debugging Floating-Point Exceptions, https://arxiv.org/abs/2403.15632 https://juliahub.com/ui/Packages/FloatTracker/dBXig
Khairul Alam, Kartik Mittal, Banani Roy, Chanchal Roy, 22 Nov 2024 (v2), Developer Challenges on Large Language Models: A Study of Stack Overflow and OpenAI Developer Forum Posts, https://arxiv.org/abs/2411.10873
Shengming Zhao, Yuheng Huang, Jiayang Song, Zhijie Wang, Chengcheng Wan, Lei Ma, 29 Nov 2024, Towards Understanding Retrieval Accuracy and Prompt Quality in RAG Systems, https://arxiv.org/abs/2411.19463
Maxwell Zeff, December 11, 2024, ChatGPT and Sora experienced a major outage, https://techcrunch.com/2024/12/11/chatgpt-and-sora-are-down/
Aleksandra Uvarova, Dec 28 2024, Court is in session: Top 10 most notorious C and C++ errors in 2024, https://pvs-studio.com/en/blog/posts/cpp/1208/
Dr. Marcel Müller, Jan 2025, Why Generative-AI Apps’ Quality Often Sucks and What to Do About It: How to get from PoCs to tested high-quality applications in production, https://towardsdatascience.com/why-generative-ai-apps-quality-often-sucks-and-what-to-do-about-it-f84407f263c3
Xiang Chen, Chaoyang Gao, Chunyang Chen, Guangbei Zhang, and Yong Liu. 2025. An Empirical Study on Challenges for LLM Application Developers. ACM Trans. Softw. Eng. Methodol. Just Accepted (January 2025). https://doi.org/10.1145/3715007 https://dl.acm.org/doi/pdf/10.1145/3715007
C Winston, R Just, Feb 2025, A Taxonomy of Failures in Tool-Augmented LLMs, https://homes.cs.washington.edu/~rjust/publ/tallm_testing_ast_2025.pdf
Zhihan Jiang, Junjie Huang, Zhuangbin Chen, Yichen Li, Guangba Yu, Cong Feng, Yongqiang Yang, Zengyin Yang, Michael R. Lyu, 26 Mar 2025, L4: Diagnosing Large-scale LLM Training Failures via Automated Log Analysis, https://arxiv.org/abs/2503.20263
Boyue Caroline Hu, Divya Gopinath, Corina S. Pasareanu, Nina Narodytska, Ravi Mangal, Susmit Jha, 21 Mar 2025, Debugging and Runtime Analysis of Neural Networks with VLMs (A Case Study), https://arxiv.org/abs/2503.17416
Cameron R. Wolfe, Ph.D., May 19, 2025, A Guide for Debugging LLM Training Data: Data-centric techniques and tools that anyone should use when training an LLM, https://cameronrwolfe.substack.com/p/llm-debugging
Yicheng Feng, Yuetao Chen, Kaiwen Chen, Jingzong Li, Tianyuan Wu, Peng Cheng, Chuan Wu, Wei Wang, Tsung-Yi Ho, Hong Xu, 17 Dec 2024, Echo: Simulating Distributed Training At Scale, https://arxiv.org/abs/2412.12487
Weiyi Wu, Xinwen Xu, Chongyang Gao, Xingjian Diao, Siting Li, Lucas A. Salas, Jiang Gui, 12 May 2025, Assessing and Mitigating Medical Knowledge Drift and Conflicts in Large Language Models, https://arxiv.org/abs/2505.07968
Assaf Ben-Kish, Itamar Zimerman, M. Jehanzeb Mirza, James Glass, Leonid Karlinsky, Raja Giryes, 12 May 2025, Overflow Prevention Enhances Long-Context Recurrent LLMs, https://arxiv.org/abs/2505.07793
Lei Wang, 12 May 2025, SEReDeEP: Hallucination Detection in Retrieval-Augmented Models via Semantic Entropy and Context-Parameter Fusion, https://arxiv.org/abs/2505.07528
Chen Amiraz, Florin Cuconasu, Simone Filice, Zohar Karnin, 11 May 2025, The Distracting Effect: Understanding Irrelevant Passages in RAG, https://arxiv.org/abs/2505.06914
Y Jiang, Z Zhou, B Xu, B Liu, R Xu, P Huang, 2025, Training with Confidence: Catching Silent Errors in Deep Learning Training with Automated Proactive Checks, University of Michigan, https://orderlab.io/paper/traincheck-osdi25-preprint.pdf (Runtime checking for LLM training errors by tracking "training invariants" and focused mostly on software causes.)
S. Cui et al., 2025, Characterizing Modern GPU Resilience and Impact in HPC Systems: A Case Study of A100 GPUs, 2025 55th Annual IEEE/IFIP International Conference on Dependable Systems and Networks Workshops (DSN-W), Naples, Italy, 2025, pp. 1-6, doi: 10.1109/DSN-W65791.2025.00031, https://www.computer.org/csdl/proceedings-article/dsn-w/2025/120500a001/28llgBK5Xag
Ravishka Rathnasuriya, Nidhi Majoju, Zihe Song, and Wei Yang. 2025. An Investigation on Numerical Bugs in GPU Programs Towards Automated Bug Detection. Proc. ACM Softw. Eng. 2, ISSTA, Article ISSTA073 (July 2025), 24 pages. https://doi.org/10.1145/3728950 https://dl.acm.org/doi/abs/10.1145/3728950 https://dl.acm.org/doi/pdf/10.1145/3728950
Zhu Zhu, Yu Sun, Dhatri Parakal, Bo Fang, Steven Farrell, Gregory H. Bauer, Brett Bode, Ian T. Foster, Michael E. Papka, William Gropp, Zhao Zhang, Lishan Yang, 5 Aug 2025, Understanding the Landscape of Ampere GPU Memory Errors, https://arxiv.org/abs/2508.03513
Yicheng Feng, Xin Tan, Kin Hang Sew, Yimin Jiang, Yibo Zhu, Hong Xu, 5 Aug 2025, Frontier: Simulating the Next Generation of LLM Inference Systems, https://arxiv.org/abs/2508.03148

General Debugging Techniques Research

Research on general program debugging methods:

TM Austin, SE Breach, GS Sohi, 1994, Efficient detection of all pointer and array access errors, PLDI '94: Proceedings of the ACM SIGPLAN 1994 conference on Programming language design and implementation, August 1994, Pages 290–301, https://dl.acm.org/doi/abs/10.1145/178243.178446, PDF: https://dl.acm.org/doi/pdf/10.1145/178243.178446

Testing AI Applications

Research on testing of AI apps in general (not just model evaluation of LLMs):

Stella Biderman, Hailey Schoelkopf, Lintang Sutawika, Leo Gao, Jonathan Tow, Baber Abbasi, Alham Fikri Aji, Pawan Sasanka Ammanamanchi, Sidney Black, Jordan Clive, Anthony DiPofi, Julen Etxaniz, Benjamin Fattori, Jessica Zosa Forde, Charles Foster, Mimansa Jaiswal, Wilson Y. Lee, Haonan Li, Charles Lovering, Niklas Muennighoff, Ellie Pavlick, Jason Phang, Aviya Skowron, Samson Tan, Xiangru Tang, Kevin A. Wang, Genta Indra Winata, François Yvon, Andy Zou, 23 May 2024, Lessons from the Trenches on Reproducible Evaluation of Language Models, https://arxiv.org/abs/2405.14782 (Model evaluation theory and practice with the lm-eval test harness tool.)
Chang, Xiangyu; Miraj Ahmed, Sk; Krishnamurthy, Srikanth V.; Guler, Basak; Swami, Ananthram; Oymak, Samet; Roy-Chowdhury, Amit K., Jan 2024, Plug-and-Play Transformer Modules for Test-Time Adaptation, https://arxiv.org/abs/2401.04130 https://ui.adsabs.harvard.edu/abs/2024arXiv240104130C/abstract
Lior Solomon, Sep 2024, Gen AI testing strategies and tools, https://medium.com/ai-in-grc/gen-ai-testing-strategies-and-tools-257383e5cbfb
wilicc, 2024, gpu-burn: Multi-GPU CUDA stress test, http://wili.cc/blog/gpu-burn.html https://github.com/wilicc/gpu-burn
NVIDIA, 2024, nvbandwidth: A tool for bandwidth measurements on NVIDIA GPUs. https://github.com/NVIDIA/nvbandwidth
NVIDIA, 2024, DCGM Diagnostics, https://docs.nvidia.com/datacenter/dcgm/latest/user-guide/dcgm-diagnostics.html
Lucas C. Cordeiro, Matthew L. Daggitt, Julien Girard-Satabin, Omri Isac, Taylor T. Johnson, Guy Katz, Ekaterina Komendantskaya, Augustin Lemesle, Edoardo Manino, Artjoms Šinkarovs, Haoze Wu, 10 Jan 2025, Neural Network Verification is a Programming Language Challenge, https://arxiv.org/abs/2501.05867

GPU Testing

GPU testing is a variety of techniques to detect errors in GPU hardware, which can arise due to aging, overheating, or transient soft errors. GPU failures are a common problem in large-scale AI training jobs, because there are literally 100,000+ GPU chips, each of which has a small failure chance. There are various tools, both commercial and open source, to run stress tests on GPUs.

Papers on GPU testing:

Ziheng Jiang, Haibin Lin, Yinmin Zhong, Qi Huang, Yangrui Chen, Zhi Zhang, Yanghua Peng, Xiang Li, Cong Xie, Shibiao Nong, Yulu Jia, Sun He, Hongmin Chen, Zhihao Bai, Qi Hou, Shipeng Yan, Ding Zhou, Yiyao Sheng, Zhuo Jiang, Haohan Xu, Haoran Wei, Zhang Zhang, Pengfei Nie, Leqi Zou, Sida Zhao, Liang Xiang, Zherui Liu, Zhe Li, Xiaoying Jia, Jianxi Ye, Xin Jin, Xin Liu, 23 Feb 2024, MegaScale: Scaling Large Language Model Training to More Than 10,000 GPUs, https://arxiv.org/abs/2402.15627
wilicc, 2024, gpu-burn: Multi-GPU CUDA stress test, http://wili.cc/blog/gpu-burn.html https://github.com/wilicc/gpu-burn
NVIDIA, 2024, nvbandwidth: A tool for bandwidth measurements on NVIDIA GPUs. https://github.com/NVIDIA/nvbandwidth
NVIDIA, 2024, DCGM Diagnostics, https://docs.nvidia.com/datacenter/dcgm/latest/user-guide/dcgm-diagnostics.html