Aussie AI

Mechanistic Interpretability

  • Last Updated 27 August, 2025
  • by David Spuler, Ph.D.

Mechanistic interpretability is the analysis of internal LLM computations during attention to understand or interpret why the LLM emitted the answers that it did. This involves analysis of the signals from the activations in the latent space of the embeddings. Mechanistic interpretability was initially a read-only analysis to aid in explainability, but arithmetic modifications of the activations are also possible in methods such as attention steering and activation patching.

Research on Mechanistic Interpretability

Research papers on mechanistic interpretability:

  • Daking Rai, Yilun Zhou, Shi Feng, Abulhair Saparov, Ziyu Yao, 2 Jul 2024, A Practical Review of Mechanistic Interpretability for Transformer-Based Language Models, https://arxiv.org/abs/2407.02646
  • Mengru Wang, Yunzhi Yao, Ziwen Xu, Shuofei Qiao, Shumin Deng, Peng Wang, Xiang Chen, Jia-Chen Gu, Yong Jiang, Pengjun Xie, Fei Huang, Huajun Chen, Ningyu Zhang, 4 Dec 2024 (v4), Knowledge Mechanisms in Large Language Models: A Survey and Perspective, https://arxiv.org/abs/2407.15017
  • Xintong Wang, Jingheng Pan, Longqin Jiang, Liang Ding, Xingshan Li, Chris Biemann, 23 Oct 2024, CogSteer: Cognition-Inspired Selective Layer Intervention for Efficient Semantic Steering in Large Language Models, https://arxiv.org/abs/2410.17714
  • Hesam Hosseini, Ghazal Hosseini Mighan, Amirabbas Afzali, Sajjad Amini, Amir Houmansadr, 15 Nov 2024, ULTra: Unveiling Latent Token Interpretability in Transformer Based Understanding, https://arxiv.org/abs/2411.12589
  • Naomi Saphra, Sarah Wiegreffe, 7 Oct 2024, Mechanistic? https://arxiv.org/abs/2410.09087
  • Leonard Bereska, Efstratios Gavves, 23 Aug 2024 (v3), Mechanistic Interpretability for AI Safety -- A Review, https://arxiv.org/abs/2404.14082
  • Neel Nanda, 8th Jul 2024, An Extremely Opinionated Annotated List of My Favourite Mechanistic Interpretability Papers v2, https://www.alignmentforum.org/posts/NfFST5Mio7BCAQHPA/an-extremely-opinionated-annotated-list-of-my-favourite
  • Nikhil Anand, Dec 20, 2024, Understanding “steering” in LLMs And how simple math can solve global problems. https://ai.gopubby.com/understanding-steering-in-llms-96faf6e0bee7
  • Chashi Mahiul Islam, Samuel Jacob Chacko, Mao Nishino, Xiuwen Liu, 7 Feb 2025, Mechanistic Understandings of Representation Vulnerabilities and Engineering Robust Vision Transformers, https://arxiv.org/abs/2502.04679
  • Ala N. Tak, Amin Banayeeanzade, Anahita Bolourani, Mina Kian, Robin Jia, Jonathan Gratch, 8 Feb 2025, Mechanistic Interpretability of Emotion Inference in Large Language Models, https://arxiv.org/abs/2502.05489
  • Artem Kirsanov, Chi-Ning Chou, Kyunghyun Cho, SueYeon Chung, 11 Feb 2025, The Geometry of Prompting: Unveiling Distinct Mechanisms of Task Adaptation in Language Models, https://arxiv.org/abs/2502.08009
  • Zeping Yu, Yonatan Belinkov, Sophia Ananiadou, 15 Feb 2025, Back Attention: Understanding and Enhancing Multi-Hop Reasoning in Large Language Models, https://arxiv.org/abs/2502.10835
  • Lukasz Bartoszcze, Sarthak Munshi, Bryan Sukidi, Jennifer Yen, Zejia Yang, David Williams-King, Linh Le, Kosi Asuzu, Carsten Maple, 24 Feb 2025, Representation Engineering for Large-Language Models: Survey and Research Challenges,https://arxiv.org/abs/2502.17601
  • Samuel Miller, Daking Rai, Ziyu Yao, 20 Feb 2025, Mechanistic Understanding of Language Models in Syntactic Code Completion, https://arxiv.org/abs/2502.18499
  • Yifan Zhang, Wenyu Du, Dongming Jin, Jie Fu, Zhi Jin, 27 Feb 2025, Finite State Automata Inside Transformers with Chain-of-Thought: A Mechanistic Study on State Tracking, https://arxiv.org/abs/2502.20129
  • Kevin Meng, David Bau, Alex Andonian, Yonatan Belinkov, 13 Jan 2023 (v5), Locating and Editing Factual Associations in GPT, https://arxiv.org/abs/2202.05262
  • Yingbing Huang, Deming Chen, Abhishek K. Umrawal, 28 Feb 2025, JAM: Controllable and Responsible Text Generation via Causal Reasoning and Latent Vector Manipulation, https://arxiv.org/abs/2502.20684
  • J. Katta, M. Allanki and N. R. Kodumuru, "Understanding Sarcasm Detection Through Mechanistic Interpretability," 2025 4th International Conference on Sentiment Analysis and Deep Learning (ICSADL), Bhimdatta, Nepal, 2025, pp. 990-995, doi: 10.1109/ICSADL65848.2025.10933475. https://ieeexplore.ieee.org/abstract/document/10933475/
  • Ying Shen, Lifu Huang, 20 Mar 2025, LLM Braces: Straightening Out LLM Predictions with Relevant Sub-Updates, https://arxiv.org/abs/2503.16334
  • Lee Sharkey, Bilal Chughtai, Joshua Batson, Jack Lindsey, Jeff Wu, Lucius Bushnaq, Nicholas Goldowsky-Dill, Stefan Heimersheim, Alejandro Ortega, Joseph Bloom, Stella Biderman, Adria Garriga-Alonso, Arthur Conmy, Neel Nanda, Jessica Rumbelow, Martin Wattenberg, Nandi Schoots, Joseph Miller, Eric J. Michaud, Stephen Casper, Max Tegmark, William Saunders, David Bau, Eric Todd, Atticus Geiger, Mor Geva, Jesse Hoogland, Daniel Murfet, Tom McGrath, 27 Jan 2025, Open Problems in Mechanistic Interpretability, https://arxiv.org/abs/2501.16496
  • Hang Chen, Jiaying Zhu, Xinyu Yang, Wenya Wang, 15 May 2025, Rethinking Circuit Completeness in Language Models: AND, OR, and ADDER Gates, https://arxiv.org/abs/2505.10039
  • Jingcheng Niu, Xingdi Yuan, Tong Wang, Hamidreza Saghir, Amir H. Abdi, 14 May 2025, Llama See, Llama Do: A Mechanistic Perspective on Contextual Entrainment and Distraction in LLMs, https://arxiv.org/abs/2505.09338
  • Vishnu Kabir Chhabra, Mohammad Mahdi Khalili, 5 Apr 2025, Towards Understanding and Improving Refusal in Compressed Models via Mechanistic Interpretability, https://arxiv.org/abs/2504.04215
  • Michael Nuñez, July 15, 2025, OpenAI, Google DeepMind and Anthropic sound alarm: ‘We may be losing the ability to understand AI’, https://venturebeat.com/ai/openai-google-deepmind-and-anthropic-sound-alarm-we-may-be-losing-the-ability-to-understand-ai/ (Monitoring the text-based interim "thinking-out-loud" reasoning of models in CoT.)
  • Tomek Korbak, Mikita Balesni, (and many more authors) July 2025, Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety, https://tomekkorbak.com/cot-monitorability-is-a-fragile-opportunity/cot_monitoring.pdf
  • M Toker, I Galil, H Orgad, R Gal, Y Tewel, G Chechik, 2025, Padding tone: A mechanistic analysis of padding tokens in t2i models, Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies, (Volume 1: Long Papers), pages 7618–7632, April 29- May 4, 2025, https://aclanthology.org/anthology-files/pdf/naacl/2025.naacl-long.389.pdf (Padding effect on image generation.)
  • Abir Harrasse, Philip Quirke, Clement Neo, Dhruv Nathawani, Luke Marks and Amir Abdullah, 27 Jul 2025, TinySQL: A Progressive Text-to-SQL Dataset for Mechanistic Interpretability Research, https://arxiv.org/abs/2503.12730
  • Xi Chen, Aske Plaat, Niki van Stein, 24 Jul 2025, How does Chain of Thought Think? Mechanistic Interpretability of Chain-of-Thought Reasoning with Sparse Autoencoding, https://arxiv.org/abs/2507.22928
  • Charles O'Neill, Mudith Jayasekara, Max Kirkby, 12 Aug 2025, Resurrecting the Salmon: Rethinking Mechanistic Interpretability with Domain-Specific Sparse Autoencoders, https://arxiv.org/abs/2508.09363
  • Neta Glazer, Yael Segal-Feldman, Hilit Segev, Aviv Shamsian, Asaf Buchnick, Gill Hetz, Ethan Fetaya, Joseph Keshet, Aviv Navon, 21 Aug 2025, Beyond Transcription: Mechanistic Interpretability in ASR, https://arxiv.org/abs/2508.15882

More Attention Research Topics

Related LLM research areas for long context optimization of the attention methods include:

Other topics in attention research:

AI Books from Aussie AI



The Sweetest Lesson: Your Brain Versus AI The Sweetest Lesson: Your Brain Versus AI: new book on AI intelligence theory:
  • Your brain is 50 times bigger than the best AI engines.
  • Truly intelligent AI will require more compute!
  • Another case of the bitter lesson?
  • Maybe it's the opposite of that: the sweetest lesson.

Get your copy from Amazon: The Sweetest Lesson



RAG Optimization RAG Optimization: Accurate and Efficient LLM Applications: new book on RAG architectures:
  • Smarter RAG
  • Faster RAG
  • Cheaper RAG
  • Agentic RAG
  • RAG reasoning

Get your copy from Amazon: RAG Optimization



Generative AI in C++ Generative AI Applications book:
  • Deciding on your AI project
  • Planning for success and safety
  • Designs and LLM architectures
  • Expediting development
  • Implementation and deployment

Get your copy from Amazon: Generative AI Applications



Generative AI in C++ Generative AI programming book:
  • Generative AI coding in C++
  • Transformer engine speedups
  • LLM models
  • Phone and desktop AI
  • Code examples
  • Research citations

Get your copy from Amazon: Generative AI in C++



CUDA C++ Optimization CUDA C++ Optimization book:
  • Faster CUDA C++ kernels
  • Optimization tools & techniques
  • Compute optimization
  • Memory optimization

Get your copy from Amazon: CUDA C++ Optimization



CUDA C++ Optimization CUDA C++ Debugging book:
  • Debugging CUDA C++ kernels
  • Tools & techniques
  • Self-testing & reliability
  • Common GPU kernel bugs

Get your copy from Amazon: CUDA C++ Debugging

More AI Research

Read more about: