Aussie AI
Activation Patching
-
Last Updated 25 August, 2025
-
by David Spuler, Ph.D.
What is Activation Patching?
Activation patching is an LLM fine-tuning method that involves directly modifying the dynamic activation calculations. In other words, the numbers used in the "latent space" of embedding vectors can be changed directly at run-time. Activation patching is often used to perform attentiong steering, which can change issues such as tone or style of output. It is also related to other techniques that directly modify activation vectors, such as prompt tuning or prefix tuning. The general classes of algorithms that work directly on embedding activations in "latent space" include mechanistic interpretability and representation engineering.
Research on Activation Patching
Research papers include:
- Neel Nanda, 8th Jul 2024, An Extremely Opinionated Annotated List of My Favourite Mechanistic Interpretability Papers v2, https://www.alignmentforum.org/posts/NfFST5Mio7BCAQHPA/an-extremely-opinionated-annotated-list-of-my-favourite
- Neel Nanda, Feb 4, 2024, Attribution Patching: Activation Patching At Industrial Scale, https://www.neelnanda.io/mechanistic-interpretability/attribution-patching
- Yixin Ji, Juntao Li, Hai Ye, Kaixin Wu, Jia Xu, Linjian Mo, Min Zhang, 5 Jan 2025, Test-time Computing: from System-1 Thinking to System-2 Thinking, https://arxiv.org/abs/2501.02497
- Hanyu Zhang, Xiting Wang, Chengao Li, Xiang Ao, Qing He, 10 Jan 2025, Controlling Large Language Models Through Concept Activation Vectors, https://arxiv.org/abs/2501.05764 (Training a vector used to control the model on certain attributes.)
- Clément Dumas, Chris Wendler, Veniamin Veselovsky, Giovanni Monea, Robert West, 9 Jan 2025 (v3), Separating Tongue from Thought: Activation Patching Reveals Language-Agnostic Concept Representations in Transformers, https://arxiv.org/abs/2411.08745
- Wei Jie Yeo, Ranjan Satapathy, Erik Cambria, 1 Nov 2024 (v2), Towards Faithful Natural Language Explanations: A Study Using Activation Patching in Large Language Models, https://arxiv.org/abs/2410.14155
- Stefan Heimersheim, Neel Nanda, 23 Apr 2024, How to use and interpret activation patching, https://arxiv.org/abs/2404.15255
- Aleksandar Makelov, Georg Lange, Neel Nanda, 6 Dec 2023 (v2), Is This the Subspace You Are Looking for? An Interpretability Illusion for Subspace Activation Patching, https://arxiv.org/abs/2311.17030
- Fred Zhang, Neel Nanda, 17 Jan 2024 (v2), Towards Best Practices of Activation Patching in Language Models: Metrics and Methods, https://arxiv.org/abs/2309.16042
- Ansh Poonia, Maeghal Jain, 28 Jul 2025, Dissecting Persona-Driven Reasoning in Language Models via Activation Patching, https://arxiv.org/abs/2507.20936
AI Books from Aussie AI
![]() |
The Sweetest Lesson: Your Brain Versus AI: new book on AI intelligence theory:
Get your copy from Amazon: The Sweetest Lesson |
![]() |
RAG Optimization: Accurate and Efficient LLM Applications:
new book on RAG architectures:
Get your copy from Amazon: RAG Optimization |
![]() |
Generative AI Applications book:
Get your copy from Amazon: Generative AI Applications |
![]() |
Generative AI programming book:
Get your copy from Amazon: Generative AI in C++ |
![]() |
CUDA C++ Optimization book:
Get your copy from Amazon: CUDA C++ Optimization |
![]() |
CUDA C++ Debugging book:
Get your copy from Amazon: CUDA C++ Debugging |
More AI Research
Read more about: