Aussie AI

Activation Patching

Last Updated 25 August, 2025

by David Spuler, Ph.D.

What is Activation Patching?

Activation patching is an LLM fine-tuning method that involves directly modifying the dynamic activation calculations. In other words, the numbers used in the "latent space" of embedding vectors can be changed directly at run-time. Activation patching is often used to perform attentiong steering, which can change issues such as tone or style of output. It is also related to other techniques that directly modify activation vectors, such as prompt tuning or prefix tuning. The general classes of algorithms that work directly on embedding activations in "latent space" include mechanistic interpretability and representation engineering.

Research on Activation Patching

Research papers include:

Neel Nanda, 8th Jul 2024, An Extremely Opinionated Annotated List of My Favourite Mechanistic Interpretability Papers v2, https://www.alignmentforum.org/posts/NfFST5Mio7BCAQHPA/an-extremely-opinionated-annotated-list-of-my-favourite
Neel Nanda, Feb 4, 2024, Attribution Patching: Activation Patching At Industrial Scale, https://www.neelnanda.io/mechanistic-interpretability/attribution-patching
Yixin Ji, Juntao Li, Hai Ye, Kaixin Wu, Jia Xu, Linjian Mo, Min Zhang, 5 Jan 2025, Test-time Computing: from System-1 Thinking to System-2 Thinking, https://arxiv.org/abs/2501.02497
Hanyu Zhang, Xiting Wang, Chengao Li, Xiang Ao, Qing He, 10 Jan 2025, Controlling Large Language Models Through Concept Activation Vectors, https://arxiv.org/abs/2501.05764 (Training a vector used to control the model on certain attributes.)
Clément Dumas, Chris Wendler, Veniamin Veselovsky, Giovanni Monea, Robert West, 9 Jan 2025 (v3), Separating Tongue from Thought: Activation Patching Reveals Language-Agnostic Concept Representations in Transformers, https://arxiv.org/abs/2411.08745
Wei Jie Yeo, Ranjan Satapathy, Erik Cambria, 1 Nov 2024 (v2), Towards Faithful Natural Language Explanations: A Study Using Activation Patching in Large Language Models, https://arxiv.org/abs/2410.14155
Stefan Heimersheim, Neel Nanda, 23 Apr 2024, How to use and interpret activation patching, https://arxiv.org/abs/2404.15255
Aleksandar Makelov, Georg Lange, Neel Nanda, 6 Dec 2023 (v2), Is This the Subspace You Are Looking for? An Interpretability Illusion for Subspace Activation Patching, https://arxiv.org/abs/2311.17030
Fred Zhang, Neel Nanda, 17 Jan 2024 (v2), Towards Best Practices of Activation Patching in Language Models: Metrics and Methods, https://arxiv.org/abs/2309.16042
Ansh Poonia, Maeghal Jain, 28 Jul 2025, Dissecting Persona-Driven Reasoning in Language Models via Activation Patching, https://arxiv.org/abs/2507.20936