Aussie AI
Representation Engineering
-
Last Updated 26 August, 2025
-
by David Spuler, Ph.D.
What is Representation Engineering?
Representation engineering is a dynamic method of modifying the behavior of an LLM, with some similarities to fine-tuning. Whereas fine-tuning will change the weights, representation engineering creates a set of vectors that can be added to activations dynamically during inference, thereby offering some extra control over the LLM.
For example, you can determine what numbers in the activations represent "happiness" or "sadness," and then amplify or reduce those signals. The effect is to make the LLM results become happier or sadder.
There is some similarity between representation engineering verus prompt engineering. After all, it is well known that you can prepend global instructions as to the desired tone of voice (e.g., "use an optimistic tone" or "use a sad tone"). However, the effects of the two techniques are somewhat different, and representation engineering allows a finer-grained control, because you can scale the control vectors (e.g., only use 50% of the happiness vector). Again, prompt engineering could use instructions like "use a half-optimistic tone," but the level of granularity is much coarser. Furthermore, prepending extensive global instructions is quite expensive at inference time in terms of extra tokens to process, whereas representation engineering has no extra tokens, and only a few vector additions during inference.
How Does It Work?
The idea underlying representation engineering is to figure out which activation numbers are relevant to particular signals, and then to either amplify or reduce those signals at run-time. The activations are in embedding space, so the trick is to work out which elements of the embedding vectors to modify for particular effects.
This method is not a speed optimization, but can change the results of the LLM. Representation engineering involves an additional "training" phase, but with a different goal to normal training or fine-tuning. The extra cost to inference of using representation engineering after training is quite low, being vector additions to the activations at every layer.
Control Vectors
The "control vectors" are the numbers that are added to the activation vectors during inference. The idea is similar to "bias vectors" in model architectures, but control vectors are not pre-trained into the weights like biases.
There are various ways of creating the control vectors" that are to be added during inference. This involves a training phase, where data sets representing opposite effects are processed, which is called "contrastive prompting." The signals from two opposite-tuned prompts can be examined to see which embedding numbers have changed in the activation vectors. In this way, it is determined what numbers can be added (or subtracted) to amplify a trait (or reduce it).
Technically, the control vectors are a set of vectors, one per layer. These are added to the activation vectors after every layer. Note that they can contain both positive and negative numbers, which will increase or reduce a signal.
The whole area of representation engineering is still emerging. Results are somewhat ad hoc, and the methods to understand what the numbers in the embeddings actually mean is a tricky conundrum.
Research on Representation Engineering
Research papers on representation engineering include:
- Andy Zou, Long Phan, Sarah Chen, James Campbell, Phillip Guo, Richard Ren, Alexander Pan, Xuwang Yin, Mantas Mazeika, Ann-Kathrin Dombrowski, Shashwat Goel, Nathaniel Li, Michael J. Byun, Zifan Wang, Alex Mallen, Steven Basart, Sanmi Koyejo, Dawn Song, Matt Fredrikson, J. Zico Kolter, Dan Hendrycks, 10 Oct 2023 (v3), Representation Engineering: A Top-Down Approach to AI Transparency, https://arxiv.org/abs/2310.01405 https://github.com/andyzoujm/representation-engineering
- Theia Vogel, January 22, 2024, Representation Engineering Mistral-7B an Acid Trip, https://vgel.me/posts/representation-engineering/
- Lukasz Bartoszcze, Sarthak Munshi, Bryan Sukidi, Jennifer Yen, Zejia Yang, David Williams-King, Linh Le, Kosi Asuzu, Carsten Maple, 24 Feb 2025, Representation Engineering for Large-Language Models: Survey and Research Challenges,https://arxiv.org/abs/2502.17601
- Kevin Meng, David Bau, Alex Andonian, Yonatan Belinkov, 13 Jan 2023 (v5), Locating and Editing Factual Associations in GPT, https://arxiv.org/abs/2202.05262
- Yingbing Huang, Deming Chen, Abhishek K. Umrawal, 28 Feb 2025, JAM: Controllable and Responsible Text Generation via Causal Reasoning and Latent Vector Manipulation, https://arxiv.org/abs/2502.20684
- Hao-Xiang Xu, Jun-Yu Ma, Zhen-Hua Ling, Ningyu Zhang, Jia-Chen Gu, 25 Feb 2025, Constraining Sequential Model Editing with Editing Anchor Compression, https://arxiv.org/abs/2503.00035
- Qiyuan Deng, Xuefeng Bai, Kehai Chen, Yaowei Wang, Liqiang Nie, Min Zhang, 13 Mar 2025, Representation-based Reward Modeling for Efficient Safety Alignment of Large Language Model, https://arxiv.org/abs/2503.10093
- Bilgehan Sel, Dingcheng Li, Phillip Wallis, Vaishakh Keshava, Ming Jin, Siddhartha Reddy Jonnalagadda, 11 Mar 2025, Backtracking for Safety, https://arxiv.org/abs/2503.08919
- Alexander Ku, Declan Campbell, Xuechunzi Bai, Jiayi Geng, Ryan Liu, Raja Marjieh, R. Thomas McCoy, Andrew Nam, Ilia Sucholutsky, Veniamin Veselovsky, Liyi Zhang, Jian-Qiao Zhu, Thomas L. Griffiths, 17 Mar 2025, Using the Tools of Cognitive Science to Understand Large Language Models at Different Levels of Analysis, https://arxiv.org/abs/2503.13401
- Kenneth J. K. Ong, Lye Jia Jun, Hieu Minh "Jord" Nguyen, Seong Hah Cho, Natalia Pérez-Campanero Antolín, 17 Mar 2025, Identifying Cooperative Personalities in Multi-agent Contexts through Personality Steering with Representation Engineering, https://arxiv.org/abs/2503.12722
- Moreno D'Incà, Elia Peruzzo, Xingqian Xu, Humphrey Shi, Nicu Sebe, Massimiliano Mancini, 14 Mar 2025, Safe Vision-Language Models via Unsafe Weights Manipulation, https://arxiv.org/abs/2503.11742
- Yanshu Li, Yi Cao, Hongyang He, Qisen Cheng, Xiang Fu, Xi Xiao, Tianyang Wang, Ruixiang Tang, 8 Aug 2025, M$^2$IV: Towards Efficient and Fine-grained Multimodal In-Context Learning via Representation Engineering, https://arxiv.org/abs/2504.04633
- Jun Li, Kai Li, Shaoguo Liu, Tingting Gao, 15 Aug 2025, Enhancing Supervised Composed Image Retrieval via Reasoning-Augmented Representation Engineering, https://arxiv.org/abs/2508.11272
AI Books from Aussie AI
|
The Sweetest Lesson: Your Brain Versus AI: new book on AI intelligence theory:
Get your copy from Amazon: The Sweetest Lesson |
|
RAG Optimization: Accurate and Efficient LLM Applications:
new book on RAG architectures:
Get your copy from Amazon: RAG Optimization |
|
Generative AI Applications book:
Get your copy from Amazon: Generative AI Applications |
|
Generative AI programming book:
Get your copy from Amazon: Generative AI in C++ |
|
CUDA C++ Optimization book:
Get your copy from Amazon: CUDA C++ Optimization |
|
CUDA C++ Debugging book:
Get your copy from Amazon: CUDA C++ Debugging |
More AI Research
Read more about: