Aussie AI

Chapter 5. Neurology Versus Numbers

Book Excerpt from "The Sweetest Lesson: Your Brain vs AI"

by David Spuler, Ph.D.

Chapter 5. Neurology Versus Numbers

“No, I’m not interested in developing a powerful brain.”

— Alan Turing

Neurology versus Numbers

The AI models are very similar to the human brain, since they are programming models of its neurons and synapses. There’s another similarity between LLMs and brains:

Nobody knows how they work!

Sure, there are neurologists who know about the brain, and technologists that know about AI models. But both areas are really quite underwhelming when you consider how little is known. For example, neurologists really don’t know:

Why do we need sleep?

What causes dementia?

Incidentally, the answer may be the same: clearing away brain cell detritus. Our brain cells clean up waste by-products during sleep, and dementia may be an impairment in that process.

Anyway, technologists really don’t understand their computerized AI models either. We can run them on huge computing clusters, and we know what they can do, but our understanding of how the models do these things is a little vague. For example, questions that AI researchers struggle with include:

Why do models lack common sense?

Why do they make stuff up?

Amusingly, the fact is that for every neurology area of study, there’s a parallel research area in artificial brains.

Activated Brain Regions

You don’t use your whole brain all of the time. And I don’t mean that "only 10%" rule that Hollywood is fond of. What I mean is that there are different regions of the brain that are “activated” by different activities. For example, we know about “lobes” in the brain:

Frontal lobe (front) — high-level functioning (e.g., decisions, motor control, speech).
Parietal lobe (top) — sensory inputs, numbers/mathematics, and hand-eye coordination.
Temporal lobe (sides) — audio input, hearing, memory, and language understanding.
Occipital lobe (back) — higher-level visual processing (e.g., detecting shapes and movement).

And there are many more sub-regions of the brain that have been analyzed in much more depth:

Brainstem — basic life-preserving functions (breathing, heartbeat).
Motor cortext — voluntary movement controls.
Visual cortex — processing what they eyes see.
Cerebrum — high-level rationality and logic.
Cerebellum — movement and motor functions.

Neurology researchers have studied these brain regions in detail by: (a) the joy of dissection, and (b) having people do different things while their brain is being watched by an MRI. You can literally see that different parts of the brain “light up” in response to different stimuli or when the brain is doing different types of work. Other parts of the brain are stuck in idle.

AI researchers are doing two main things in response to these types of neurology research:

1. Copying it!

2. Studying AI models.

I’m not sure which one came first, but AI engineers are doing both.

AI Copies of Brain Regions

The brain’s structure is much more complicated than an AI model. In fact, whereas the brain has all sorts of different components scattered all around the skull, an LLM is rather simple in structure.

Your brain is rather special in structure, and un-symmetrical. In fact, the left hemisphere and right hemisphere do different things, which isn’t the case in AI models. Instead, the average AI model is rigid in its structure, and symmetrical in all three dimensions.

The “Mixture-of-Experts” or “MoE” architecture, as pioneered in GPT-4, is like dividing the AI model into brain regions. Different parts of an AI model get activiated in response to different questions.

Firstly, note that experts aren’t chosen based on the whole question you ask them. It’s not like the model analyzes the input and then decides which part of the LLM will handle the whole thing. That’s not the Mixture-of-Experts architecture that I’m talking about here, although there is a different thing that does work that way, which is called “model routing.” But here, I’m talking about having an LLM where internal submodels inside the model are turned on and off while it’s processing a response.

Note that the “experts” in an LLM are split on the strength of the signals along the “embedding dimension” of the matrices, also known as the “width” of the model. It’s not that one expert handles a word, and then a different expert does the next one, which is called the “lengthwise” dimension (or token dimension). Rather, each word is converted into “embeddings” as a type of “semantic vector” that gives it various weights on different signals (e.g., is it a noun like “cat” or is it a verb like “jump”). The model figures out which signals are strongest, and hence, which expert should process the data. Hence, simplifying greatly, a different expert would process a noun versus another expert that processes verbs.

You can see how this would be efficient. Having the DeepSeek R1 model would be slow if all 600 billion weights were used for every word. Instead, it only uses its 37 billion weights in each expert for each word. It’s potentially a different set of 37 billion, changing each time it goes around to output another word, but it’s still 600/37=16 times faster at each word, and hence, also that much faster for producing the whole output text. You can also see why this might be faster than GPT-4, which has about 200 billion weights in its experts, compared to only 37 billion in DeepSeek.

Vision and audio processing in AI models is very different in a sense (haha). There’s not really a full equivalent of the visual cortex in an LLM. Usually, the LLM doesn’t get a “raw feed” of vision or sound like the brain does, but instead gets one that’s already had factors like color and intensity computed for it. Hence, the brain has some neuron power assigned to processing raw input signals, which the LLMs don’t need.

Studying AI Model Numbers

AI researchers have been studying artificial brains in some depth, and you don’t even need to use an MRI to do this. In fact, you can just code it up. It sounds easy, but in reality, it’s a little tricky to analyze model weights, because it’s like trying to discern something interesting about what is literally a billion random numbers.

The whole area of studying AI weight numbers has a really great name: mechanistic interpretability. I’m not even sure what to write about that. I mean, sure, it’s interpreting the numbers, I guess, and they’re “mechanistic” in some way. Thankfully, most AI researchers shorten it to “mech interp” to hide the fact that they still don’t understand what the numbers actually do.

Nevertheless, AI researchers have discovered all sort of intriguing fact about LLM weights. These are just numbers, and they can be examined and also changed. Weirdness ensues in multiple ways:

Attention sink — the first token.
Layer importance — three levels of AI.
Too many weights — thinning and shrinking numbers.

One weird thing is the “attention sink” research, which finds that there’s always one token that gets far too much attention: the very first one. So, the first word you use in a prompt matters a lot more than any other single word, and no-one really knows why that is. Maybe it’s not weird to think that AI only really listens to one word, because that’s how teenagers work.

Layers are another area of research. At the highest level is the study of “layer importance” in AI models. The structure of every model is that the computation goes repeatedly through several layers, sometimes hundreds, of identical number crunching.

Like in Shrek, layers.

The different layers of the model matter and they do different things. The early layers of an LLM do the high-level topic selection and guidance, whereas the middle layers start choosing the most likely next word to output. The final layers are “finesse” layers that choose between several good options for the best output.

Researchers have studied this by looking internally how the numbers change after each layer, and also by doing “layer skipping” or “early exiting” to see what happens if some of the layers are removed. Turns out that you can actually skip a lot of layers and still have a reasonable level of intelligence.

Maybe LLMs are only really using 10% of their AI brains?

The 10% Rule

Could it be that LLMs are like humans and waste their brain capacity? It’s not actually a joke, because AI models are often shrunk to be smaller. Turns out if you have billions of numbers, and you don’t understand what they all do, then there’s some redundancy in there.

One technique is called “quantization” and that name is probably because it has nothing to do with quantum physics. Anyway, that’s the name of the most popular technique. The goal is to make the model smaller, so it can run faster, while giving up some accuracy. In other words: a lot faster but slightly dumber.

Common levels of shrinkage, ahem, I mean, quantization, are from 32-bit to 4-bit, which is an eight-fold decline in LLM brain size. In other words, a 4-bit quantized model is only using 12.5% of its original brain power.

There’s also another way that is called “pruning” (good name) or “sparsification” (bad name), and involves setting numbers to zero. This is equivalent to “pruning” (removing) a synaptic connection between two neurons. In neurosurgery, this is done to block connections that are causing problems, such as in epilepsy, whereas in AI the weights are mainly removed just to get a speedup. Hence, different goals for pruning, but you can even combine pruning techniques with quantization. So, yes, we can get below 10% usage of an LLM brain.

Neurosurgery on AI Brains

Researchers haven’t just studied the numbers, but have also changed them. It’s like stimulating neurons in the LLM with an electric prod, just like they do in neurosurgery with Deep Brain Stimulation (DBS) to treat Parkinson’s disease.

The basic idea in AI research is called “activation patching” where you modify the outputs of each of those layers as things progress. If you change the numbers, then the output will change.

Note that this is not just about suppressing neurons, as in “sparsifying” the numbers by putting lots of them to zero. That’s a speedup, but activation patching is about changing the results, not just making it run faster.

What can you change?

Turns out, it’s not easy. There’s no simple algorithm to look at the numbers in an AI model and know what to change. Who knew that brain surgery could be so difficult?

Researchers have ways of changing the numbers, but they’re not easy algorithms, and always involve watching the numbers first. The most fun you can have on a Sunday night in San Jose is running your AI engines and watching how the numbers change.

The most successful idea here is called “attention steering” or just “steering” for short. The idea is that you can “steer” the AI model to be more in some type of tone, such as optimistic or pessimistic. The way this works is only in pairs of styles. The idea is that you run your AI engine with prompts that you know will cause optimistic or pessimistic outputs, and you know which is which. Then you look at the numbers from both, and run a “diff” on a big vector of numbers to find the numbers that are different in optimistic versus pessimistic answers. And that means the internal numbers along the way, not just the different types of words at the end, which would be too easy.

That difference vector is a list of numbers that is the set of activations that are different between the two styles. Hence, you end up with two vectors, one for positive, and one for negative. Later, when you wan to do “steering” you can modify your model layers so that these extra numbers are added to the outputs.

Bizarrely, this actually works.

Unfortunately, it’s hard to drill down deeper than this to give more steering control. We can find the signals that tend towards positivity or negativity, but we can’t really identify the meaning of every number in the vector of numbers. Each number must mean something, or be a signal toward some output, but which one is what output?

Anyway, I mean, it’s intriguing research, and I love it so much, but what’s the practical point? It’s easier to just command your AI to "please output in an optimistic tone" at the end of your prompt. That works, too.

References

References on the “attention sink” where the very first token gets too much “attention”:

Xiangming Gu, Tianyu Pang, Chao Du, Qian Liu, Fengzhuo Zhang, Cunxiao Du, Ye Wang, Min Lin, 14 Oct 2024, When Attention Sink Emerges in Language Models: An Empirical View, https://arxiv.org/abs/2410.10781 https://github.com/sail-sg/Attention-Sink
Tianyu Guo, Druv Pai, Yu Bai, Jiantao Jiao, Michael I. Jordan, Song Mei, 17 Oct 2024, Active-Dormant Attention Heads: Mechanistically Demystifying Extreme-Token Phenomena in LLMs, https://arxiv.org/abs/2410.13835
Zunhai Su, Zhe Chen, Wang Shen, Hanyu Wei, Linge Li, Huangqi Yu, Kehong Yuan, 25 Jan 2025, RotateKV: Accurate and Robust 2-Bit KV Cache Quantization for LLMs via Outlier-Aware Adaptive Rotations, https://arxiv.org/abs/2501.16383
Xinyi Wu, Yifei Wang, Stefanie Jegelka, Ali Jadbabaie, 4 Feb 2025, On the Emergence of Position Bias in Transformers, https://arxiv.org/abs/2502.01951

References on “layer importance” inside AI model structures:

Anton Razzhigaev, Matvey Mikhalchuk, Elizaveta Goncharova, Nikolai Gerasimenko, Ivan Oseledets, Denis Dimitrov, Andrey Kuznetsov, 19 May 2024, Your Transformer is Secretly Linear, https://arxiv.org/abs/2405.12250 (Replacing model layers in the decoder with linear approximations.)
BS Akash, V Singh, A Krishna, LB Murthy, L Kumar, April 2024, Investigating BERT Layer Performance and SMOTE Through MLP-Driven Ablation on Gittercom, Lecture Notes on Data Engineering and Communications Technologies (LNDECT,volume 200), https://link.springer.com/chapter/10.1007/978-3-031-57853-3_25
Jiachen Jiang, Jinxin Zhou, Zhihui Zhu, 20 Jun 2024, On Layer-wise Representation Similarity: Application for Multi-Exit Models with a Single Classifier, https://arxiv.org/abs/2406.14479 (Using layer similarity for early exit classifiers, which is also related to layer fusion.)
Vedang Lad, Wes Gurnee, Max Tegmark, 27 Jun 2024, The Remarkable Robustness of LLMs: Stages of Inference, https://arxiv.org/abs/2406.19384 (Deleting and swapping adjacent model layers. Hypothesizes that the first layer is effectively detokenization, the early layers focus on “features”, the middle layers focus on “ensemble predictions” and the latter layers “sharpen” or finalize, with a lot of suppression happening near the end.)
Xu Cheng, Lei Cheng, Zhaoran Peng, Yang Xu, Tian Han, Quanshi Zhang, July 2024, Layerwise Change of Knowledge in Neural Networks, Proceedings of the 41st International Conference on Machine Learning, PMLR 235:8038-8059, 2024, https://proceedings.mlr.press/v235/cheng24b.html PDF: https://raw.githubusercontent.com/mlresearch/v235/main/assets/cheng24b/cheng24b.pdf
Siqi Fan, Xin Jiang, Xiang Li, Xuying Meng, Peng Han, Shuo Shang, Aixin Sun, Yequan Wang, Zhongyuan Wang, 9 Jul 2024 (v3), Not All Layers of LLMs Are Necessary During Inference, https://arxiv.org/abs/2403.02181
Amit Ben Artzy, Roy Schwartz, 5 Sep 2024, Attend First, Consolidate Later: On the Importance of Attention in Different LLM Layers, https://arxiv.org/abs/2409.03621
Tianyu Guo, Druv Pai, Yu Bai, Jiantao Jiao, Michael I. Jordan, Song Mei, 17 Oct 2024, Active-Dormant Attention Heads: Mechanistically Demystifying Extreme-Token Phenomena in LLMs, https://arxiv.org/abs/2410.13835
Khashayar Gatmiry, Nikunj Saunshi, Sashank J. Reddi, Stefanie Jegelka, Sanjiv Kumar, 29 Oct 2024, On the Role of Depth and Looping for In-Context Learning with Task Diversity, https://arxiv.org/abs/2410.21698
Guangyuan Shi, Zexin Lu, Xiaoyu Dong, Wenlong Zhang, Xuanyu Zhang, Yujie Feng, Xiao-Ming Wu, 23 Oct 2024, Understanding Layer Significance in LLM Alignment, https://arxiv.org/abs/2410.17875
Jason Du, Kelly Hong, Alishba Imran, Erfan Jahanparast, Mehdi Khfifi, Kaichun Qiao, 13 Jan 2025, How GPT learns layer by layer, https://arxiv.org/abs/2501.07108 https://github.com/ALT-JS/OthelloSAE
Ming Li, Yanhong Li, Tianyi Zhou, 31 Oct 2024, What Happened in LLMs Layers when Trained for Fast vs. Slow Thinking: A Gradient Perspective, https://arxiv.org/abs/2410.23743
Jiachen Jiang, Yuxin Dong, Jinxin Zhou, Zhihui Zhu, 22 May 2025, From Compression to Expansion: A Layerwise Analysis of In-Context Learning, https://arxiv.org/abs/2505.17322

References on “mechanistic interpretability” or “mech interp” if you’re trendy:

Neel Nanda, 8th Jul 2024, An Extremely Opinionated Annotated List of My Favourite Mechanistic Interpretability Papers v2, https://www.alignmentforum.org/posts/NfFST5Mio7BCAQHPA/an-extremely-opinionated-annotated-list-of-my-favourite
Daking Rai, Yilun Zhou, Shi Feng, Abulhair Saparov, Ziyu Yao, 2 Jul 2024, A Practical Review of Mechanistic Interpretability for Transformer-Based Language Models, https://arxiv.org/abs/2407.02646
Mengru Wang, Yunzhi Yao, Ziwen Xu, Shuofei Qiao, Shumin Deng, Peng Wang, Xiang Chen, Jia-Chen Gu, Yong Jiang, Pengjun Xie, Fei Huang, Huajun Chen, Ningyu Zhang, 4 Dec 2024 (v4), Knowledge Mechanisms in Large Language Models: A Survey and Perspective, https://arxiv.org/abs/2407.15017
Leonard Bereska, Efstratios Gavves, 23 Aug 2024 (v3), Mechanistic Interpretability for AI Safety -- A Review, https://arxiv.org/abs/2404.14082
Chashi Mahiul Islam, Samuel Jacob Chacko, Mao Nishino, Xiuwen Liu, 7 Feb 2025, Mechanistic Understandings of Representation Vulnerabilities and Engineering Robust Vision Transformers, https://arxiv.org/abs/2502.04679
Ala N. Tak, Amin Banayeeanzade, Anahita Bolourani, Mina Kian, Robin Jia, Jonathan Gratch, 8 Feb 2025, Mechanistic Interpretability of Emotion Inference in Large Language Models, https://arxiv.org/abs/2502.05489
Lukasz Bartoszcze, Sarthak Munshi, Bryan Sukidi, Jennifer Yen, Zejia Yang, David Williams-King, Linh Le, Kosi Asuzu, Carsten Maple, 24 Feb 2025, Representation Engineering for Large-Language Models: Survey and Research Challenges, https://arxiv.org/abs/2502.17601
Samuel Miller, Daking Rai, Ziyu Yao, 20 Feb 2025, Mechanistic Understanding of Language Models in Syntactic Code Completion, https://arxiv.org/abs/2502.18499
Yifan Zhang, Wenyu Du, Dongming Jin, Jie Fu, Zhi Jin, 27 Feb 2025, Finite State Automata Inside Transformers with Chain-of-Thought: A Mechanistic Study on State Tracking, https://arxiv.org/abs/2502.20129
J. Katta, M. Allanki and N. R. Kodumuru, 2025, Understanding Sarcasm Detection Through Mechanistic Interpretability, 2025 4th International Conference on Sentiment Analysis and Deep Learning (ICSADL), Bhimdatta, Nepal, 2025, pp. 990-995, doi: 10.1109/ICSADL65848.2025.10933475. https://ieeexplore.ieee.org/abstract/document/10933475/
Lee Sharkey, Bilal Chughtai, Joshua Batson, Jack Lindsey, Jeff Wu, Lucius Bushnaq, Nicholas Goldowsky-Dill, Stefan Heimersheim, Alejandro Ortega, Joseph Bloom, Stella Biderman, Adria Garriga-Alonso, Arthur Conmy, Neel Nanda, Jessica Rumbelow, Martin Wattenberg, Nandi Schoots, Joseph Miller, Eric J. Michaud, Stephen Casper, Max Tegmark, William Saunders, David Bau, Eric Todd, Atticus Geiger, Mor Geva, Jesse Hoogland, Daniel Murfet, Tom McGrath, 27 Jan 2025, Open Problems in Mechanistic Interpretability, https://arxiv.org/abs/2501.16496
Jingcheng Niu, Xingdi Yuan, Tong Wang, Hamidreza Saghir, Amir H. Abdi, 14 May 2025, Llama See, Llama Do: A Mechanistic Perspective on Contextual Entrainment and Distraction in LLMs, https://arxiv.org/abs/2505.09338
Vishnu Kabir Chhabra, Mohammad Mahdi Khalili, 5 Apr 2025, Towards Understanding and Improving Refusal in Compressed Models via Mechanistic Interpretability, https://arxiv.org/abs/2504.04215

References on “activation patching” (like deep brain stimulation):

Neel Nanda, Feb 4, 2024, Attribution Patching: Activation Patching At Industrial Scale, https://www.neelnanda.io/mechanistic-interpretability/attribution-patching
Clément Dumas, Chris Wendler, Veniamin Veselovsky, Giovanni Monea, Robert West, 9 Jan 2025 (v3), Separating Tongue from Thought: Activation Patching Reveals Language-Agnostic Concept Representations in Transformers, https://arxiv.org/abs/2411.08745
Wei Jie Yeo, Ranjan Satapathy, Erik Cambria, 1 Nov 2024 (v2), Towards Faithful Natural Language Explanations: A Study Using Activation Patching in Large Language Models, https://arxiv.org/abs/2410.14155
Stefan Heimersheim, Neel Nanda, 23 Apr 2024, How to use and interpret activation patching, https://arxiv.org/abs/2404.15255
Aleksandar Makelov, Georg Lange, Neel Nanda, 6 Dec 2023 (v2), Is This the Subspace You Are Looking for? An Interpretability Illusion for Subspace Activation Patching, https://arxiv.org/abs/2311.17030
Fred Zhang, Neel Nanda, 17 Jan 2024 (v2), Towards Best Practices of Activation Patching in Language Models: Metrics and Methods, https://arxiv.org/abs/2309.16042

References on “attention steering” that changes AI behavior by playing with its internal numbers:

Zhuohan Gu, Jiayi Yao, Kuntai Du, Junchen Jiang, 21 Nov 2024 (v2), LLMSteer: Improving Long-Context LLM Inference by Steering Attention on Reused Contexts, https://arxiv.org/abs/2411.13009
Qingru Zhang, Chandan Singh, Liyuan Liu, Xiaodong Liu, Bin Yu, Jianfeng Gao, Tuo Zhao, 1 Oct 2024 (v2), Tell Your Model Where to Attend: Post-hoc Attention Steering for LLMs, https://arxiv.org/abs/2311.02262 https://github.com/QingruZhang/PASTA
Baifeng Shi, Siyu Gai, Trevor Darrell, Xin Wang, 11 Jul 2023 (v2), TOAST: Transfer Learning via Attention Steering, https://arxiv.org/abs/2305.15542 https://github.com/bfshi/TOAST
Yibin Wang, Weizhong Zhang, Jianwei Zheng, Cheng Jin, 20 Aug 2024 (v3), PrimeComposer: Faster Progressively Combined Diffusion for Image Composition with Attention Steering, https://arxiv.org/abs/2403.05053 https://github.com/CodeGoat24/PrimeComposer
Kyle O’Brien, David Majercak, Xavier Fernandes, Richard Edgar, Jingya Chen, Harsha Nori, Dean Carignan, Eric Horvitz, Forough Poursabzi-Sangde, 18 Nov 2024, Steering Language Model Refusal with Sparse Autoencoders, https://arxiv.org/abs/2411.11296
Xintong Wang, Jingheng Pan, Longqin Jiang, Liang Ding, Xingshan Li, Chris Biemann, 23 Oct 2024, CogSteer: Cognition-Inspired Selective Layer Intervention for Efficient Semantic Steering in Large Language Models, https://arxiv.org/abs/2410.17714
Hanyu Zhang, Xiting Wang, Chengao Li, Xiang Ao, Qing He, 10 Jan 2025, Controlling Large Language Models Through Concept Activation Vectors, https://arxiv.org/abs/2501.05764 (Training a vector used to control the model on certain attributes.)
Liu Yang, Ziqian Lin, Kangwook Lee, Dimitris Papailiopoulos, Robert Nowak, 16 Jan 2025, Task Vectors in In-Context Learning: Emergence, Formation, and Benefit, https://arxiv.org/abs/2501.09240
Peixuan Han, Cheng Qian, Xiusi Chen, Yuji Zhang, Denghui Zhang, Heng Ji, 4 Feb 2025 (v2), Internal Activation as the Polar Star for Steering Unsafe LLM Behavior, https://arxiv.org/abs/2502.01042
Daniel Beaglehole, Adityanarayanan Radhakrishnan, Enric Boix-Adserà, Mikhail Belkin, 6 Feb 2025, Aggregate and conquer: detecting and steering LLM concepts by combining nonlinear predictors over multiple layers, https://arxiv.org/abs/2502.03708 https://github.com/dmbeaglehole/neural_controllers
Nikhil Anand, Dec 20, 2024, Understanding “steering” in LLMs: And how simple math can solve global problems, https://ai.gopubby.com/understanding-steering-in-llms-96faf6e0bee7
Somnath Banerjee, Sayan Layek, Pratyush Chatterjee, Animesh Mukherjee, Rima Hazra, 16 Feb 2025, Soteria: Language-Specific Functional Parameter Steering for Multilingual Safety Alignment, https://arxiv.org/abs/2502.11244
Seongheon Park, Xuefeng Du, Min-Hsuan Yeh, Haobo Wang, Yixuan Li, 1 Mar 2025, How to Steer LLM Latents for Hallucination Detection? https://arxiv.org/abs/2503.01917
Marco Scialanga, Thibault Laugel, Vincent Grari, Marcin Detyniecki, 3 Mar 2025, SAKE: Steering Activations for Knowledge Editing, https://arxiv.org/abs/2503.01751
Kenneth J. K. Ong, Lye Jia Jun, Hieu Minh “Jord” Nguyen, Seong Hah Cho, Natalia Pérez-Campanero Antolín, 17 Mar 2025, Identifying Cooperative Personalities in Multi-agent Contexts through Personality Steering with Representation Engineering, https://arxiv.org/abs/2503.12722
Moreno D’Incà, Elia Peruzzo, Xingqian Xu, Humphrey Shi, Nicu Sebe, Massimiliano Mancini, 14 Mar 2025, Safe Vision-Language Models via Unsafe Weights Manipulation, https://arxiv.org/abs/2503.11742
Changho Shin, Xinya Yan, Suenggwan Jo, Sungjun Cho, Shourjo Aditya Chaudhuri, Frederic Sala, 25 Mar 2025 (v2), TARDIS: Mitigating Temporal Misalignment via Representation Steering, https://arxiv.org/abs/2503.18693
Jingcheng Niu, Xingdi Yuan, Tong Wang, Hamidreza Saghir, Amir H. Abdi, 14 May 2025, Llama See, Llama Do: A Mechanistic Perspective on Contextual Entrainment and Distraction in LLMs, https://arxiv.org/abs/2505.09338
Yao Huang, Huanran Chen, Shouwei Ruan, Yichi Zhang, Xingxing Wei, Yinpeng Dong, 28 May 2025, Mitigating Overthinking in Large Reasoning Models via Manifold Steering, https://arxiv.org/abs/2505.22411 https://github.com/Aries-iai/Manifold_Steering