Aussie AI

Refusal Module

  • Last Updated 17 November, 2025
  • by David Spuler, Ph.D.

Research on Refusal Module

Research papers include:

  • Andy Arditi, Oscar Obeso, Aaquib111, wesg, Neel Nanda, 27th Apr 2024, Refusal in LLMs is mediated by a single direction, LessWrong, https://www.lesswrong.com/posts/jGuXSZgv6qfdhMCuJ/refusal-in-llms-is-mediated-by-a-single-direction
  • Maxime Labonne June 13, 2024 Uncensor any LLM with abliteration, https://huggingface.co/blog/mlabonne/abliteration
  • NVIDIA, June 2024, Nemotron-4 340B Technical Report, https://d1qx31qr3h6wln.cloudfront.net/publications/Nemotron_4_340B_8T_0.pdf (Architecture is decoder-only with GQA, SentencePiece tokenizer, causal attention masks, RoPE, 96 layers, 96 heads, 8 KV heads, 256,000 vocabulary, 18432 internal dimension, context window 4096, and uses squared RELU.)
  • Seungju Han, Kavel Rao, Allyson Ettinger, Liwei Jiang, Bill Yuchen Lin, Nathan Lambert, Yejin Choi, Nouha Dziri, 26 Jun 2024, WildGuard: Open One-Stop Moderation Tools for Safety Risks, Jailbreaks, and Refusals of LLMs, https://arxiv.org/abs/2406.18495
  • Maksym Andriushchenko, Nicolas Flammarion, 16 Jul 2024, Does Refusal Training in LLMs Generalize to the Past Tense? https://arxiv.org/abs/2407.11969 Code: https://github.com/tml-epfl/llm-past-tense
  • Kylie Robison, Jul 20, 2024, OpenAI’s latest model will block the ‘ignore all previous instructions’ loophole, https://www.theverge.com/2024/7/19/24201414/openai-chatgpt-gpt-4o-prompt-injection-instruction-hierarchy
  • Xinyi Hou, Yanjie Zhao, Haoyu Wang, 3 Aug 2024, Voices from the Frontier: A Comprehensive Analysis of the OpenAI Developer Forum, https://arxiv.org/abs/2408.01687
  • Asir Saadat, Tasmia Binte Sogir, Md Taukir Azam Chowdhury, Syem Aziz, 16 Oct 2024, When Not to Answer: Evaluating Prompts on GPT Models for Effective Abstention in Unanswerable Math Word Problems, https://arxiv.org/abs/2410.13029
  • Kyle O'Brien, David Majercak, Xavier Fernandes, Richard Edgar, Jingya Chen, Harsha Nori, Dean Carignan, Eric Horvitz, Forough Poursabzi-Sangde, 18 Nov 2024, Steering Language Model Refusal with Sparse Autoencoders, https://arxiv.org/abs/2411.11296
  • Mohit Sewak, Dec 6, 2024, Prompt Injection Attacks on Large Language Models, https://pub.towardsai.net/prompt-injection-attacks-on-large-language-models-bd8062fa1bb7
  • Yue Liu, Hongcheng Gao, Shengfang Zhai, Jun Xia, Tianyi Wu, Zhiwei Xue, Yulin Chen, Kenji Kawaguchi, Jiaheng Zhang, Bryan Hooi, 30 Jan 2025, GuardReasoner: Towards Reasoning-based LLM Safeguards, https://arxiv.org/abs/2501.18492
  • Holistic AI Team, March 6, 2025, Anthropic’s Claude 3.7 Sonnet Jailbreaking & Red Teaming Audit: The Most Secure Model Yet? https://www.holisticai.com/blog/claude-3-7-sonnet-jailbreaking-audit
  • Vishnu Kabir Chhabra, Mohammad Mahdi Khalili, 5 Apr 2025, Towards Understanding and Improving Refusal in Compressed Models via Mechanistic Interpretability, https://arxiv.org/abs/2504.04215
  • Wojciech Zaremba, Evgenia Nitishinskaya, Boaz Barak, Stephanie Lin, Sam Toyer, Yaodong Yu, Rachel Dias, Eric Wallace, Kai Xiao, Johannes Heidecke, Amelia Glaese, 31 Jan 2025, Trading Inference-Time Compute for Adversarial Robustness, https://arxiv.org/abs/2501.18841
  • Hongzhe Du, Weikai Li, Min Cai, Karim Saraipour, Zimin Zhang, Himabindu Lakkaraju, Yizhou Sun, Shichang Zhang, 11 Aug 2025, How Post-Training Reshapes LLMs: A Mechanistic View on Knowledge, Truthfulness, Refusal, and Confidence, https://arxiv.org/abs/2504.02904
  • Haonan Zhang, Dongxia Wang, Yi Liu, Kexin Chen, Jiashui Wang, Xinlei Ying, Long Liu, Wenhai Wang, 15 Aug 2025, ORFuzz: Fuzzing the "Other Side" of LLM Safety -- Testing Over-Refusal, https://arxiv.org/abs/2508.11222
  • Yuan Yuan, Tina Sriskandarajah, Anna-Luisa Brakman, Alec Helyar, Alex Beutel, Andrea Vallone, Saachi Jain, 12 Aug 2025, From Hard Refusals to Safe-Completions: Toward Output-Centric Safety Training, https://arxiv.org/abs/2508.09224
  • Ranjie Duan, Jiexi Liu, Xiaojun Jia, Shiji Zhao, Ruoxi Cheng, Fengxiang Wang, Cheng Wei, Yong Xie, Chang Liu, Defeng Li, Yinpeng Dong, Yichi Zhang, Yuefeng Chen, Chongwen Wang, Xingjun Ma, Xingxing Wei, Yang Liu, Hang Su, Jun Zhu, Xinfeng Li, Yitong Sun, Jie Zhang, Jinzhao Hu, Sha Xu, Yitong Yang, Jialing Tao, Hui Xue, 4 Sep 2025, Oyster-I: Beyond Refusal -- Constructive Safety Alignment for Responsible Language Models, https://arxiv.org/abs/2509.01909
  • Md Abdullah Al Mamun, Ihsen Alouani, Nael Abu-Ghazaleh, 28 Aug 2025, Poison Once, Refuse Forever: Weaponizing Alignment for Injecting Bias in LLMs, https://arxiv.org/abs/2508.20333
  • Neel Jain, Aditya Shrivastava, Chenyang Zhu, Daben Liu, Alfy Samuel, Ashwinee Panda, Anoop Kumar, Micah Goldblum, Tom Goldstein, 29 Aug 2025, Refusal Tokens: A Simple Way to Calibrate Refusals in Large Language Models, https://arxiv.org/abs/2412.06748
  • Nirmalendu Prakash, Yeo Wei Jie, Amir Abdullah, Ranjan Satapathy, Erik Cambria, Roy Ka Wei Lee, 7 Sep 2025, Beyond I'm Sorry, I Can't: Dissecting Large Language Model Refusal, https://arxiv.org/abs/2509.09708
  • Wenbo Pan, Jie Xu, Qiguang Chen, Junhao Dong, Libo Qin, Xinfeng Li, Haining Yu, Xiaohua Jia, 2 Oct 2025, Can LLMs Refuse Questions They Do Not Know? Measuring Knowledge-Aware Refusal in Factual Tasks, https://arxiv.org/abs/2510.01782
  • Huizhen Shu, Xuying Li, Zhuo Li, 24 Sep 2025, LatentGuard: Controllable Latent Steering for Robust Refusal of Attacks and Reliable Response Generation, https://arxiv.org/abs/2509.19839
  • Ziheng Cheng, Yixiao Huang, Hui Xu, Somayeh Sojoudi, Xuandong Zhao, Dawn Song, Song Mei, 25 Oct 2025, OVERT: A Benchmark for Over-Refusal Evaluation on Text-to-Image Models, https://arxiv.org/abs/2505.21347
  • Sha Luo, Sang Jung Kim, Zening Duan, Kaiping Chen, 27 Oct 2025, Refusal as Silence: Gendered Disparities in Vision-Language Model Responses, https://arxiv.org/abs/2406.08222
  • Ravi Pandya, Madison Bland, Duy P. Nguyen, Changliu Liu, Jaime Fern\'andez Fisac, Andrea Bajcsy, 15 Oct 2025, From Refusal to Recovery: A Control-Theoretic Approach to Generative AI Guardrails, https://arxiv.org/abs/2510.13727
  • Vincent Siu and Nathan W. Henry and Nicholas Crispino and Yang Liu and Dawn Song and Chenguang Wang, 20 Oct 2025, RepIt: Steering Language Models with Concept-Specific Refusal Vectors, https://arxiv.org/abs/2509.13281
  • Eric Hanchen Jiang, Weixuan Ou, Run Liu, Shengyuan Pang, Guancheng Wan, Ranjie Duan, Wei Dong, Kai-Wei Chang, XiaoFeng Wang, Ying Nian Wu, Xinfeng Li, 9 Oct 2025, Energy-Driven Steering: Reducing False Refusals in Large Language Models, https://arxiv.org/abs/2510.08646
  • Aashiq Muhamed, Leonardo F. R. Ribeiro, Markus Dreyer, Virginia Smith, Mona T. Diab, 12 Oct 2025, RefusalBench: Generative Evaluation of Selective Refusal in Grounded Language Models, https://arxiv.org/abs/2510.10390
  • Thijs Willems, Sumbul Khan, Qian Huang, Bradley Camburn, Nachamma Sockalingam, King Wang Poon, 22 Oct 2025, To Use or to Refuse? Re-Centering Student Agency with Generative AI in Engineering Design Education, https://arxiv.org/abs/2510.19342
  • Cheng-Kuang Wu, Zhi Rui Tam, Chieh-Yen Lin, Yun-Nung Chen, Hung-yi Lee, 30 Sep 2025, Answer, Refuse, or Guess? Investigating Risk-Aware Decision Making in Language Models, https://arxiv.org/abs/2503.01332
  • Qingyu Yin, Chak Tou Leong, Linyi Yang, Wenxuan Huang, Wenjie Li, Xiting Wang, Jaehong Yoon, YunXing, XingYu, Jinjin Gu, 7 Oct 2025, Refusal Falls off a Cliff: How Safety Alignment Fails in Reasoning?, https://arxiv.org/abs/2510.06036

AI Books from Aussie AI



The Sweetest Lesson: Your Brain Versus AI The Sweetest Lesson: Your Brain Versus AI: new book on AI intelligence theory:
  • Your brain is 50 times bigger than the best AI engines.
  • Truly intelligent AI will require more compute!
  • Another case of the bitter lesson?
  • Maybe it's the opposite of that: the sweetest lesson.

Get your copy from Amazon: The Sweetest Lesson



RAG Optimization RAG Optimization: Accurate and Efficient LLM Applications: new book on RAG architectures:
  • Smarter RAG
  • Faster RAG
  • Cheaper RAG
  • Agentic RAG
  • RAG reasoning

Get your copy from Amazon: RAG Optimization



Generative AI in C++ Generative AI Applications book:
  • Deciding on your AI project
  • Planning for success and safety
  • Designs and LLM architectures
  • Expediting development
  • Implementation and deployment

Get your copy from Amazon: Generative AI Applications



Generative AI in C++ Generative AI programming book:
  • Generative AI coding in C++
  • Transformer engine speedups
  • LLM models
  • Phone and desktop AI
  • Code examples
  • Research citations

Get your copy from Amazon: Generative AI in C++



CUDA C++ Optimization CUDA C++ Optimization book:
  • Faster CUDA C++ kernels
  • Optimization tools & techniques
  • Compute optimization
  • Memory optimization

Get your copy from Amazon: CUDA C++ Optimization



CUDA C++ Optimization CUDA C++ Debugging book:
  • Debugging CUDA C++ kernels
  • Tools & techniques
  • Self-testing & reliability
  • Common GPU kernel bugs

Get your copy from Amazon: CUDA C++ Debugging

More AI Research Topics

Read more about: