Aussie AI

AI Safety Research

Last Updated 22 October, 2025

by David Spuler, Ph.D.

Safe and responsible use of AI is an important and all-encompassing goal. Multiple concerns arise in the use of modern AI capabilities, and for the future with more advanced AI systems. This article examines the various research papers on difference AI safety issues.

Types of AI Safety Issues

There are a variety of distinct issue in terms of appropriate use of AI. Some of the categories include:

Bias and fairness
Inaccurate results
Imaginary results ("hallucinations" or "confabulations")
Inappropriate responses (e.g., "toxicity")
Plagiarism

There are some issues that get quite close to being philosophy rather than technology:

Alignment (ensuring AI engines are "aligned" with human goals)
Overrideability/interruptibility
Obedience vs autonomy

There are some overarching issues for AI matters for the government and in the community:

Ethics
Governance
Regulation
Auditing and Enforcement
Privacy
Risk Mitigation

Issues specific to mitigation of AI safety risks include:

Red teaming (testing of safety issues)
Prompt shields
Guardrails
Jailbreak prevention
Refusal modules
Security issues

And since we may rely on AI models in various real-world situations, including dangerous real-time situations like driving a car, there are some practical technological issues ensuring that AI engines operate safely and reliably within their basic operational scope:

Testing and Debugging (simply avoiding coding "bugs" in complex AI engines)
Real-time performance profiling ("de-slugging")
Error Handling (tolerance of internal or external errors)
Code Resilience (handling unexpected inputs or situations reasonably)

Overviews, Surveys, and Reviews

Various authors have reviewed the areas of safety and ethics:

Cath C. Governing artificial intelligence: ethical, legal and technical opportunities and challenges. Philos Trans A Math Phys Eng Sci. 2018 Oct 15;376(2133):20180080. doi: 10.1098/rsta.2018.0080. PMID: 30322996 https://pubmed.ncbi.nlm.nih.gov/30322996/
Hagendorff Thilo. The ethics of AI ethics: an evaluation of guidelines. Minds and Machines. 2020; 30(1):99–120. https://link.springer.com/article/10.1007/s11023-020-09517-8
Jobin Anna, Ienca Marcello, Vayena Effy. The global landscape of AI ethics guidelines. Nature Machine Intell. 2019;(1):389–399. https://www.nature.com/articles/s42256-019-0088-2
Soni N., Sharma E.K., Singh N., Kapoor A. 2019. Impact of Artificial Intelligence on Businesses: from Research, Innovation, Market Deployment to Future Shifts in Business Models”.arXiv:1905.02092. https://arxiv.org/abs/1905.02092

Hallucinations

Hallucinations are plausible-sounding answers that are not correct, and not based on any facts. It appears like the LLM is lying or faking the answer, but it doesn't actually know this. Rather, it is probabilistically trying to come up with the best answer, and sometimes it doesn't have a factual answer, so it can fill in the blanks.

Liangming Pan, Michael Saxon, Wenda Xu, Deepak Nathani, Xinyi Wang, William Yang Wang, May 03 2024, Automatically Correcting Large Language Models: Surveying the Landscape of Diverse Automated Correction Strategies, https://direct.mit.edu/tacl/article/doi/10.1162/tacl_a_00660/120911
Bingbin Liu, Jordan T. Ash, Surbhi Goel, Akshay Krishnamurthy, Cyril Zhang, June 2023, Exposing Attention Glitches with Flip-Flop Language Modeling, https://arxiv.org/abs/2306.00946
Lucas Mearian, 14 Mar 2024, AI hallucination mitigation: two brains are better than one, https://www.computerworld.com/article/1612465/ai-hallucination-mitigation-two-brains-are-better-than-one.html
Zhihang Yuan, Yuzhang Shang, Yang Zhou, Zhen Dong, Zhe Zhou, Chenhao Xue, Bingzhe Wu, Zhikai Li, Qingyi Gu, Yong Jae Lee, Yan Yan, Beidi Chen, Guangyu Sun, Kurt Keutzer, 15 Mar 2024 (v5), LLM Inference Unveiled: Survey and Roofline Model Insights, https://arxiv.org/abs/2402.16363 Code: https://github.com/hahnyuan/LLM-Viewer (A large survey of a variety of LLM optimizations.)
Bijit Ghosh Feb 2024, Advanced Prompt Engineering for Reducing Hallucination, https://medium.com/@bijit211987/advanced-prompt-engineering-for-reducing-hallucination-bb2c8ce62fc6
Junyi Li, Jie Chen, Ruiyang Ren, Xiaoxue Cheng, Wayne Xin Zhao, Jian-Yun Nie, Ji-Rong Wen, 6 Jan 2024, The Dawn After the Dark: An Empirical Study on Factuality Hallucination in Large Language Models, https://arxiv.org/abs/2401.03205 Code: https://github.com/RUCAIBox/HaluEval-2.0
Colin Fraser, Apr 18, 2024, Hallucinations, Errors, and Dreams On why modern AI systems produce false outputs and what there is to be done about it, https://medium.com/@colin.fraser/hallucinations-errors-and-dreams-c281a66f3c35
Johnny Li, Saksham Consul, Eda Zhou, James Wong, Naila Farooqui, Yuxin Ye, Nithyashree Manohar, Zhuxiaona Wei, Tian Wu, Ben Echols, Sharon Zhou, Gregory Diamos, 25 Jun 2024, Banishing LLM Hallucinations Requires Rethinking Generalization, https://arxiv.org/abs/2406.17642
Pavan Belagatti, Jul 31, 2024, Semantic Chunking for Enhanced RAG Applications! https://levelup.gitconnected.com/semantic-chunking-for-enhanced-rag-applications-b6bc92942af0
Mintong Kang, Nezihe Merve Gürel, Ning Yu, Dawn Song, Bo Li, July 2024, C-RAG: Certified Generation Risks for Retrieval-Augmented Language Models, Proceedings of the 41st International Conference on Machine Learning, PMLR 235:22963-23000, 2024, https://proceedings.mlr.press/v235/kang24a.html
Mengya Hu, Rui Xu, Deren Lei, Yaxi Li, Mingyu Wang, Emily Ching, Eslam Kamal, Alex Deng, 22 Aug 2024, SLM Meets LLM: Balancing Latency, Interpretability and Consistency in Hallucination Detection, https://arxiv.org/abs/2408.12748
Hao Zhou, Chengming Hu, Ye Yuan, Yufei Cui, Yili Jin, Can Chen, Haolun Wu, Dun Yuan, Li Jiang, Di Wu, Xue Liu, Charlie Zhang, Xianbin Wang, Jiangchuan Liu, 17 May 2024, Large Language Model (LLM) for Telecommunications: A Comprehensive Survey on Principles, Key Techniques, and Opportunities, https://arxiv.org/abs/2405.10825
C Yang, S Fujita, 2024, Adaptive Control of Retrieval-Augmented Generation for LLMs Through Reflective Tags, https://www.preprints.org/manuscript/202408.2152/download/final_file
Michael Wood, Aug 26, 2024, 100% Accurate AI Claimed by Acurai — OpenAI and Anthropic Confirm Acurai’s Discoveries, https://blog.cubed.run/100-accurate-ai-claimed-by-acurai-openai-and-anthropic-confirm-acurais-discoveries-98fce1ddeb5b
James Lee Stakelum, Sep 2024, The End of AI Hallucinations: A Big Breakthrough in Accuracy for AI Application Developers, https://medium.com/@JamesStakelum/the-end-of-ai-hallucinations-a-breakthrough-in-accuracy-for-data-engineers-e67be5cc742a
F. Li, X. zhang and P. Zhang, 2024, Mitigating Hallucination Issues in Small-Parameter LLMs through Inter-Layer Contrastive Decoding, 2024 International Joint Conference on Neural Networks (IJCNN), Yokohama, Japan, 2024, pp. 1-8, doi: 10.1109/IJCNN60899.2024.10650644, https://ieeexplore.ieee.org/abstract/document/10650644
Zhongxiang Sun, Zihua Si, Xiaoxue Zang, Kai Zheng, Yang Song, Xiao Zhang, Jun Xu, 15 Oct 2024, LargePiG: Your Large Language Model is Secretly a Pointer Generator, https://arxiv.org/abs/2410.11366
Garanc Burke, Hilke Schellmann, October 27, 2024, Researchers say an AI-powered transcription tool used in hospitals invents things no one ever said, https://apnews.com/article/ai-artificial-intelligence-health-business-90020cdf5fa16c79ca2e5b6c4c9bbb14
Adi Simhi, Jonathan Herzig, Idan Szpektor, Yonatan Belinkov, 29 Oct 2024, Distinguishing Ignorance from Error in LLM Hallucinations, https://arxiv.org/abs/2410.22071 https://github.com/technion-cs-nlp/hallucination-mitigation
Salvatore Raieli, Nov 2024, What Is The Best Therapy For a Hallucinating AI Patient? Exploring the Art and Science of Prompt Engineering to Cure LLM Hallucinations, https://levelup.gitconnected.com/what-is-the-best-therapy-for-a-hallucinating-ai-patient-acf0cb9b3e00
Vitaly Kukharenko, Nov 2024, Why Do Neural Networks Hallucinate (And What Are Experts Doing About It)? https://pub.towardsai.net/why-do-neural-networks-hallucinate-and-what-are-experts-doing-about-it-7b9342605bf7
Yixiong Fang, Ziran Yang, Zhaorun Chen, Zhuokai Zhao, Jiawei Zhou, 9 Dec 2024, From Uncertainty to Trust: Enhancing Reliability in Vision-Language Models with Uncertainty-Guided Dropout Decoding, https://arxiv.org/abs/2412.06474
Inkit Padhi, Manish Nagireddy, Giandomenico Cornacchia, Subhajit Chaudhury, Tejaswini Pedapati, Pierre Dognin, Keerthiram Murugesan, Erik Miehling, Martín Santillán Cooper, Kieran Fraser, Giulio Zizzo, Muhammad Zaid Hameed, Mark Purcell, Michael Desmond, Qian Pan, Inge Vejsbjerg, Elizabeth M. Daly, Michael Hind, Werner Geyer, Ambrish Rawat, Kush R. Varshney, Prasanna Sattigeri, 10 Dec 2024, Granite Guardian, https://arxiv.org/abs/2412.07724 https://github.com/ibm-granite/granite-guardian (Open-sourcing of safety models with many capabilities.)
Lilian Weng, July 7, 2024, Extrinsic Hallucinations in LLMs, https://lilianweng.github.io/posts/2024-07-07-hallucination/
Rhiannon Williams, December 31, 2024, The biggest AI flops of 2024: From chatbots dishing out illegal advice to dodgy AI-generated search results, take a look back over the year’s top AI failures. https://www.technologyreview.com/2024/12/31/1109612/biggest-worst-ai-artificial-intelligence-flops-fails-2024/
Kazi Hasan Ibn Arif, Sajib Acharjee Dip, Khizar Hussain, Lang Zhang, Chris Thomas, 21 Jan 2025, Fixing Imbalanced Attention to Mitigate In-Context Hallucination of Large Vision-Language Model, https://arxiv.org/abs/2501.12206
Huan Ma, Jingdong Chen, Guangyu Wang, Changqing Zhang, 1 Feb 2025, Estimating LLM Uncertainty with Logits, https://arxiv.org/abs/2502.00290
Ningke Li, Yahui Song, Kailong Wang, Yuekang Li, Ling Shi, Yi Liu, Haoyu Wang, 19 Feb 2025, Detecting LLM Fact-conflicting Hallucinations Enhanced by Temporal-logic-based Reasoning, https://arxiv.org/abs/2502.13416
Seongheon Park, Xuefeng Du, Min-Hsuan Yeh, Haobo Wang, Yixuan Li, 1 Mar 2025, How to Steer LLM Latents for Hallucination Detection? https://arxiv.org/abs/2503.01917
Sean Michael Kerner, May 13, 2025, Guardian agents: New approach could reduce AI hallucinations to below 1%, https://venturebeat.com/ai/beyond-detection-why-automatically-correcting-hallucinations-could-transform-enterprise-ai-adoption/
Lei Wang, 12 May 2025, SEReDeEP: Hallucination Detection in Retrieval-Augmented Models via Semantic Entropy and Context-Parameter Fusion, https://arxiv.org/abs/2505.07528
Manuel Cossio, 3 Aug 2025, A comprehensive taxonomy of hallucinations in Large Language Models, https://arxiv.org/abs/2508.01781
Igor Halperin, 13 Aug 2025, Prompt-Response Semantic Divergence Metrics for Faithfulness Hallucination and Misalignment Detection in Large Language Models, https://arxiv.org/abs/2508.10192
Denis Janiak, Jakub Binkowski, Albert Sawczyn, Bogdan Gabrys, Ravid Shwartz-Ziv, Tomasz Kajdanowicz, 13 Aug 2025, The Illusion of Progress: Re-evaluating Hallucination Detection in LLMs, https://arxiv.org/abs/2508.08285
Xi Long, Christy Boscardin, Lauren A. Maggio, Joseph A. Costello, Ralph Gonzales, Rasmyah Hammoudeh, Ki Lai, Yoon Soo Park, Brian C. Gin, 14 Aug 2025, Hallucination vs interpretation: rethinking accuracy and precision in AI-assisted data extraction for knowledge synthesis, https://arxiv.org/abs/2508.09458
Siyuan Liu, Wenjing Liu, Zhiwei Xu, Xin Wang, Bo Chen, Tao Li, 21 Jul 2025, Towards Mitigation of Hallucination for LLM-empowered Agents: Progressive Generalization Bound Exploration and Watchdog Monitor, https://arxiv.org/abs/2507.15903
Zhenliang Zhang, Xinyu Hu, Huixuan Zhang, Junzhe Zhang, Xiaojun Wan, 22 Jul 2025, ICR Probe: Tracking Hidden State Dynamics for Reliable Hallucination Detection in LLMs, https://arxiv.org/abs/2507.16488
Xin Dong, Shichao Dong, Jin Wang, Jing Huang, Li Zhou, Zenghui Sun, Lihua Jing, Jingsong Lan, Xiaoyong Zhu, Bo Zheng, 22 Jul 2025, INTER: Mitigating Hallucination in Large Vision-Language Models by Interaction Guidance Sampling, https://arxiv.org/abs/2507.05056
Seunghoi Kim and Henry F. J. Tregidgo and Matteo Figini and Chen Jin and Sarang Joshi and Daniel C. Alexander, 24 Jul 2025, Tackling Hallucination from Conditional Models for Medical Image Reconstruction with DynamicDPS, https://arxiv.org/abs/2503.01075
Weihua Zheng, Roy Ka-Wei Lee, Zhengyuan Liu, Kui Wu, AiTi Aw, Bowei Zou, 17 Jul 2025, CCL-XCoT: An Efficient Cross-Lingual Knowledge Transfer Method for Mitigating Hallucination Generation, https://arxiv.org/abs/2507.14239
Jingwei Huang, Kuroush Nezafati, Ismael Villanueva-Miranda, Zifan Gu, Yueshuang Xu, Ann Marie Navar, Tingyi Wanyan, Qin Zhou, Bo Yao, Ruichen Rong, Xiaowei Zhan, Guanghua Xiao, Eric D. Peterson, Donghan M. Yang, Wenqi Shi, Yang Xie, 18 Jul 2025, Large Language Models Powered Multiagent Ensemble for Mitigating Hallucination and Efficient Atrial Fibrillation Annotation of ECG Reports, https://arxiv.org/abs/2410.16543
Ashley Lewis, Michael White, Jing Liu, Toshiaki Koike-Akino, Kieran Parsons, Ye Wang, 21 Jul 2025, Winning Big with Small Models: Knowledge Distillation vs. Self-Training for Reducing Hallucination in Product QA Agents, https://arxiv.org/abs/2502.19545
Quan Shi, Wang Xi, Zenghui Ding, Jianqing Gao, Xianjun Yang, 10 Aug 2025, Hallucination as a Computational Boundary: A Hierarchy of Inevitability and the Oracle Escape, https://arxiv.org/abs/2508.07334
Ming-Kun Xie, Jia-Hao Xiao, Gang Niu, Lei Feng, Zhiqiang Kou, Min-Ling Zhang, and Masashi Sugiyama, 3 Aug 2025, What Makes "Good" Distractors for Object Hallucination Evaluation in Large Vision-Language Models?, https://arxiv.org/abs/2508.06530
Jakob Snel and Seong Joon Oh, 28 Jul 2025, First Hallucination Tokens Are Different from Conditional Ones, https://arxiv.org/abs/2507.20836
Shengyuan Wang, Jie Feng, Tianhui Liu, Dan Pei, Yong Li, 25 Jul 2025, Mitigating Geospatial Knowledge Hallucination in Large Language Models: Benchmarking and Dynamic Factuality Aligning, https://arxiv.org/abs/2507.19586
Baiyu Chen, Wilson Wongso, Xiaoqian Hu, Yue Tan, Flora Salim, 27 Jul 2025, Multi-Stage Verification-Centric Framework for Mitigating Hallucination in Multi-Modal RAG, https://arxiv.org/abs/2507.20136
Joosung Lee, Cheonbok Park, Hwiyeol Jo, Jeonghoon Kim, Joonsuk Park, Kang Min Yoo, 28 Jul 2025, Enhancing Hallucination Detection via Future Context, https://arxiv.org/abs/2507.20546
Esmail Gumaan, 20 Jul 2025, Theoretical Foundations and Mitigation of Hallucination in Large Language Models, https://arxiv.org/abs/2507.22915
Praveenkumar Katwe, Rakesh Chandra, Balabantaray Kali, Prasad Vittala, 30 Jul 2025, Reducing Hallucinations in Summarization via Reinforcement Learning with Entity Hallucination Index, https://arxiv.org/abs/2507.22744
Vijja Wichitwechkarn, Charles Fox, Ruchi Choudhary, 23 Jul 2025, Hallucination Detection and Mitigation with Diffusion in Multi-Variate Time-Series Foundation Models, https://arxiv.org/abs/2508.00881
Xiaoyu Pan, Yang Bai, Ke Zou, Yang Zhou, Jun Zhou, Huazhu Fu, Yih-Chung Tham, Yong Liu, 24 Jul 2025, EH-Benchmark Ophthalmic Hallucination Benchmark and Agent-Driven Top-Down Traceable Reasoning Workflow, https://arxiv.org/abs/2507.22929
Zhaochen Wang, Yiwei Wang, Yujun Cai, 3 Aug 2025, Cure or Poison? Embedding Instructions Visually Alters Hallucination in Vision-Language Models, https://arxiv.org/abs/2508.01678
Yijun Feng, 3 Aug 2025, Counterfactual Probing for Hallucination Detection and Mitigation in Large Language Models, https://arxiv.org/abs/2508.01862
Zhaoyi Sun, Wen-Wai Yim, Ozlem Uzuner, Fei Xia, Meliha Yetisgen, 1 Aug 2025, A Scoping Review of Natural Language Processing in Addressing Medically Inaccurate Information: Errors, Misinformation, and Hallucination, https://arxiv.org/abs/2505.00008
Junyoung Lim, Jaewoo Ahn, Gunhee Kim, 5 Aug 2025, ChartCap: Mitigating Hallucination of Dense Chart Captioning, https://arxiv.org/abs/2508.03164
Subhey Sadi Rahman, Md. Adnanul Islam, Md. Mahbub Alam, Musarrat Zeba, Md. Abdur Rahman, Sadia Sultana Chowa, Mohaimenul Azam Khan Raiaan, Sami Azam, 5 Aug 2025, Hallucination to Truth: A Review of Fact-Checking and Factuality Evaluation in Large Language Models, https://arxiv.org/abs/2508.03860
Shunqi Mao, Chaoyi Zhang, Weidong Cai, 6 Aug 2025, Through the Magnifying Glass: Adaptive Perception Magnification for Hallucination-Free VLM Decoding, https://arxiv.org/abs/2503.10183
Micha{\l} P. Karpowicz, 6 Aug 2025, On the Fundamental Impossibility of Hallucination Control in Large Language Models, https://arxiv.org/abs/2506.06382
Huaicheng Zhang, Wei Tan, Guangzheng Li, Yixuan Zhang, Hangting Chen, Shun Lei, Chenyu Yang, Zhiyong Wu, Shuai Wang, Qijun Huang, Dong Yu, 7 Aug 2025, Towards Hallucination-Free Music: A Reinforcement Learning Preference Optimization Framework for Reliable Song Generation, https://arxiv.org/abs/2508.05011
Kim Hammar and Tansu Alpcan and Emil C. Lupu, 7 Aug 2025, Incident Response Planning Using a Lightweight Large Language Model with Reduced Hallucination, https://arxiv.org/abs/2508.05188
Marc Pavel, Nenad Petrovic, Lukasz Mazur, Vahid Zolfaghari, Fengjunjie Pan, Alois Knoll, 15 Aug 2025, Hallucination in LLM-Based Code Generation: An Automotive Case Study, https://arxiv.org/abs/2508.11257
Nanxing Hu, Xiaoyue Duan, Jinchao Zhang, Guoliang Kang, 19 Aug 2025, Enhancing Visual Reliance in Text Generation: A Bayesian Perspective on Mitigating Hallucination in Large Vision-Language Models, https://arxiv.org/abs/2505.19498
Huan Ma, Jiadong Pan, Jing Liu, Yan Chen, Joey Tianyi Zhou, Guangyu Wang, Qinghua Hu, Hua Wu, Changqing Zhang, Haifeng Wang, 20 Aug 2025, Semantic Energy: Detecting LLM Hallucination Beyond Entropy, https://arxiv.org/abs/2508.14496
Aman Goel, Daniel Schwartz, Yanjun Qi, 19 Aug 2025, Zero-knowledge LLM hallucination detection and mitigation through fine-grained cross-model consistency, https://arxiv.org/abs/2508.14314
Yupei Yang, Fan Feng, Lin Yang, Wanxi Deng, Lin Qu, Biwei Huang, Shikui Tu, Lei Xu, 20 Aug 2025, DEPTH: Hallucination-Free Relation Extraction via Dependency-Aware Sentence Simplification and Two-tiered Hierarchical Refinement, https://arxiv.org/abs/2508.14391
Nicole Cho, William Watson, Alec Koppel, Sumitra Ganesh, Manuela Veloso, 22 Aug 2025, QueryBandits for Hallucination Mitigation: Exploiting Semantic Features for No-Regret Rewriting, https://arxiv.org/abs/2508.16697
Nicolas Zucchet, J\"org Bornschein, Stephanie Chan, Andrew Lampinen, Razvan Pascanu, Soham De, 24 Jul 2025, How do language models learn facts? Dynamics, curricula and hallucinations, https://arxiv.org/abs/2503.21676
Anindya Bijoy Das, Shahnewaz Karim Sakib and Shibbir Ahmed, 9 Aug 2025, Trustworthy Medical Imaging with Large Language Models: A Study of Hallucinations Across Modalities, https://arxiv.org/abs/2508.07031
Charles O'Neill, Slava Chalnev, Chi Chi Zhao, Max Kirkby, Mudith Jayasekara, 31 Jul 2025, A Single Direction of Truth: An Observer Model's Linear Residual Probe Exposes and Steers Contextual Hallucinations, https://arxiv.org/abs/2507.23221
Zhangcheng Qiang, Kerry Taylor, Weiqing Wang, Jing Jiang, 25 Mar 2025, OAEI-LLM-T: A TBox Benchmark Dataset for Understanding Large Language Model Hallucinations in Ontology Matching, https://arxiv.org/abs/2503.21813
Yudong Zhang, Ruobing Xie, Xingwu Sun, Yiqing Huang, Jiansheng Chen, Zhanhui Kang, Di Wang, Yu Wang, 31 Jul 2025, DHCP: Detecting Hallucinations by Cross-modal Attention Pattern in Large Vision-Language Models, https://arxiv.org/abs/2411.18659
Haonan Ge, Yiwei Wang, Ming-Hsuan Yang, Yujun Cai, 14 Aug 2025, MRFD: Multi-Region Fusion Decoding with Self-Consistency for Mitigating Hallucinations in LVLMs, https://arxiv.org/abs/2508.10264
Likun Tan, Kuan-Wei Huang, Kevin Wu, 28 Jul 2025, FRED: Financial Retrieval-Enhanced Detection and Editing of Hallucinations in Language Models, https://arxiv.org/abs/2507.20930
Neil F. Johnson and Frank Yingjie Huo, 1 Aug 2025, Multispin Physics of AI Tipping Points and Hallucinations, https://arxiv.org/abs/2508.01097
Chenxi Li, Yichen Guo, Benfang Qian, Jinhao You, Kai Tang, Yaosong Du, Zonghao Zhang, and Xiande Huang, 3 Aug 2025, MAP: Mitigating Hallucinations in Large Vision-Language Models with Map-Level Attention Processing, https://arxiv.org/abs/2508.01653
Peizheng Guo, Jingyao Wang, Wenwen Qiang, Huijie Guo, Changwen Zheng, Jiahuan Zhou, Gang Hua, 6 Aug 2025, Hacking Hallucinations of MLLMs with Causal Sufficiency and Necessity, https://arxiv.org/abs/2508.04182
Mengao Zhang, Jiayu Fu, Tanya Warrier, Yuwen Wang, Tianhui Tan, Ke-wei Huang, 7 Aug 2025, FAITH: A Framework for Assessing Intrinsic Tabular Hallucinations in finance, https://arxiv.org/abs/2508.05201
Vibhor Agarwal, Yiqiao Jin, Mohit Chandra, Munmun De Choudhury, Srijan Kumar, Nishanth Sastry, 7 Aug 2025, MedHalu: Hallucinations in Responses to Healthcare Queries by Large Language Models, https://arxiv.org/abs/2409.19492
Chunhua Liu, Hong Yi Lin and Patanamon Thongtanunam, 12 Aug 2025, Hallucinations in Code Change to Natural Language Generation: Prevalence and Evaluation of Detection Metrics, https://arxiv.org/abs/2508.08661
Ashish Seth, Utkarsh Tyagi, Ramaneswaran Selvakumar, Nishit Anand, Sonal Kumar, Sreyan Ghosh, Ramani Duraiswami, Chirag Agarwal, Dinesh Manocha, 18 Aug 2025, EGOILLUSION: Benchmarking Hallucinations in Egocentric Video Understanding, https://arxiv.org/abs/2508.12687
Yuangang Li, Yiqing Shen, Yi Nian, Jiechao Gao, Ziyi Wang, Chenxiao Yu, Shawn Li, Jie Wang, Xiyang Hu, Yue Zhao, 17 Aug 2025, Mitigating Hallucinations in Large Language Models via Causal Reasoning, https://arxiv.org/abs/2508.12495
Wenhao Li, Xiu Su, Jingyi Wu, Feng Yang, Yang Liu, Yi Chen, Shan You, Chang Xu, 19 Aug 2025, Identify, Isolate, and Purge: Mitigating Hallucinations in LVLMs via Self-Evolving Distillation, https://arxiv.org/abs/2507.04680
Anindya Bijoy Das, Shibbir Ahmed and Shahnewaz Karim Sakib, 19 Aug 2025, Hallucinations and Key Information Extraction in Medical Texts: A Comprehensive Assessment of Open-Source Large Language Models, https://arxiv.org/abs/2504.19061
Chenlin Liu, Minghui Fang, Patrick Zhang, Wei Zhou, Jie Gao, Jiqing Han, 21 Aug 2025, Mitigating Hallucinations in LM-Based TTS Models via Distribution Alignment Using GFlowNets, https://arxiv.org/abs/2508.15442
Reilly Haskins and Benjamin Adams, 21 Aug 2025, KEA Explain: Explanations of Hallucinations using Graph Kernel Analysis, https://arxiv.org/abs/2507.03847
Shuzhou Yuan, Zhan Qu, Ashish Yashwanth Kangen, Michael F\"arber, 22 Aug 2025, Can Hallucinations Help? Boosting LLMs for Drug Discovery, https://arxiv.org/abs/2501.13824
Charles Moslonka, Hicham Randrianarivo, Arthur Garnier and Emmanuel Malherbe, 1 Sep 2025, Learned Hallucination Detection in Black-Box LLMs using Token-level Entropy Production Rate, https://arxiv.org/abs/2509.04492
Jiawei Li, Akshayaa Magesh, Venugopal V. Veeravalli, 25 Aug 2025, Principled Detection of Hallucinations in Large Language Models via Multiple Testing, https://arxiv.org/abs/2508.18473
Yiming Huang, Junyan Zhang, Zihao Wang, Biquan Bie, Yunzhong Qiu, Yi R. Fung, Xinlei He, 26 Aug 2025, RePPL: Recalibrating Perplexity by Uncertainty in Semantic Propagation and Language Generation for Explainable QA Hallucination Detection, https://arxiv.org/abs/2505.15386
Supratik Sarkar, Swagatam Das, 26 Aug 2025, Grounding the Ungrounded: A Spectral-Graph Framework for Quantifying Hallucinations in multimodal LLMs, https://arxiv.org/abs/2508.19366
Kehao Miao, Xiaolong Jin, 26 Aug 2025, An Investigation on Group Query Hallucination Attacks, https://arxiv.org/abs/2508.19321
Seongheon Park and Yixuan Li, 27 Aug 2025, GLSim: Detecting Object Hallucinations in LVLMs via Global-Local Similarity, https://arxiv.org/abs/2508.19972
Alberto Compagnoni, Davide Caffagni, Nicholas Moratelli, Lorenzo Baraldi, Marcella Cornia, Rita Cucchiara, 27 Aug 2025, Mitigating Hallucinations in Multimodal LLMs via Object-aware Preference Optimization, https://arxiv.org/abs/2508.20181
Tingxuan Xu, Jiarui Feng, Justin Melendez, Kaleigh Roberts, Donghong Cai, Mingfang Zhu, Donald Elbert, Yixin Chen, Randall J. Bateman, 28 Aug 2025, Addressing accuracy and hallucination of LLMs in Alzheimer's disease research through knowledge graphs, https://arxiv.org/abs/2508.21238
Weizhi Gao, Xiaorui Liu, Feiyi Wang, Dan Lu, Junqi Yin, 28 Aug 2025, Decoding Memories: An Efficient Pipeline for Self-Consistency Hallucination Detection, https://arxiv.org/abs/2508.21228
Hao Lu, Jiahao Wang, Yaolun Zhang, Ruohui Wang, Xuanyu Zheng, Yepeng Tang, Dahua Lin, Lewei Lu, 29 Aug 2025, ELV-Halluc: Benchmarking Semantic Aggregation Hallucinations in Long Video Understanding, https://arxiv.org/abs/2508.21496
Junzhe Chen, Tianshu Zhang, Shiyu Huang, Yuwei Niu, Chao Sun, Rongzhou Zhang, Guanyu Zhou, Lijie Wen, Xuming Hu, 31 Aug 2025, OmniDPO: A Preference Optimization Framework to Address Omni-Modal Hallucination, https://arxiv.org/abs/2509.00723
Saad Abdul Ghani, Zizhao Wang, Peter Stone, Xuesu Xiao, 1 Sep 2025, Dyna-LfLH: Learning Agile Navigation in Dynamic Environments from Learned Hallucination, https://arxiv.org/abs/2403.17231
Haoran Huan, Mihir Prabhudesai, Mengning Wu, Shantanu Jaiswal, Deepak Pathak, 3 Sep 2025, Can LLMs Lie? Investigation beyond Hallucination, https://arxiv.org/abs/2509.03518
Qiang Liu, Xinlong Chen, Yue Ding, Bowen Song, Weiqiang Wang, Shu Wu, Liang Wang, 3 Sep 2025, Attention-guided Self-reflection for Zero-shot Hallucination Detection in Large Language Models, https://arxiv.org/abs/2501.09997
Praneet Suresh, Jack Stanley, Sonia Joseph, Luca Scimeca, Danilo Bzdok, 8 Sep 2025, From Noise to Narrative: Tracing the Origins of Hallucinations in Transformers, https://arxiv.org/abs/2509.06938
Xin Tong, Zhi Lin, Jingya Wang, Bo Jin, 8 Sep 2025, HAVE: Head-Adaptive Gating and ValuE Calibration for Hallucination Mitigation in Large Language Models, https://arxiv.org/abs/2509.06596
Jerry Li, Evangelos Papalexakis, 3 Sep 2025, Beyond ROUGE: N-Gram Subspace Features for LLM Hallucination Detection, https://arxiv.org/abs/2509.05360
Kamil Ciosek, Nicol\`o Felicioni, Sina Ghiassian, 8 Sep 2025, Hallucination Detection on a Budget: Efficient Bayesian Estimation of Semantic Entropy, https://arxiv.org/abs/2504.03579
Kishan Maharaj, Vitobha Munigala, Srikanth G. Tamilselvam, Prince Kumar, Sayandeep Sen, Palani Kodeswaran, Abhijit Mishra, Pushpak Bhattacharyya, 6 Sep 2025, ETF: An Entity Tracing Framework for Hallucination Detection in Code Summaries, https://arxiv.org/abs/2410.14748
Masoumeh Zareh, Mohammad Hossein Manshaei, Sayed Jalal Zahabi, and Marwan Krunz, 6 Sep 2025, Modeling Visual Hallucination: A Generative Adversarial Network Framework, https://arxiv.org/abs/2102.08209
OpenAI, September 5, 2025, Why language models hallucinate, https://openai.com/index/why-language-models-hallucinate/ (Many interesting findings, including that some level of hallucinations are inevitable in the next-token decoding method, and also that current LLM evals reward hallucinations, and need to be reworked to fix hallucinations properly by rewarding expressions of uncertainty in results, i.e., when the model admits it doesn't know something instead of making something up.)
Saumya Goswami, Siddharth Kurra, 9 Sep 2025, HALT-RAG: A Task-Adaptable Framework for Hallucination Detection with Calibrated NLI Ensembles and Abstention, https://arxiv.org/abs/2509.07475
Nobin Sarwar, 8 Sep 2025, FilterRAG: Zero-Shot Informed Retrieval-Augmented Generation to Mitigate Hallucinations in VQA, https://arxiv.org/abs/2502.18536
Malavika Suresh, Rahaf Aljundi, Ikechukwu Nkisi-Orji, Nirmalie Wiratunga, 4 Sep 2025, Cross-Layer Attention Probing for Fine-Grained Hallucination Detection, https://arxiv.org/abs/2509.09700
Naveen Lamba, Sanju Tiwari and Manas Gaur, 9 Sep 2025, Investigating Symbolic Triggers of Hallucination in Gemma Models Across HaluEval and TruthfulQA, https://arxiv.org/abs/2509.09715
Ponhvoan Srey, Xiaobao Wu, Anh Tuan Luu, 12 Sep 2025, Unsupervised Hallucination Detection by Inspecting Reasoning Processes, https://arxiv.org/abs/2509.10004
Garry Yang, Zizhe Chen, Man Hon Wong, Haoyu Lei, Yongqiang Chen, Zhenguo Li, Kaiwen Zhou, James Cheng, 11 Sep 2025, MESH -- Understanding Videos Like Human: Measuring Hallucinations in Large Video Models, https://arxiv.org/abs/2509.08538
Humam Kourani, Anton Antonov, Alessandro Berti, Wil M.P. van der Aalst, 18 Sep 2025, Knowledge-Driven Hallucination in Large Language Models: An Empirical Study on Process Modeling, https://arxiv.org/abs/2509.15336
Davide Ettori, Nastaran Darabi, Sina Tayebati, Ranganath Krishnan, Mahesh Subedar, Omesh Tickoo, and Amit Ranjan Trivedi, 19 Sep 2025, EigenTrack: Spectral Activation Feature Tracking for Hallucination and Out-of-Distribution Detection in LLMs and VLMs, https://arxiv.org/abs/2509.15735
Chung-En Johnny Yu, Hsuan-Chih (Neil) Chen, Brian Jalaian, Nathaniel D. Bastian, 18 Sep 2025, ORCA: Agentic Reasoning For Hallucination and Adversarial Robustness in Vision-Language Models, https://arxiv.org/abs/2509.15435
Seongmin Lee, Hsiang Hsu, Chun-Fu Chen, Duen Horng Chau, 15 Sep 2025, Probing LLM Hallucination from Within: Perturbation-Driven Approach via Internal Knowledge, https://arxiv.org/abs/2411.09689
Saad Obaid ul Islam, Anne Lauscher, Goran Glava\v{s}, 16 Sep 2025, How Much Do LLMs Hallucinate across Languages? On Multilingual Estimation of LLM Hallucination in the Wild, https://arxiv.org/abs/2502.12769
Boris Kovalerchuk, Brent D. Fegley, 13 Sep 2025, LLM Enhancement with Domain Expert Mental Model to Reduce LLM Hallucination with Causal Prompt Engineering, https://arxiv.org/abs/2509.10818
Minh Vu, Brian K. Tran, Syed A. Shah, Geigh Zollicoffer, Nhat Hoang-Xuan, Manish Bhattarai, 12 Sep 2025, HalluField: Detecting LLM Hallucinations via Field-Theoretic Modeling, https://arxiv.org/abs/2509.10753
Junjie Hu, Gang Tu, ShengYu Cheng, Jinxin Li, Jinting Wang, Rui Chen, Zhilong Zhou, Dongbo Shan, 15 Sep 2025, HARP: Hallucination Detection via Reasoning Subspace Projection, https://arxiv.org/abs/2509.11536
Leon Chlon, Ahmed Karim, Maggie Chlon, 14 Sep 2025, Predictable Compression Failures: Why Language Models Actually Hallucinate, https://arxiv.org/abs/2509.11208
Yue Zhang, Yafu Li, Leyang Cui, Deng Cai, Lemao Liu, Tingchen Fu, Xinting Huang, Enbo Zhao, Yu Zhang, Chen Xu, Yulong Chen, Longyue Wang, Anh Tuan Luu, Wei Bi, Freda Shi, Shuming Shi, 14 Sep 2025, Siren's Song in the AI Ocean: A Survey on Hallucination in Large Language Models, https://arxiv.org/abs/2309.01219
Zhenglin Hua, Jinghan He, Zijun Yao, Tianxu Han, Haiyun Guo, Yuheng Jia, Junfeng Fang, 15 Sep 2025, Steering LVLMs via Sparse Autoencoder for Hallucination Mitigation, https://arxiv.org/abs/2505.16146
Hongxiang Zhang, Hao Chen, Muhao Chen, Tianyi Zhang, 15 Sep 2025, Active Layer-Contrastive Decoding Reduces Hallucination in Large Language Model Generation, https://arxiv.org/abs/2505.23657
Yurui Chang, Bochuan Cao, Lu Lin, 13 Sep 2025, Monitoring Decoding: Mitigating Hallucination via Evaluating the Factuality of Partial Response during Generation, https://arxiv.org/abs/2503.03106
Martin Prei{\ss}, 11 Sep 2025, Hallucination Detection with the Internal Layers of LLMs, https://arxiv.org/abs/2509.14254
Zihao Li, Weiwei Yi, Jiahong Chen, 12 Sep 2025, Accuracy Paradox in Large Language Models: Regulating Hallucination Risks in Generative AI, https://arxiv.org/abs/2509.13345
Xiao Zheng, 17 Sep 2025, DSCC-HS: A Dynamic Self-Reinforcing Framework for Hallucination Suppression in Large Language Models, https://arxiv.org/abs/2509.13702
Mahjabin Nahar, Eun-Ju Lee, Jin Won Park, Dongwon Lee, 17 Sep 2025, Catch Me if You Search: When Contextual Web Search Results Affect the Detection of Hallucinations, https://arxiv.org/abs/2504.01153

Security of AI

Research on security issues involving AI and LLMs:

Jason Koebler, June 26, 2024, Researchers Prove Rabbit AI Breach By Sending Email to Us as Admin, https://www.404media.co/researchers-prove-rabbit-ai-breach-by-sending-email-to-us-as-admin/ (Rabbit's API security credentials were hard-coded into the device.)
Yuanchun Li, Hao Wen, Weijun Wang, Xiangyu Li, Yizhen Yuan, Guohong Liu, Jiacheng Liu, Wenxing Xu, Xiang Wang, Yi Sun, Rui Kong, Yile Wang, Hanfei Geng, Jian Luan, Xuefeng Jin, Zilong Ye, Guanjing Xiong, Fan Zhang, Xiang Li, Mengwei Xu, Zhijun Li, Peng Li, Yang Liu, Ya-Qin Zhang, Yunxin Liu, 8 May 2024 (v2), Personal LLM Agents: Insights and Survey about the Capability, Efficiency and Security, https://arxiv.org/abs/2401.05459 https://github.com/MobileLLM/Personal_LLM_Agents_Survey
Michael Nuñez, August 30, 2024, AI is growing faster than companies can secure it, warn industry leaders, https://venturebeat.com/ai/ai-is-growing-faster-than-companies-can-secure-it-warn-industry-leaders/
Yiheng Liu, Hao He, Tianle Han, Xu Zhang, Mengyuan Liu, Jiaming Tian, Yutong Zhang, Jiaqi Wang, Xiaohui Gao, Tianyang Zhong, Yi Pan, Shaochen Xu, Zihao Wu, Zhengliang Liu, Xin Zhang, Shu Zhang, Xintao Hu, Tuo Zhang, Ning Qiang, Tianming Liu, Bao Ge, 6 Jan 2024 (v2), Understanding LLMs: A Comprehensive Overview from Training to Inference, https://arxiv.org/abs/2401.02038
Huan Yang, Deyu Zhang, Yudong Zhao, Yuanchun Li, Yunxin Liu, 6 Sep 2024, A First Look At Efficient And Secure On-Device LLM Inference Against KV Leakage, https://arxiv.org/abs/2409.04040 (Security issues where KV caches can be data leaks as they may contain encodings of private information.)
Nicholas Carlini, Milad Nasr, 22 Oct 2024, Remote Timing Attacks on Efficient Language Model Inference, https://arxiv.org/abs/2410.17175
Chenchen Gu, Xiang Lisa Li, Rohith Kuditipudi, Percy Liang, Tatsunori Hashimoto, Dec 2024, Timing Attacks on Prompt Caching in Language Model APIs, Stanford CS 191W Senior Project, https://cs191w.stanford.edu/projects/Gu,%20Chenchen_CS191W.pdf (Using timing attacks to detect prefix KV caching, thereby gaining information about other users' prompts.)
Úlfar Erlingsson, 27 Mar 2025, How to Secure Existing C and C++ Software without Memory Safety, https://arxiv.org/pdf/2503.21145 (Examines four risk mitigation techniques for memory safety.)
Anthropic, 13 Aug 2025, Building Safeguards for Claude, https://www.anthropic.com/news/building-safeguards-for-claude
Pallavi Zambare, Venkata Nikhil Thanikella, Nikhil Padmanabh Kottur, Sree Akhil Akula, Ying Liu, 12 Aug 2025, NetMoniAI: An Agentic AI Framework for Network Security & Monitoring, https://arxiv.org/abs/2508.10052
Miles Q. Li and Benjamin C. M. Fung, 13 Aug 2025, Security Concerns for Large Language Models: A Survey, https://arxiv.org/abs/2505.18889
Vita Santa Barletta, Vito Bavaro, Miriana Calvano, Antonio Curci, Antonio Piccinno, Davide Pio Posa, 23 Jul 2025, Enabling Cyber Security Education through Digital Twins and Generative AI, https://arxiv.org/abs/2507.17518
Haibo Wang, Lutfu S.Sua, and Bahram Alidaee, 22 Jul 2025, Enhancing supply chain security with automated machine learning, https://arxiv.org/abs/2406.13166
Lily Stelling, Mick Yang, Rokas Gipi\v{s}kis, Leon Staufer, Ze Shen Chin, Sim\'eon Campos, Ariel Gil, and Michael Chen, 22 Jul 2025, Mapping Industry Practices to the EU AI Act's GPAI Code of Practice Safety and Security Measures, https://arxiv.org/abs/2504.15181
Rui Guo, Avinash Ayalasomayajula, Henian Li, Jingbo Zhou, Sujan Kumar Saha, Farimah Farahmandi, 22 Jul 2025, SVAgent: AI Agent for Hardware Security Verification Assertion, https://arxiv.org/abs/2507.16203
Chang Gong and Zhongwen Li and Xiaoqi Li, 24 Jul 2025, Information Security Based on LLM Approaches: A Review, https://arxiv.org/abs/2507.18215
Pengfei Du, 14 Jul 2025, PRM-Free Security Alignment of Large Models via Red Teaming and Adversarial Training, https://arxiv.org/abs/2507.14202
Eldor Abdukhamidov, Mohammed Abuhamad, Simon S. Woo, Hyoungshick Kim, Tamer Abuhmed, 18 Jul 2025, Breaking the Illusion of Security via Interpretation: Interpretable Vision Transformer Systems under Attack, https://arxiv.org/abs/2507.14248
Zhou Li, Xiang Zhang, Jiawen Lv, Jihao Fan, Haiqiang Chen, Giuseppe Caire, 19 Jul 2025, Collusion-Resilient Hierarchical Secure Aggregation with Heterogeneous Security Constraints, https://arxiv.org/abs/2507.14768
Nidhi Rastogi, Shirid Pant, Devang Dhanuka, Amulya Saxena, Pranjal Mairal, 20 Jul 2025, Too Much to Trust? Measuring the Security and Cognitive Impacts of Explainability in AI-Driven SOCs, https://arxiv.org/abs/2503.02065
Andrew C. Cullen, Paul Montague, Sarah M. Erfani, Benjamin I.P. Rubinstein, 11 Aug 2025, Position: Certified Robustness Does Not (Yet) Imply Model Security, https://arxiv.org/abs/2506.13024
Andy Zou, Maxwell Lin, Eliot Jones, Micha Nowak, Mateusz Dziemian, Nick Winter, Alexander Grattan, Valent Nathanael, Ayla Croft, Xander Davies, Jai Patel, Robert Kirk, Nate Burnikell, Yarin Gal, Dan Hendrycks, J. Zico Kolter, Matt Fredrikson, 28 Jul 2025, Security Challenges in AI Agent Deployment: Insights from a Large Scale Public Competition, https://arxiv.org/abs/2507.20526
Shen Li, Liuyi Yao, Wujia Niu, Lan Zhang, Yaliang Li, 28 Jul 2025, Security Tensors as a Cross-Modal Bridge: Extending Text-Aligned Safety to Vision in LVLM, https://arxiv.org/abs/2507.20994
Song Son Ha,Florian Foerster,Thomas Robert Doebbert,Tim Kittel,Dominik Merli,Gerd Scholl, 28 Jul 2025, Testbed and Software Architecture for Enhancing Security in Industrial Private 5G Networks, https://arxiv.org/abs/2507.20873
Keerthana Madhavan, Abbas Yazdinejad, Fattane Zarrinkalam, Ali Dehghantanha, 26 Jul 2025, Quantifying Security Vulnerabilities: A Metric-Driven Security Analysis of Gaps in Current AI Standards, https://arxiv.org/abs/2502.08610
Craig Wright, 10 Jul 2025, A Formal Rebuttal of "The Blockchain Trilemma: A Formal Proof of the Inherent Trade-Offs Among Decentralization, Security, and Scalability", https://arxiv.org/abs/2507.21111
Gauri Sharma, Vidhi Kulkarni, Miles King, Ken Huang, 23 Jul 2025, Towards Unifying Quantitative Security Benchmarking for Multi Agent Systems, https://arxiv.org/abs/2507.21146
Muzhi Dai, Shixuan Liu, Zhiyuan Zhao, Junyu Gao, Hao Sun, Xuelong Li, 29 Jul 2025, Secure Tug-of-War (SecTOW): Iterative Defense-Attack Training with Reinforcement Learning for Multimodal Model Security, https://arxiv.org/abs/2507.22037
Kang Chen, Xiuze Zhou, Yuanguo Lin, Jinhe Su, Yuanhui Yu, Li Shen, Fan Lin, 4 Aug 2025, A Survey on Data Security in Large Language Models, https://arxiv.org/abs/2508.02312
Niklas Pfister, V\'aclav Volhejn, Manuel Knott, Santiago Arias, Julia Bazi\'nska, Mykhailo Bichurin, Alan Commike, Janet Darling, Peter Dienes, Matthew Fiedler, David Haber, Matthias Kraft, Marco Lancini, Max Mathys, Dami\'an Pascual-Ortiz, Jakub Podolak, Adri\`a Romero-L\'opez, Kyriacos Shiarlis, Andreas Signer, Zsolt Terek, Athanasios Theocharis, Daniel Timbrell, Samuel Trautwein, Samuel Watts, Yun-Han Wu, Mateo Rojas-Carulla, 4 Aug 2025, Gandalf the Red: Adaptive Security for LLMs, https://arxiv.org/abs/2501.07927
Nusrat Zahan, Imranur Rahman, Laurie Williams, 2 Aug 2025, Assumptions to Evidence: Evaluating Security Practices Adoption and Their Impact on Outcomes in the npm Ecosystem, https://arxiv.org/abs/2504.14026
Arturo S\'anchez-Matas, Pablo Escribano Ruiz, Daniel D\'iaz-L\'opez, Angel Luis Perales G\'omez, Pantaleone Nespoli, Gregorio Mart\'inez P\'erez, 5 Aug 2025, Simulating Cyberattacks through a Breach Attack Simulation (BAS) Platform empowered by Security Chaos Engineering (SCE), https://arxiv.org/abs/2508.03882
Hammad Atta, Ken Huang, Manish Bhatt, Kamal Ahmed, Muhammad Aziz Ul Haq, Yasir Mehmood, 6 Aug 2025, Logic layer Prompt Control Injection (LPCI): A Novel Security Vulnerability Class in Agentic Systems, https://arxiv.org/abs/2507.10457
Minghao Shao, Nanda Rani, Kimberly Milner, Haoran Xi, Meet Udeshi, Saksham Aggarwal, Venkata Sai Charan Putrevu, Sandeep Kumar Shukla, Prashanth Krishnamurthy, Farshad Khorrami, Ramesh Karri, Muhammad Shafique, 5 Aug 2025, Towards Effective Offensive Security LLM Agents: Hyperparameter Tuning, LLM as a Judge, and a Lightweight CTF Benchmark, https://arxiv.org/abs/2508.05674
Hiroya Kato, Kentaro Kita, Kento Hasegawa, Seira Hidano, 12 Aug 2025, AI Security Map: Holistic Organization of AI Security Technologies and Impacts on Stakeholders, https://arxiv.org/abs/2508.08583
Aayush Gupta, 12 Aug 2025, Can AI Keep a Secret? Contextual Integrity Verification: A Provable Security Architecture for LLMs, https://arxiv.org/abs/2508.09288
Irash Perera (1), Hiranya Abeyrathne (2), Sanjeewa Malalgoda (2), Arshardh Ifthikar (2) ((1) Department of Computer Science and Engineering, University of Moratuwa, Colombo, Sri Lanka, (2) WSO2, Colombo, Sri Lanka), 14 Aug 2025, Enhancing GraphQL Security by Detecting Malicious Queries Using Large Language Models, Sentence Transformers, and Convolutional Neural Networks, https://arxiv.org/abs/2508.11711
Afrah Gueriani, Hamza Kheddar, Ahmed Cherif Mazari and Mohamed Chahine Ghanem, 17 Aug 2025, A Robust Cross-Domain IDS using BiGRU-LSTM-Attention for Medical and Industrial IoT Security, https://arxiv.org/abs/2508.12470
Yongjian Guo, Puzhuo Liu, Wanlun Ma, Zehang Deng, Xiaogang Zhu, Peng Di, Xi Xiao, Sheng Wen, 18 Aug 2025, Systematic Analysis of MCP Security, https://arxiv.org/abs/2508.12538
Yixuan Yang and Daoyuan Wu and Yufan Chen, 17 Aug 2025, MCPSecBench: A Systematic Security Benchmark and Playground for Testing Model Context Protocols, https://arxiv.org/abs/2508.13220
Daniel M. Jimenez-Gutierrez, Yelizaveta Falkouskaya, Jose L. Hernandez-Ramos, Aris Anagnostopoulos, Ioannis Chatzigiannakis, Andrea Vitaletti, 19 Aug 2025, On the Security and Privacy of Federated Learning: A Survey with Attacks, Defenses, Frameworks, Applications, and Future Directions, https://arxiv.org/abs/2508.13730
Abbas Sabra, Olivier Schmitt and Joseph Tyler, 20 Aug 2025, Assessing the Quality and Security of AI-Generated Code: A Quantitative Analysis, https://arxiv.org/abs/2508.14727
Zhixiang Guo, Siyuan Liang, Aishan Liu, Dacheng Tao, 21 Aug 2025, CopyrightShield: Enhancing Diffusion Model Security against Copyright Infringement Attacks, https://arxiv.org/abs/2412.01528
Akshay Mhatre and Noujoud Nader and Patrick Diehl and Deepti Gupta, 22 Aug 2025, LLM-GUARD: Large Language Model-Based Detection and Repair of Bugs and Security Vulnerabilities in C++ and Python, https://arxiv.org/abs/2508.16419
Anton Ludwig Bonin, Pawel Robert Smolinski, Jacek Winiarski, 22 Aug 2025, Exploring the Impact of Generative Artificial Intelligence on Software Development in the IT Sector: Preliminary Findings on Productivity, Efficiency and Job Security, https://arxiv.org/abs/2508.16811
Keke Lian and Bin Wang, Lei Zhang, Libo Chen, Junjie Wang, Ziming Zhao, Yujiu Yang, Haotong Duan, Haoran Zhao, Shuang Liao, Mingda Guo, Jiazheng Quan, Yilu Zhong, Chenhao He, Zichuan Chen, Jie Wu, Haoling Li, Zhaoxuan Li, Jiongchi Yu, Hui Li and Dong Zhang, 25 Aug 2025, A.S.E: A Repository-Level Benchmark for Evaluating Security in AI-Generated Code, https://arxiv.org/abs/2508.18106
Matous Kozak, Roshanak Zilouchian Moghaddam, Siva Sivaraman, 23 Aug 2025, When Developer Aid Becomes Security Debt: A Systematic Analysis of Insecure Behaviors in LLM Coding Agents, https://arxiv.org/abs/2507.09329
Ada Chen, Yongjiang Wu, Junyuan Zhang, Jingyu Xiao, Shu Yang, Jen-tse Huang, Kun Wang, Wenxuan Wang, Shuai Wang, 25 Aug 2025, A Survey on the Safety and Security Threats of Computer-Using Agents: JARVIS or Ultron?, https://arxiv.org/abs/2505.10924
Niveen O. Jaffal, Mohammed Alkhanafseh, David Mohaisen, 18 Jul 2025, Large Language Models in Cybersecurity: Applications, Vulnerabilities, and Defense Techniques, https://arxiv.org/abs/2507.13629
Julia Laubmann, Johannes Reschke, 18 Jul 2025, Tackling fake images in cybersecurity -- Interpretation of a StyleGAN and lifting its black-box, https://arxiv.org/abs/2507.13722
Felix H\"arer, 19 Jul 2025, Specification and Evaluation of Multi-Agent LLM Systems -- Prototype and Cybersecurity Applications, https://arxiv.org/abs/2506.10467
Terry Yue Zhuo, Dingmin Wang, Hantian Ding, Varun Kumar, Zijian Wang, 29 Jul 2025, Cyber-Zero: Training Cybersecurity Agents without Runtime, https://arxiv.org/abs/2508.00910
Mehdi Akbari Gurabi, Lasse Nitz, Radu-Mihai Castravet, Roman Matzutt, Avikarsha Mandal, Stefan Decker, 5 Aug 2025, From Legacy to Standard: LLM-Assisted Transformation of Cybersecurity Playbooks into CACAO Format, https://arxiv.org/abs/2508.03342
Md Zesun Ahmed Mia, Malyaban Bal, Sen Lu, George M. Nishibuchi, Suhas Chelian, Srini Vasan, Abhronil Sengupta, 6 Aug 2025, Neuromorphic Cybersecurity with Semi-supervised Lifelong Learning, https://arxiv.org/abs/2508.04610
Daniele Proverbio, Alessio Buscemi, Alessandro Di Stefano, The Anh Han, German Castignani and Pietro Li\`o, 4 Aug 2025, Can LLMs effectively provide game-theoretic-based scenarios for cybersecurity?, https://arxiv.org/abs/2508.05670
Victor Lopez Juarez, 9 Aug 2025, EU Digital Regulation and Guatemala: AI, 5G, and Cybersecurity, https://arxiv.org/abs/2508.08315
Yuksel Aydin, 9 Aug 2025, Cognitive Cybersecurity for Artificial Intelligence: Guardrail Engineering with CCS-7, https://arxiv.org/abs/2508.10033
Aydin Zaboli and Junho Hong, 12 Aug 2025, Generative AI for Cybersecurity of Energy Management Systems: Methods, Challenges, and Future Directions, https://arxiv.org/abs/2508.10044
Nsengiyumva Wilberforce, 2 Sep 2025, A software security review on Uganda's Mobile Money Services: Dr. Jim Spire's tweets sentiment analysis, https://arxiv.org/abs/2509.03545
Ofir Cohen, Gil Ari Agmon, Asaf Shabtai, Rami Puzis, 5 Sep 2025, The Information Security Awareness of Large Language Models, https://arxiv.org/abs/2411.13207
Anders M{\o}lmen H{\o}st and Pierre Lison and Leon Moonen, 25 Aug 2025, A Systematic Approach to Predict the Impact of Cybersecurity Vulnerabilities Using LLMs, https://arxiv.org/abs/2508.18439
Martin Lochner and Keegan Keplinger, 25 Aug 2025, Collaborative Intelligence: Topic Modelling of Large Language Model use in Live Cybersecurity Operations, https://arxiv.org/abs/2508.18488
Afan Ali and Irfanullah Khan, 26 Aug 2025, SkyTrust: Blockchain-Enhanced UAV Security for NTNs with Dynamic Trust and Energy-Aware Consensus, https://arxiv.org/abs/2508.18735
Xavier Cadet, Simona Boboila, Sie Hendrata Dharmawan, Alina Oprea, Peter Chin, 27 Aug 2025, PoolFlip: A Multi-Agent Reinforcement Learning Security Environment for Cyber Defense, https://arxiv.org/abs/2508.19488
Sai Teja Reddy Adapala, Yashwanth Reddy Alugubelly, 22 Aug 2025, The Aegis Protocol: A Foundational Security Framework for Autonomous AI Agents, https://arxiv.org/abs/2508.19267
Michael R Smith, Joe Ingram, 27 Aug 2025, Surveying the Operational Cybersecurity and Supply Chain Threat Landscape when Developing and Deploying AI Systems, https://arxiv.org/abs/2508.20307
Dan Lin, Shunfeng Lu, Ziyan Liu, Jiajing Wu, Junyuan Fang, Kaixin Lin, Bowen Song, Zibin Zheng, 28 Aug 2025, BridgeShield: Enhancing Security for Cross-chain Bridge Applications via Heterogeneous Graph Mining, https://arxiv.org/abs/2508.20517
Guofu Liao, Taotao Wang, Shengli Zhang, Jiqun Zhang, Shi Long, and Dacheng Tao, 29 Aug 2025, zkLoRA: Fine-Tuning Large Language Models with Verifiable Security via Zero-Knowledge Proofs, https://arxiv.org/abs/2508.21393
Georgios Syros, Anshuman Suri, Jacob Ginesin, Cristina Nita-Rotaru, Alina Oprea, 29 Aug 2025, SAGA: A Security Architecture for Governing AI Agentic Systems, https://arxiv.org/abs/2504.21034
Wenxiao Zhang, Xiangrui Kong, Conan Dewitt, Thomas Br\"aunl, Jin B. Hong, 2 Sep 2025, Enhancing Reliability in LLM-Integrated Robotic Systems: A Unified Approach to Security and Safety, https://arxiv.org/abs/2509.02163
Honghui Xu, Kaiyang Li, Wei Chen, Danyang Zheng, Zhiyuan Li, Zhipeng Cai, 2 Sep 2025, A Survey: Towards Privacy and Security in Mobile Large Language Models, https://arxiv.org/abs/2509.02411
Chengshuai Zhao, Riccardo De Maria, Tharindu Kumarage, Kumar Satvik Chaudhary, Garima Agrawal, Yiwen Li, Jongchan Park, Yuli Deng, Ying-Chih Chen, Huan Liu, 3 Sep 2025, CyberBOT: Towards Reliable Cybersecurity Education via Ontology-Grounded Retrieval Augmented Generation, https://arxiv.org/abs/2504.00389
Ayoub Si-ahmed, Mohammed Ali Al-Garadi, Narhimene Boustia, 2 Sep 2025, Explainable Machine Learning-Based Security and Privacy Protection Framework for Internet of Medical Things Systems, https://arxiv.org/abs/2403.09752
Qingyuan Li, Binchang Li, Cuiyun Gao, Shuzheng Gao, and Zongjie Li, 7 Sep 2025, Empirical Study of Code Large Language Models for Binary Security Patch Detection, https://arxiv.org/abs/2509.06052
Safayat Bin Hakim, Muhammad Adil, Alvaro Velasquez, Shouhuai Xu, Houbing Herbert Song, 8 Sep 2025, Neuro-Symbolic AI for Cybersecurity: State of the Art, Challenges, and Opportunities, https://arxiv.org/abs/2509.06921
Guangyu Lei, Tianhao Liang, Yuqi Ping, Xinglin Chen, Longyu Zhou, Junwei Wu, Xiyuan Zhang, Huahao Ding, Xingjian Zhang, Weijie Yuan, Tingting Zhang, Qinyu Zhang, 8 Sep 2025, Enhancing Low-Altitude Airspace Security: MLLM-Enabled UAV Intent Recognition, https://arxiv.org/abs/2509.06312
Gabriele Digregorio and Marco Di Gennaro and Stefano Zanero and Stefano Longari and Michele Carminati, 8 Sep 2025, When Secure Isn't: Assessing the Security of Machine Learning Model Sharing, https://arxiv.org/abs/2509.06703
Nicol\`o Romandini, Carlo Mazzocca, Kai Otsuki, Rebecca Montanari, 8 Sep 2025, SoK: Security and Privacy of AI Agents for Blockchain, https://arxiv.org/abs/2509.07131
Lei Yu, Jingyuan Zhang, Xin Wang, Jiajia Ma, Li Yang, Fengjun Zhang, 12 Sep 2025, SmartCoder-R1: Towards Secure and Explainable Smart Contract Generation with Security-Aware Group Relative Policy Optimization, https://arxiv.org/abs/2509.09942
Evan Li, Tushin Mallick, Evan Rose, William Robertson, Alina Oprea, Cristina Nita-Rotaru, 10 Sep 2025, ACE: A Security Architecture for LLM-Integrated App Systems, https://arxiv.org/abs/2504.20984
Yuzhou Nie, Zhun Wang, Yu Yang, Ruizhe Jiang, Yuheng Tang, Xander Davies, Yarin Gal, Bo Li, Wenbo Guo, Dawn Song, 18 Sep 2025, SeCodePLT: A Unified Platform for Evaluating the Security of Code GenAI, https://arxiv.org/abs/2410.11096
Yuchong Xie, Mingyu Luo, Zesen Liu, Zhixiang Zhang, Kaikai Zhang, Yu Liu, Zongjie Li, Ping Chen, Shuai Wang, Dongdong She, 19 Sep 2025, On the Security of Tool-Invocation Prompts for LLM-Based Agentic Systems: An Empirical Risk Assessment, https://arxiv.org/abs/2509.05755
Sergio Benlloch-Lopez, Miquel Viel-Vazquez, Javier Naranjo-Alcazar, Jordi Grau-Haro and Pedro Zuccarello, 19 Sep 2025, Threat Modeling for Enhancing Security of IoT Audio Classification Devices under a Secure Protocols Framework, https://arxiv.org/abs/2509.14657
Kiho Lee, Jungkon Kim, Doowon Kim, Hyoungshick Kim, 16 Sep 2025, A Systematic Evaluation of Parameter-Efficient Fine-Tuning Methods for the Security of Code LLMs, https://arxiv.org/abs/2509.12649
Magnus Wiik Eckhoff, Peter Marius Flydal, Siem Peters, Martin Eian, Jonas Halvorsen, Vasileios Mavroeidis, Gudmund Grov, 16 Sep 2025, A Graph-Based Approach to Alert Contextualisation in Security Operations Centres, https://arxiv.org/abs/2509.12923
Shaina Raza, Ranjan Sapkota, Manoj Karkee, Christos Emmanouilidis, 15 Sep 2025, TRiSM for Agentic AI: A Review of Trust, Risk, and Security Management in LLM-based Agentic Multi-Agent Systems, https://arxiv.org/abs/2506.04133
Umberto Gon\c{c}alves de Sousa, 2 Sep 2025, LogGuardQ: A Cognitive-Enhanced Reinforcement Learning Framework for Cybersecurity Anomaly Detection in Security Logs, https://arxiv.org/abs/2509.10511
Ali Habibzadeh, Farid Feyzi, and Reza Ebrahimi Atani, 13 Sep 2025, Large Language Models for Security Operations Centers: A Comprehensive Survey, https://arxiv.org/abs/2509.10858
Ambra Demontis, Srishti Gupta, Maura Pintor, Luca Demetrio, Kathrin Grosse, Hsiao-Ying Lin, Chengfang Fang, Battista Biggio, Fabio Roli, 15 Sep 2025, Security of Deep Reinforcement Learning for Autonomous Driving: A Survey, https://arxiv.org/abs/2212.06123
Amena Amro and Manar H. Alalfi, 17 Sep 2025, GitHub's Copilot Code Review: Can AI Spot Security Flaws Before You Commit?, https://arxiv.org/abs/2509.13650
Adel ElZemity, Budi Arief and Shujun Li, 17 Sep 2025, CyberLLMInstruct: A Pseudo-malicious Dataset Revealing Safety-performance Trade-offs in Cyber Security LLM Fine-tuning, https://arxiv.org/abs/2503.09334
Adel ElZemity, Budi Arief and Shujun Li, 17 Sep 2025, Analysing Safety Risks in LLMs Fine-Tuned with Pseudo-Malicious Cyber Security Data, https://arxiv.org/abs/2505.09974
Samuele Pasini, Jinhan Kim, Tommaso Aiello, Rocio Cabrera Lozoya, Antonino Sabetta, Paolo Tonella, 17 Sep 2025, Evaluating and Improving the Robustness of Security Attack Detectors Generated by LLMs, https://arxiv.org/abs/2411.18216

Safety Monitor

A safety monitor is a component that can be added to the LLM deployment.

OpenAI, Moderation: Learn how to build moderation into your AI applications, 2024, https://platform.openai.com/docs/guides/moderation
Azure, 06/13/2024, Content filtering, https://learn.microsoft.com/en-us/azure/ai-services/openai/concepts/content-filter?tabs=warning%2Cpython
Yu Wang, Xiaogeng Liu, Yu Li, Muhao Chen, Chaowei Xiao, 14 Mar 2024, AdaShield: Safeguarding Multimodal Large Language Models from Structure-based Attack via Adaptive Shield Prompting, https://arxiv.org/abs/2403.09513 Code: https://github.com/rain305f/AdaShield
Jinhwa Kim, Ali Derakhshan, Ian G. Harris, 31 Oct 2023, Robust Safety Classifier for Large Language Models: Adversarial Prompt Shield, https://arxiv.org/abs/2311.00172
Francisco Munguia-Galeano, Zhengxue Zhou, Satheeshkumar Veeramani, Hatem Fakhruldeen, Louis Longley, Rob Clowes and Andrew I. Cooper, 7 Aug 2025, Chemist Eye: A Visual Language Model-Powered System for Safety Monitoring and Robot Decision-Making in Self-Driving Laboratories, https://arxiv.org/abs/2508.05148

General Thoughts on AI Safety

High-level debate and discussions of AI safety issues:

Stephen Hawking, Max Tegmark, Stuart Russell, and Frank Wilczek. April 2014. Transcending complacency on superintelligent machines. http://www.huffingtonpost.com/stephen-hawking/artificial-intelligence_b_5174265.html
S. Alexander. OpenAI’s “Planning for AGI and beyond”. March 2023, https://astralcodexten.substack.com/p/openais-planning-for-agi-and-beyond
N. Bostrom. The vulnerable world hypothesis. Global Policy, 10(4):455–476, 2019. https://doi.org/10.1111/1758-5899.12718
Smitha Milli, Dylan Hadfield-Menell, Anca Dragan, and Stuart Russell. Should robots be obedient? In International Joint Conference on Artificial Intelligence, 2017. https://arxiv.org/abs/1705.09990
Nick Bostrom. Superintelligence: Paths, Dangers, Strategies. Oxford University Press, March 2016, https://www.amazon.com.au/Superintelligence-Professor-Philosophy-Institute-University/dp/0198739834/
Nick Bostrom. Superintelligence: Paths, Dangers, Strategies. Oxford University Press, July 2014 (prior edition), https://www.amazon.com.au/Superintelligence-Dangers-Strategies-Nick-Bostrom-ebook/dp/B00LOOCGB2/
OpenAI, May 2023, Governance of superintelligence, https://openai.com/blog/governance-of-superintelligence
Winfield AFT, Jirotka M. Ethical governance is essential to building trust in robotics and artificial intelligence systems. Philos Trans A Math Phys Eng Sci. 2018 Oct 15;376(2133):20180085. doi: 10.1098/rsta.2018.0085. PMID: 30323000 https://pubmed.ncbi.nlm.nih.gov/30323000/
OpenAI, Feb 2023, How should AI systems behave, and who should decide? https://openai.com/blog/how-should-ai-systems-behave
Stuart Russell. Should we fear supersmart robots? Scientific American, 314(6):58–59, 2016. https://www.scientificamerican.com/article/should-we-fear-supersmart-robots/, https://pubmed.ncbi.nlm.nih.gov/27196844/
A Ramalho, 2017, Will robots rule the (artistic) world? A proposed model for the legal status of creations by artificial intelligence systems, https://papers.ssrn.com/sol3/papers.cfm?abstract_id=2987757
Bernd Carsten Stahl, 2023, Embedding responsibility in intelligent systems: from AI ethics to responsible AI ecosystems, Scientific Reports Open Access 18 May 2023, https://doi.org/10.1038/s41598-023-34622-w
McCarthy, John, and Patrick J. Hayes. 1969. Some Philosophical Problems From the Standpoint of Artificial Intelligence, In: Machine Intelligence 4, B. Meltzer and D. Michie (eds.), Edinburgh University Press, 1969, pp. 463-502, Stanford University. http://jmc.stanford.edu/articles/mcchay69.html
Russell, Stuart J. 2019. Human Compatible: Artificial Intelligence and the Problem of Control (Viking-Penguin Random House: London). https://link.springer.com/chapter/10.1007/978-3-030-86144-5_3
Winfield A.F.T., Jirotka M., 2018, Ethical governance is essential to building trust in robotics and artificial intelligence systems. Philos. Trans. R. Soc. A. Math. Phys. Eng. Sci. 2018;376:13. http://www.ncbi.nlm.nih.gov/pmc/articles/pmc6191667/, https://pubmed.ncbi.nlm.nih.gov/30323000/
Thomas Claburn 12 Oct 2023, AI safety guardrails easily thwarted, security study finds, The Register, https://www.theregister.com/2023/10/12/chatbot_defenses_dissolve/
Alibaba Qwen Team, Sep 2023, Qwen Technical Report, https://arxiv.org/pdf/2309.16609.pdf

Government Policy and Regulation

Various governments have examined issues around regulation, and there has also been much debate:

A. Solender and A. Gold. April 2023, Scoop: Schumer lays groundwork for Congress to regulate AI. https://www.axios.com/2023/04/13/congress-regulate-ai-tech
UK Government. National AI strategy. Sep 2021. https://www.gov.uk/government/publications/national-ai-strategy
AI Now Institute, A. Kak, and S. M. West. April 2023, General purpose AI poses serious risks, should not be excluded from the EU’s AI Act. https://ainowinstitute.org/publication/gpai-is-high-risk-should-not-be-excluded-from-eu-ai-act
L. Bertuzzi. March 2023, Leading EU lawmakers propose obligations for general purpose ai. https://www.euractiv.com/section/artificial-intelligence/news/leading-eu-lawmakers-propose-obligations-for-general-purpose-ai
UK Department for Science and Technology. Aug 2023, Policy paper: A pro-innovation approach to AI regulation. https://www.gov.uk/government/publications/ai-regulation-a-pro-innovation-approach/white-paper
White House. May 2023. Fact sheet: Biden-Harris Administration announces new actions to promote responsible AI innovation that protects Americans’ rights and safety. https://www.whitehouse.gov/briefing-room/statements-releases/2023/05/04/fact-sheet-biden-harris-administration-annou nces-new-actions-to-promote-responsible-ai-innovation-that-protects-americans-rights-and-safety
B. Zhang, M. Anderljung, L. Kahn, N. Dreksler, M. C. Horowitz, and A. Dafoe. 2021, Ethics and governance of artificial intelligence: Evidence from a survey of machine learning researchers. arXiv preprint arXiv:2105.02117, https://arxiv.org/abs/2105.02117
ISO/IEC. 2023, ISO/IEC 23894:2023 Information technology — Artificial intelligence — Guidance on risk management. https://www.iso.org/standard/77304.html
NIST, AI Risk Management Framework Concept Paper, 13 December 2021, PDF: https://www.nist.gov/system/files/documents/2021/12/14/AI%20RMF%20Concept%20Paper_13Dec2021_posted.pdf
NIST. 2023, Artificial Intelligence Risk Management Framework (AI RMF 1.0). https://doi.org/10.6028/NIST.AI.100-1, https://www.nist.gov/itl/ai-risk-management-framework
Tathagat Katiyar & Harshitha Chondamma II, Accorian, Feb 2023, UNDERSTANDING AI RMF 1.0 – The Artificial Intelligence Risk Management Framework https://accorian.com/understanding-ai-rmf-1-0-the-artificial-intelligence-risk-management-framework/
E. Yudkowsky, 2023. Pausing AI developments isn’t enough. We need to shut it all down. https://time.com/6266923/ai-eliezer-yudkowsky-open-letter-not-enough
Stephanie Palazzolo, Erin Woo, Aug 2024, Passage of California AI Bill Sends Shivers Across Tech Industry, https://www.theinformation.com/articles/passage-of-california-ai-bill-sends-shivers-across-tech-industry

Auditing and Enforcement

Papers on auditing or enforcement of AI policy:

J. Mökander and L. Floridi. 2022, Operationalising AI governance through ethics-based auditing: An industry case study. AI and Ethics, pages 1–18, https://link.springer.com/article/10.1007/s43681-022-00171-7
J. Mökander, J. Schuett, H. R. Kirk, and L. Floridi. June 2023. Auditing large language models: A three-layered approach. arXiv preprint arXiv:2302.08500. https://arxiv.org/abs/2302.08500
J. Mökander, J. Morley, M. Taddeo, and L. Floridi. Ethics-based auditing of automated decision-making systems: Nature, scope, and limitations. Science and Engineering Ethics, 27(44), 2021. https://arxiv.org/abs/2110.10980

Bias and Fairness

AI engines have shown bias in various ways. The goal is to have them show "fairness" in their results:

Dastin Jeffrey. Oct 2018, Amazon scraps secret AI recruiting tool that showed bias against women. Reuters. https://www.reuters.com/article/us-amazon-com-jobs-automation-insight-idUSKCN1MK08G
Courtland R., 2018, Bias detectives: the researchers striving to make algorithms fair. Nature. 2018 Jun;558(7710):357-360. doi: 10.1038/d41586-018-05469-3. PMID: 29925973 https://pubmed.ncbi.nlm.nih.gov/29925973/
Caliskan Aylin, Bryson Joanna J., Narayanan Arvind. 2017. Semantics derived automatically from language corpora contain human-like biases. Science. 2017;356:183–186. https://pubmed.ncbi.nlm.nih.gov/28408601/
A Levendowski, 2018, How copyright law can fix artificial intelligence's implicit bias problem, Wash. L. Rev., https://digitalcommons.law.uw.edu/cgi/viewcontent.cgi?article=5042&context=wlr
Hao Karen. 2020. AI researchers say scientific publishers help perpetuate racist algorithms. MIT Technology Review. https://www.technologyreview.com/2020/06/23/1004333/ai-science-publishers-perpetuate-racist-face-recognition/
K Ramesh, A Chavan, S Pandit, 2023, A Comparative Study on the Impact of Model Compression Techniques on Fairness in Language Models, Microsoft Research, https://aclanthology.org/2023.acl-long.878.pdf, https://www.microsoft.com/en-us/research/uploads/prod/2023/07/3687_Paper.pdf
Jwala Dhamala, Varun Kumar, Rahul Gupta, Kai-Wei Chang, Aram Galstyan, Oct 2022, An Analysis of the Effects of Decoding Algorithms on Fairness in Open-Ended Language Generation, https://arxiv.org/abs/2210.03826 (Examines top-p, top-k, and temperature in decoding algorithms from a safety perspective.)
Bingbin Liu, Jordan T. Ash, Surbhi Goel, Akshay Krishnamurthy, Cyril Zhang, June 2023, Exposing Attention Glitches with Flip-Flop Language Modeling, https://arxiv.org/abs/2306.00946
Valentin Hofmann, Pratyusha Ria Kalluri, Dan Jurafsky, Sharese King, 1 Mar 2024, Dialect prejudice predicts AI decisions about people's character, employability, and criminality, https://arxiv.org/abs/2403.00742 https://arxiv.org/pdf/2403.00742.pdf
Jaymari Chua, Yun Li, Shiyi Yang, Chen Wang, Lina Yao, 6 Jul 2024, AI Safety in Generative AI Large Language Models: A Survey, https://arxiv.org/abs/2407.18369
Cem Dilmegani, Jan 10, 2024, The Future of Large Language Models in 2024, https://research.aimultiple.com/future-of-large-language-models/
Douglas C. Youvan, September 27, 2024, Building and Running Large-Scale Language Models: The Infrastructure and Techniques Behind GPT-4 , https://www.researchgate.net/profile/Douglas-Youvan/publication/384398902_Building_and_Running_Large-Scale_Language_Models_The_Infrastructure_and_Techniques_Behind_GPT-4/links/66f6f4d3906bca2ac3d20e68/Building-and-Running-Large-Scale-Language-Models-The-Infrastructure-and-Techniques-Behind-GPT-4.pdf
Mayank Vatsa, Anubhooti Jain, Richa Singh, 7 Dec 2023, Adventures of Trustworthy Vision-Language Models: A Survey, https://arxiv.org/abs/2312.04231
Inkit Padhi, Manish Nagireddy, Giandomenico Cornacchia, Subhajit Chaudhury, Tejaswini Pedapati, Pierre Dognin, Keerthiram Murugesan, Erik Miehling, Martín Santillán Cooper, Kieran Fraser, Giulio Zizzo, Muhammad Zaid Hameed, Mark Purcell, Michael Desmond, Qian Pan, Inge Vejsbjerg, Elizabeth M. Daly, Michael Hind, Werner Geyer, Ambrish Rawat, Kush R. Varshney, Prasanna Sattigeri, 10 Dec 2024, Granite Guardian, https://arxiv.org/abs/2412.07724 https://github.com/ibm-granite/granite-guardian (Open-sourcing of safety models with many capabilities.)
FZ Subah, Oct 2025, Mitigating and Assessing Bias and Fairness in Large Language Model-Generated Synthetic Tabular Data, Masters Thesis, Department of Engineering, University of Cambridge, https://www.mlmi.eng.cam.ac.uk/files/2023-2024/fzs21_mitigating_2024.pdf
Abdullah Hashmat, Muhammad Arham Mirza, Agha Ali Raza, 13 Aug 2025, PakBBQ: A Culturally Adapted Bias Benchmark for QA, https://arxiv.org/abs/2508.10186
Gustavo Bonil, Simone Hashiguti, Jhessica Silva, Jo\~ao Gondim, Helena Maia, N\'adia Silva, Helio Pedrini, Sandra Avila, 14 Aug 2025, Yet another algorithmic bias: A Discursive Analysis of Large Language Models Reinforcing Dominant Discourses on Gender and Race, https://arxiv.org/abs/2508.10304
Alessio Buscemi, Daniele Proverbio, Alessandro Di Stefano, The Anh Han, German Castignani and Pietro Li\`o, 14 Aug 2025, FAIRGAME: a Framework for AI Agents Bias Recognition using Game Theory, https://arxiv.org/abs/2504.14325
Suhas G Hegde, Shilpy Kaur, Aruna Tiwari, 14 Aug 2025, VectorFit : Adaptive Singular & Bias Vector Fine-Tuning of Pre-trained Foundation Models, https://arxiv.org/abs/2503.19530
Yan Li, Guangyi Chen, Yunlong Deng, Zijian Li, Zeyu Tang, Anpeng Wu, Kun Zhang, 22 Jul 2025, Should Bias Always be Eliminated? A Principled Framework to Use Data Bias for OOD Generation, https://arxiv.org/abs/2507.17001
Shalaka Satheesh, Katrin Klug, Katharina Beckh, H\'ector Allende-Cid, Sebastian Houben, Teena Hassan, 22 Jul 2025, GG-BBQ: German Gender Bias Benchmark for Question Answering, https://arxiv.org/abs/2507.16410
Kristin Gnadt, David Thulke, Simone Kopeinik, Ralf Schl\"uter, 22 Jul 2025, Exploring Gender Bias in Large Language Models: An In-depth Dive into the German Language, https://arxiv.org/abs/2507.16557
Zhenyuan Chen, 21 Jul 2025, Rethinking Inductive Bias in Geographically Neural Network Weighted Regression, https://arxiv.org/abs/2507.09958
Sergio Morales, Robert Claris\'o, Jordi Cabot, 22 Jul 2025, LangBiTe: A Platform for Testing Bias in Large Language Models, https://arxiv.org/abs/2404.18558
Yanbiao Ma, Bowei Liu, Boyuan Gao, Wei Dai, Jiayi Chen, Shuo Li, Andi Zhang, 22 Jul 2025, Revealing Bias Formation in Deep Neural Networks Through the Geometric Mechanisms of Human Visual Decoupling, https://arxiv.org/abs/2502.11809
Brian Liu and Rahul Mazumder, 21 Jul 2025, Randomization Can Reduce Both Bias and Variance: A Case Study in Random Forests, https://arxiv.org/abs/2402.12668
Ali Vardasbi, Gustavo Penha, Claudia Hauff, and Hugues Bouchard, 23 Jul 2025, Adaptive Repetition for Mitigating Position Bias in LLM-Based Ranking, https://arxiv.org/abs/2507.17788
Steven A. Frank, 24 Jul 2025, The Price equation reveals a universal force-metric-bias law of algorithmic learning and natural selection, https://arxiv.org/abs/2507.18549
Bruno Scarone, Alfredo Viola, Ren\'ee J. Miller, Ricardo Baeza-Yates, 24 Jul 2025, A Principled Approach for Data Bias Mitigation, https://arxiv.org/abs/2405.12312
He-Yang Xu, Hongxiang Gao, Yuwen Li, Xiu-Shen Wei and Chengyu Liu, 24 Jul 2025, Masked Autoencoders that Feel the Heart: Unveiling Simplicity Bias for ECG Analyses, https://arxiv.org/abs/2506.22495
Yuen Chen, Vethavikashini Chithrra Raghuram, Justus Mattern, Rada Mihalcea, Zhijing Jin, 24 Jul 2025, Causally Testing Gender Bias in LLMs: A Case Study on Occupational Bias, https://arxiv.org/abs/2212.10678
Yongyi Yang, Hidenori Tanaka, Wei Hu, 17 Jul 2025, Provable Low-Frequency Bias of In-Context Learning of Representations, https://arxiv.org/abs/2507.13540
Yile Yan, Yuqi Zhu, Wentao Xu, 18 Jul 2025, Bias in Decision-Making for AI's Ethical Dilemmas: A Comparative Study of ChatGPT and Claude, https://arxiv.org/abs/2501.10484
Andr\'es Morales-Forero (1), Lili J. Rueda (2), Ronald Herrera (3), Samuel Bassetto (1), Eric Coatanea (4) ((1) Polytechnique Montr\'eal, (2) Universidad El Bosque, (3) Boehringer Ingelheim International GmbH, (4) Tampere University), 10 Jul 2025, Predictive Representativity: Uncovering Racial Bias in AI-based Skin Cancer Detection, https://arxiv.org/abs/2507.14176
Xiaotong Luo, Shengda Zhuo, Min Chen, Lichun Li, Ruizhao Lu, Wenqi Fan, Shuqiang Huang and Yin Tang, 12 Jul 2025, From Bias to Behavior: Learning Bull-Bear Market Dynamics with Contrastive Modeling, https://arxiv.org/abs/2507.14182
Eoghan Cunningham, James Cross, Derek Greene, 16 Jul 2025, Identifying Algorithmic and Domain-Specific Bias in Parliamentary Debate Summarisation, https://arxiv.org/abs/2507.14221
Garud Iyengar, Henry Lam, Tianyu Wang, 21 Jul 2025, Optimizer's Information Criterion: Dissecting and Correcting Bias in Data-Driven Optimization, https://arxiv.org/abs/2306.10081
Evangelia Spiliopoulou, Riccardo Fogliato, Hanna Burnsky, Tamer Soliman, Jie Ma, Graham Horwood, Miguel Ballesteros, 8 Aug 2025, Play Favorites: A Statistical Method to Measure Self-Bias in LLM-as-a-Judge, https://arxiv.org/abs/2508.06709
Falaah Arif Khan, Nivedha Sivakumar, Yinong Oliver Wang, Katherine Metcalf, Cezanne Camacho, Barry-John Theobald, Luca Zappella, Nicholas Apostoloff, 9 Aug 2025, Investigating Intersectional Bias in Large Language Models using Confidence Disparities in Coreference Resolution, https://arxiv.org/abs/2508.07111
Vivek Hruday Kavuri, Vysishtya Karanam, Venkata Jahnavi Venkamsetty, Kriti Madumadukala, Lakshmipathi Balaji Darur, Ponnurangam Kumaraguru, 10 Aug 2025, Freeze and Reveal: Exposing Modality Bias in Vision-Language Models, https://arxiv.org/abs/2508.07432
Vojt\v{e}ch Stan\v{e}k, Karel Srna, Anton Firc, Kamil Malinka, 11 Aug 2025, SCDF: A Speaker Characteristics DeepFake Speech Dataset for Bias Analysis, https://arxiv.org/abs/2508.07944
Xinyi Wu, Yifei Wang, Stefanie Jegelka, Ali Jadbabaie, 9 Aug 2025, On the Emergence of Position Bias in Transformers, https://arxiv.org/abs/2502.01951
Walter Laurito, Benjamin Davis, Peli Grietzer, Tom\'a\v{s} Gaven\v{c}iak, Ada B\"ohm, Jan Kulveit, 11 Aug 2025, AI-AI Bias: large language models favor communications generated by large language models, https://arxiv.org/abs/2407.12856
Dasol Choi, Jihwan Lee, Minjae Lee, Minsuk Kahng, 10 Aug 2025, When Cars Have Stereotypes: Auditing Demographic Bias in Objects from Text-to-Image Models, https://arxiv.org/abs/2508.03483
Anuprabha M, Krishna Gurugubelli and Anil Kumar Vuppala, 11 Aug 2025, Fairness in Dysarthric Speech Synthesis: Understanding Intrinsic Bias in Dysarthric Speech Cloning using F5-TTS, https://arxiv.org/abs/2508.05102
Chao Wu, Zhenyi Wang, Kangxian Xie, Naresh Kumar Devulapally, Vishnu Suresh Lokhande, Mingchen Gao, 28 Jul 2025, Model-Agnostic Gender Bias Control for Text-to-Image Generation via Sparse Autoencoder, https://arxiv.org/abs/2507.20973
Gabriel Recchia, Chatrik Singh Mangat, Jinu Nyachhyon, Mridul Sharma, Callum Canavan, Dylan Epstein-Gross, Muhammed Abdulbari, 17 May 2025, Confirmation bias: A challenge for scalable oversight, https://arxiv.org/abs/2507.19486
Pavel Korshunov, Ketan Kotwal, Christophe Ecabert, Vidit Vidit, Amir Mohammadi, and Sebastien Marcel, 28 Jul 2025, Investigation of Accuracy and Bias in Face Recognition Trained with Synthetic Data, https://arxiv.org/abs/2507.20782
Hoyoung Lee, Junhyuk Seo, Suhwan Park, Junhyeong Lee, Wonbin Ahn, Chanyeol Choi, Alejandro Lopez-Lira, Yongjae Lee, 28 Jul 2025, Your AI, Not Your View: The Bias of LLMs in Investment Analysis, https://arxiv.org/abs/2507.20957
Yooshin Cho, Hanbyel Cho, Janghyeon Lee, HyeongGwon Hong, Jaesung Ahn, Junmo Kim, 27 Jul 2025, Controllable Feature Whitening for Hyperparameter-Free Bias Mitigation, https://arxiv.org/abs/2507.20284
Seoyoung Doh, Hyeon Jeon, Sungbok Shin, Ghulam Jilani Quadri, Nam Wook Kim, Jinwook Seo, 28 Jul 2025, Understanding Bias in Perceiving Dimensionality Reduction Projections, https://arxiv.org/abs/2507.20805
Hitomi Yanaka, Xinqi He, Jie Lu, Namgi Han, Sunjin Oh, Ryoma Kumon, Yuma Matsuoka, Katsuhiko Watabe, Yuko Itatsu, 27 Jul 2025, Intersectional Bias in Japanese Large Language Models from a Contextualized Perspective, https://arxiv.org/abs/2506.12327
Franck Bardol, 17 Jun 2025, ChatGPT Reads Your Tone and Responds Accordingly -- Until It Does Not -- Emotional Framing Induces Bias in LLM Outputs, https://arxiv.org/abs/2507.21083
Zhenyu Pan, Yutong Zhang, Jianshu Zhang, Haoran Lu, Haozheng Luo, Yuwei Han, Philip S. Yu, Manling Li, Han Liu, 30 Jul 2025, FairReason: Balancing Reasoning and Social Bias in MLLMs, https://arxiv.org/abs/2507.23067
Patricia A. Apell\'aniz and Ana Jim\'enez and Borja Arroyo Galende and Juan Parras and Santiago Zazo, 31 Jul 2025, Artificial Inductive Bias for Synthetic Tabular Data Generation in Data-Scarce Scenarios, https://arxiv.org/abs/2407.03080
Utku Ozbulak, Seyed Amir Mousavi, Francesca Tozzi, Niki Rashidian, Wouter Willaert, Wesley De Neve, Joris Vankerschaver, 31 Jul 2025, Revisiting the Evaluation Bias Introduced by Frame Sampling Strategies in Surgical Video Segmentation Using SAM2, https://arxiv.org/abs/2502.20934
Afrozah Nadeem, Mark Dras, and Usman Naseem, 31 Jul 2025, Framing Political Bias in Multilingual LLMs Across Pakistani Languages, https://arxiv.org/abs/2506.00068
Bushra Asseri, Estabrag Abdelaziz, Areej Al-Wabil, 30 Jul 2025, Prompt Engineering Techniques for Mitigating Cultural Bias Against Arabs and Muslims in Large Language Models: A Systematic Review, https://arxiv.org/abs/2506.18199
Simon M\"unker, 31 Jul 2025, Cultural Bias in Large Language Models: Evaluating AI Agents through Moral Questionnaires, https://arxiv.org/abs/2507.10073
Kwesi Cobbina and Tianyi Zhou, 30 Jul 2025, Where to show Demos in Your Prompt: A Positional Bias of In-Context Learning, https://arxiv.org/abs/2507.22887
Adam Block and Cyril Zhang, 31 Jul 2025, EMA Without the Lag: Bias-Corrected Iterate Averaging Schemes, https://arxiv.org/abs/2508.00180
Kangda Wei, Hasnat Md Abdullah, Ruihong Huang, 1 Aug 2025, Mitigating Gender Bias via Fostering Exploratory Thinking in LLMs, https://arxiv.org/abs/2505.17217
Zhenpeng Chen, Xinyue Li, Jie M. Zhang, Weisong Sun, Ying Xiao, Tianlin Li, Yiling Lou, Yang Liu, 5 Aug 2025, Software Fairness Dilemma: Is Bias Mitigation a Zero-Sum Game?, https://arxiv.org/abs/2508.03323
Jiangen He, 2 Aug 2025, Who Gets Cited? Gender- and Majority-Bias in LLM-Driven Reference Selection, https://arxiv.org/abs/2508.02740
Shahed Masoudian, Gustavo Escobedo, Hannah Strauss, Markus Schedl, 5 Aug 2025, Investigating Gender Bias in LLM-Generated Stories via Psychological Stereotypes, https://arxiv.org/abs/2508.03292
Joseph Lee, Tianqi Shang, Jae Young Baik, Duy Duong-Tran, Shu Yang, Lingyao Li, Li Shen, 4 Aug 2025, From Promising Capability to Pervasive Bias: Assessing Large Language Models for Emergency Department Triage, https://arxiv.org/abs/2504.16273
Zhen Zou, Feng Zhao, 5 Aug 2025, FEB-Cache: Frequency-Guided Exposure Bias Reduction for Enhancing Diffusion Transformer Caching, https://arxiv.org/abs/2503.07120
Hamed Ayoobi, Nico Potyka, Anna Rapberger, Francesca Toni, 6 Aug 2025, Argumentative Debates for Transparent Bias Detection [Technical Report], https://arxiv.org/abs/2508.04511
Tiffany Zhu, Iain Weissburg, Kexun Zhang, William Yang Wang, 6 Aug 2025, Human Bias in the Face of AI: Examining Human Judgment Against Text Labeled as AI Generated, https://arxiv.org/abs/2410.03723
Tosin Fadahunsi, Giordano d'Aloisio, Antinisca Di Marco, Federica Sarro, 5 Aug 2025, How Do Generative Models Draw a Software Engineer? A Case Study on Stable Diffusion Bias, https://arxiv.org/abs/2501.09014
Kelsey Doerksen, Yuliya Marchetti, Kevin Bowman, Steven Lu, James Montgomery, Yarin Gal, Freddie Kalaitzis, Kazuyuki Miyazaki, 6 Aug 2025, Leveraging Deep Learning for Physical Model Bias of Global Air Quality Estimates, https://arxiv.org/abs/2508.04886
Menghua Jiang, Yuxia Lin, Baoliang Chen, Haifeng Hu, Yuncheng Jiang, Sijie Mai, 7 Aug 2025, Disentangling Bias by Modeling Intra- and Inter-modal Causal Attention for Multimodal Sentiment Analysis, https://arxiv.org/abs/2508.04999
Jiahao Chen, Bin Qin, Jiangmeng Li, Hao Chen, Bing Su, 8 Aug 2025, Rethinking the Bias of Foundation Model under Long-tailed Distribution, https://arxiv.org/abs/2501.15955
Shivam Dubey, 12 Aug 2025, Activation Steering for Bias Mitigation: An Interpretable Approach to Safer LLMs, https://arxiv.org/abs/2508.09019
Afrozah Nadeem, Mark Dras, Usman Naseem, 12 Aug 2025, Steering Towards Fairness: Mitigating Political Bias in LLMs, https://arxiv.org/abs/2508.08846
Krzysztof Maziarz, Guoqing Liu, Hubert Misztela, Austin Tripp, Junren Li, Aleksei Kornev, Piotr Gai\'nski, Holger Hoefling, Mike Fortunato, Rishi Gupta, Marwin Segler, 12 Aug 2025, Chemist-aligned retrosynthesis by ensembling diverse inductive bias models, https://arxiv.org/abs/2412.05269
Zara Siddique, Irtaza Khalid, Liam D. Turner, Luis Espinosa-Anke, 13 Aug 2025, Shifting Perspectives: Steering Vectors for Robust Bias Mitigation in LLMs, https://arxiv.org/abs/2503.05371
Jingwei Li, Jing Xu, Zifan Wang, Huishuai Zhang, Jingzhao Zhang, 13 Aug 2025, Understanding Nonlinear Implicit Bias via Region Counts in Input Space, https://arxiv.org/abs/2505.11370
Parker Whitfill, 14 Aug 2025, Note on Selection Bias in Observational Estimates of Algorithmic Progress, https://arxiv.org/abs/2508.11033
Aiswarya Konavoor, Raj Abhijit Dandekar, Rajat Dandekar, Sreedath Panat, 15 Aug 2025, Vision-Language Models display a strong gender bias, https://arxiv.org/abs/2508.11262
Binxu Wang, Cengiz Pehlevan, 14 Aug 2025, An Analytical Theory of Spectral Bias in the Learning Dynamics of Diffusion Models, https://arxiv.org/abs/2503.03206
Keyon Vafa, Peter G. Chang, Ashesh Rambachan, Sendhil Mullainathan, 14 Aug 2025, What Has a Foundation Model Found? Using Inductive Bias to Probe for World Models, https://arxiv.org/abs/2507.06952
Pengcheng Huang, Shuhao Liu, Zhenghao Liu, Yukun Yan, Shuo Wang, Zulong Chen, Tong Xiao, 18 Aug 2025, PC-Sampler: Position-Aware Calibration of Decoding Bias in Masked Diffusion Models, https://arxiv.org/abs/2508.13021
Yuanzhe Hu, Kinshuk Goel, Vlad Killiakov, Yaoqing Yang, 18 Aug 2025, Eigenspectrum Analysis of Neural Networks without Aspect Ratio Bias, https://arxiv.org/abs/2506.06280
Evan Chen, Run-Jun Zhan, Yan-Bai Lin, Hung-Hsuan Chen, 15 Aug 2025, More Women, Same Stereotypes: Unpacking the Gender Bias Paradox in Large Language Models, https://arxiv.org/abs/2503.15904
Koffi Ismael Ouattara, Ioannis Krontiris, Theo Dimitrakos and Frank Kargl, 19 Aug 2025, Assessing Trustworthiness of AI Training Dataset using Subjective Logic -- A Use Case on Bias, https://arxiv.org/abs/2508.13813
Jonathan A. Karr Jr., Benjamin F. Herbst, Ting Hua, Matthew Hauenstein, Georgina Curto, Nitesh V. Chawla, 14 Aug 2025, Combating Homelessness Stigma with LLMs: A New Multi-Modal Dataset for Bias Detection, https://arxiv.org/abs/2508.13187
Hao Zhang and Chen Li and Basura Fernando, 19 Aug 2025, Mitigating Easy Option Bias in Multiple-Choice Question Answering, https://arxiv.org/abs/2508.13428
Dariia Puhach and Amir H. Payberah and \'Eva Sz\'ekely, 19 Aug 2025, Who Gets the Mic? Investigating Gender Bias in the Speaker Assignment of a Speech-LLM, https://arxiv.org/abs/2508.13603
Vinod Kumar Chauhan, Lei Clifton, Achille Sala\"un, Huiqi Yvonne Lu, Kim Branson, Patrick Schwab, Gaurav Nigam, David A. Clifton, 20 Aug 2025, Sample Selection Bias in Machine Learning for Healthcare, https://arxiv.org/abs/2405.07841
Ilja Kuzborskij, Yasin Abbasi Yadkori, 20 Aug 2025, Low-rank bias, weight decay, and model merging in neural networks, https://arxiv.org/abs/2502.17340
Haodi Zhong, Liuxin Zou, Di Wang, Bo Wang, Zhenxing Niu, Quan Wang, 21 Aug 2025, EvoFormer: Learning Dynamic Graph-Level Representations with Structural and Temporal Bias Correction, https://arxiv.org/abs/2508.15378
Cheng Wang, Gelei Deng, Xianglin Yang, Han Qiu, Tianwei Zhang, 21 Aug 2025, When Audio and Text Disagree: Revealing Text Bias in Large Audio-Language Models, https://arxiv.org/abs/2508.15407
Tuhina Tripathi, Manya Wadhwa, Greg Durrett, Scott Niekum, 21 Aug 2025, Pairwise or Pointwise? Evaluating Feedback Protocols for Bias in LLM-Based Evaluation, https://arxiv.org/abs/2504.14716
Saumya Roy, 13 Aug 2025, Persuasiveness and Bias in LLM: Investigating the Impact of Persuasiveness and Reinforcement of Bias in Language Models, https://arxiv.org/abs/2508.15798
Xu Pan, Jingxuan Fan, Zidi Xiong, Ely Hahami, Jorin Overwiening, Ziqian Xie, 16 Aug 2025, User-Assistant Bias in LLMs, https://arxiv.org/abs/2508.15815
Srikant Panda, Vishnu Hari, Kalpana Panda, Amit Agarwal, Hitesh Laxmichand Patel, 18 Aug 2025, Who's Asking? Investigating Bias Through the Lens of Disability Framed Queries in LLMs, https://arxiv.org/abs/2508.15831
Tom Jacobs, Chao Zhou, Rebekka Burkholz, 22 Aug 2025, Mirror, Mirror of the Flow: How Does Regularization Shape Implicit Bias?, https://arxiv.org/abs/2504.12883
Gousia Habib, Tausifa Jan Saleem, Ishfaq Ahmad Malik, Brejesh Lall, 21 Aug 2025, LIB-KD: Teaching Inductive Bias for Efficient Vision Transformer Distillation and Compression, https://arxiv.org/abs/2310.00369
Shir Bernstein, David Beste, Daniel Ayzenshteyn, Lea Schonherr, Yisroel Mirsky, 24 Aug 2025, Trust Me, I Know This Function: Hijacking LLM Static Analysis using Bias, https://arxiv.org/abs/2508.17361
Pooja S. B. Rao and Laxminarayen Nagarajan Venkatesan and Mauro Cherubini and Dinesh Babu Jayagopi, 21 Aug 2025, Invisible Filters: Cultural Bias in Hiring Evaluations Using Large Language Models, https://arxiv.org/abs/2508.16673
Viacheslav Yusupov, Danil Maksimov, Ameliia Alaeva, Tatiana Zaitceva, Antipina Anna, Anna Vasileva, Chenlin Liu, Rayuth Chheng, Danil Sazanakov, Andrey Chetvergov, Alina Ermilova, Egor Shvetsov, 23 Aug 2025, Token Homogenization under Positional Bias, https://arxiv.org/abs/2508.17126
Kyra Wilson, Sourojit Ghosh, Aylin Caliskan, 24 Aug 2025, Bias Amplification in Stable Diffusion's Representation of Stigma Through Skin Tones and Their Homogeneity, https://arxiv.org/abs/2508.17465
Xuan-Bac Nguyen, Thanh-Dat Truong, Pawan Sinha, Khoa Luu, 25 Aug 2025, BRAIN: Bias-Mitigation Continual Learning Approach to Vision-Brain Understanding, https://arxiv.org/abs/2508.18187
Federico Marcuzzi, Xuefei Ning, Roy Schwartz, and Iryna Gurevych, 25 Aug 2025, How Quantization Shapes Bias in Large Language Models, https://arxiv.org/abs/2508.18088
Emanuele Zangrando, Piero Deidda, Simone Brugiapaglia, Nicola Guglielmi, Francesco Tudisco, 23 Aug 2025, Provable Emergence of Deep Neural Collapse and Low-Rank Bias in $L^2$-Regularized Nonlinear Networks, https://arxiv.org/abs/2402.03991
Jihwan Oh, Minchan Jeong, Jongwoo Ko, Se-Young Yun, 24 Aug 2025, Understanding Bias Reinforcement in LLM Agents Debate, https://arxiv.org/abs/2503.16814
Jiancong Xiao, Ziniu Li, Xingyu Xie, Emily Getzen, Cong Fang, Qi Long, Weijie J. Su, 25 Aug 2025, On the Algorithmic Bias of Aligning Large Language Models with RLHF: Preference Collapse and Matching Regularization, https://arxiv.org/abs/2405.16455
Paul Scherer, Andreas Kirsch, Jake P. Taylor-King, 4 Sep 2025, When three experiments are better than two: Avoiding intractable correlated aleatoric uncertainty by leveraging a novel bias--variance tradeoff, https://arxiv.org/abs/2509.04363
Joseph Jackson, Georgiy Lapin, Jeremy E. Thompson, 4 Sep 2025, Gravity Well Echo Chamber Modeling With An LLM-Based Confirmation Bias Model, https://arxiv.org/abs/2509.03832
Zhengyu Hu, Linxin Song, Jieyu Zhang, Zheyuan Xiao, Tianfu Wang, Zhengyu Chen, Nicholas Jing Yuan, Jianxun Lian, Kaize Ding, Hui Xiong, 4 Sep 2025, Explaining Length Bias in LLM-Based Preference Evaluations, https://arxiv.org/abs/2407.01085
Andrii Dzhoha, Katya Mirylenka, Egor Malykh, Marco-Andrea Buchmann, Francesca Catino, 3 Sep 2025, Short-Form Video Recommendations with Multimodal Embeddings: Addressing Cold-Start and Bias Challenges, https://arxiv.org/abs/2507.19346
Junyu Yan, Feng Chen, Yuyang Xue, Yuning Du, Konstantinos Vilouras, Sotirios A. Tsaftaris, Steven McDonagh, 4 Sep 2025, SWiFT: Soft-Mask Weight Fine-tuning for Bias Mitigation, https://arxiv.org/abs/2508.18826
Yifan Chen, Xiaoou Cheng, Jonathan Niles-Weed, Jonathan Weare, 3 Sep 2025, Convergence of Unadjusted Langevin in High Dimensions: Delocalization of Bias, https://arxiv.org/abs/2408.13115
Martha O. Dimgba, Sharon Oba, Ameeta Agrawal, Philippe J. Giabbanelli, 3 Sep 2025, Mitigation of Gender and Ethnicity Bias in AI-Generated Stories through Model Explanations, https://arxiv.org/abs/2509.04515
Karanbir Singh, Deepak Muppiri, William Ngu, 26 Aug 2025, Bias Mitigation Agent: Optimizing Source Selection for Fair and Balanced Knowledge Retrieval, https://arxiv.org/abs/2508.18724
Jay L. Cunningham, Adinawa Adjagbodjou, Jeffrey Basoah, Jainaba Jawara, Kowe Kadoma, Aaleyah Lewis, 20 Aug 2025, Toward Responsible ASR for African American English Speakers: A Scoping Review of Bias and Equity in Speech Technology, https://arxiv.org/abs/2508.18288
Kwonyoung Kim, Jungin Park, Jiyoung Lee, Dongbo Min, Kwanghoon Sohn, 26 Aug 2025, PointFix: Learning to Fix Domain Bias for Robust Online Stereo Adaptation, https://arxiv.org/abs/2207.13340
Sheryl Mathew and N Harshit, 27 Aug 2025, Counterfactual Reward Model Training for Bias Mitigation in Multimodal Reinforcement Learning, https://arxiv.org/abs/2508.19567
Md Abdullah Al Mamun, Ihsen Alouani, Nael Abu-Ghazaleh, 28 Aug 2025, Poison Once, Refuse Forever: Weaponizing Alignment for Injecting Bias in LLMs, https://arxiv.org/abs/2508.20333
Ruben Solozabal, Velibor Bojkovic, Hilal AlQuabeh, Kentaro Inui, Martin Tak\'a\v{c}, 28 Aug 2025, Uncovering the Spectral Bias in Diagonal State Space Models, https://arxiv.org/abs/2508.20441
Farhad Abtahi, Mehdi Astaraki, Fernando Seoane, 29 Aug 2025, Leveraging Imperfection with MEDLEY A Multi-Model Approach Harnessing Bias in Medical AI, https://arxiv.org/abs/2508.21648
Muskan Saraf, Sajjad Rezvani Boroujeni, Justin Beaudry, Hossein Abedi, Tom Bush, 28 Aug 2025, Quantifying Label-Induced Bias in Large Language Model Self- and Cross-Evaluations, https://arxiv.org/abs/2508.21164
Liulu He, Shenli Zheng, Karwei Sun, Yijiang Liu, Yufei Zhao, Chongkang Tan, Huanrui Yang, Yuan Du, Li Du, 29 Aug 2025, BASE-Q: Bias and Asymmetric Scaling Enhanced Rotational Quantization for Large Language Models, https://arxiv.org/abs/2506.15689
Lucas Mansilla, Rodrigo Echeveste, Camila Gonzalez, Diego H. Milone, Enzo Ferrante, 1 Sep 2025, BM-CL: Bias Mitigation through the lens of Continual Learning, https://arxiv.org/abs/2509.01730
Sanjeeevan Selvaganapathy and Mehwish Nasim, 31 Aug 2025, Confident, Calibrated, or Complicit: Probing the Trade-offs between Safety Alignment and Ideological Bias in Language Models in Detecting Hate Speech, https://arxiv.org/abs/2509.00673
Theodor Stoecker, Samed Bayer, and Ingo Weber, 28 Aug 2025, Bias Mitigation for AI-Feedback Loops in Recommender Systems: A Systematic Literature Review and Taxonomy, https://arxiv.org/abs/2509.00109
Chen Zheng, Zhenyu Zhao, 29 Aug 2025, Algorithm Adaptation Bias in Recommendation System Online Experiments, https://arxiv.org/abs/2509.00199
Ryan Franks, Alexey Miroshnikov, Konstandinos Kotsiopoulos, 2 Sep 2025, Explainable post-training bias mitigation with distribution-based fairness metrics, https://arxiv.org/abs/2504.01223
Abhishek Pasula and Deepak N. Subramani, 2 Sep 2025, Global Climate Model Bias Correction Using Deep Learning, https://arxiv.org/abs/2504.19145
Serra Aksoy, 3 Sep 2025, Systematic Evaluation of Attribution Methods: Eliminating Threshold Bias and Revealing Method-Dependent Performance Patterns, https://arxiv.org/abs/2509.03176
Alissa A. Valentine, Lauren A. Lepow, Lili Chan, Alexander W. Charney, Isotta Landi, 2 Sep 2025, Quantifying Clinician Bias and its Effects on Schizophrenia Diagnosis in the Emergency Department of the Mount Sinai Health System, https://arxiv.org/abs/2509.02651
Jessica M. Lundin, Ada Zhang, Nihal Karim, Hamza Louzan, Victor Wei, David Adelani, Cody Carroll, 5 Sep 2025, The Token Tax: Systematic Bias in Multilingual Tokenization, https://arxiv.org/abs/2509.05486
Jinrui Yang, Xudong Han, Timothy Baldwin, 7 Sep 2025, Benchmarking Gender and Political Bias in Large Language Models, https://arxiv.org/abs/2509.06164
Jinrui Yang, Fan Jiang, Timothy Baldwin, 7 Sep 2025, Language Bias in Information Retrieval: The Nature of the Beast and Mitigation Methods, https://arxiv.org/abs/2509.06195
Vincent C. Brockers, David A. Ehrlich, Viola Priesemann, 8 Sep 2025, Disentangling Interaction and Bias Effects in Opinion Dynamics of Large Language Models, https://arxiv.org/abs/2509.06858
Jinrui Yang, Timothy Baldwin, Trevor Cohn, 3 Nov 2023, Multi-EuP: The Multilingual European Parliament Dataset for Analysis of Bias in Information Retrieval, https://arxiv.org/abs/2311.01870
Viacheslav Sinii, Alexey Gorbatovski, Artem Cherepanov, Boris Shaposhnikov, Nikita Balagansky, Daniil Gavrilov, 8 Sep 2025, Steering LLM Reasoning Through Bias-Only Adaptation, https://arxiv.org/abs/2505.18706
Rushia Harada, Yuken Kimura, Keito Inoshita, 7 Sep 2025, Role-Playing LLM-Based Multi-Agent Support Framework for Detecting and Addressing Family Communication Bias, https://arxiv.org/abs/2507.11210
Amnon Balanov, Tamir Bendory, and Wasim Huleihel, 7 Sep 2025, Confirmation Bias in Gaussian Mixture Models, https://arxiv.org/abs/2408.09718
Qihu Xie, Yuan Li, and Yi Kang, 9 Sep 2025, SBS: Enhancing Parameter-Efficiency of Neural Representations for Neural Networks via Spectral Bias Suppression, https://arxiv.org/abs/2509.07373
Juan Manuel Contreras, 8 Sep 2025, Automated Evaluation of Gender Bias Across 13 Large Multimodal Models, https://arxiv.org/abs/2509.07050
Sai Siddhartha Chary Aylapuram, Veeraraju Elluru, Shivang Agarwal, 9 Sep 2025, Bias-Aware Machine Unlearning: Towards Fairer Vision Models via Controllable Forgetting, https://arxiv.org/abs/2509.07456
Camilo Chac\'on Sartori, Mart\'in Isla Pino, Pedro Pinacho-Davidson, Christian Blum, 5 Sep 2025, LLM-Based Instance-Driven Heuristic Bias In the Context of a Biased Random Key Genetic Algorithm, https://arxiv.org/abs/2509.09707
Zahraa Al Sahili, Ioannis Patras, Matthew Purver, 11 Sep 2025, Data Matters Most: Auditing Social Bias in Contrastive Vision Language Models, https://arxiv.org/abs/2501.13223
Zahraa Al Sahili, Ioannis Patras, Matthew Purver, 11 Sep 2025, Breaking Language Barriers or Reinforcing Bias? A Study of Gender and Racial Disparities in Multilingual Contrastive Vision Language Models, https://arxiv.org/abs/2505.14160
Baichuan Huang, Ananth Balashankar, Amir Aminifar, 19 Sep 2025, BEFT: Bias-Efficient Fine-Tuning of Language Models, https://arxiv.org/abs/2509.15974
Shuo Wang and Renhao Li and Xi Chen and Yulin Yuan and Derek F. Wong and Min Yang, 18 Sep 2025, Exploring the Impact of Personality Traits on LLM Bias and Toxicity, https://arxiv.org/abs/2502.12566
Nikolaos Tsilivis, Eitan Gronich, Gal Vardi, Julia Kempe, 19 Sep 2025, Flavors of Margin: Implicit Bias of Steepest Descent in Homogeneous Neural Networks, https://arxiv.org/abs/2410.22069
Avinash Madasu, Vasudev Lal, Phillip Howard, 19 Sep 2025, Pruning the Paradox: How CLIP's Most Informative Heads Enhance Performance While Amplifying Bias, https://arxiv.org/abs/2503.11103
Xiaoguang Chang, Teng Wang and Changyin Sun, 13 Sep 2025, A Modern Look at Simplicity Bias in Image Classification Tasks, https://arxiv.org/abs/2509.12265
Paul Kr\"oger, Emilio Barkett, 16 Sep 2025, Don't Change My View: Ideological Bias Auditing in Large Language Models, https://arxiv.org/abs/2509.12652
Maximus Powers, Shaina Raza, Alex Chang, Rehana Riaz, Umang Mavani, Harshitha Reddy Jonala, Ansh Tiwari, Hua Wei, 15 Sep 2025, Responsible AI in NLP: GUS-Net Span-Level Bias Detection Dataset and Benchmark for Generalizations, Unfairness, and Stereotypes, https://arxiv.org/abs/2410.08388
Robin Narsingh Ranabhat, Longwei Wang, Amit Kumar Patel, KC santosh, 14 Sep 2025, Promoting Shape Bias in CNNs: Frequency-Based and Contrastive Regularization for Corruption Robustness, https://arxiv.org/abs/2509.11355
Amy Rafferty, Rishi Ramaesh, Ajitha Rajan, 18 Sep 2025, Limitations of Public Chest Radiography Datasets for Artificial Intelligence: Label Quality, Domain Shift, Bias and Evaluation Challenges, https://arxiv.org/abs/2509.15107
Kiana Kiashemshaki, Mohammad Jalili Torkamani, Negin Mahmoudi, Meysam Shirdel Bilehsavar, 17 Sep 2025, Simulating a Bias Mitigation Scenario in Large Language Models, https://arxiv.org/abs/2509.14438
Chiyu Ma, Enpei Zhang, Yilun Zhao, Wenjun Liu, Yaning Jia, Peijun Qing, Lin Shi, Arman Cohan, Yujun Yan, Soroush Vosoughi, 17 Sep 2025, Judging with Many Minds: Do More Perspectives Mean Less Prejudice? On Bias Amplifications and Resistance in Multi-Agent Based LLM-as-Judge, https://arxiv.org/abs/2505.19477
Zoya Hammad, Nii Longdon Sowah, 7 Sep 2025, Evaluating and comparing gender bias across four text-to-image models, https://arxiv.org/abs/2509.08004
Nivedha Sivakumar, Natalie Mackraz, Samira Khorshidi, Krishna Patel, Barry-John Theobald, Luca Zappella, Nicholas Apostoloff, 9 Sep 2025, Bias after Prompting: Persistent Discrimination in Large Language Models, https://arxiv.org/abs/2509.08146
Daniel Lacker and Fuzhong Zhou, 10 Sep 2025, A hierarchical entropy method for the delocalization of bias in high-dimensional Langevin Monte Carlo, https://arxiv.org/abs/2509.08619
Ji Zhang, Xu Luo, Lianli Gao, Difan Zou, Hengtao Shen, Jingkuan Song, 10 Sep 2025, From Channel Bias to Feature Redundancy: Uncovering the "Less is More" Principle in Few-Shot Learning, https://arxiv.org/abs/2310.03843
Xuan Liu, Haoyang Shang, Haojian Jin, 16 Sep 2025, Programmable Cognitive Bias in Social Agents, https://arxiv.org/abs/2509.13588
Sai Suresh Marchala Vasu, Ivaxi Sheth, Hui-Po Wang, Ruta Binkyte, Mario Fritz, 16 Sep 2025, Justice in Judgment: Unveiling (Hidden) Bias in LLM-assisted Peer Reviews, https://arxiv.org/abs/2509.13400
Dingwei Zhang, Dong Zhang, Jinhui Tang, 17 Sep 2025, Mitigating Query Selection Bias in Referring Video Object Segmentation, https://arxiv.org/abs/2509.13722
Mohsinul Kabir, Tasfia Tahsin, Sophia Ananiadou, 17 Sep 2025, From n-gram to Attention: How Model Architectures Learn and Propagate Bias in Language Modeling, https://arxiv.org/abs/2505.12381

Toxicity

Toxicity is the LLM safety issue of ensuring that the AI does not give "toxic" answers to the user. There are many subtypes of this issue, such as ensuring that answers are appropriate, non-aggressive, non-disparaging, not insulting, and generally helpful. The overall tone of AI interactions should be positive rather than negative.

Research papers on LLM toxicity issues:

Jaymari Chua, Yun Li, Shiyi Yang, Chen Wang, Lina Yao, 6 Jul 2024, AI Safety in Generative AI Large Language Models: A Survey, https://arxiv.org/abs/2407.18369
Cem Dilmegani, Jan 10, 2024, The Future of Large Language Models in 2024, https://research.aimultiple.com/future-of-large-language-models/
Shayne Longpre, Stella Biderman, Alon Albalak, Hailey Schoelkopf, Daniel McDuff, Sayash Kapoor, Kevin Klyman, Kyle Lo, Gabriel Ilharco, Nay San, Maribeth Rauh, Aviya Skowron, Bertie Vidgen, Laura Weidinger, Arvind Narayanan, Victor Sanh, David Adelani, Percy Liang, Rishi Bommasani, Peter Henderson, Sasha Luccioni, Yacine Jernite, Luca Soldaini, 26 Jun 2024 (v2), The Responsible Foundation Model Development Cheatsheet: A Review of Tools & Resources, https://arxiv.org/abs/2406.16746
Inkit Padhi, Manish Nagireddy, Giandomenico Cornacchia, Subhajit Chaudhury, Tejaswini Pedapati, Pierre Dognin, Keerthiram Murugesan, Erik Miehling, Martín Santillán Cooper, Kieran Fraser, Giulio Zizzo, Muhammad Zaid Hameed, Mark Purcell, Michael Desmond, Qian Pan, Inge Vejsbjerg, Elizabeth M. Daly, Michael Hind, Werner Geyer, Ambrish Rawat, Kush R. Varshney, Prasanna Sattigeri, 10 Dec 2024, Granite Guardian, https://arxiv.org/abs/2412.07724 https://github.com/ibm-granite/granite-guardian (Open-sourcing of safety models with many capabilities.)
Jueon Park, Yein Park, Minju Song, Soyon Park, Donghyeon Lee, Seungheun Baek and Jaewoo Kang, 5 Aug 2025, CoTox: Chain-of-Thought-Based Molecular Toxicity Reasoning and Prediction, https://arxiv.org/abs/2508.03159
Axel Delaval, Shujian Yang, Haicheng Wang, Han Qiu, Jialiang Lu, 15 Aug 2025, ToxiFrench: Benchmarking and Enhancing Language Models via CoT Fine-Tuning for French Toxicity Detection, https://arxiv.org/abs/2508.11281
Han Zhang, Fengji Ma, Jiamin Su, Xinyue Yang, Lei Wang, Wen-Cai Ye, Li Liu, 4 Sep 2025, Quantum-Enhanced Multi-Task Learning with Learnable Weighting for Pharmacokinetic and Toxicity Prediction, https://arxiv.org/abs/2509.04601
Guillermo Villate-Castillo, Javier Del Ser, Borja Sanz, 29 Aug 2025, A Collaborative Content Moderation Framework for Toxicity Detection based on Conformalized Estimates of Annotation Disagreement, https://arxiv.org/abs/2411.04090
Naquee Rizwan, Nayandeep Deb, Sarthak Roy, Vishwajeet Singh Solanki, Kiran Garimella, Animesh Mukherjee, 29 Aug 2025, Toxicity Begets Toxicity: Unraveling Conversational Chains in Political Podcasts, https://arxiv.org/abs/2501.12640
Akriti Verma, Shama Islam, Valeh Moghaddam and Adnan Anwar, 31 Aug 2025, Queuing for Civility: Regulating Emotions and Reducing Toxicity in Digital Discourse, https://arxiv.org/abs/2509.00696
Ruoxi Cheng, Yizhong Ding, Shuirong Cao, Ranjie Duan, Xiaoshuang Jia, Shaowei Yuan, Simeng Qin, Zhiqiang Wang, Xiaojun Jia, 30 Aug 2025, PBI-Attack: Prior-Guided Bimodal Interactive Black-Box Jailbreak Attack for Toxicity Maximization, https://arxiv.org/abs/2412.05892
Shuo Wang and Renhao Li and Xi Chen and Yulin Yuan and Derek F. Wong and Min Yang, 18 Sep 2025, Exploring the Impact of Personality Traits on LLM Bias and Toxicity, https://arxiv.org/abs/2502.12566
Sudeshna Jana, Manjira Sinha and Tirthankar Dasgupta, 14 Sep 2025, Decoding Plastic Toxicity: An Intelligent Framework for Conflict-Aware Relational Metapath Extraction from Scientific Abstracts, https://arxiv.org/abs/2509.11330
Huy Nghiem, Advik Sachdeva, Hal Daum\'e III, 18 Sep 2025, SMARTER: A Data-efficient Framework to Improve Toxicity Detection with Explanation via Self-augmenting Large Language Models, https://arxiv.org/abs/2509.15174
Gautam Kishore Shahi, Tim A. Majchrzak, 14 Sep 2025, Defining, Understanding, and Detecting Online Toxicity: Challenges and Machine Learning Approaches, https://arxiv.org/abs/2509.14264

Ethics of Responsible AI Research

Ethical issues in AI research and related publication of results:

Partnership on AI. 2021, Managing the risks of AI research: Six Recommendations for Responsible Publication. https://partnershiponai.org/paper/responsible-publication-recommendations
M. Brundage, S. Avin, J. Wang, H. Belfield, G. Krueger, G. Hadfield, H. Khlaaf, J. Yang, H. Toner, R. Fong, T. Maharaj, P. W. Koh, S. Hooker, J. Leung, A. Trask, E. Bluemke, J. Lebensold, C. O’Keefe, M. Koren, T. Ryffel, J. Rubinovitz, T. Besiroglu, F. Carugati, J. Clark, P. Eckersley, S. de Haas, M. Johnson, B. Laurie, A. Ingerman, I. Krawczuk, A. Askell, R. Cammarota, A. Lohn, D. Krueger, C. Stix, P. Henderson, L. Graham, C. Prunkl, B. Martin, E. Seger, N. Zilberman, S. Ó. hÉigeartaigh, F. Kroeger, G. Sastry, R. Kagan, A. Weller, B. Tse, E. Barnes, A. Dafoe, P. Scharre, A. Herbert-Voss, M. Rasser, S. Sodhani, C. Flynn, T. K. Gilbert, L. Dyer, S. Khan, Y. Bengio, and M. Anderljung. Toward trustworthy AI development: Mechanisms for supporting verifiable claims. arXiv preprint arXiv:2004.07213, 2020. https://arxiv.org/abs/2004.07213
R. Crootof. 2019, Artificial intelligence research needs responsible publication norms. https://www.lawfareblog.com/artificial-intelligence-research-needs-responsible-publication-norms
C. Ashurst, S. Barocas, R. Campbell, and D. Raji. Disentangling the components of ethical research in machine learning. In 2022 ACM Conference on Fairness, Accountability, and Transparency, pages 2057–2068, 2022. http://dx.doi.org/10.1145/3531146.3533781, https://www.researchgate.net/publication/361439688_Disentangling_the_Components_of_Ethical_Research_in_Machine_Learning
Herrmann H. What's next for responsible artificial intelligence: a way forward through responsible innovation. Heliyon. 2023 Mar 11;9(3):e14379. doi: 10.1016/j.heliyon.2023.e14379. eCollection 2023 Mar. PMID: 36967876, https://pubmed.ncbi.nlm.nih.gov/36967876/
Ethically governing artificial intelligence in the field of scientific research and innovation. González-Esteban Y Patrici Calvo E. Heliyon. 2022 Feb 16;8(2):e08946. doi: 10.1016/j.heliyon.2022.e08946. eCollection 2022 Feb. PMID: 35243068, https://pubmed.ncbi.nlm.nih.gov/35243068/
Dzobo K, Adotey S, Thomford NE, Dzobo W. Integrating Artificial and Human Intelligence: A Partnership for Responsible Innovation in Biomedical Engineering and Medicine. OMICS. 2020 May;24(5):247-263. doi: 10.1089/omi.2019.0038. Epub 2019 Jul 16. PMID: 31313972, https://pubmed.ncbi.nlm.nih.gov/31313972/
d'Aquin M., Troullinou P., O'Connor N.E., Cullen A., Faller G., Holden L. 2018 AAAI/ACM Conference on AI, Ethics, and Society (AIES ’18) ACM; New York: 2018. Towards an “ethics by design” methodology for AI research projects”; pp. 54–59. https://www.researchgate.net/publication/330297261_Towards_an_Ethics_by_Design_Methodology_for_AI_Research_Projects
Dignum Virginia. 2019. Responsible Artificial Intelligence. How to Develop and Use AI in a Responsible Way. Springer, https://link.springer.com/book/10.1007/978-3-030-30371-6
European Commission. 2012. Responsible Research and Innovation: Europe’s Ability to Respond to Societal Challenges. Brussels. https://op.europa.eu/en/publication-detail/-/publication/2be36f74-b490-409e-bb60-12fd438100fe
Helmore Edward. 2019. Profit over safety? Boeing under fire over 737 Max crashes as families demand answers. Guardian. https://www.theguardian.com/business/2019/jun/17/boeing-737-max-ethiopian-airlines-crash
High-level expert Group on Artificial Intelligence. European Commission; 2019. Ethics Guidelines for Trustworthy AI. Brussels. https://op.europa.eu/en/publication-detail/-/publication/d3988569-0434-11ea-8c1f-01aa75ed71a1
Prates M., Avelar P., Lamb L.C. 2018, On quantifying and understanding the role of ethics in AI research: a historical account of flagship conferences and journals. EPiC Series in Computing. 2018;55:188–201. https://arxiv.org/abs/1809.08328
Castelvecchi D., 2021, Prestigious AI meeting takes steps to improve ethics of research. Nature. 2021 Jan;589(7840):12-13. doi: 10.1038/d41586-020-03611-8. PMID: 33361804, https://pubmed.ncbi.nlm.nih.gov/33361804/
Bouhouita-Guermech S, Gogognon P, Bélisle-Pipon JC. 2023, Specific challenges posed by artificial intelligence in research ethics. Front Artif Intell. 2023 Jul 6;6:1149082. doi: 10.3389/frai.2023.1149082. eCollection 2023. PMID: 37483869 https://pubmed.ncbi.nlm.nih.gov/37483869/
Gibney E., 2020, The battle for ethical AI at the world's biggest machine-learning conference. Nature. 2020 Jan;577(7792):609. doi: 10.1038/d41586-020-00160-y. PMID: 31992885, https://pubmed.ncbi.nlm.nih.gov/31992885/
Sánchez López JD, Cambil Martín J, Villegas Calvo M, Luque Martínez F., 2020. Ethical conflicts between authonomy and deep learning, J Healthc Qual Res. 2020 Jan-Feb;35(1):51-52. doi: 10.1016/j.jhqr.2019.06.009. Epub 2019 Nov 26. PMID: 31784256, https://pubmed.ncbi.nlm.nih.gov/31784256/
Prabhu SP., 2019, Ethical challenges of machine learning and deep learning algorithms. Lancet Oncol. 2019 May;20(5):621-622. doi: 10.1016/S1470-2045(19)30230-X. PMID: 31044701, https://pubmed.ncbi.nlm.nih.gov/31044701/
Dignum V. Ethics in artificial intelligence: introduction to the special issue. Ethics Inf. Technol. 2018;20:1–3. https://link.springer.com/article/10.1007/s10676-018-9450-z
IEEE. 2019. "Ethically Aligned Design: A Vision for Prioritizing Human Well-being With Autonomous and Intelligent Systems [First Edition]." The IEEE Global Initiative on Ethics of Autonomous and Intelligent Systems. https://standards.ieee.org/content/ieee-standards/en/industry-connections/ec/autonomous-systems.html
Stuart Russell, Daniel Dewey, and Max Tegmark. 2015. Research priorities for robust and beneficial artificial intelligence. AI Magazine, 36(4):105–114, 2015. PDF: https://futureoflife.org/data/documents/research_priorities.pdf
Peter Dizikes, December 11, 2023, MIT group releases white papers on governance of AI, MIT News, https://news.mit.edu/2023/mit-group-releases-white-papers-governance-ai-1211
Thomas Mildner, Orla Cooney, Anna-Maria Meck, Marion Bartl, Gian-Luca Savino, Philip R. Doyle, Diego Garaialde, Leigh Clark, John Sloan, Nina Wenig, Rainer Malaka, Jasmin Niess, 26 Jan 2024, Listening to the Voices: Describing Ethical Caveats of Conversational User Interfaces According to Experts and Frequent Users, Proceedings of the CHI Conference on Human Factors in Computing Systems (CHI '24), May 11--16, 2024, Honolulu, HI, USA, https://arxiv.org/abs/2401.14746 https://doi.org/https://doi.org/10.1145/3613904.3642542
Balasubramaniam S. , Vanajaroselin Chirchi, Seifedine Kadry, Moorthy Agoramoorthy, Gururama Senthilvel P., Satheesh Kumar K., and Sivakumar T. A., Oct 2024, The Road Ahead: Emerging Trends, Unresolved Issues, and ConcludingRemarksinGenerativeAI—AComprehensiveReview, International Journal of Intelligent Systems, Volume 2024, Article ID 4013195, 38 pages, https://doi.org/10.1155/2024/4013195 https://www.researchgate.net/profile/Balasubramaniam-s-2/publication/384729387_The_Road_Ahead_Emerging_Trends_Unresolved_Issues_and_Concluding_Remarks_in_Generative_AI-A_Comprehensive_Review/links/6705560cf5eb7108c6e5d261/The-Road-Ahead-Emerging-Trends-Unresolved-Issues-and-Concluding-Remarks-in-Generative-AI-A-Comprehensive-Review.pdf

AI Alignment Research

Alignment is the study of how to ensure that AI engines are "aligned" with the goals and intent of humans.

J. Leike, J. Schulman, and J. Wu. OpenAI, August 2022. Our approach to alignment research. https://openai.com/blog/our-approach-to-alignment-research
OpenAI, July 2023, Introducing Superalignment, https://openai.com/blog/introducing-superalignment
V. Krakovna and R. Shah. 2023, Some high-level thoughts on the DeepMind alignment team’s strategy. https://www.alignmentforum.org/posts/a9SPcZ6GXAg9cNKdi/linkpost-some-high-level-thoughts-on-the-deepmind-alignment
J. Leike. Dec 2022, Why I’m optimistic about our alignment approach. https://aligned.substack.com/p/alignment-optimism
Nate Soares and Benja Fallenstein. Aligning superintelligence with human interests: A technical research agenda. Technical report, Machine Intelligence Research Institute, 2014. https://www.semanticscholar.org/paper/Aligning-Superintelligence-with-Human-Interests%3A-A-Soares-Fallenstein/d8033a314493c8df3791912272ac4b58d3a7b8c2
Jessica Taylor, Eliezer Yudkowsky, Patrick LaVictoire, and Andrew Critch. 2016. Alignment for advanced machine learning systems. Technical report, Machine Intelligence Research Institute, 2016. PDF: https://intelligence.org/files/AlignmentMachineLearning.pdf
Daniel Weld and Oren Etzioni. The first law of robotics (a call to arms). Proceedings of the AAAI Conference on Artificial Intelligence, 12, pages 1042–1047, 1994. https://aaai.org/papers/01042-the-first-law-of-robotics-a-call-to-arms/
Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, Ryan Lowe, Mar 2022, Training language models to follow instructions with human feedback, https://arxiv.org/abs/2203.02155 (InstructGPT main paper from OpenAI in 2022.)
Ziniu Li1, Tian Xu, Yushun Zhang, Zhihang Lin, Yang Yu, Ruoyu Sun, Zhi-Quan Luo, 2024, ReMax: ASimple, Effective, and Efficient Reinforcement Learning Method for Aligning Large Language Models, https://openreview.net/pdf?id=Stn8hXkpe6
Aibek Bekbayev, Sungbae Chun, Yerzat Dulat, James Yamazaki, Aug 2023, The Poison of Alignment, https://arxiv.org/abs/2308.13449
Yotam Wolf, Noam Wies, Yoav Levine, and Amnon Shashua. Fundamental limitations of alignment in large language models. arXiv preprint arXiv:2304.11082, 2023. https://arxiv.org/abs/2304.11082
Renze Lou, Kai Zhang, Wenpeng Yin, 25 May 2024 (v8), Large Language Model Instruction Following: A Survey of Progresses and Challenges, https://arxiv.org/abs/2303.10475 Project: https://github.com/RenzeLou/awesome-instruction-learning
Alexandre Ramé, Nino Vieillard, Léonard Hussenot, Robert Dadashi, Geoffrey Cideron, Olivier Bachem, Johan Ferret, 22 Jan 2024, WARM: On the Benefits of Weight Averaged Reward Models, https://arxiv.org/abs/2401.12187 (Uses multiple reward models to avoid problems with the LLM "hacking rewards" in unforeseen ways.)
NVIDIA, June 2024, Nemotron-4 340B Technical Report, https://d1qx31qr3h6wln.cloudfront.net/publications/Nemotron_4_340B_8T_0.pdf (Architecture is decoder-only with GQA, SentencePiece tokenizer, causal attention masks, RoPE, 96 layers, 96 heads, 8 KV heads, 256,000 vocabulary, 18432 internal dimension, context window 4096, and uses squared RELU.)
Piotr Wojciech Mirowski, Juliette Love, Kory W. Mathewson, Shakir Mohamed, 3 Jun 2024 (v2), A Robot Walks into a Bar: Can Language Models Serve as Creativity Support Tools for Comedy? An Evaluation of LLMs' Humour Alignment with Comedians, https://arxiv.org/abs/2405.20956 (The unfunny fact that AI is bad at humor.)
Jaymari Chua, Yun Li, Shiyi Yang, Chen Wang, Lina Yao, 6 Jul 2024, AI Safety in Generative AI Large Language Models: A Survey, https://arxiv.org/abs/2407.18369
Mintong Kang, Nezihe Merve Gürel, Ning Yu, Dawn Song, Bo Li, July 2024, C-RAG: Certified Generation Risks for Retrieval-Augmented Language Models, Proceedings of the 41st International Conference on Machine Learning, PMLR 235:22963-23000, 2024, https://proceedings.mlr.press/v235/kang24a.html
Rohin Shah, Seb Farquhar, Anca Dragan, 21st Aug 2024, AGI Safety and Alignment at Google DeepMind: A Summary of Recent Work, https://www.alignmentforum.org/posts/79BPxvSsjzBkiSyTq/agi-safety-and-alignment-at-google-deepmind-a-summary-of
Shayne Longpre, Stella Biderman, Alon Albalak, Hailey Schoelkopf, Daniel McDuff, Sayash Kapoor, Kevin Klyman, Kyle Lo, Gabriel Ilharco, Nay San, Maribeth Rauh, Aviya Skowron, Bertie Vidgen, Laura Weidinger, Arvind Narayanan, Victor Sanh, David Adelani, Percy Liang, Rishi Bommasani, Peter Henderson, Sasha Luccioni, Yacine Jernite, Luca Soldaini, 26 Jun 2024 (v2), The Responsible Foundation Model Development Cheatsheet: A Review of Tools & Resources, https://arxiv.org/abs/2406.16746
Yiheng Liu, Hao He, Tianle Han, Xu Zhang, Mengyuan Liu, Jiaming Tian, Yutong Zhang, Jiaqi Wang, Xiaohui Gao, Tianyang Zhong, Yi Pan, Shaochen Xu, Zihao Wu, Zhengliang Liu, Xin Zhang, Shu Zhang, Xintao Hu, Tuo Zhang, Ning Qiang, Tianming Liu, Bao Ge, 6 Jan 2024 (v2), Understanding LLMs: A Comprehensive Overview from Training to Inference, https://arxiv.org/abs/2401.02038
Hao Zhou, Chengming Hu, Ye Yuan, Yufei Cui, Yili Jin, Can Chen, Haolun Wu, Dun Yuan, Li Jiang, Di Wu, Xue Liu, Charlie Zhang, Xianbin Wang, Jiangchuan Liu, 17 May 2024, Large Language Model (LLM) for Telecommunications: A Comprehensive Survey on Principles, Key Techniques, and Opportunities, https://arxiv.org/abs/2405.10825
Zekun Moore Wang, Shawn Wang, Kang Zhu, Jiaheng Liu, Ke Xu, Jie Fu, Wangchunshu Zhou, Wenhao Huang, 17 Oct 2024, PopAlign: Diversifying Contrasting Patterns for a More Comprehensive Alignment, https://arxiv.org/abs/2410.13785
Mozhi Zhang, Pengyu Wang, Chenkun Tan, Mianqiu Huang, Dong Zhang, Yaqian Zhou, Xipeng Qiu, 18 Oct 2024, MetaAlign: Align Large Language Models with Diverse Preferences during Inference Time, https://arxiv.org/abs/2410.14184
OpenAI, Dec 2024, Deliberative alignment: reasoning enables safer language models. Introducing our new alignment strategy for o-series models, which are directly taught safety specifications and how to reason over them. https://openai.com/index/deliberative-alignment/
Asif Razzaq, December 23, 2024, OpenAI Researchers Propose ‘Deliberative Alignment’: A Training Approach that Teaches LLMs to Explicitly Reason through Safety Specifications before Producing an Answer, https://www.marktechpost.com/2024/12/23/openai-researchers-propose-deliberative-alignment-a-training-approach-that-teaches-llms-to-explicitly-reason-through-safety-specifications-before-producing-an-answer/
Andrea Matarazzo, Riccardo Torlone, 3 Jan 2025, A Survey on Large Language Models with some Insights on their Capabilities and Limitations, https://arxiv.org/abs/2501.04040 (Broad survey with many LLM topics covered from history to architectures to optimizations.)
Tong Xiao, Jingbo Zhu, 16 Jan 2025, Foundations of Large Language Models, https://arxiv.org/abs/2501.09223 (Huge 230 page paper on many topics such as training, prompting, alignment, and long context.)
Zongxi Li, Yang Li, Haoran Xie, S. Joe Qin, 3 Feb 2025, CondAmbigQA: A Benchmark and Dataset for Conditional Ambiguous Question Answering, https://arxiv.org/abs/2502.01523
Y Gong, D Ran, X He, T Cong, A Wang, X Wang, Feb 2025, Safety Misalignment Against Large Language Models, Network and Distributed System Security (NDSS) Symposium 2025, 24-28 February 2025, San Diego, CA, USA, ISBN 979-8-9894372-8-3, https://dx.doi.org/10.14722/ndss.2025.241089 https://www.ndss-symposium.org/wp-content/uploads/2025-1089-paper.pdf
Guiyao Tie, Zeli Zhao, Dingjie Song, Fuyang Wei, Rong Zhou, Yurou Dai, Wen Yin, Zhejian Yang, Jiangyue Yan, Yao Su, Zhenhan Dai, Yifeng Xie, Yihan Cao, Lichao Sun, Pan Zhou, Lifang He, Hechang Chen, Yu Zhang, Qingsong Wen, Tianming Liu, Neil Zhenqiang Gong, Jiliang Tang, Caiming Xiong, Heng Ji, Philip S. Yu, Jianfeng Gao, 8 Mar 2025, A Survey on Post-training of Large Language Models, https://arxiv.org/abs/2503.06072
Ziqi Yin, Hao Wang, Kaito Horio, Daisuke Kawahara, Satoshi Sekine, 14 Oct 2024 (v2), Should We Respect LLMs? A Cross-Lingual Study on the Influence of Prompt Politeness on LLM Performance, https://arxiv.org/abs/2402.14531
Michael Nuñez, July 15, 2025, OpenAI, Google DeepMind and Anthropic sound alarm: ‘We may be losing the ability to understand AI’, https://venturebeat.com/ai/openai-google-deepmind-and-anthropic-sound-alarm-we-may-be-losing-the-ability-to-understand-ai/ (Monitoring the text-based interim "thinking-out-loud" reasoning of models in CoT.)
Tomek Korbak, Mikita Balesni, (and many more authors) July 2025, Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety, https://tomekkorbak.com/cot-monitorability-is-a-fragile-opportunity/cot_monitoring.pdf
Yafu Li, Xuyang Hu, Xiaoye Qu, Linjie Li, Yu Cheng, 22 Jan 2025, Test-Time Preference Optimization: On-the-Fly Alignment via Iterative Textual Feedback, https://arxiv.org/abs/2501.12895 https://github.com/yafuly/TPO
Cameron R. Wolfe, Ph.D., Jun 30, 2025, Reward Models: Modeling human preferences for LLMs in the age of reasoning models, https://cameronrwolfe.substack.com/p/reward-models
Zetian Sun, Dongfang Li, Baotian Hu, 14 Aug 2025, Diversity First, Quality Later: A Two-Stage Assumption for Language Model Alignment, https://arxiv.org/abs/2508.10530
Xinyan Jiang, Lin Zhang, Jiayi Zhang, Qingsong Yang, Guimin Hu, Di Wang, Lijie Hu, 14 Aug 2025, MSRS: Adaptive Multi-Subspace Representation Steering for Attribute Alignment in Large Language Models, https://arxiv.org/abs/2508.10599
Jinhwa Kim, Ian G. Harris, 9 Aug 2025, Context Misleads LLMs: The Role of Context Filtering in Maintaining Safe Alignment of LLMs, https://arxiv.org/abs/2508.10031
Christopher Pinier, Sonia Acu\~na Vargas, Mariia Steeghs-Turchina, Dora Matzke, Claire E. Stevenson, Michael D. Nunez, 12 Aug 2025, Large Language Models Show Signs of Alignment with Human Neurocognition During Abstract Reasoning, https://arxiv.org/abs/2508.10057
Shixiong Xu, Chenghao Zhang, Lubin Fan, Yuan Zhou, Bin Fan, Shiming Xiang, Gaofeng Meng, Jieping Ye, 14 Aug 2025, AddressVLM: Cross-view Alignment Tuning for Image Address Localization using Large Vision-Language Models, https://arxiv.org/abs/2508.10667
Xia Chen, 13 Aug 2025, Dynamical Alignment: A Principle for Adaptive Neural Computation, https://arxiv.org/abs/2508.10064
Yihao Xue, Baharan Mirzasoleiman, 22 Jul 2025, LoRA is All You Need for Safety Alignment of Reasoning LLMs, https://arxiv.org/abs/2507.17075
Haoran Sun, Zekun Zhang, Shaoning Zeng, 23 Jul 2025, An Uncertainty-Driven Adaptive Self-Alignment Framework for Large Language Models, https://arxiv.org/abs/2507.17477
Xiang Li, 21 Jul 2025, Look Before You Fuse: 2D-Guided Cross-Modal Alignment for Robust 3D Detection, https://arxiv.org/abs/2507.16861
Miguel Carrasco, C\'esar Gonz\'alez-Mart\'in, Jos\'e Aranda, Luis Oliveros, 23 Jul 2025, Vision Transformer attention alignment with human visual perception in aesthetic object evaluation, https://arxiv.org/abs/2507.17616
Yifan Wang, Runjin Chen, Bolian Li, David Cho, Yihe Deng, Ruqi Zhang, Tianlong Chen, Zhangyang Wang, Ananth Grama, Junyuan Hong, 22 Jul 2025, More is Less: The Pitfalls of Multi-Model Synthetic Preference Data in DPO Safety Alignment, https://arxiv.org/abs/2504.02193
Tom\'as H\"uttebr\"aucker, Mario Edoardo Pandolfo, Simone Fiorellino, Emilio Calvanese Strinati, Paolo Di Lorenzo, 23 Jul 2025, RIS-aided Latent Space Alignment for Semantic Channel Equalization, https://arxiv.org/abs/2507.16450
Shehzeen Hussain, Paarth Neekhara, Xuesong Yang, Edresson Casanova, Subhankar Ghosh, Mikyas T. Desta, Roy Fejgin, Rafael Valle, Jason Li, 22 Jul 2025, Koel-TTS: Enhancing LLM based Speech Generation with Preference Alignment and Classifier Free Guidance, https://arxiv.org/abs/2502.05236
Songming Zhang, Xue Zhang, Tong Zhang, Bojie Hu, Yufeng Chen, Jinan Xu, 23 Jul 2025, AlignDistil: Token-Level Language Model Alignment as Adaptive Policy Distillation, https://arxiv.org/abs/2503.02832
ZhengXiao He, Jinghao Wen, Huayu Li, Siyuan Tian, Ao Li, 23 Jul 2025, NeuroHD-RA: Neural-distilled Hyperdimensional Model with Rhythm Alignment, https://arxiv.org/abs/2507.14184
Amir Mohammad Izadi, Seyed Mohammad Hadi Hosseini, Soroush Vafaie Tabar, Ali Abdollahi, Armin Saghafian, and Mahdieh Soleymani Baghshah, 22 Jul 2025, Fine-Grained Alignment and Noise Refinement for Compositional Text-to-Image Generation, https://arxiv.org/abs/2503.06506
Andy E. Williams, 18 Jul 2025, The Recursive Coherence Principle: A Formal Constraint on Scalable Intelligence, Alignment, and Reasoning Architecture, https://arxiv.org/abs/2507.15880
Debangshu Banerjee, Kintan Saha, Aditya Gopalan, 21 Jul 2025, Towards Reliable, Uncertainty-Aware Alignment, https://arxiv.org/abs/2507.15906
Mario Edoardo Pandolfo, Simone Fiorellino, Emilio Calvanese Strinati, Paolo Di Lorenzo, 22 Jul 2025, Latent Space Alignment for AI-Native MIMO Semantic Communications, https://arxiv.org/abs/2507.16680
Han Jiang, Dongyao Zhu, Zhihua Wei, Xiaoyuan Yi, Ziang Xiao, Xing Xie, 22 Jul 2025, PICACO: Pluralistic In-Context Value Alignment of LLMs via Total Correlation Optimization, https://arxiv.org/abs/2507.16679
Difei Gu, Yunhe Gao, Yang Zhou, Mu Zhou, Dimitris Metaxas, 22 Jul 2025, RadAlign: Advancing Radiology Report Generation with Vision-Language Concept Alignment, https://arxiv.org/abs/2501.07525
Ziteng Yang, Jingzehua Xu, Yanshu Li, Zepeng Li, Yeqiang Wang, Xinghui Li, 22 Jul 2025, ViP$^2$-CLIP: Visual-Perception Prompting with Unified Alignment for Zero-Shot Anomaly Detection, https://arxiv.org/abs/2505.17692
Tianze Wang, Dongnan Gui, Yifan Hu, Shuhang Lin, Linjun Zhang, 22 Jul 2025, MPO: An Efficient Post-Processing Framework for Mixing Diverse Preference Alignment, https://arxiv.org/abs/2502.18699
Xiandong Zou, Wanyu Lin, Yuchen Li, Pan Zhou, 24 Jul 2025, HPS: Hard Preference Sampling for Human Preference Alignment, https://arxiv.org/abs/2502.14400
Alberto Hern\'andez-Espinosa, Felipe S. Abrah\~ao, Olaf Witkowski, Hector Zenil, 24 Jul 2025, Neurodivergent Influenceability as a Contingent Solution to the AI Alignment Problem, https://arxiv.org/abs/2505.02581
Yuhui Sun (University of Alberta), Xiyao Wang (University of Toronto), Zixi Li (Zhejiang University), Zhenlong Yuan (Institute of Computing Technology, Chinese Academy of Sciences), and Jinman Zhao (University of Toronto), 24 Jul 2025, Multi-Preference Lambda-weighted Listwise DPO for Small-Scale Model Alignment, https://arxiv.org/abs/2506.19780
Bowen Jin, Jinsung Yoon, Zhen Qin, Ziqi Wang, Wei Xiong, Yu Meng, Jiawei Han, Sercan O. Arik, 23 Jul 2025, LLM Alignment as Retriever Optimization: An Information Retrieval Perspective, https://arxiv.org/abs/2502.03699
Jie Xu, Na Zhao, Gang Niu, Masashi Sugiyama, Xiaofeng Zhu, 24 Jul 2025, Robust Multi-View Learning via Representation Fusion of Sample-Level Attention and Alignment of Simulated Perturbation, https://arxiv.org/abs/2503.04151
Charvi Rastogi, Tian Huey Teh, Pushkar Mishra, Roma Patel, Ding Wang, Mark D\'iaz, Alicia Parrish, Aida Mostafazadeh Davani, Zoe Ashwood, Michela Paganini, Vinodkumar Prabhakaran, Verena Rieser, Lora Aroyo, 15 Jul 2025, Whose View of Safety? A Deep DIVE Dataset for Pluralistic Alignment of Text-to-Image Models, https://arxiv.org/abs/2507.13383
Oussama Bouaggad, Natalia Grabar, 18 Jul 2025, Search-Optimized Quantization in Biomedical Ontology Alignment, https://arxiv.org/abs/2507.13742
Shuliang Liu, Qi Zheng, Jesse Jiaxi Xu, Yibo Yan, He Geng, Aiwei Liu, Peijie Jiang, Jia Liu, Yik-Cheung Tam, and Xuming Hu, 18 Jul 2025, VLA-Mark: A cross modal watermark for large vision-language alignment model, https://arxiv.org/abs/2507.14067
Yi Zhang, An Zhang, XiuYu Zhang, Leheng Sheng, Yuxin Chen, Zhenkai Liang, Xiang Wang, 20 Jul 2025, AlphaAlign: Incentivizing Safety Alignment with Extremely Simplified Reinforcement Learning, https://arxiv.org/abs/2507.14987
Pengfei Du, 14 Jul 2025, PRM-Free Security Alignment of Large Models via Red Teaming and Adversarial Training, https://arxiv.org/abs/2507.14202
Wenqian Ye, Guangtao Zheng, Aidong Zhang, 20 Jul 2025, Improving Group Robustness on Spurious Correlation via Evidential Alignment, https://arxiv.org/abs/2506.11347
Anirudh Sundar, Sinead Williamson, Katherine Metcalf, Barry-John Theobald, Skyler Seto, Masha Fedzechkina, 21 Jul 2025, Steering into New Embedding Spaces: Analyzing Cross-Lingual Alignment Induced by Model Interventions in Multilingual Language Models, https://arxiv.org/abs/2502.15639
Noel Teku, Fengwei Tian, Payel Bhattacharjee, Souradip Chakraborty, Amrit Singh Bedi, Ravi Tandon, 9 Aug 2025, PROPS: Progressively Private Self-alignment of Large Language Models, https://arxiv.org/abs/2508.06783
Yuandong Tan, 10 Aug 2025, A Stable and Principled Loss Function for Direct Language Model Alignment, https://arxiv.org/abs/2508.07137
Jia Zhang, Yao Liu, Chen-Xi Zhang, Yi Liu, Yi-Xuan Jin, Lan-Zhe Guo, Yu-Feng Li, 11 Aug 2025, Beyond Single: A Data Selection Principle for LLM Alignment via Fine-Grained Preference Signals, https://arxiv.org/abs/2508.07638
Haowen Wang, Yun Yue, Zhiling Ye, Shuowen Zhang, Lei Fan, Jiaxin Liang, Jiadi Jiang, Cheng Wei, Jingyuan Deng, Xudong Han, Ji Li, Chunxiao Guo, Peng Wei, Jian Wang, Jinjie Gu, 11 Aug 2025, Learning to Align, Aligning to Learn: A Unified Approach for Self-Optimized Alignment, https://arxiv.org/abs/2508.07750
Qiang He, Setareh Maghsudi, 11 Aug 2025, Pareto Multi-Objective Alignment for Language Models, https://arxiv.org/abs/2508.07768
Nicole Lai-Tan and Xiao Gu and Marios G. Philiastides and Fani Deligianni, 11 Aug 2025, Cross-Subject and Cross-Montage EEG Transfer Learning via Individual Tangent Space Alignment and Spatial-Riemannian Feature Fusion, https://arxiv.org/abs/2508.08216
Ben Y. Reis and William La Cava, 8 Aug 2025, Towards Integrated Alignment, https://arxiv.org/abs/2508.06592
Xiaobo Zhang (1 and 2), Congqing He (2), Ying He (1 and 2), Jian Peng (1), Dajie Fu (1), Tien-Ping Tan (2) ((1) School of Information Engineering, Jiangxi Vocational College of Finance & Economics, Jiujiang, China, (2) School of Computer Sciences, Universiti Sains Malaysia, Penang, Malaysia), 9 Aug 2025, ESNERA: Empirical and semantic named entity alignment for named entity dataset merging, https://arxiv.org/abs/2508.06877
Jianting Tang, Yubo Wang, Haoyu Cao, Linli Xu, 9 Aug 2025, BASIC: Boosting Visual Alignment with Intrinsic Refined Embeddings in Multimodal Large Language Models, https://arxiv.org/abs/2508.06895
Yanru Sun, Emadeldeen Eldele, Zongxia Xie, Yucheng Wang, Wenzhe Niu, Qinghua Hu, Chee Keong Kwoh, Min Wu, 10 Aug 2025, Adapting LLMs to Time Series Forecasting via Temporal Heterogeneity Modeling and Semantic Alignment, https://arxiv.org/abs/2508.07195
Gustavo Moreira, Leonardo Ferreira, Carolina Veiga, Maryam Hosseini, Fabio Miranda, 10 Aug 2025, Urbanite: A Dataflow-Based Framework for Human-AI Interactive Alignment in Urban Visual Analytics, https://arxiv.org/abs/2508.07390
Wenze Xu and Chun Wang and Jiazhen Yu and Sheng Chen and Liang Gao and Weihong Deng, 11 Aug 2025, Optimal Transport Regularization for Speech Text Alignment in Spoken Language Models, https://arxiv.org/abs/2508.08131
Kyle Moore, Jesse Roberts, Daryl Watson, 11 Aug 2025, Human-Alignment and Calibration of Inference-Time Uncertainty in Large Language Models, https://arxiv.org/abs/2508.08204
Jie Xiao, Changyuan Fan, Qingnan Ren, Alfred Long, Yuchen Zhang, Rymon Yu, Eric Yang, Lynn Ai, Shaoduo Gan, 9 Aug 2025, Echo: Decoupling Inference and Training for Large-Scale RL Alignment on Heterogeneous Swarms, https://arxiv.org/abs/2508.05387
Haoran Lu, Luyang Fang, Ruidong Zhang, Xinliang Li, Jiazhang Cai, Huimin Cheng, Lin Tang, Ziyu Liu, Zeliang Sun, Tao Wang, Yingchuan Zhang, Arif Hassan Zidan, Jinwen Xu, Jincheng Yu, Meizhi Yu, Hanqi Jiang, Xilin Gong, Weidi Luo, Bolun Sun, Yongkai Chen, Terry Ma, Shushan Wu, Yifan Zhou, Junhao Chen, Haotian Xiang, Jing Zhang, Afrar Jahin, Wei Ruan, Ke Deng, Yi Pan, Peilong Wang, Jiahui Li, Zhengliang Liu, Lu Zhang, Lin Zhao, Wei Liu, Dajiang Zhu, Xin Xing, Fei Dou, Wei Zhang, Chao Huang, Rongjie Liu, Mengrui Zhang, Yiwen Liu, Xiaoxiao Sun, Qin Lu, Zhen Xiang, Wenxuan Zhong, Tianming Liu, Ping Ma, 25 Jul 2025, Alignment and Safety in Large Language Models: Safety Mechanisms, Training Paradigms, and Emerging Challenges, https://arxiv.org/abs/2507.19672
Sarat Chandra Bobbili, Ujwal Dinesha, Dheeraj Narasimha, Srinivas Shakkottai, 26 Jul 2025, PITA: Preference-Guided Inference-Time Alignment for LLM Post-Training, https://arxiv.org/abs/2507.20067
Rachel S.Y. Teo, Laziz U. Abdullaev, Tan M. Nguyen, 27 Jul 2025, The Blessing and Curse of Dimensionality in Safety Alignment, https://arxiv.org/abs/2507.20333
Tiantian Peng, Yuyang Liu, Shuo Yang, Qiuhe Hong, YongHong Tian, 26 Jul 2025, GNSP: Gradient Null Space Projection for Preserving Cross-Modal Alignment in VLMs Continual Learning, https://arxiv.org/abs/2507.19839
Siyu Song, Wentao Liu, Ye Lu, Ruohua Zhang, Tao Liu, Jinze Lv, Xinyun Wang, Aimin Zhou, Fei Tan, Bo Jiang, Hao Hao, 27 Jul 2025, Cultivating Helpful, Personalized, and Creative AI Tutors: A Framework for Pedagogical Alignment using Reinforcement Learning, https://arxiv.org/abs/2507.20335
Rongyao Cai, Ming Jin, Qingsong Wen, Kexin Zhang, 28 Jul 2025, From Entanglement to Alignment: Representation Space Decomposition for Unsupervised Time Series Domain Adaptation, https://arxiv.org/abs/2507.20968
Andr\'e Steingr\"uber, Kevin Baum, 24 Jul 2025, Justifications for Democratizing AI Alignment and Their Prospects, https://arxiv.org/abs/2507.19548
Shuhaib Mehri, Xiaocheng Yang, Takyoung Kim, Gokhan Tur, Shikib Mehri, Dilek Hakkani-T\"ur, 27 Jul 2025, Goal Alignment in LLM-Based User Simulators for Conversational AI, https://arxiv.org/abs/2507.20152
Gabriel Downer, Sean Craven, Damian Ruck, Jake Thomas, 28 Jul 2025, Text2VLM: Adapting Text-Only Datasets to Evaluate Alignment Training in Visual Language Models, https://arxiv.org/abs/2507.20704
Renhang Liu, Chia-Yu Hung, Navonil Majumder, Taylor Gautreaux, Amir Ali Bagherzadeh, Chuan Li, Dorien Herremans, Soujanya Poria, 28 Jul 2025, JAM: A Tiny Flow-based Song Generator with Fine-grained Controllability and Aesthetic Alignment, https://arxiv.org/abs/2507.20880
Hei Shing Cheung and Boya Zhang, 26 Jul 2025, Efficient Vocal-Conditioned Music Generation via Soft Alignment Attention and Latent Diffusion, https://arxiv.org/abs/2507.19991
Jiaming Ji, Kaile Wang, Tianyi Qiu, Boyuan Chen, Jiayi Zhou, Changye Li, Hantao Lou, Juntao Dai, Yunhuai Liu, Yaodong Yang, 27 Jul 2025, Language Models Resist Alignment: Evidence From Data Compression, https://arxiv.org/abs/2406.06144
Madhava Gaikwad (1), Ashwini Ramchandra Doke (2) ((1) Microsoft, (2) Amrita University), 22 Jul 2025, NPO: Learning Alignment and Meta-Alignment through Structured Human Feedback, https://arxiv.org/abs/2507.21131
Lenart Motnikar, Katharina Baum, Alexander Kagan, Sarah Spiekermann-Hoff, 26 Jun 2025, The Value of Gen-AI Conversations: A bottom-up Framework for AI Value Alignment, https://arxiv.org/abs/2507.21091
Aran Nayebi, 29 Jul 2025, Intrinsic Barriers and Practical Pathways for Human-AI Alignment: An Agreement-Based Complexity Analysis, https://arxiv.org/abs/2502.05934
Haipeng Liu, Yuxuan Liu, Ting Long, 31 Jul 2025, Personalized Education with Ranking Alignment Recommendation, https://arxiv.org/abs/2507.23664
Wei Li and Xun Gong and Jiao Li and Xiaobin Sun, 31 Jul 2025, AGA: An adaptive group alignment framework for structured medical cross-modal representation learning, https://arxiv.org/abs/2507.23402
Ananth Balashankar and Ziteng Sun and Jonathan Berant and Jacob Eisenstein and Michael Collins and Adrian Hutter and Jong Lee and Chirag Nagpal and Flavien Prost and Aradhana Sinha and Ananda Theertha Suresh and Ahmad Beirami, 31 Jul 2025, InfAlign: Inference-aware language model alignment, https://arxiv.org/abs/2412.19792
Qun Ma, Xiao Xue, Ming Zhang, Yifan Shen, Zihan Zhao, 30 Jul 2025, An Explainable Emotion Alignment Framework for LLM-Empowered Agent in Metaverse Service Ecosystem, https://arxiv.org/abs/2507.22326
Yixuan Nan, Xixun Lin, Yanmin Shang, Zhuofan Li, Can Zhao and Yanan Cao, 30 Jul 2025, RANA: Robust Active Learning for Noisy Network Alignment, https://arxiv.org/abs/2507.22434
Shaoan Xie, Lingjing Kong, Yujia Zheng, Yu Yao, Zeyu Tang, Eric P. Xing, Guangyi Chen, Kun Zhang, 29 Jul 2025, SmartCLIP: Modular Vision-language Alignment with Identification Guarantees, https://arxiv.org/abs/2507.22264
Junjie Cao, 30 Jul 2025, Adaptive Duration Model for Text Speech Alignment, https://arxiv.org/abs/2507.22612
Yeonjun In, Wonjoong Kim, Sangwu Park, Chanyoung Park, 1 Aug 2025, R1-ACT: Efficient Reasoning Model Safety Alignment by Activating Safety Knowledge, https://arxiv.org/abs/2508.00324
Jens U. Kreber, Joerg Stueckler, 1 Aug 2025, Guiding Diffusion-Based Articulated Object Generation by Partial Point Cloud Alignment and Physical Plausibility Constraints, https://arxiv.org/abs/2508.00558
Amitava Das, Vinija Jain, Aman Chadha, 4 Aug 2025, TRACEALIGN -- Tracing the Drift: Attributing Alignment Failures to Training-Time Belief Sources in LLMs, https://arxiv.org/abs/2508.02063
Istabrak Abbes, Gopeshh Subbaraj, Matthew Riemer, Nizar Islah, Benjamin Therien, Tsuguchika Tabaru, Hiroaki Kingetsu, Sarath Chandar, Irina Rish, 3 Aug 2025, Revisiting Replay and Gradient Alignment for Continual Pre-Training of Large Language Models, https://arxiv.org/abs/2508.01908
Ziyu Zhou, Yiming Huang, Yanyun Wang, Yuankai Wu, James Kwok, Yuxuan Liang, 4 Aug 2025, Revitalizing Canonical Pre-Alignment for Irregular Multivariate Time Series Forecasting, https://arxiv.org/abs/2508.01971
Amitava Das, Abhilekh Borah, Vinija Jain, Aman Chadha, 4 Aug 2025, AlignGuard-LoRA: Alignment-Preserving Fine-Tuning via Fisher-Guided Decomposition and Riemannian-Geodesic Collision Regularization, https://arxiv.org/abs/2508.02079
Yu Lei, Jinbin Bai, Qingyu Shi, Aosong Feng and Kaidong Yu, 2 Aug 2025, Personalized Safety Alignment for Text-to-Image Diffusion Models, https://arxiv.org/abs/2508.01151
Tae Soo Kim, Yoonjoo Lee, Yoonah Park, Jiho Kim, Young-Ho Kim, Juho Kim, 3 Aug 2025, CUPID: Evaluating Personalized and Contextualized Alignment of LLMs from Interactions, https://arxiv.org/abs/2508.01674
Tom S. Juzek, Zina B. Ward, 3 Aug 2025, Word Overuse and Alignment in Large Language Models: The Influence of Learning from Human Feedback, https://arxiv.org/abs/2508.01930
Haoran Gu, Handing Wang, Yi Mei, Mengjie Zhang, Yaochu Jin, 4 Aug 2025, ParetoHqD: Fast Offline Multiobjective Alignment of Large Language Models using Pareto High-quality Data, https://arxiv.org/abs/2504.16628
Ivan Zakazov, Mikolaj Boronski, Lorenzo Drudi, Robert West, 4 Aug 2025, Assessing Social Alignment: Do Personality-Prompted Large Language Models Behave Like Humans?, https://arxiv.org/abs/2412.16772
Taibiao Zhao, Xiaobing Chen, and Mingxuan Sun, 1 Aug 2025, Enhancing Time Series Forecasting via Multi-Level Text Alignment with LLMs, https://arxiv.org/abs/2504.07360
Bolian Li, Yifan Wang, Anamika Lochab, Ananth Grama, Ruqi Zhang, 3 Aug 2025, Cascade Reward Sampling for Efficient Decoding-Time Alignment, https://arxiv.org/abs/2406.16306
Amir Aghdam, Vincent Tao Hu, Bj\"orn Ommer, 4 Aug 2025, ActAlign: Zero-Shot Fine-Grained Video Classification via Language-Guided Sequence Alignment, https://arxiv.org/abs/2506.22967
Dahun Kim, Anelia Angelova, 3 Aug 2025, Context-Adaptive Multi-Prompt LLM Embedding for Vision-Language Alignment, https://arxiv.org/abs/2508.02762
Hongjun Liu, Chao Yao, Yalan Zhang, Xiaokun wang and Xiaojuan Ban, 5 Aug 2025, Spatial Imputation Drives Cross-Domain Alignment for EEG Classification, https://arxiv.org/abs/2508.03437
Anamika Lochab, Ruqi Zhang, 5 Aug 2025, Energy-Based Reward Models for Robust Language Model Alignment, https://arxiv.org/abs/2504.13134
Wentao Wu, Linqing Chen, Hanmeng Zhong, Weilei Wang, 6 Aug 2025, Large Language Model's Multi-Capability Alignment in Biomedical Domain, https://arxiv.org/abs/2508.04278
Abdul Monaf Chowdhury, Rabeya Akter, Safaeid Hossain Arib, 6 Aug 2025, T3Time: Tri-Modal Time Series Forecasting via Adaptive Multi-Head Alignment and Residual Fusion, https://arxiv.org/abs/2508.04251
Hongxu Chen, Zhen Wang, Taoran Mei, Lin Li, Bowei Zhu, Runshi Li, Long Chen, 6 Aug 2025, Zero-Residual Concept Erasure via Progressive Alignment in Text-to-Image Model, https://arxiv.org/abs/2508.04472
Feifan Song, Bofei Gao, Yifan Song, Yi Liu, Weimin Xiong, Yuyang Song, Tianyu Liu, Guoyin Wang, Houfeng Wang, 6 Aug 2025, P-Aligner: Enabling Pre-Alignment of Language Models via Principled Instruction Synthesis, https://arxiv.org/abs/2508.04626
Wenji Fang, Jing Wang, Yao Lu, Shang Liu, Zhiyao Xie, 6 Aug 2025, GenEDA: Towards Generative Netlist Functional Reasoning via Cross-Modal Circuit Encoder-Decoder Alignment, https://arxiv.org/abs/2504.09485
You Rim Choi, Subeom Park, Seojun Heo, Eunchung Noh, Hyung-Sin Kim, 6 Aug 2025, Let the Void Be Void: Robust Open-Set Semi-Supervised Learning via Selective Non-Alignment, https://arxiv.org/abs/2504.12569
Krzysztof Janowicz and Zilong Liu and Gengchen Mai and Zhangyu Wang and Ivan Majic and Alexandra Fortacz and Grant McKenzie and Song Gao, 7 Aug 2025, Whose Truth? Pluralistic Geo-Alignment for (Agentic) AI, https://arxiv.org/abs/2508.05432
Shruti Saxena, Arijit Khan and Joydeep Chandra, 5 Aug 2025, NAEx: A Plug-and-Play Framework for Explaining Network Alignment, https://arxiv.org/abs/2508.04731
Mason Nakamura, Saaduddin Mahmud, Kyle H. Wray, Hamed Zamani, Shlomo Zilberstein, 7 Aug 2025, Aligning LLMs on a Budget: Inference-Time Alignment with Heuristic Reward Models, https://arxiv.org/abs/2508.05165
Zhongheng Yang, Aijia Sun, Yushang Zhao, Yinuo Yang, Dannier Li, Chengrui Zhou, 7 Aug 2025, RLHF Fine-Tuning of LLMs for Alignment with Implicit User Feedback in Conversational Recommenders, https://arxiv.org/abs/2508.05289
Qinghua Yao, Xiangrui Xu, Zhize Li, 7 Aug 2025, X-VFL: A New Vertical Federated Learning Framework with Cross Completion and Decision Subspace Alignment, https://arxiv.org/abs/2508.05568
Sam Kouteili, Hiren Madhu, George Typaldos, Mark Santolucito, 7 Aug 2025, Embedding Alignment in Code Generation for Audio, https://arxiv.org/abs/2508.05473
Yubin Zhang, Yanhua Huang, Haiming Xu, Mingliang Qi, Chang Wang, Jiarui Jin, Xiangyuan Ren, Xiaodan Wang, Ruiwen Xu, 7 Aug 2025, A Metric for MLLM Alignment in Large-scale Recommendation, https://arxiv.org/abs/2508.04963
Zhiqing Xiao, Haobo Wang, Xu Lu, Wentao Ye, Gang Chen, Junbo Zhao, 7 Aug 2025, SPA++: Generalized Graph Spectral Alignment for Versatile Domain Adaptation, https://arxiv.org/abs/2508.05182
Wei Zeng, Hengshu Zhu, Chuan Qin, Han Wu, Yihang Cheng, Sirui Zhang, Xiaowei Jin, Yinuo Shen, Zhenxing Wang, Feimin Zhong, Hui Xiong, 7 Aug 2025, Multi-level Value Alignment in Agentic AI Systems: Survey and Perspectives, https://arxiv.org/abs/2506.09656
Yifei Xu, Tusher Chakraborty, Emre K{\i}c{\i}man, Bibek Aryal, Eduardo Rodrigues, Srinagesh Sharma, Roberto Estevao, Maria Angels de Luis Balaguer, Jessica Wolk, Rafael Padilha, Leonardo Nunes, Shobana Balakrishnan, Songwu Lu, Ranveer Chandra, 6 Aug 2025, RLTHF: Targeted Human Feedback for LLM Alignment, https://arxiv.org/abs/2502.13417
Shengzhu Yang, Jiawei Du, Shuai Lu, Weihang Zhang, Ningli Wang, Huiqi Li, 8 Aug 2025, CLIPin: A Non-contrastive Plug-in to CLIP for Multimodal Semantic Alignment, https://arxiv.org/abs/2508.06434
Keiyu Nosaka, Yuichi Takano, Akiko Yoshise, 8 Aug 2025, Data Collaboration Analysis with Orthonormal Basis Selection and Alignment, https://arxiv.org/abs/2403.02780
Parker Whitfill, Stewy Slocum, 11 Aug 2025, Beyond Ordinal Preferences: Why Alignment Needs Cardinal Human Feedback, https://arxiv.org/abs/2508.08486
Sviatoslav Lushnei, Dmytro Shumskyi, Severyn Shykula, Ernesto Jimenez-Ruiz, Artur d'Avila Garcez, 11 Aug 2025, Large Language Models as Oracles for Ontology Alignment, https://arxiv.org/abs/2508.08500
Saketh Reddy Vemula, Dipti Mishra Sharma and Parameswari Krishnamurthy, 11 Aug 2025, Rethinking Tokenization for Rich Morphology: The Dominance of Unigram over BPE and Morphological Alignment, https://arxiv.org/abs/2508.08424
Jadie Adams, Brian Hu, Emily Veenhuis, David Joy, Bharadwaj Ravichandran, Aaron Bray, Anthony Hoogs, Arslan Basharat, 11 Aug 2025, Steerable Pluralism: Pluralistic Alignment via Few-Shot Comparative Regression, https://arxiv.org/abs/2508.08509
Birong Pan, Yongqi Li, Weiyu Zhang, Wenpeng Lu, Mayi Xu, Shen Zhou, Yuanyuan Zhu, Ming Zhong, Tieyun Qian, 12 Aug 2025, A Survey on Training-free Alignment of Large Language Models, https://arxiv.org/abs/2508.09016
Sejin Kim, Sundong Kim, 12 Aug 2025, System~2 Reasoning for Human--AI Alignment: Generality and Adaptivity via ARC-AGI, https://arxiv.org/abs/2410.07866
Liang Chen, Xueting Han, Li Shen, Jing Bai, Kam-Fai Wong, 12 Aug 2025, Vulnerability-Aware Alignment: Mitigating Uneven Forgetting in Harmful Fine-Tuning, https://arxiv.org/abs/2506.03850
Yuxin Chen and Chen Tang and Jianglan Wei and Chenran Li and Ran Tian and Xiang Zhang and Wei Zhan and Peter Stone and Masayoshi Tomizuka, 12 Aug 2025, MEReQ: Max-Ent Residual-Q Inverse RL for Sample-Efficient Alignment from Intervention, https://arxiv.org/abs/2406.16258
Yang Fan, 12 Aug 2025, AdEval: Alignment-based Dynamic Evaluation to Mitigate Data Contamination in Large Language Models, https://arxiv.org/abs/2501.13983
Shravan Nayak, Mehar Bhatia, Xiaofeng Zhang, Verena Rieser, Lisa Anne Hendricks, Sjoerd van Steenkiste, Yash Goyal, Karolina Sta\'nczak, Aishwarya Agrawal, 12 Aug 2025, CulturalFrames: Assessing Cultural Expectation Alignment in Text-to-Image Models and Evaluation Metrics, https://arxiv.org/abs/2506.08835
Yang Zhang, Cunxiang Wang, Lindong Wu, Wenbo Yu, Yidong Wang, Guangsheng Bao, Jie Tang, 13 Aug 2025, UDA: Unsupervised Debiasing Alignment for Pair-wise LLM-as-a-Judge, https://arxiv.org/abs/2508.09724
Mansi, Anastasios Lepipas, Dominika Woszczyk, Yiying Guan, Soteris Demetriou, 12 Aug 2025, Understanding Dementia Speech Alignment with Diffusion-Based Image Generation, https://arxiv.org/abs/2508.09385
Birong Pan, Mayi Xu, Qiankun Pi, Jianhao Chen, Yuanyuan Zhu, Ming Zhong, Tieyun Qian, 13 Aug 2025, NeuronTune: Fine-Grained Neuron Modulation for Balanced Safety-Utility Alignment in LLMs, https://arxiv.org/abs/2508.09473
Peiran Peng, Tingfa Xu, Liqiang Song, Mengqi Zhu, Yuqiang Fang, Jianan Li, 13 Aug 2025, COXNet: Cross-Layer Fusion with Adaptive Alignment and Scale Integration for RGBT Tiny Object Detection, https://arxiv.org/abs/2508.09533
Muneeza Azmat, Momin Abbas, Maysa Malfiza Garcia de Macedo, Marcelo Carpinette Grave, Luan Soares de Souza, Tiago Machado, Rogerio A de Paula, Raya Horesh, Yixin Chen, Heloisa Caroline de Souza Pereira Candello, Rebecka Nordenlow, Aminat Adebiyi, 13 Aug 2025, A Comprehensive Evaluation framework of Alignment Techniques for LLMs, https://arxiv.org/abs/2508.09937
Zichao Hu, Junyi Jessy Li, Arjun Guha, Joydeep Biswas, 12 Aug 2025, Robo-Instruct: Simulator-Augmented Instruction Alignment For Finetuning Code LLMs, https://arxiv.org/abs/2405.20179
Numair Nadeem, Saeed Anwar, Muhammad Hamza Asad, Abdul Bais, 13 Aug 2025, HVL: Semi-Supervised Segmentation leveraging Hierarchical Vision-Language Synergy with Dynamic Text-Spatial Query Alignment, https://arxiv.org/abs/2506.13925
Durgesh Mishra, Rishabh Uikey, 15 Aug 2025, Unified Knowledge Distillation Framework: Fine-Grained Alignment and Geometric Relationship Preservation for Deep Face Recognition, https://arxiv.org/abs/2508.11376
Alessio Galatolo, Luca Alberto Rappuoli, Katie Winkle, Meriem Beloucif, 18 Aug 2025, Beyond Ethical Alignment: Evaluating LLMs as Artificial Moral Assistants, https://arxiv.org/abs/2508.12754
Manning Zhu, Songtao Guo, Pengzhan Zhou, Yansong Ning, Chang Han, Dewen Qiao, 18 Aug 2025, FedSODA: Federated Fine-tuning of LLMs via Similarity Group Pruning and Orchestrated Distillation Alignment, https://arxiv.org/abs/2508.12727
Zhixin Xie, Xurui Song, Jun Luo, 17 Aug 2025, Where to Start Alignment? Diffusion Large Language Model May Demand a Distinct Position, https://arxiv.org/abs/2508.12398
Xuhui Zhan and Tyler Derr, 17 Aug 2025, Inverse-LLaVA: Eliminating Alignment Pre-training Through Text-to-Vision Mapping, https://arxiv.org/abs/2508.12466
Yizhuo Zhang, Heng Wang, Shangbin Feng, Zhaoxuan Tan, Xinyun Liu, Yulia Tsvetkov, 17 Aug 2025, Generalizable LLM Learning of Graph Synthetic Data with Post-training Alignment, https://arxiv.org/abs/2506.00845
Mohammad Jalali, Bahar Dibaei Nia, Farzan Farnia, 16 Aug 2025, Towards an Explainable Comparison and Alignment of Feature Embeddings, https://arxiv.org/abs/2506.06231
Guangfu Hao, Haojie Wen, Liangxuan Guo, Yang Chen, Yanchao Bi, Shan Yu, 18 Aug 2025, Flexible Tool Selection through Low-dimensional Attribute Alignment of Vision and Language, https://arxiv.org/abs/2505.22146
Yang Zhang, Yu Yu, Bo Tang, Yu Zhu, Chuxiong Sun, Wenqiang Wei, Jie Hu, Zipeng Xie, Zhiyu Li, Feiyu Xiong, Edward Chung, 16 Aug 2025, Token-level Accept or Reject: A Micro Alignment Approach for Large Language Models, https://arxiv.org/abs/2505.19743
Jeremy Carleton, Debajoy Mukherjee, Srinivas Shakkottai, Dileep Kalathil, 19 Aug 2025, MAVIS: Multi-Objective Alignment via Value-Guided Inference-Time Search, https://arxiv.org/abs/2508.13415
Zeeshan Ahmed, Frank Seide, Niko Moritz, Ju Lin, Ruiming Xie, Simone Merello, Zhe Liu and Christian Fuegen, 18 Aug 2025, Overcoming Latency Bottlenecks in On-Device Speech Translation: A Cascaded Approach with Alignment-Based Streaming MT, https://arxiv.org/abs/2508.13358
Jinhui Pang, Changqing Lin, Hao Lin, Zhihui Zhang, Long Chen, Weiping Ding, Yu Liu, Xiaoshuai Hao, 19 Aug 2025, MEGA: Second-Order Gradient Alignment for Catastrophic Forgetting Mitigation in GFSCIL, https://arxiv.org/abs/2504.13691
Samir Abdaljalil, Erchin Serpedin, Khalid Qaraqe, Hasan Kurban, 20 Aug 2025, Evaluating Multilingual and Code-Switched Alignment in LLMs via Synthetic Natural Language Inference, https://arxiv.org/abs/2508.14735
Abhigya Verma, Sriram Puttagunta, Seganrasan Subramanian, Sravan Ramachandran, 21 Aug 2025, GRAFT: GRaPH and Table Reasoning for Textual Alignment -- A Benchmark for Structured Instruction Following and Visual Reasoning, https://arxiv.org/abs/2508.15690
Chenlin Liu, Minghui Fang, Patrick Zhang, Wei Zhou, Jie Gao, Jiqing Han, 21 Aug 2025, Mitigating Hallucinations in LM-Based TTS Models via Distribution Alignment Using GFlowNets, https://arxiv.org/abs/2508.15442
Youjia Zhang, Youngeun Kim, Young-Geun Choi, Hongyeob Kim, Huiling Liu, Sungeun Hong, 21 Aug 2025, Backpropagation-Free Test-Time Adaptation via Probabilistic Gaussian Alignment, https://arxiv.org/abs/2508.15568
J. Koorndijk, 21 Aug 2025, Empirical Evidence for Alignment Faking in a Small LLM and Prompt-Based Mitigation Techniques, https://arxiv.org/abs/2506.21584
Qilong Xing, Zikai Song, Youjia Zhang, Na Feng, Junqing Yu, Wei Yang, 21 Aug 2025, MCA-RG: Enhancing LLMs with Medical Concept Alignment for Radiology Report Generation, https://arxiv.org/abs/2507.06992
Shuyuan Tu, Zhen Xing, Xintong Han, Zhi-Qi Cheng, Qi Dai, Chong Luo, Zuxuan Wu, Yu-Gang Jiang, 20 Jul 2025, StableAnimator++: Overcoming Pose Misalignment and Face Distortion for Human Image Animation, https://arxiv.org/abs/2507.15064
Vince Trencsenyi and Agnieszka Mensfelt and Kostas Stathis, 25 Jul 2025, Hypergames: Modeling Misaligned Perceptions and Nested Beliefs for Multi-agent Systems, https://arxiv.org/abs/2507.19593
Bryce Anderson, Riley Galpin, Tom S. Juzek, 1 Aug 2025, Model Misalignment and Language Change: Traces of AI-Associated Language in Unscripted Spoken English, https://arxiv.org/abs/2508.00238
Fan Bu, Zheng Wang, Siyi Wang and Ziyao Liu, 1 Aug 2025, An Investigation into Value Misalignment in LLM-Generated Texts for Cultural Heritage, https://arxiv.org/abs/2501.02039
Siddhant Panpatil, Hiskias Dingeto, Haon Park, 6 Aug 2025, Eliciting and Analyzing Emergent Misalignment in State-of-the-Art Large Language Models, https://arxiv.org/abs/2508.04196
David Kacz\'er, Magnus J{\o}rgenv{\aa}g, Clemens Vetter, Lucie Flek, Florian Mai, 8 Aug 2025, In-Training Defenses against Emergent Misalignment in Language Models, https://arxiv.org/abs/2508.06249
Yichao Cai, Yuhang Liu, Erdun Gao, Tianjiao Jiang, Zhen Zhang, Anton van den Hengel, Javen Qinfeng Shi, 7 Aug 2025, On the Value of Cross-Modal Misalignment in Multimodal Representation Learning, https://arxiv.org/abs/2504.10143
Dongyoon Hahm, Taywon Min, Woogyeol Jin, Kimin Lee, 19 Aug 2025, Unintended Misalignment from Agentic Fine-Tuning: Risks and Mitigation, https://arxiv.org/abs/2508.14031
Igor Halperin, 13 Aug 2025, Prompt-Response Semantic Divergence Metrics for Faithfulness Hallucination and Misalignment Detection in Large Language Models, https://arxiv.org/abs/2508.10192
Zhi Wen Soi, Chenrui Fan, Aditya Shankar, Abele M\u{a}lan, Lydia Y. Chen, 14 Aug 2025, Federated Time Series Generation on Feature and Temporally Misaligned Data, https://arxiv.org/abs/2410.21072
Yue Pei, Hongming Zhang, Chao Gao, Martin M\"uller, Mengxiao Zhu, Hao Sheng, Haogang Zhu, Liang Lin, 22 Aug 2025, Double Check My Desired Return: Transformer with Target Alignment for Offline Reinforcement Learning, https://arxiv.org/abs/2508.16420
Junhao Yin, Haolin Wang, Peng Bao, Ju Xu, Yongliang Wang, 15 Aug 2025, From Clicks to Preference: A Multi-stage Alignment Framework for Generative Query Suggestion in Conversational System, https://arxiv.org/abs/2508.15811
Pi-Wei Chen, Jerry Chun-Wei Lin, Wei-Han Chen, Jia Ji, Zih-Ching Chen, Feng-Hao Yeh, Chao-Chun Chen, 22 Aug 2025, Beyond Human-prompting: Adaptive Prompt Tuning with Semantic Alignment for Anomaly Detection, https://arxiv.org/abs/2508.16157
Xiaoxiong Zhang, Xin Zhou, Zhiwei Zeng, Yongjie Wang, Dusit Niyato, Zhiqi Shen, 22 Aug 2025, EGRA:Toward Enhanced Behavior Graphs and Representation Alignment for Multimodal Recommendation, https://arxiv.org/abs/2508.16170
Zirui Li and Stephan Husung and Haoze Wang, 22 Aug 2025, LLM-Assisted Semantic Alignment and Integration in Collaborative Model-Based Systems Engineering Using SysML v2, https://arxiv.org/abs/2508.16181
Buhua Liu, Shitong Shao, Bao Li, Lichen Bai, Zhiqiang Xu, Haoyi Xiong, James Kwok, Sumi Helal, Zeke Xie, 7 Aug 2025, Alignment of Diffusion Models: Fundamentals, Challenges, and Future, https://arxiv.org/abs/2409.07253
Somnath Banerjee, Sayan Layek, Pratyush Chatterjee, Animesh Mukherjee, Rima Hazra, 22 Aug 2025, Soteria: Language-Specific Functional Parameter Steering for Multilingual Safety Alignment, https://arxiv.org/abs/2502.11244
Zeguan Xiao, Yun Chen, Guanhua Chen, Ke Tang, 22 Aug 2025, Towards Bridging the Reward-Generation Gap in Direct Alignment Algorithms, https://arxiv.org/abs/2506.09457
Mia Taylor and James Chua and Jan Betley and Johannes Treutlein and Owain Evans, 24 Aug 2025, School of Reward Hacks: Hacking harmless tasks generalizes to misaligned behavior in LLMs, https://arxiv.org/abs/2508.17511
Junan Zhang, Xueyao Zhang, Jing Yang, Yuancheng Wang, Fan Fan, Zhizheng Wu, 24 Aug 2025, Multi-Metric Preference Alignment for Generative Speech Restoration, https://arxiv.org/abs/2508.17229
Yang Li, Songlin Yang, Xiaoxuan Han, Wei Wang, Jing Dong, Yueming Lyu, Ziyu Xue, 25 Aug 2025, Instant Preference Alignment for Text-to-Image Diffusion Models, https://arxiv.org/abs/2508.17718
Bin Tan, Wangyao Ge, Yidi Wang, Xin Liu, Jeff Burtoft, Hao Fan, Hui Wang, 25 Aug 2025, PCR-CA: Parallel Codebook Representations with Contrastive Alignment for Multiple-Category App Recommendation, https://arxiv.org/abs/2508.18166
Yaoyao Qian, Jindan Huang, Yuanli Wang, Simon Yu, Kyrie Zhixuan Zhou, Jiayuan Mao, Mingfu Liang, Hanhan Zhou, 23 Aug 2025, WHEN TO ACT, WHEN TO WAIT: Modeling the Intent-Action Alignment Problem in Dialogue, https://arxiv.org/abs/2506.01881
Paul Darm, Annalisa Riccardi, 25 Aug 2025, Head-Specific Intervention Can Induce Misaligned AI Coordination in Large Language Models, https://arxiv.org/abs/2502.05945
Stephanie Palazzolo, Sep 2025, OpenAI’s Models Are Getting Too Smart For Their Human Teachers, https://www.theinformation.com/articles/openais-models-getting-smart-human-teachers (Using human labeling to train AI models is becoming more difficult, as the models begin to surpass humans.)
Cyrus Cousins, Vijay Keswani, Vincent Conitzer, Hoda Heidari, Jana Schaich Borg, Walter Sinnott-Armstrong, 4 Sep 2025, Towards Cognitively-Faithful Decision-Making Models to Improve AI Alignment, https://arxiv.org/abs/2509.04445
Yuqing Huang, Rongyang Zhang, Qimeng Wang, Chengqiang Lu, Yan Gao, Yi Wu, Yao Hu, Xuyang Zhi, Guiquan Liu, Xin Li, Hao Wang, Enhong Chen, 4 Sep 2025, SelfAug: Mitigating Catastrophic Forgetting in Retrieval-Augmented Generation via Distribution Self-Alignment, https://arxiv.org/abs/2509.03934
Ranjie Duan, Jiexi Liu, Xiaojun Jia, Shiji Zhao, Ruoxi Cheng, Fengxiang Wang, Cheng Wei, Yong Xie, Chang Liu, Defeng Li, Yinpeng Dong, Yichi Zhang, Yuefeng Chen, Chongwen Wang, Xingjun Ma, Xingxing Wei, Yang Liu, Hang Su, Jun Zhu, Xinfeng Li, Yitong Sun, Jie Zhang, Jinzhao Hu, Sha Xu, Yitong Yang, Jialing Tao, Hui Xue, 4 Sep 2025, Oyster-I: Beyond Refusal -- Constructive Safety Alignment for Responsible Language Models, https://arxiv.org/abs/2509.01909
Jonathn Chang, Leonhard Piff, Suvadip Sana, Jasmine X. Li, Lionel Levine, 3 Sep 2025, EigenBench: A Comparative Behavioral Measure of Value Alignment, https://arxiv.org/abs/2509.01938
Jusheng Zhang, Yijia Fan, Kaitong Cai, Xiaofei Sun, Keze Wang, 5 Sep 2025, OSC: Cognitive Orchestration through Dynamic Knowledge Alignment in Multi-Agent LLM Collaboration, https://arxiv.org/abs/2509.04876
Gongyue Zhang and Honghai Liu, 5 Sep 2025, Natural Spectral Fusion: p-Exponent Cyclic Scheduling and Early Decision-Boundary Alignment in First-Order Optimization, https://arxiv.org/abs/2509.04713
Wei Chen, Shigui Li, Jiacheng Li, Jian Xu, Zhiqi Lin, Junmei Yang, Delu Zeng, John Paisley, Qibin Zhao, 5 Sep 2025, Any-Step Density Ratio Estimation via Interval-Annealed Secant Alignment, https://arxiv.org/abs/2509.04852
Tiansheng Huang, Gautam Bhattacharya, Pratik Joshi, Josh Kimball, Ling Liu, 5 Sep 2025, Antidote: Post-fine-tuning Safety Alignment for Large Language Models against Harmful Fine-tuning, https://arxiv.org/abs/2408.09600
Furong Jia, Lanxin Liu, Ce Hou, Fan Zhang, Xinyan Liu, Yu Liu, 5 Sep 2025, Towards Interpretable Geo-localization: a Concept-Aware Global Image-GPS Alignment Framework, https://arxiv.org/abs/2509.01910
Subrata Biswas, Mohammad Nur Hossain Khan, Bashima Islam, 5 Sep 2025, RAVEN: Query-Guided Representation Alignment for Question Answering over Audio, Video, Embedded Sensors, and Natural Language, https://arxiv.org/abs/2505.17114
Shervin Khalafi, Ignacio Hounie, Dongsheng Ding, Alejandro Ribeiro, 26 Aug 2025, Composition and Alignment of Diffusion Models using Constrained Learning, https://arxiv.org/abs/2508.19104
Nanxi Li, Zhengyue Zhao, Chaowei Xiao, 26 Aug 2025, PRISM: Robust VLM Alignment with Principled Reasoning for Integrated Safety in Multimodality, https://arxiv.org/abs/2508.18649
Trisanth Srinivasan, Santosh Patapati, 27 Aug 2025, Democracy-in-Silico: Institutional Design as Alignment in AI-Governed Polities, https://arxiv.org/abs/2508.19562
Julian Arnold, Niels L\"orch, 27 Aug 2025, Decomposing Behavioral Phase Transitions in LLMs: Order Parameters for Emergent Misalignment, https://arxiv.org/abs/2508.20015
Mingxi Fu, Fanglei Fu, Xitong Ling, Huaitian Yuan, Tian Guan, Yonghong He, Lianghui Zhu, 27 Aug 2025, Multimodal Prototype Alignment for Semi-supervised Pathology Image Segmentation, https://arxiv.org/abs/2508.19574
Chao Huang, Zefeng Zhang, Juewei Yue, Quangang Li, Chuang Zhang, Tingwen Liu, 27 Aug 2025, Safety Alignment Should Be Made More Than Just A Few Attention Heads, https://arxiv.org/abs/2508.19697
Md Abdullah Al Mamun, Ihsen Alouani, Nael Abu-Ghazaleh, 28 Aug 2025, Poison Once, Refuse Forever: Weaponizing Alignment for Injecting Bias in LLMs, https://arxiv.org/abs/2508.20333
Zhibang Yang, Xinke Jiang, Rihong Qiu, Ruiqing Li, Yihang Zhang, Yue Fang, Yongxin Xu, Hongxin Ding, Xu Chu, Junfeng Zhao, Yasha Wang, 28 Aug 2025, DFAMS: Dynamic-flow guided Federated Alignment based Multi-prototype Search, https://arxiv.org/abs/2508.20353
Guillaume Guy, Mihajlo Grbovic, Chun How Tan, Han Zhao, 28 Aug 2025, BiListing: Modality Alignment for Listings, https://arxiv.org/abs/2508.20396
Luozhijie Jin, Zijie Qiu, Jie Liu, Zijie Diao, Lifeng Qiao, Ning Ding, Alex Lamb, Xipeng Qiu, 28 Aug 2025, Inference-Time Alignment Control for Diffusion Models with Reinforcement Learning Guidance, https://arxiv.org/abs/2508.21016
Harethah Abu Shairah, Hasan Abed Al Kader Hammoud, George Turkiyyah, Bernard Ghanem, 28 Aug 2025, Turning the Spell Around: Lightweight Alignment Amplification via Rank-One Safety Injection, https://arxiv.org/abs/2508.20766
Haoze Wu, Cheng Wang, Wenshuo Zhao, Junxian He, 28 Aug 2025, Model-Task Alignment Drives Distinct RL Outcomes, https://arxiv.org/abs/2508.21188
Ephraiem Sarabamoun, 27 Aug 2025, Ensemble Debates with Local Large Language Models for AI Alignment, https://arxiv.org/abs/2509.00091
Shiqiao Zhou, Holger Sch\"oner, Huanbo Lyu, Edouard Fouch\'e, Shuo Wang, 30 Aug 2025, BALM-TSF: Balanced Multimodal Alignment for LLM-Based Time Series Forecasting, https://arxiv.org/abs/2509.00622
Jinzhou Tang, Jusheng zhang, Sidi Liu, Waikit Xiu, Qinhan Lv, Xiying Li, 29 Aug 2025, Beyond Pixels: Introducing Geometric-Semantic World Priors for Video-based Embodied Models via Spatio-temporal Alignment, https://arxiv.org/abs/2509.00210
Sanjeeevan Selvaganapathy and Mehwish Nasim, 31 Aug 2025, Confident, Calibrated, or Complicit: Probing the Trade-offs between Safety Alignment and Ideological Bias in Language Models in Detecting Hate Speech, https://arxiv.org/abs/2509.00673
Yu Liu, Yanan Cao, Xixun Lin, Yanmin Shang, Shi Wang, Shirui Pan, 1 Sep 2025, Enhancing Large Language Model for Knowledge Graph Completion via Structure-Aware Alignment-Tuning, https://arxiv.org/abs/2509.01166
Hongyu Li, Chaofeng Chen, Xiaoming Li, Guangming Lu, 2 Sep 2025, 2D Gaussian Splatting with Semantic Alignment for Image Inpainting, https://arxiv.org/abs/2509.01964
Antoun Yaacoub, J\'er\^ome Da-Rugna, Zainab Assaghir, 30 Aug 2025, Assessing AI-Generated Questions' Alignment with Cognitive Frameworks in Educational Assessment, https://arxiv.org/abs/2504.14232
Jonathan Rystr{\o}m, Hannah Rose Kirk and Scott Hale, 30 Aug 2025, Multilingual != Multicultural: Evaluating Gaps Between Multilingual Capabilities and Cultural Alignment in LLMs, https://arxiv.org/abs/2502.16534
Dayeon Ki, Rachel Rudinger, Tianyi Zhou, Marine Carpuat, 1 Sep 2025, Multiple LLM Agents Debate for Equitable Cultural Alignment, https://arxiv.org/abs/2505.24671
Ertu\u{g}rul Ke\c{c}eci, M\"ujde G\"uzelkaya, Tufan Kumbasar, 3 Sep 2025, A State Alignment-Centric Approach to Federated System Identification: The FedAlign Framework, https://arxiv.org/abs/2503.12137
Jiahao Qiu, Yifu Lu, Yifan Zeng, Jiacheng Guo, Jiayi Geng, Chenhao Zhu, Xinzhe Juan, Ling Yang, Huazheng Wang, Kaixuan Huang, Yue Wu, Mengdi Wang, 3 Sep 2025, TreeBoN: Enhancing Inference-Time Alignment with Speculative Tree-Search and Best-of-N Sampling, https://arxiv.org/abs/2410.16033
Madhava Gaikwad, 4 Sep 2025, Murphys Laws of AI Alignment: Why the Gap Always Wins, https://arxiv.org/abs/2509.05381
Chengwei Wu, Li Du, Hanyu Zhao, Yiming Ju, Jiapu Wang, Tengfei Pan, 8 Sep 2025, Accelerate Scaling of LLM Alignment via Quantifying the Coverage and Depth of Instruction Set, https://arxiv.org/abs/2509.06463
Abhijnan Nath, Carine Graff and Nikhil Krishnaswamy, 7 Sep 2025, Let's Roleplay: Examining LLM Alignment in Collaborative Dialogues, https://arxiv.org/abs/2509.05882
Xugang Lu, Peng Shen, Yu Tsao, Hisashi Kawai, 6 Sep 2025, New Insights into Optimal Alignment of Acoustic and Linguistic Representations for Knowledge Transfer in ASR, https://arxiv.org/abs/2509.05609
Shuai Yuan, Zhibo Zhang, Yuxi Li, Guangdong Bai, Wang Kailong, 8 Sep 2025, Embedding Poisoning: Bypassing Safety Alignment via Embedding Semantic Shift, https://arxiv.org/abs/2509.06338
Sascha Kaltenpoth, Oliver M\"uller, 9 Sep 2025, Getting In Contract with Large Language Models -- An Agency Theory Perspective On Large Language Model Alignment, https://arxiv.org/abs/2509.07642
Xiaomeng Hu, Fei Huang, Chenhan Yuan, Junyang Lin, Tsung-Yi Ho, 1 Sep 2025, CARE: Decoding Time Safety Alignment via Rollback and Introspection Intervention, https://arxiv.org/abs/2509.06982
Neal G. Ravindra, Arijit Sehanobish, 22 Aug 2025, Cross-device Zero-shot Label Transfer via Alignment of Time Series Foundation Model Embeddings, https://arxiv.org/abs/2509.06966
Ji Xie and Trevor Darrell and Luke Zettlemoyer and XuDong Wang, 8 Sep 2025, Reconstruction Alignment Improves Unified Multimodal Models, https://arxiv.org/abs/2509.07295
Andrey Sakhovskiy, Elena Tutubalina, 9 Sep 2025, BALI: Enhancing Biomedical Language Representations through Knowledge Graph and Language Model Alignment, https://arxiv.org/abs/2509.07588
Crispin Cooper, Ana Friedrich, Tommaso Reggiani, Wouter Poortinga, 9 Sep 2025, Individual utilities of life satisfaction reveal inequality aversion unrelated to political alignment, https://arxiv.org/abs/2509.07793
Tianyi Wang, Jianan Fan, Dingxin Zhang, Dongnan Liu, Yong Xia, Heng Huang, Weidong Cai, 9 Sep 2025, MIRROR: Multi-Modal Pathological Self-Supervised Representation Learning via Modality Alignment and Retention, https://arxiv.org/abs/2503.00374
Hasibur Rahman, Smit Desai, 11 Sep 2025, Vibe Check: Understanding the Effects of LLM-Based Conversational Agents' Personality and Alignment on User Perceptions in Goal-Oriented Tasks, https://arxiv.org/abs/2509.09870
Yuexi Du, Lihui Chen, Nicha C. Dvornek, 12 Sep 2025, GLAM: Geometry-Guided Local Alignment for Multi-View VLP in Mammography, https://arxiv.org/abs/2509.10344
Maysam Behmanesh, Erkan Turan, and Maks Ovsjanikov, 11 Sep 2025, Graph Alignment via Dual-Pass Spectral Encoding and Latent Space Communication, https://arxiv.org/abs/2509.09597
Dohun Lee, Hyeonho Jeong, Jiwook Kim, Duygu Ceylan, Jong Chul Ye, 11 Sep 2025, Improving Video Diffusion Transformer Training by Multi-Feature Fusion and Alignment from Self-Supervised Vision Encoders, https://arxiv.org/abs/2509.09547
Oriane Peter and Kate Devlin, 9 Sep 2025, Decentralising LLM Alignment: A Case for Context, Pluralism, and Participation, https://arxiv.org/abs/2509.08858
Bronson Schoen, Evgenia Nitishinskaya, Mikita Balesni, Axel H{\o}jmark, Felix Hofst\"atter, J\'er\'emy Scheurer, Alexander Meinke, Jason Wolfe, Teun van der Weij, Alex Lloyd, Nicholas Goldowsky-Dill, Angela Fan, Andrei Matveiakin, Rusheb Shah, Marcus Williams, Amelia Glaese, Boaz Barak, Wojciech Zaremba, Marius Hobbhahn, 19 Sep 2025, Stress Testing Deliberative Alignment for Anti-Scheming Training, https://arxiv.org/abs/2509.15541
Wenjun Cao, 19 Sep 2025, The Alignment Bottleneck, https://arxiv.org/abs/2509.15932
Maithili Joshi, Palash Nandi, Tanmoy Chakraborty, 19 Sep 2025, SABER: Uncovering Vulnerabilities in Safety Alignment via Cross-Layer Residual Connection, https://arxiv.org/abs/2509.16060
Nomi Yu (1), Md Ferdous Alam (1), A. John Hart (1), and Faez Ahmed (1) ((1) Massachusetts Institute of Technology), 17 Sep 2025, GenCAD-3D: CAD Program Generation using Multimodal Latent Space Alignment and Synthetic Dataset Balancing, https://arxiv.org/abs/2509.15246
Ajsal Shereef Palattuparambil, Thommen George Karimpanal, Santu Rana, 19 Sep 2025, Dynamic Policy Fusion for User Alignment Without Re-Interaction, https://arxiv.org/abs/2409.20016
Sifan Wang, Ananyae Kumar Bhartari, Bowen Li, Paris Perdikaris, 19 Sep 2025, Gradient Alignment in Physics-informed Neural Networks: A Second-Order Optimization Perspective, https://arxiv.org/abs/2502.00604
Tianhao Zhang, Zhecheng Sheng, Zhexiao Lin, Chen Jiang, Dongyeop Kang, 19 Sep 2025, BBScoreV2: Learning Time-Evolution and Latent Alignment from Stochastic Representation, https://arxiv.org/abs/2405.17764
Rashid Mushkani, Hugo Berard, Shin Koseki, 18 Sep 2025, Negotiative Alignment: Embracing Disagreement to Achieve Fairer Outcomes -- Insights from Urban Studies, https://arxiv.org/abs/2503.12613
Jeremias Ferrao, Matthijs van der Lende, Ilija Lichkovski, Clement Neo, 16 Sep 2025, The Anatomy of Alignment: Decomposing Preference Optimization by Steering Sparse Features, https://arxiv.org/abs/2509.12934
Denis Janiak, Julia Moska, Dawid Motyka, Karolina Seweryn, Pawe{\l} Walkowiak, Bartosz \.Zuk, Arkadiusz Janz, 16 Sep 2025, Rethinking the Evaluation of Alignment Methods: Insights into Diversity, Generalisation, and Safety, https://arxiv.org/abs/2509.12936
Jinjie Shen, Yaxiong Wang, Lechao Cheng, Nan Pu, Zhun Zhong, 16 Sep 2025, Beyond Artificial Misalignment: Detecting and Grounding Semantic-Coordinated Multimodal Manipulations, https://arxiv.org/abs/2509.12653
Qianqi Lu, Yuxiang Xie, Jing Zhang, Shiwei Zou, Yan Chen, Xidao Luan, 16 Sep 2025, TFANet: Three-Stage Image-Text Feature Alignment Network for Robust Referring Image Segmentation, https://arxiv.org/abs/2509.13070
Yubo Li, Weiyi Song, 16 Sep 2025, Co-Alignment: Rethinking Alignment as Bidirectional Human-AI Cognitive Adaptation, https://arxiv.org/abs/2509.12179
Mohsinul Kabir, Ajwad Abrar, Sophia Ananiadou, 16 Sep 2025, Break the Checkbox: Challenging Closed-Style Evaluations of Cultural Alignment in LLMs, https://arxiv.org/abs/2502.08045
Jie Wu, Haoling Li, Xin Zhang, Jianwen Luo, Yangyu Huang, Ruihang Chu, Yujiu Yang, Scarlett Li, 16 Sep 2025, Teaching Your Models to Understand Code via Focal Preference Alignment, https://arxiv.org/abs/2503.02783
Jing Xiao, Chang You, Zhiyu Chen, 14 Sep 2025, AlignKT: Explicitly Modeling Knowledge State for Knowledge Tracing with Ideal State Alignment, https://arxiv.org/abs/2509.11135
Yining Lu, Zilong Wang, Shiyang Li, Xin Liu, Changlong Yu, Qingyu Yin, Zhan Shi, Zixuan Zhang, Meng Jiang, 14 Sep 2025, Learning to Optimize Multi-Objective Alignment Through Dynamic Reward Weighting, https://arxiv.org/abs/2509.11452
Chentao Cao, Xiaojun Xu, Bo Han, Hang Li, 15 Sep 2025, Reasoned Safety Alignment: Ensuring Jailbreak Defense via Answer-Then-Check, https://arxiv.org/abs/2509.11629
Jiayou Zhong, Anudeex Shetty, Chao Jia, Xuanrui Lin, Usman Naseem, 12 Sep 2025, Pluralistic Alignment for Healthcare: A Role-Driven Framework, https://arxiv.org/abs/2509.10685
Hyeongju Kim, Juheon Lee, Jinhyeok Yang, Jacob Morton, 14 Sep 2025, Length-Aware Rotary Position Embedding for Text-Speech Alignment, https://arxiv.org/abs/2509.11084
Etienne Boursier, Nicolas Flammarion, 15 Sep 2025, Early alignment in two-layer networks training is a two-edged sword, https://arxiv.org/abs/2401.10791
Zedian Shao, Hongbin Liu, Jaden Mu, Neil Zhenqiang Gong, 15 Sep 2025, Enhancing Prompt Injection Attacks to LLMs via Poisoning Alignment, https://arxiv.org/abs/2410.14827
Ankur Samanta, Akshayaa Magesh, Youliang Yu, Runzhe Wu, Ayush Jain, Daniel Jiang, Boris Vidolov, Paul Sajda, Yonathan Efroni, Kaveh Hassani, 18 Sep 2025, Internalizing Self-Consistency in Language Models: Multi-Agent Consensus Alignment, https://arxiv.org/abs/2509.15172
Herlock (SeyedAbolfazl) Rahimi, Dionysis Kalogerias, 17 Sep 2025, FedAVOT: Exact Distribution Alignment in Federated Learning via Masked Optimal Transport, https://arxiv.org/abs/2509.14444
Natalie Collina, Surbhi Goel, Aaron Roth, Emily Ryu, Mirah Shi, 18 Sep 2025, Emergent Alignment via Competition, https://arxiv.org/abs/2509.15090
Andr\'es Corrada-Emmanuel, 10 Sep 2025, No-Knowledge Alarms for Misaligned LLMs-as-Judges, https://arxiv.org/abs/2509.08593
Aadit Sengupta, Pratinav Seth, Vinay Kumar Sankarapu, 10 Sep 2025, Interpretability as Alignment: Making Internal Understanding a Design Principle, https://arxiv.org/abs/2509.08592
Katalina Hernandez Delgado, 8 Sep 2025, The Law-Following AI Framework: Legal Foundations and Technical Constraints. Legal Analogues for AI Actorship and technical feasibility of Law Alignment, https://arxiv.org/abs/2509.08009
Hua Shen, Nicholas Clark, Tanushree Mitra, 9 Sep 2025, Mind the Value-Action Gap: Do LLMs Act in Alignment with Their Values?, https://arxiv.org/abs/2501.15463
Vincent Siu, Nicholas Crispino, David Park, Nathan W. Henry, Zhun Wang, Yang Liu, Dawn Song, Chenguang Wang, 16 Sep 2025, SteeringControl: Holistic Evaluation of Alignment Steering in LLMs, https://arxiv.org/abs/2509.13450
Zhanting Zhou and Jinshan Lai and Fengchun Zhang and Zeqin Wu and Fengli Zhang, 17 Sep 2025, FedSSG: Expectation-Gated and History-Aware Drift Alignment for Federated Learning, https://arxiv.org/abs/2509.13895
Yifan Hu, Jie Yang, Tian Zhou, Peiyuan Liu, Yujin Tang, Rong Jin, Liang Sun, 17 Sep 2025, Bridging Past and Future: Distribution-Aware Alignment for Time Series Forecasting, https://arxiv.org/abs/2509.14181
Jack McKinlay, Marina De Vos, Janina A. Hoffmann, Andreas Theodorou, 17 Sep 2025, Understanding the Process of Human-AI Value Alignment, https://arxiv.org/abs/2509.13854
Elena Camuffo, Francesco Barbato, Mete Ozay, Simone Milani, Umberto Michieli, 17 Sep 2025, MOCHA: Multi-modal Objects-aware Cross-arcHitecture Alignment, https://arxiv.org/abs/2509.14001
Puru Vaish, Felix Meister, Tobias Heimann, Christoph Brune, Jelmer M. Wolterink, 17 Sep 2025, Consistent View Alignment Improves Foundation Models for 3D Medical Image Segmentation, https://arxiv.org/abs/2509.13846
Yuu Jinnai, Ukyo Honda, 17 Sep 2025, Annotation-Efficient Language Model Alignment via Diverse and Representative Response Texts, https://arxiv.org/abs/2405.13541
Zhiqiang Yuan, Weitong Chen, Hanlin Wang, Kai Yu, Xin Peng, Yiling Lou, 17 Sep 2025, Semantic Alignment-Enhanced Code Translation via an LLM-Based Multi-Agent System, https://arxiv.org/abs/2409.19894

Trustworthy AI

Trustworthy AI is the practice of ensuring that LLM-based systems are safe and predictable. This involves ensuring not only the safety of the LLM's outputs, such as avoiding bias and toxicity, but also ensuring that the AI infrastructure is resilient and the overall system is reliable. The idea of "Trustworthy AI" has been championed by NVIDIA.

Articles and papers on trustworthy AI:

Leon Derczynski, Christopher Parisien, Nikki Pope, Michael Boone, Nov 2024, NVIDIA Approaches to AI Trust and Safety: Innovation and Tools, https://www.nvidia.com/en-us/on-demand/session/aisummitdc24-sdc1088/?playlistId=playList-c6a9450c-c790-462d-a058-0bacacd5d370
Mayank Vatsa, Anubhooti Jain, Richa Singh, 7 Dec 2023, Adventures of Trustworthy Vision-Language Models: A Survey, https://arxiv.org/abs/2312.04231
Nikki Pope, March 1, 2024, What Is Trustworthy AI? Trustworthy AI is an approach to AI development that prioritizes safety and transparency for the people who interact with it. https://blogs.nvidia.com/blog/what-is-trustworthy-ai/
NVIDIA, Dec 2024 (accessed), Trustworthy AI, https://www.nvidia.com/en-us/ai-data-science/trustworthy-ai/
Phoebe Lee and Kristina Joos, Jan 25, 2024, Advancing Production AI with NVIDIA AI Enterprise, https://developer.nvidia.com/blog/advancing-production-ai-with-nvidia-ai-enterprise/ ("... advances in NVIDIA AI software deliver up to 54% performance gains without a hardware upgrade...")
Yedi Zhang, Yufan Cai, Xinyue Zuo, Xiaokun Luan, Kailong Wang, Zhe Hou, Yifan Zhang, Zhiyuan Wei, Meng Sun, Jun Sun, Jing Sun, Jin Song Dong, 9 Dec 2024, The Fusion of Large Language Models and Formal Methods for Trustworthy AI Agents: A Roadmap, https://arxiv.org/abs/2412.06512
Athanasios Davvetas, Xenia Ziouvelou, Ypatia Dami, Alexis Kaponis, Konstantina Giouvanopoulou, Michael Papademas, 23 Jul 2025, TAI Scan Tool: A RAG-Based Tool With Minimalistic Input for Trustworthy AI Self-Assessment, https://arxiv.org/abs/2507.17514
Ilias Chatzistefanidis, Navid Nikaein, 23 Jul 2025, Symbiotic Agents: A Novel Paradigm for Trustworthy AGI-driven Networks, https://arxiv.org/abs/2507.17695
H M Mohaimanul Islam, Huynh Q. N. Vo, Aditya Rane, 22 Jul 2025, Towards Trustworthy AI: Secure Deepfake Detection using CNNs and Zero-Knowledge Proofs, https://arxiv.org/abs/2507.17010
Tushar Talukder Showrav, Soyabul Islam Lincoln, Md. Kamrul Hasan, 23 Jul 2025, EXGnet: a single-lead explainable-AI guided multiresolution network with train-only quantitative features for trustworthy ECG arrhythmia classification, https://arxiv.org/abs/2506.12404
Yaomin Jiang, Levin Brinkmann, Anne-Marie Nussberger, Ivan Soraperra, Jean-Fran\c{c}ois Bonnefon, Iyad Rahwan, 17 Jul 2025, Humans learn to prefer trustworthy AI over human partners, https://arxiv.org/abs/2507.13524
Nuria Rodr\'iguez-Barroso and Mario Garc\'ia-M\'arquez and M. Victoria Luz\'on and Francisco Herrera, 21 Jul 2025, Challenges of Trustworthy Federated Learning: What's Done, Current Trends and Remaining Work, https://arxiv.org/abs/2507.15796
Mustafa Cavus, Jan N. van Rijn, Przemys{\l}aw Biecek, 19 Jul 2025, Beyond the Single-Best Model: Rashomon Partial Dependence Profile for Trustworthy Explanations in AutoML, https://arxiv.org/abs/2507.14744
Amina Dzafic, Merve Kavut, Ulya Bayram, 19 Jul 2025, Rethinking Suicidal Ideation Detection: A Trustworthy Annotation Framework and Cross-Lingual Model Evaluation, https://arxiv.org/abs/2507.14693
Yi Zhang, Zhen Chen, Chih-Hong Cheng, Wenjie Ruan, Xiaowei Huang, Dezong Zhao, David Flynn, Siddartha Khastgir, Xingyu Zhao, 20 Jul 2025, Trustworthy Text-to-Image Diffusion Models: A Timely and Focused Survey, https://arxiv.org/abs/2409.18214
Anthony Bellotti and Xindi Zhao, 9 Aug 2025, Conformal Prediction and Trustworthy AI, https://arxiv.org/abs/2508.06885
Stephan Rabanser, 11 Aug 2025, Uncertainty-Driven Reliability: Selective Prediction and Trustworthy Deployment in Modern Machine Learning, https://arxiv.org/abs/2508.07556
Anindya Bijoy Das, Shahnewaz Karim Sakib and Shibbir Ahmed, 9 Aug 2025, Trustworthy Medical Imaging with Large Language Models: A Study of Hallucinations Across Modalities, https://arxiv.org/abs/2508.07031
Jesco Talies, Eric Breitbarth, David Melching, 28 Jul 2025, Towards trustworthy AI in materials mechanics through domain-guided attention, https://arxiv.org/abs/2507.20658
Marius Baden, Ahmed Abouelazm, Christian Hubschneider, Yin Wu, Daniel Slieter, and J. Marius Z\"ollner, 27 Jul 2025, TPK: Trustworthy Trajectory Prediction Integrating Prior Knowledge For Interpretability and Kinematic Feasibility, https://arxiv.org/abs/2505.06743
Rob Procter, Mark Rouncefield, 25 Jul 2025, Trustworthy AI: UK Air Traffic Control Revisited, https://arxiv.org/abs/2507.21169
Rui Jiao, Yue Zhang, Jinku Li, 25 Jul 2025, Trustworthy Reasoning: Evaluating and Enhancing Factual Accuracy in LLM Intermediate Thought Processes, https://arxiv.org/abs/2507.22940
Xinwei Wu, Haojie Li, Hongyu Liu, Xinyu Ji, Ruohan Li, Yule Chen, Yigeng Zhang, 30 Jul 2025, Uncovering the Fragility of Trustworthy LLMs through Chinese Textual Ambiguity, https://arxiv.org/abs/2507.23121
Xiaojin Zhang, Wei Chen, 30 Jul 2025, Bridging Privacy and Robustness for Trustworthy Machine Learning, https://arxiv.org/abs/2403.16591
Sihang Zeng, Lucas Jing Liu, Jun Wen, Meliha Yetisgen, Ruth Etzioni, Gang Luo, 1 Aug 2025, TrajSurv: Learning Continuous Latent Trajectories from Electronic Health Records for Trustworthy Survival Prediction, https://arxiv.org/abs/2508.00657
Jiazhen Pan, Bailiang Jian, Paul Hager, Yundi Zhang, Che Liu, Friedrike Jungmann, Hongwei Bran Li, Chenyu You, Junde Wu, Jiayuan Zhu, Fenglin Liu, Yuyuan Liu, Niklas Bubeck, Christian Wachinger, Chen (Cherise) Chen, Zhenyu Gong, Cheng Ouyang, Georgios Kaissis, Benedikt Wiestler, Daniel Rueckert, 30 Jul 2025, Beyond Benchmarks: Dynamic, Automatic And Systematic Red-Teaming Agents For Trustworthy Medical Language Models, https://arxiv.org/abs/2508.00923
James Carzon and Luca Masserano and Joshua D. Ingram and Alex Shen and Antonio Carlos Herling Ribeiro Junior and Tommaso Dorigo and Michele Doro and Joshua S. Speagle and Rafael Izbicki and Ann B. Lee, 4 Aug 2025, Trustworthy scientific inference for inverse problems with generative models, https://arxiv.org/abs/2508.02602
Vinicius Lima, Dzung T. Phan, Jayant Kalagnanam, Dhaval Patel, Nianjun Zhou, 5 Aug 2025, Toward a Trustworthy Optimization Modeling Agent via Verifiable Synthetic Data Generation, https://arxiv.org/abs/2508.03117
Claudiu Leoveanu-Condrei, 5 Aug 2025, A DbC Inspired Neurosymbolic Layer for Trustworthy Agent Design, https://arxiv.org/abs/2508.03665
Anqi Li, Wenwei Jin, Jintao Tong, Pengda Qin, Weijia Li, Guo Lu, 5 Aug 2025, Towards Trustworthy Multimodal Moderation via Policy-Aligned Reasoning and Hierarchical Labeling, https://arxiv.org/abs/2508.03296
Haoran Li and Lihao Mai and Muhao Guo and Jiaqi Wu and Yang Weng and Yannan Sun and Ce Jimmy Liu, 7 Aug 2025, From Imperfect Signals to Trustworthy Structure: Confidence-Aware Inference from Heterogeneous and Reliability-Varying Utility Data, https://arxiv.org/abs/2508.05791
Ahmad Farooq and Kamran Iqbal, 7 Aug 2025, Towards Transparent Ethical AI: A Roadmap for Trustworthy Robotic Systems, https://arxiv.org/abs/2508.05846
Kristian Miok, Blaz \v{S}krlj, Daniela Zaharie, and Marko Robnik \v{S}ikonja, 30 Jul 2025, TT-XAI: Trustworthy Clinical Text Explanations via Keyword Distillation and LLM Reasoning, https://arxiv.org/abs/2508.08273
Mithat Can Ozgun, Jiahuan Pei, Koen Hindriks, Lucia Donatelli, Qingzhi Liu, Xin Sun, Junxiao Wang, 15 Aug 2025, Trustworthy AI Psychotherapy: Multi-Agent LLM Workflow for Counseling and Explainable Mental Disorder Diagnosis, https://arxiv.org/abs/2508.11398
Benjamin Alt, Mareike Picklum, Sorin Arion, Franklin Kenghagho Kenfack and Michael Beetz, 15 Aug 2025, Open, Reproducible and Trustworthy Robot-Based Experiments with Virtual Labs and Digital-Twin-Based Execution Tracing, https://arxiv.org/abs/2508.11406
Zihan Guo, Yuanjian Zhou, Chenyi Wang, Linlin You, Minjie Bian, Weinan Zhang, 19 Aug 2025, BetaWeb: Towards a Blockchain-enabled Trustworthy Agentic Web, https://arxiv.org/abs/2508.13787
Mary Versa Clemens-Sewall, Christopher Cervantes, Emma Rafkin, J. Neil Otte, Tom Magelinski, Libby Lewis, Michelle Liu, Dana Udwin, Monique Kirkman-Bey, 20 Aug 2025, CaTE Data Curation for Trustworthy AI, https://arxiv.org/abs/2508.14741
Wenjie Lin, Jin Wei-Kocsis, 21 Aug 2025, LLM4Sweat: A Trustworthy Large Language Model for Hyperhidrosis Support, https://arxiv.org/abs/2508.15192
Yongwoo Song and Minbyul Jeong and Mujeen Sung, 26 Aug 2025, Trustworthy Agents for Electronic Health Records through Confidence Estimation, https://arxiv.org/abs/2508.19096
William Jurayj, Nils Holzenberger, Benjamin Van Durme, 28 Aug 2025, Enabling Equitable Access to Trustworthy Financial Reasoning, https://arxiv.org/abs/2508.21051
\v{S}imon Kucharsk\'y, Aayush Mishra, Daniel Habermann, Stefan T. Radev, Paul-Christian B\"urkner, 28 Aug 2025, Towards Trustworthy Amortized Bayesian Model Comparison, https://arxiv.org/abs/2508.20614
Daocheng Fu, Jianlong Chen, Renqiu Xia, Zijun Chen, Qi Liu, Yuan Feng, Hongbin Zhou, Renrui Zhang, Shiyang Feng, Peng Gao, Hongyuan Zha, Junchi Yan, Botian Shi, Yu Qiao, Bo Zhang, 29 Aug 2025, TrustGeoGen: Formal-Verified Data Engine for Trustworthy Multi-modal Geometric Problem Solving, https://arxiv.org/abs/2504.15780
Li Rong Wang, Thomas C. Henderson, Yew Soon Ong, Yih Yng Ng, Xiuyi Fan, 1 Sep 2025, Towards Trustworthy Vital Sign Forecasting: Leveraging Uncertainty for Prediction Intervals, https://arxiv.org/abs/2509.01319
Chaoyu Zhang and Heng Jin and Shanghao Shi and Hexuan Yu and Sydney Johns and Y. Thomas Hou and Wenjing Lou, 30 Aug 2025, Enabling Trustworthy Federated Learning via Remote Attestation for Mitigating Byzantine Threats, https://arxiv.org/abs/2509.00634
Aivin V. Solatorio, 8 Sep 2025, Proof-Carrying Numbers (PCN): A Protocol for Trustworthy Numeric Answers from LLMs via Claim Verification, https://arxiv.org/abs/2509.06902
Teeradaj Racharak, Chaiyong Ragkhitwetsagul, Chommakorn Sontesadisai, Thanwadee Sunetnanta, 8 Sep 2025, Test It Before You Trust It: Applying Software Testing for Trustworthy In-context Learning, https://arxiv.org/abs/2504.18827
Zhuoyue Zhang, Haitong Xu, 19 Sep 2025, Explainable AI for Maritime Autonomous Surface Ships (MASS): Adaptive Interfaces and Trustworthy Human-AI Collaboration, https://arxiv.org/abs/2509.15959
Meryem Malak Dif, Mouhamed Amine Bouchiha, Abdelaziz Amara Korba, Yacine Ghamri-Doudane, 8 Sep 2025, Towards Trustworthy Agentic IoEV: AI Agents for Explainable Cyberthreat Mitigation and State Analytics, https://arxiv.org/abs/2509.12233
Diego Gosmar, Deborah A. Dahl, 18 Sep 2025, Sentinel Agents for Secure and Trustworthy Agentic AI in Multi-Agent Systems, https://arxiv.org/abs/2509.14956
Prathamesh Vasudeo Naik, Naresh Kumar Dintakurthi, Zhanghao Hu, Yue Wang, Robby Qiu, 10 Sep 2025, Co-Investigator AI: The Rise of Agentic AI for Smarter, Trustworthy AML Compliance Narratives, https://arxiv.org/abs/2509.08380

AI Industry Safety Practices

Various papers discuss the practices of the major AI players in the industry, along with issues such as self-governance.

OpenAI, July 2023, Frontier Model Forum, https://openai.com/blog/frontier-model-forum
OpenAI. April 2023, Our approach to AI safety. https://openai.com/blog/our-approach-to-ai-safety
A. M. Barrett, J. Newman, D. Hendrycks, and B. Nonnecke. 2023, UC Berkeley AI Risk-Management Standards Profile for General-Purpose AI Systems (GPAIS) and Foundation Models, https://cltc.berkeley.edu/seeking-input-and-feedback-ai-risk-management-standards-profile-for-increasingly-multi-purpose-or-general-purpose-ai
Meta, 2023, Responsible AI: Driven by our belief that AI should benefit everyone, https://ai.meta.com/responsible-ai/
Google, 2023, AI Governance reviews and operations, https://ai.google/responsibility/ai-governance-operations
Google, 2023, Responsibility: Our Principles, https://ai.google/responsibility/principles/
Google, 2023, How Bard Works | A Responsible Approach to AI, YouTube, https://www.youtube.com/watch?v=vhbkCEnNXcY

Technical Verification and Testing of AI Safety

Testing and evaluation of AI safety issues:

Xiaowei Huang, Marta Kwiatkowska, Sen Wang, and Min Wu. May 2017. Safety verification of deep neural networks. In Computer Aided Verification, pages 3–29, https://arxiv.org/abs/1610.06940
D. Ganguli, L. Lovitt, J. Kernion, A. Askell, Y. Bai, S. Kadavath, B. Mann, E. Perez, N. Schiefer, K. Ndousse, A. Jones, S. Bowman, A. Chen, T. Conerly, N. DasSarma, D. Drain, N. Elhage, S. El-Showk, S. Fort, Z. Hatfield-Dodds, T. Henighan, D. Hernandez, T. Hume, J. Jacobson, S. Johnston, S. Kravec, C. Olsson, S. Ringer, E. Tran-Johnson, D. Amodei, T. Brown, N. Joseph, S. McCandlish, C. Olah, J. Kaplan, and J. Clark. Red teaming language models to reduce harms: Methods, scaling behaviors, and lessons learned. arXiv preprint arXiv:2209.07858, 2022 https://arxiv.org/abs/2209.07858
K Ramesh, A Chavan, S Pandit, 2023, A Comparative Study on the Impact of Model Compression Techniques on Fairness in Language Models, Microsoft Research, https://aclanthology.org/2023.acl-long.878.pdf, https://www.microsoft.com/en-us/research/uploads/prod/2023/07/3687_Paper.pdf (Rather than testing full models, this analysis examines optimized models due to quantization, pruning or distillation.)
T. Shevlane. Structured access: An emerging paradigm for safe AI deployment. In The Oxford Handbook of AI Governance, 2022, https://arxiv.org/abs/2201.05159
E. Perez, S. Huang, F. Song, T. Cai, R. Ring, J. Aslanides, A. Glaese, N. McAleese, and G. Irving. 2022, Red teaming language models with language models. arXiv preprint arXiv:2202.03286, https://arxiv.org/abs/2202.03286
OpenAI. 2023. Safety best practices. https://platform.openai.com/docs/guides/safety-best-practices
William Saunders, Girish Sastry, Andreas Stuhlmueller, and Owain Evans. Trial without error: Towards safe reinforcement learning via human intervention. arXiv preprint arXiv:1707.05173, 2017. https://arxiv.org/abs/1707.05173
Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, Lélio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, William El Sayed, Oct 2023, Mistral 7B, https://arxiv.org/abs/2310.06825, Code: https://mistral.ai/news/announcing-mistral-7b/ (Examines guardrails and testing of the safety of the model against harmful inputs.)

AI Factual Inaccuracy

Research papers on accuracy of AI results include:

M Yuksekgonul, V Chandrasekaran, E Jones, Sep 2023, Attention Satisfies: A Constraint-Satisfaction Lens on Factual Errors of Language Models, https://arxiv.org/pdf/2309.15098.pdf, Code: https://github.com/microsoft/mechanistic-error-probe
Bingbin Liu, Jordan T. Ash, Surbhi Goel, Akshay Krishnamurthy, Cyril Zhang, June 2023, Exposing Attention Glitches with Flip-Flop Language Modeling, https://arxiv.org/abs/2306.00946
S Latifi, 2023, Efficient and Dependable Deep Learning Systems Ph.D. Thesis, Computer Science and Engineering, University of Michigan, https://deepblue.lib.umich.edu/bitstream/handle/2027.42/176548/salar_1.pdf?sequence=1
Michael Wood, Aug 26, 2024, 100% Accurate AI Claimed by Acurai — OpenAI and Anthropic Confirm Acurai’s Discoveries, https://blog.cubed.run/100-accurate-ai-claimed-by-acurai-openai-and-anthropic-confirm-acurais-discoveries-98fce1ddeb5b

AI Safety Incidents

Various incidents and accidents related to AI safety issues:

S. McGregor. Nov 2021. Preventing repeated real world AI failures by cataloging incidents: The AI Incident Database. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 15458–15463, https://arxiv.org/abs/2011.08512
Sarah Perez, 2023, Snapchat’s My AI goes rogue, posts to Stories, but Snap confirms it was just a glitch, August 17, 2023, TechCrunch, https://techcrunch.com/2023/08/16/snapchats-my-ai-goes-rogue-posts-to-stories-but-snap-confirms-it-was-just-a-glitch/
Jaime Seidel, 2019, How a ‘confused’ AI May Have Fought Pilots Attempting to Save Boeing 737 MAX8s, News Corp Australia Network, https://www.news.com.au/technology/innovation/inventions/how-a-confused-ai-may-have-fought-pilots-attempting-to-save-boeing-737-max8s/news-story/bf0d102f699905e5aa8d1f6d65f4c27e (A very good example of the need for overrides and interruptibility.)
Zachary Arnold, Helen Toner, July 2021, AI Accidents: An Emerging Threat What Could Happen and What to Do, CSET Policy Brief, https://cset.georgetown.edu/wp-content/uploads/CSET-AI-Accidents-An-Emerging-Threat.pdf
Hern Alex. Apple contractors ‘regularly hear confidential details’ on Siri recordings. Guardian. 2019, https://www.theguardian.com/technology/2019/jul/26/apple-contractors-regularly-hear-confidential-details-on-siri-recordings
Victor Tangermann, Sep 2023, Microsoft Publishes Garbled AI Article Calling Tragically Deceased NBA Player "Useless", Futurism, https://futurism.com/msn-ai-brandon-hunter-useless ("AI should not be writing obituaries.")

Incident Databases: There are various databases that collect information about AI safety incidents.

AI Incident Database, https://incidentdatabase.ai/
Zach Stein-Perlman, SeLo, stepanlos, MvK, July 20, 2023, Incident reporting for AI safety, Effective Altruism Forum, https://forum.effectivealtruism.org/posts/qkK5ejystp8GCJ3vC/incident-reporting-for-ai-safety
AVID, 2023, AI Vulnerability Database: An open-source, extensible knowledge base of AI failures, https://avidml.org/
AIAAIC (AI, Algorithmic, and Automation Incidents and Controversies), 2023, https://www.aiaaic.org/home
MITRE ATLAS™ (Adversarial Threat Landscape for Artificial-Intelligence Systems), https://atlas.mitre.org/
AI Badness: An open catalog of generative AI badness, 2023, https://badness.ai/
David Dao, 2023, Awful AI, https://github.com/daviddao/awful-ai

Medical Ethics and AI

The use of AI in medicine creates some additional ethical issues:

Vollmer S., Mateen B.A., Bohner G., Király F.J., Ghani R., Jonsson P., et al. Machine learning and AI research for patient benefit: 20 critical questions on transparency, replicability, ethics and effectiveness. BMJ. 2018;(368):1–12. https://pubmed.ncbi.nlm.nih.gov/32198138/
Cockerill RG., 2020, Ethics Implications of the Use of Artificial Intelligence in Violence Risk Assessment. J Am Acad Psychiatry Law. 2020 Sep;48(3):345-349. doi: 10.29158/JAAPL.003940-20. Epub 2020 May 14. PMID: 32409300, https://pubmed.ncbi.nlm.nih.gov/32409300/
Barron DS. 2021, Commentary: the ethical challenges of machine learning in psychiatry: a focus on data, diagnosis, and treatment. Psychol Med. 2021 Nov;51(15):2522-2524. doi: 10.1017/S0033291721001008. Epub 2021 May 12. PMID: 33975655, https://pubmed.ncbi.nlm.nih.gov/33975655/
O'Reilly-Shah VN, Gentry KR, Walters AM, Zivot J, Anderson CT, Tighe PJ. 2020, Bias and ethical considerations in machine learning and the automation of perioperative risk assessment. Br J Anaesth. 2020 Dec;125(6):843-846. doi: 10.1016/j.bja.2020.07.040. Epub 2020 Aug 21. PMID: 32838979, https://pubmed.ncbi.nlm.nih.gov/32838979/
Buchlak QD, Esmaili N, Leveque JC, Bennett C, Piccardi M, Farrokhi F., 2020, Ethical thinking machines in surgery and the requirement for clinical leadership. Am J Surg. 2020 Nov;220(5):1372-1374. doi: 10.1016/j.amjsurg.2020.06.073. Epub 2020 Jul 8. PMID: 32723487, https://pubmed.ncbi.nlm.nih.gov/32723487/
Starke G, De Clercq E, Borgwardt S, Elger BS., 2020, Computing schizophrenia: ethical challenges for machine learning in psychiatry. Psychol Med. 2021 Nov;51(15):2515-2521. doi: 10.1017/S0033291720001683. Epub 2020 Jun 15. PMID: 32536358, https://pubmed.ncbi.nlm.nih.gov/32536358/
Jacobson NC, Bentley KH, Walton A, Wang SB, Fortgang RG, Millner AJ, Coombs G 3rd, Rodman AM, Coppersmith DDL., 2020, Ethical dilemmas posed by mobile health and machine learning in psychiatry research. Bull World Health Organ. 2020 Apr 1;98(4):270-276. doi: 10.2471/BLT.19.237107. Epub 2020 Feb 25. PMID: 32284651, https://pubmed.ncbi.nlm.nih.gov/32284651/
Johnson SLJ., 2019, AI, Machine Learning, and Ethics in Health Care. J Leg Med. 2019 Oct-Dec;39(4):427-441. doi: 10.1080/01947648.2019.1690604. PMID: 31940250 https://pubmed.ncbi.nlm.nih.gov/31940250/
Vayena E, Blasimme A, Cohen IG., 2018, Machine learning in medicine: Addressing ethical challenges. PLoS Med. 2018 Nov 6;15(11):e1002689. doi: 10.1371/journal.pmed.1002689. eCollection 2018 Nov. PMID: 30399149, https://pubmed.ncbi.nlm.nih.gov/30399149/
Nabi J., 2018, How Bioethics Can Shape Artificial Intelligence and Machine Learning. Hastings Cent Rep. 2018 Sep;48(5):10-13. doi: 10.1002/hast.895. PMID: 30311202, https://pubmed.ncbi.nlm.nih.gov/30311202/
Char DS, Shah NH, Magnus D., 2018, Implementing Machine Learning in Health Care - Addressing Ethical Challenges. N Engl J Med. 2018 Mar 15;378(11):981-983. doi: 10.1056/NEJMp1714229. PMID: 29539284, https://pubmed.ncbi.nlm.nih.gov/29539284/
Fiske A, Henningsen P, Buyx A., 2019, Your Robot Therapist Will See You Now: Ethical Implications of Embodied Artificial Intelligence in Psychiatry, Psychology, and Psychotherapy. J Med Internet Res. 2019 May 9;21(5):e13216. doi: 10.2196/13216. PMID: 31094356, https://pubmed.ncbi.nlm.nih.gov/31094356/
Beil Michael, Proft Ingo, van Heerden Daniel, Sviri Sigal, van Heerden Peter Vernon. 2019, Ethical considerations about artificial intelligence for prognostication in intensive care. Intensive Care Medicine Experimental. 2019;7:70. http://www.ncbi.nlm.nih.gov/pmc/articles/pmc6904702/, https://pubmed.ncbi.nlm.nih.gov/31823128/
Lasse Benzinger, Frank Ursin, Wolf-Tilo Balke, Tim Kacprowski & Sabine Salloch, 2023, Should Artificial Intelligence be used to support clinical ethical decision-making? A systematic review of reasons BMC Medical Ethics volume 24, Article number: 48 (2023), https://doi.org/10.1186/s12910-023-00929-6
Rachel Dlugatch, Antoniya Georgieva & Angeliki Kerasidou, 2023, Trustworthy artificial intelligence and ethical design: public perceptions of trustworthiness of an AI-based decision-support tool in the context of intrapartum care, BMC Medical Ethics Open Access 20 June 2023, https://doi.org/10.1186/s12910-023-00917-w
Dzobo K, Adotey S, Thomford NE, Dzobo W. Integrating Artificial and Human Intelligence: A Partnership for Responsible Innovation in Biomedical Engineering and Medicine. OMICS. 2020 May;24(5):247-263. doi: 10.1089/omi.2019.0038. Epub 2019 Jul 16. PMID: 31313972, https://pubmed.ncbi.nlm.nih.gov/31313972/
McCradden MD, Joshi S, Mazwi M, Anderson JA., 2020, Ethical limitations of algorithmic fairness solutions in health care machine learning. Lancet Digit Health. 2020 May;2(5):e221-e223. doi: 10.1016/S2589-7500(20)30065-0. PMID: 33328054, https://pubmed.ncbi.nlm.nih.gov/33328054/
Kulikowski CA., 2019, Beginnings of Artificial Intelligence in Medicine (AIM): Computational Artifice Assisting Scientific Inquiry and Clinical Art - with Reflections on Present AIM Challenges. Yearb Med Inform. 2019 Aug;28(1):249-256. doi: 10.1055/s-0039-1677895. Epub 2019 Apr 25. PMID: 31022744, https://pubmed.ncbi.nlm.nih.gov/31022744/
Park S.H., Kim Y.H., Lee J.Y., Yoo S., Kim C.J. Ethical challenges regarding artificial intelligence in medicine from the perspective of scientific editing and peer review. Science Editing. 2019;6:91–98. https://www.semanticscholar.org/paper/Ethical-challenges-regarding-artificial-in-medicine-Park-Kim/7a5b3c84c6f5d16e68eaf17989b0debfd4ba57d0

Data Leakage

Data leakage refers to the AI accidentally causing the leak of data that you'd prefer was kept confidential. The "leak" can actually be caused by the LLM, or by the user, depending on the context. There are various ways this can occur:

Uploading confidential data in AI queries (User data leakage)
Training or fine-tuning data containing proprietary information (Training data leakage)
RAG datastore documents containing proprietary information (RAG data leakage)

In the context of an LLM output leaking, this refers to where internal company IP is accidentally "leaked" to the public by training the AI with documents containing internal information. The AI is not smart enough to note when it shouldn't be reading a document, and anything that goes into the training dataset, or in the RAG datastore, will be shown to users.

User data leakage is where company users are sending proprietary information to a third-party AI engine. In theory, this data is protected by the confidentiality practices of the LLM company. This issue is similar to having company staff emitting confidential information in their Google queries, but the issue is more problematic because AI queries can upload entire documents to be analyzed by the LLM, such as when doing grammar checking with an LLM.

Research papers on data leakage:

Grant Gross, 05 Jun 2024, Unauthorized AI is eating your company data, thanks to your employees, https://www.csoonline.com/article/2138447/unauthorized-ai-is-eating-your-company-data-thanks-to-your-employees.html
Mary K. Pratt, 08 Jul 2024, 10 ways to prevent shadow AI disaster, https://www.cio.com/article/2150142/10-ways-to-prevent-shadow-ai-disaster.html
Rachel Curry, Aug 28 2024, Why companies including JPMorgan and Walmart are opting for internal gen AI assistants after initially restricting usage, https://www.cnbc.com/2024/08/28/why-jpmorgan-and-walmart-are-opting-for-internal-gen-ai-assistants.html
Huan Yang, Deyu Zhang, Yudong Zhao, Yuanchun Li, Yunxin Liu, 6 Sep 2024, A First Look At Efficient And Secure On-Device LLM Inference Against KV Leakage, https://arxiv.org/abs/2409.04040 (Security issues where KV caches can be data leaks as they may contain encodings of private information.)
G Wu, Z Zhang, Y Zhang, W Wang, J Niu, Y Wu, Mar 2025, I Know What You Asked: Prompt Leakage via KV-Cache Sharing in Multi-Tenant LLM Serving https://www.ndss-symposium.org/wp-content/uploads/2025-1772-paper.pdf

Refusal

Refusal refers to the way that an LLM will politely decline to answer an inappropriate question. There are all types of questions that we don't want an LLM to respond to, and this requires training to achieve.

Andy Arditi, Oscar Obeso, Aaquib111, wesg, Neel Nanda, 27th Apr 2024, Refusal in LLMs is mediated by a single direction, LessWrong, https://www.lesswrong.com/posts/jGuXSZgv6qfdhMCuJ/refusal-in-llms-is-mediated-by-a-single-direction
Maxime Labonne June 13, 2024 Uncensor any LLM with abliteration, https://huggingface.co/blog/mlabonne/abliteration
NVIDIA, June 2024, Nemotron-4 340B Technical Report, https://d1qx31qr3h6wln.cloudfront.net/publications/Nemotron_4_340B_8T_0.pdf (Architecture is decoder-only with GQA, SentencePiece tokenizer, causal attention masks, RoPE, 96 layers, 96 heads, 8 KV heads, 256,000 vocabulary, 18432 internal dimension, context window 4096, and uses squared RELU.)
Seungju Han, Kavel Rao, Allyson Ettinger, Liwei Jiang, Bill Yuchen Lin, Nathan Lambert, Yejin Choi, Nouha Dziri, 26 Jun 2024, WildGuard: Open One-Stop Moderation Tools for Safety Risks, Jailbreaks, and Refusals of LLMs, https://arxiv.org/abs/2406.18495
Maksym Andriushchenko, Nicolas Flammarion, 16 Jul 2024, Does Refusal Training in LLMs Generalize to the Past Tense? https://arxiv.org/abs/2407.11969 Code: https://github.com/tml-epfl/llm-past-tense
Kylie Robison, Jul 20, 2024, OpenAI’s latest model will block the ‘ignore all previous instructions’ loophole, https://www.theverge.com/2024/7/19/24201414/openai-chatgpt-gpt-4o-prompt-injection-instruction-hierarchy
Xinyi Hou, Yanjie Zhao, Haoyu Wang, 3 Aug 2024, Voices from the Frontier: A Comprehensive Analysis of the OpenAI Developer Forum, https://arxiv.org/abs/2408.01687
Asir Saadat, Tasmia Binte Sogir, Md Taukir Azam Chowdhury, Syem Aziz, 16 Oct 2024, When Not to Answer: Evaluating Prompts on GPT Models for Effective Abstention in Unanswerable Math Word Problems, https://arxiv.org/abs/2410.13029
Kyle O'Brien, David Majercak, Xavier Fernandes, Richard Edgar, Jingya Chen, Harsha Nori, Dean Carignan, Eric Horvitz, Forough Poursabzi-Sangde, 18 Nov 2024, Steering Language Model Refusal with Sparse Autoencoders, https://arxiv.org/abs/2411.11296
Mohit Sewak, Dec 6, 2024, Prompt Injection Attacks on Large Language Models, https://pub.towardsai.net/prompt-injection-attacks-on-large-language-models-bd8062fa1bb7
Yue Liu, Hongcheng Gao, Shengfang Zhai, Jun Xia, Tianyi Wu, Zhiwei Xue, Yulin Chen, Kenji Kawaguchi, Jiaheng Zhang, Bryan Hooi, 30 Jan 2025, GuardReasoner: Towards Reasoning-based LLM Safeguards, https://arxiv.org/abs/2501.18492
Holistic AI Team, March 6, 2025, Anthropic’s Claude 3.7 Sonnet Jailbreaking & Red Teaming Audit: The Most Secure Model Yet? https://www.holisticai.com/blog/claude-3-7-sonnet-jailbreaking-audit
Vishnu Kabir Chhabra, Mohammad Mahdi Khalili, 5 Apr 2025, Towards Understanding and Improving Refusal in Compressed Models via Mechanistic Interpretability, https://arxiv.org/abs/2504.04215
Wojciech Zaremba, Evgenia Nitishinskaya, Boaz Barak, Stephanie Lin, Sam Toyer, Yaodong Yu, Rachel Dias, Eric Wallace, Kai Xiao, Johannes Heidecke, Amelia Glaese, 31 Jan 2025, Trading Inference-Time Compute for Adversarial Robustness, https://arxiv.org/abs/2501.18841
Hongzhe Du, Weikai Li, Min Cai, Karim Saraipour, Zimin Zhang, Himabindu Lakkaraju, Yizhou Sun, Shichang Zhang, 11 Aug 2025, How Post-Training Reshapes LLMs: A Mechanistic View on Knowledge, Truthfulness, Refusal, and Confidence, https://arxiv.org/abs/2504.02904
Haonan Zhang, Dongxia Wang, Yi Liu, Kexin Chen, Jiashui Wang, Xinlei Ying, Long Liu, Wenhai Wang, 15 Aug 2025, ORFuzz: Fuzzing the "Other Side" of LLM Safety -- Testing Over-Refusal, https://arxiv.org/abs/2508.11222
Yuan Yuan, Tina Sriskandarajah, Anna-Luisa Brakman, Alec Helyar, Alex Beutel, Andrea Vallone, Saachi Jain, 12 Aug 2025, From Hard Refusals to Safe-Completions: Toward Output-Centric Safety Training, https://arxiv.org/abs/2508.09224
Ranjie Duan, Jiexi Liu, Xiaojun Jia, Shiji Zhao, Ruoxi Cheng, Fengxiang Wang, Cheng Wei, Yong Xie, Chang Liu, Defeng Li, Yinpeng Dong, Yichi Zhang, Yuefeng Chen, Chongwen Wang, Xingjun Ma, Xingxing Wei, Yang Liu, Hang Su, Jun Zhu, Xinfeng Li, Yitong Sun, Jie Zhang, Jinzhao Hu, Sha Xu, Yitong Yang, Jialing Tao, Hui Xue, 4 Sep 2025, Oyster-I: Beyond Refusal -- Constructive Safety Alignment for Responsible Language Models, https://arxiv.org/abs/2509.01909
Md Abdullah Al Mamun, Ihsen Alouani, Nael Abu-Ghazaleh, 28 Aug 2025, Poison Once, Refuse Forever: Weaponizing Alignment for Injecting Bias in LLMs, https://arxiv.org/abs/2508.20333
Neel Jain, Aditya Shrivastava, Chenyang Zhu, Daben Liu, Alfy Samuel, Ashwinee Panda, Anoop Kumar, Micah Goldblum, Tom Goldstein, 29 Aug 2025, Refusal Tokens: A Simple Way to Calibrate Refusals in Large Language Models, https://arxiv.org/abs/2412.06748
Nirmalendu Prakash, Yeo Wei Jie, Amir Abdullah, Ranjan Satapathy, Erik Cambria, Roy Ka Wei Lee, 7 Sep 2025, Beyond I'm Sorry, I Can't: Dissecting Large Language Model Refusal, https://arxiv.org/abs/2509.09708

Guardrails

Aarushi Kansal, Chapter 4: Guardrails and AI: Building Safe and Controllable Apps, Building Generative AI-Powered Apps: A Hands-on Guide for Developers, Apress, https://www.amazon.com/Building-Generative-AI-Powered-Apps-Hands-ebook/dp/B0CTXXP1S4/
Meta, July 2024 (accessed), Llama: Making safety tools accessible to everyone, https://llama.meta.com/trust-and-safety/
Chip Huyen, Jul 25, 2024, Building A Generative AI Platform, https://huyenchip.com/2024/07/25/genai-platform.html
Jaymari Chua, Yun Li, Shiyi Yang, Chen Wang, Lina Yao, 6 Jul 2024, AI Safety in Generative AI Large Language Models: A Survey, https://arxiv.org/abs/2407.18369
Marko Zivkovic, Aug 06, 2024, Discovered Apple Intelligence prompts show Apple's attempt at preventing AI disaster, https://appleinsider.com/articles/24/08/06/discovered-apple-intelligence-prompts-show-apples-attempt-at-preventing-ai-disaster
Rachel Curry, Aug 28 2024, Why companies including JPMorgan and Walmart are opting for internal gen AI assistants after initially restricting usage, https://www.cnbc.com/2024/08/28/why-jpmorgan-and-walmart-are-opting-for-internal-gen-ai-assistants.html
Shayne Longpre, Stella Biderman, Alon Albalak, Hailey Schoelkopf, Daniel McDuff, Sayash Kapoor, Kevin Klyman, Kyle Lo, Gabriel Ilharco, Nay San, Maribeth Rauh, Aviya Skowron, Bertie Vidgen, Laura Weidinger, Arvind Narayanan, Victor Sanh, David Adelani, Percy Liang, Rishi Bommasani, Peter Henderson, Sasha Luccioni, Yacine Jernite, Luca Soldaini, 26 Jun 2024 (v2), The Responsible Foundation Model Development Cheatsheet: A Review of Tools & Resources, https://arxiv.org/abs/2406.16746
Jason Perlow, Nov. 6, 2024, The best open-source AI models: All your free-to-use options explained: Here are the best open-source and free-to-use AI models for text, images, and audio, organized by type, application, and licensing considerations. https://www.zdnet.com/article/the-best-open-source-ai-models-all-your-free-to-use-options-explained/
McKinsey, November 14, 2024, What are AI guardrails? https://www.mckinsey.com/featured-insights/mckinsey-explainers/what-are-ai-guardrails
Aditi Bodhankar, Dec 06, 2024, Content Moderation and Safety Checks with NVIDIA NeMo Guardrails, https://developer.nvidia.com/blog/content-moderation-and-safety-checks-with-nvidia-nemo-guardrails/
Inkit Padhi, Manish Nagireddy, Giandomenico Cornacchia, Subhajit Chaudhury, Tejaswini Pedapati, Pierre Dognin, Keerthiram Murugesan, Erik Miehling, Martín Santillán Cooper, Kieran Fraser, Giulio Zizzo, Muhammad Zaid Hameed, Mark Purcell, Michael Desmond, Qian Pan, Inge Vejsbjerg, Elizabeth M. Daly, Michael Hind, Werner Geyer, Ambrish Rawat, Kush R. Varshney, Prasanna Sattigeri, 10 Dec 2024, Granite Guardian, https://arxiv.org/abs/2412.07724 https://github.com/ibm-granite/granite-guardian (Open-sourcing of safety models with many capabilities.)
Rama Akkiraju, Anbang Xu, Deepak Bora, Tan Yu, Lu An, Vishal Seth, Aaditya Shukla, Pritam Gundecha, Hridhay Mehta, Ashwin Jha, Prithvi Raj, Abhinav Balasubramanian, Murali Maram, Guru Muthusamy, Shivakesh Reddy Annepally, Sidney Knowles, Min Du, Nick Burnett, Sean Javiya, Ashok Marannan, Mamta Kumari, Surbhi Jha, Ethan Dereszenski, Anupam Chakraborty, Subhash Ranjan, Amina Terfai, Anoop Surya, Tracey Mercer, Vinodh Kumar Thanigachalam, Tamar Bar, Sanjana Krishnan, Samy Kilaru, Jasmine Jaksic, Nave Algarici, Jacob Liberman, Joey Conway, Sonu Nayyar, Justin Boitano, 10 Jul 2024, FACTS About Building Retrieval Augmented Generation-based Chatbots, NVIDIA Research, https://arxiv.org/abs/2407.07858
Aditi Bodhankar, Jan 16, 2025, How to Safeguard AI Agents for Customer Service with NVIDIA NeMo Guardrails, https://developer.nvidia.com/blog/how-to-safeguard-ai-agents-for-customer-service-with-nvidia-nemo-guardrails/
Yue Liu, Hongcheng Gao, Shengfang Zhai, Jun Xia, Tianyi Wu, Zhiwei Xue, Yulin Chen, Kenji Kawaguchi, Jiaheng Zhang, Bryan Hooi, 30 Jan 2025, GuardReasoner: Towards Reasoning-based LLM Safeguards, https://arxiv.org/abs/2501.18492
Aditi Bodhankar, Mar 03, 2025, Measuring the Effectiveness and Performance of AI Guardrails in Generative AI Applications, https://developer.nvidia.com/blog/measuring-the-effectiveness-and-performance-of-ai-guardrails-in-generative-ai-applications/
Manuel Cossio, 3 Aug 2025, A comprehensive taxonomy of hallucinations in Large Language Models, https://arxiv.org/abs/2508.01781
Anthropic, 13 Aug 2025, Building Safeguards for Claude, https://www.anthropic.com/news/building-safeguards-for-claude
Yuksel Aydin, 9 Aug 2025, Cognitive Cybersecurity for Artificial Intelligence: Guardrail Engineering with CCS-7, https://arxiv.org/abs/2508.10033
Boyuan Zheng, Zeyi Liao, Scott Salisbury, Zeyuan Liu, Michael Lin, Qinyuan Zheng, Zifan Wang, Xiang Deng, Dawn Song, Huan Sun, Yu Su, 18 Jul 2025, WebGuard: Building a Generalizable Guardrail for Web Agents, https://arxiv.org/abs/2507.14293
Cheng-Fu Yang, Thanh Tran, Christos Christodoulopoulos, Weitong Ruan, Rahul Gupta, Kai-Wei Chang, 28 Jul 2025, Customize Multi-modal RAI Guardrails with Precedent-based predictions, https://arxiv.org/abs/2507.20503
Chad DeLuca, Anna Lisa Gentile, Shubhi Asthana, Bing Zhang, Pawan Chowdhary, Kellen Cheng, Basel Shbita, Pengyuan Li, Guang-Jie Ren, Sandeep Gopisetty, 25 Jul 2025, OneShield - the Next Generation of LLM Guardrails, https://arxiv.org/abs/2507.21170
Hannah-Beth Clark, Laura Benton, Emma Searle, Margaux Dowland, Matthew Gregory, Will Gayne and John Roberts, 7 Aug 2025, Building Effective Safety Guardrails in AI Education Tools, https://arxiv.org/abs/2508.05360
Alexander W. Lee, Justin Chan, Michael Fu, Nicolas Kim, Akshay Mehta, Deepti Raghavan, Ugur Cetintemel, 7 Aug 2025, Semantic Integrity Constraints: Declarative Guardrails for AI-Augmented Data Processing Systems, https://arxiv.org/abs/2503.00600
Darpan Aswal and C\'eline Hudelot, 22 Aug 2025, LLMSymGuard: A Symbolic Safety Guardrail Framework Leveraging Interpretable Jailbreak Concepts, https://arxiv.org/abs/2508.16325
Jun Zhuang, Haibo Jin, Ye Zhang, Zhengjian Kang, Wenbin Zhang, Gaby G. Dagher, Haohan Wang, 25 Aug 2025, Exploring the Vulnerability of the Content Moderation Guardrail in Large Language Models via Intent Manipulation, https://arxiv.org/abs/2505.18556
Kellen Tan Cheng, Anna Lisa Gentile, Chad DeLuca, Guang-Jie Ren, 25 Aug 2025, Backprompting: Leveraging Synthetic Production Data for Health Advice Guardrails, https://arxiv.org/abs/2508.18384
Victoria R. Li and Yida Chen and Naomi Saphra, 26 Aug 2025, ChatGPT Doesn't Trust Chargers Fans: Guardrail Sensitivity in Context, https://arxiv.org/abs/2407.06866
Monte Hoover, Vatsal Baherwani, Neel Jain, Khalid Saifullah, Joseph Vincent, Chirag Jain, Melissa Kazemi Rad, C. Bayan Bruss, Ashwinee Panda, Tom Goldstein, 2 Sep 2025, DynaGuard: A Dynamic Guardrail Model With User-Defined Policies, https://arxiv.org/abs/2509.02563

Jailbreak

Jailbreaking is the hack of using English to break into a computer system. Actually, it's not so much a violation of the server, but it does refer to a way of getting the LLM to answer questions that its developer probably doesn't want it to. In other words, it's a trick to bypass the "refusal" module of an LLM.

Andy Arditi, Oscar Obeso, Aaquib111, wesg, Neel Nanda, 27th Apr 2024, Refusal in LLMs is mediated by a single direction, LessWrong, https://www.lesswrong.com/posts/jGuXSZgv6qfdhMCuJ/refusal-in-llms-is-mediated-by-a-single-direction
Adva Nakash Peleg, May 30, 2024, An LLM Journey: From POC to Production, https://medium.com/cyberark-engineering/an-llm-journey-from-poc-to-production-6c5ec6a172fb
Yu Wang, Xiaogeng Liu, Yu Li, Muhao Chen, Chaowei Xiao, 14 Mar 2024, AdaShield: Safeguarding Multimodal Large Language Models from Structure-based Attack via Adaptive Shield Prompting, https://arxiv.org/abs/2403.09513 Code: https://github.com/rain305f/AdaShield
Jinhwa Kim, Ali Derakhshan, Ian G. Harris, 31 Oct 2023, Robust Safety Classifier for Large Language Models: Adversarial Prompt Shield, https://arxiv.org/abs/2311.00172
Zixuan Ni, Longhui Wei, Jiacheng Li, Siliang Tang, Yueting Zhuang, Qi Tian, 8 Aug 2023 (v2), Degeneration-Tuning: Using Scrambled Grid shield Unwanted Concepts from Stable Diffusion, https://arxiv.org/abs/2308.02552
Xiao Peng, Tao Liu, Ying Wang, 3 Jun 2024 (v2), Genshin: General Shield for Natural Language Processing with Large Language Models, https://arxiv.org/abs/2405.18741
Ayushi Nirmal, Amrita Bhattacharjee, Paras Sheth, Huan Liu, 8 May 2024 ( v2), Towards Interpretable Hate Speech Detection using Large Language Model-extracted Rationales, https://arxiv.org/abs/2403.12403 Code: https://github.com/AmritaBh/shield
Shweta Sharma, 27 Jun 2024, Microsoft warns of ‘Skeleton Key’ jailbreak affecting many generative AI models, https://www.csoonline.com/article/2507702/microsoft-warns-of-novel-jailbreak-affecting-many-generative-ai-models.html
Seungju Han, Kavel Rao, Allyson Ettinger, Liwei Jiang, Bill Yuchen Lin, Nathan Lambert, Yejin Choi, Nouha Dziri, 26 Jun 2024, WildGuard: Open One-Stop Moderation Tools for Safety Risks, Jailbreaks, and Refusals of LLMs, https://arxiv.org/abs/2406.18495
Maksym Andriushchenko, Nicolas Flammarion, 16 Jul 2024, Does Refusal Training in LLMs Generalize to the Past Tense? https://arxiv.org/abs/2407.11969 Code: https://github.com/tml-epfl/llm-past-tense
Kylie Robison, Jul 20, 2024, OpenAI’s latest model will block the ‘ignore all previous instructions’ loophole, https://www.theverge.com/2024/7/19/24201414/openai-chatgpt-gpt-4o-prompt-injection-instruction-hierarchy
Chip Huyen, Jul 25, 2024, Building A Generative AI Platform, https://huyenchip.com/2024/07/25/genai-platform.html
Jaymari Chua, Yun Li, Shiyi Yang, Chen Wang, Lina Yao, 6 Jul 2024, AI Safety in Generative AI Large Language Models: A Survey, https://arxiv.org/abs/2407.18369
Ayush RoyChowdhury, Mulong Luo,, Prateek Sahu,, Sarbartha Banerjee, Mohit Tiwari, Aug 2024, ConfusedPilot: Confused Deputy Risks in RAG-based LLMs, https://confusedpilot.info/confused_pilot_new.pdf
Dr. Ashish Bamania, Sep 2024, ‘MathPrompt’ Embarassingly Jailbreaks All LLMs Available On The Market Today. A deep dive into how a novel LLM Jailbreaking technique called ‘MathPrompt’ works, why it is so effective, and why it needs to be patched as soon as possible to prevent harmful LLM content generation, https://bamania-ashish.medium.com/mathprompt-embarassingly-jailbreaks-all-llms-available-on-the-market-today-d749da26c6e8
Y. Bai et al., "Backdoor Attack and Defense on Deep Learning: A Survey," in IEEE Transactions on Computational Social Systems, doi: 10.1109/TCSS.2024.3482723. https://ieeexplore.ieee.org/abstract/document/10744415
Steve Jones, Oct 3, 2024, LLM Prompt Injection: Never send the request to the model. Classify, rewrite and reject, https://blog.metamirror.io/llm-prompt-injection-never-send-the-request-to-the-model-e8017269b96a
Emet Bethany, Mazal Bethany, Juan Arturo Nolazco Flores, Sumit Kumar Jha, Peyman Najafirad, 5 Nov 2024 (v2), Jailbreaking Large Language Models with Symbolic Mathematics, https://arxiv.org/abs/2409.11445
Alwin Peng, Julian Michael, Henry Sleight, Ethan Perez, Mrinank Sharma, 12 Nov 2024, Rapid Response: Mitigating LLM Jailbreaks with a Few Examples, https://arxiv.org/abs/2411.07494
Kyle O'Brien, David Majercak, Xavier Fernandes, Richard Edgar, Jingya Chen, Harsha Nori, Dean Carignan, Eric Horvitz, Forough Poursabzi-Sangde, 18 Nov 2024, Steering Language Model Refusal with Sparse Autoencoders, https://arxiv.org/abs/2411.11296
Zachary Coalson, Jeonghyun Woo, Shiyang Chen, Yu Sun, Lishan Yang, Prashant Nair, Bo Fang, Sanghyun Hong, 10 Dec 2024, PrisonBreak: Jailbreaking Large Language Models with Fewer Than Twenty-Five Targeted Bit-flips, https://arxiv.org/abs/2412.07192
Inkit Padhi, Manish Nagireddy, Giandomenico Cornacchia, Subhajit Chaudhury, Tejaswini Pedapati, Pierre Dognin, Keerthiram Murugesan, Erik Miehling, Martín Santillán Cooper, Kieran Fraser, Giulio Zizzo, Muhammad Zaid Hameed, Mark Purcell, Michael Desmond, Qian Pan, Inge Vejsbjerg, Elizabeth M. Daly, Michael Hind, Werner Geyer, Ambrish Rawat, Kush R. Varshney, Prasanna Sattigeri, 10 Dec 2024, Granite Guardian, https://arxiv.org/abs/2412.07724 https://github.com/ibm-granite/granite-guardian (Open-sourcing of safety models with many capabilities.)
Mohit Sewak, Dec 6, 2024, Prompt Injection Attacks on Large Language Models, https://pub.towardsai.net/prompt-injection-attacks-on-large-language-models-bd8062fa1bb7
Sicheng Zhu, Brandon Amos, Yuandong Tian, Chuan Guo, Ivan Evtimov, 13 Dec 2024, AdvPrefix: An Objective for Nuanced LLM Jailbreaks, https://arxiv.org/abs/2412.10321
Aditi Bodhankar, Jan 16, 2025, How to Safeguard AI Agents for Customer Service with NVIDIA NeMo Guardrails, https://developer.nvidia.com/blog/how-to-safeguard-ai-agents-for-customer-service-with-nvidia-nemo-guardrails/
Xin Yi, Yue Li, Linlin Wang, Xiaoling Wang, Liang He, 18 Jan 2025, Latent-space adversarial training with post-aware calibration for defending large language models against jailbreak attacks, https://arxiv.org/abs/2501.10639
Yue Liu, Hongcheng Gao, Shengfang Zhai, Jun Xia, Tianyi Wu, Zhiwei Xue, Yulin Chen, Kenji Kawaguchi, Jiaheng Zhang, Bryan Hooi, 30 Jan 2025, GuardReasoner: Towards Reasoning-based LLM Safeguards, https://arxiv.org/abs/2501.18492
Taryn Plumb, February 3, 2025, Anthropic claims new AI security method blocks 95% of jailbreaks, invites red teamers to try, https://venturebeat.com/security/anthropic-claims-new-ai-security-method-blocks-95-of-jailbreaks-invites-red-teamers-to-try/
Holistic AI Team, March 6, 2025, Anthropic’s Claude 3.7 Sonnet Jailbreaking & Red Teaming Audit: The Most Secure Model Yet? https://www.holisticai.com/blog/claude-3-7-sonnet-jailbreaking-audit
Jiacheng Liang, Tanqiu Jiang, Yuhui Wang, Rongyi Zhu, Fenglong Ma, Ting Wang, 16 May 2025, AutoRAN: Weak-to-Strong Jailbreaking of Large Reasoning Models, https://arxiv.org/abs/2505.10846
Wojciech Zaremba, Evgenia Nitishinskaya, Boaz Barak, Stephanie Lin, Sam Toyer, Yaodong Yu, Rachel Dias, Eric Wallace, Kai Xiao, Johannes Heidecke, Amelia Glaese, 31 Jan 2025, Trading Inference-Time Compute for Adversarial Robustness, https://arxiv.org/abs/2501.18841
Manuel Cossio, 3 Aug 2025, A comprehensive taxonomy of hallucinations in Large Language Models, https://arxiv.org/abs/2508.01781
Anthropic, 13 Aug 2025, Building Safeguards for Claude, https://www.anthropic.com/news/building-safeguards-for-claude
Wenpeng Xing, Mohan Li, Chunqiang Hu, Haitao XuNingyu Zhang, Bo Lin, Meng Han, 8 Aug 2025, Latent Fusion Jailbreak: Blending Harmful and Harmless Representations to Elicit Unsafe LLM Outputs, https://arxiv.org/abs/2508.10029
Fan Yang, 9 Aug 2025, The Cost of Thinking: Increased Jailbreak Risk in Large Language Models, https://arxiv.org/abs/2508.10032
Xiaoxue Yang, Jaeha Lee, Anna-Katharina Dick, Jasper Timm, Fei Xie, Diogo Cruz, 11 Aug 2025, Multi-Turn Jailbreaks Are Simpler Than They Seem, https://arxiv.org/abs/2508.07646
Xianjun Yang, Liqiang Xiao, Shiyang Li, Faisal Ladhak, Hyokun Yun, Linda Ruth Petzold, Yi Xu, William Yang Wang, 9 Aug 2025, Many-Turn Jailbreaking, https://arxiv.org/abs/2508.06755
Xuancun Lu, Zhengxian Huang, Xinfeng Li, Chi Zhang, Xiaoyu ji, Wenyuan Xu, 11 Aug 2025, POEX: Towards Policy Executable Jailbreak Attacks Against the LLM-based Robots, https://arxiv.org/abs/2412.16633
Tatia Tsmindashvili, Ana Kolkhidashvili, Dachi Kurtskhalia, Nino Maghlakelidze, Elene Mekvabishvili, Guram Dentoshvili, Orkhan Shamilov, Zaal Gachechiladze, Steven Saporta, David Dachi Choladze, 11 Aug 2025, Improving LLM Outputs Against Jailbreak Attacks with Expert Model Integration, https://arxiv.org/abs/2505.17066
Jirui Yang, Zheyu Lin, Zhihui Lu, Yinggui Wang, Lei Wang, Tao Wei, Xin Du, Shuhan Yang, 31 Jul 2025, CEE: An Inference-Time Jailbreak Defense for Embodied Intelligence via Subspace Concept Rotation, https://arxiv.org/abs/2504.13201
Zheng Zhang, Peilin Zhao, Deheng Ye, Hao Wang, 28 Jul 2025, Enhancing Jailbreak Attacks on LLMs via Persona Prompts, https://arxiv.org/abs/2507.22171
Jiecong Wang, Haoran Li, Hao Peng, Ziqian Zeng, Zihao Wang, Haohua Du, Zhengtao Yu, 1 Aug 2025, Activation-Guided Local Editing for Jailbreaking Attacks, https://arxiv.org/abs/2508.00555
Yelim Ahn, Jaejin Lee, 2 Aug 2025, PUZZLED: Jailbreaking LLMs through Word-Based Puzzles, https://arxiv.org/abs/2508.01306
Yik Siu Chan, Narutatsu Ri, Yuxin Xiao, Marzyeh Ghassemi, 2 Aug 2025, Speak Easy: Eliciting Harmful Jailbreaks from LLMs with Simple Interactions, https://arxiv.org/abs/2502.04322
Muyang Zheng, Yuanzhi Yao, Changting Lin, Rui Wang, Caihong Kai, 4 Aug 2025, MIST: Jailbreaking Black-box Large Language Models via Iterative Semantic Tuning, https://arxiv.org/abs/2506.16792
Rui Pu, Chaozhuo Li, Rui Ha, Litian Zhang, Lirong Qiu, Xi Zhang, 5 Aug 2025, Beyond Surface-Level Detection: Towards Cognitive-Driven Defense Against Jailbreak Attacks via Meta-Operations Reasoning, https://arxiv.org/abs/2508.03054
Bodam Kim, Hiskias Dingeto, Taeyoun Kwon, Dasol Choi, DongGeon Lee, Haon Park, JaeHoon Lee, Jongho Shin, 5 Aug 2025, When Good Sounds Go Adversarial: Jailbreaking Audio-Language Models with Benign Inputs, https://arxiv.org/abs/2508.03365
Giovanni Cherubin, Andrew Paverd, 4 Aug 2025, Highlight & Summarize: RAG without the jailbreaks, https://arxiv.org/abs/2508.02872
Ruofan Wang, Juncheng Li, Yixu Wang, Bo Wang, Xiaosen Wang, Yan Teng, Yingchun Wang, Xingjun Ma, Yu-Gang Jiang, 5 Aug 2025, IDEATOR: Jailbreaking and Benchmarking Large Vision-Language Models Using Themselves, https://arxiv.org/abs/2411.00827
Junwoo Ha, Hyunjun Kim, Sangyoon Yu, Haon Park, Ashkan Yousefpour, Yuna Park, Suhyun Kim, 5 Aug 2025, M2S: Multi-turn to Single-turn jailbreak in Red Teaming for LLMs, https://arxiv.org/abs/2503.04856
Thilo Hagendorff, Erik Derner, Nuria Oliver, 4 Aug 2025, Large Reasoning Models Are Autonomous Jailbreak Agents, https://arxiv.org/abs/2508.04039
Xiaohu Li and Yunfeng Ning and Zepeng Bao and Mayi Xu and Jianhao Chen and Tieyun Qian, 6 Aug 2025, CAVGAN: Unifying Jailbreak and Defense of LLMs via Generative Adversarial Attacks on their Internal Representations, https://arxiv.org/abs/2507.06043
Renmiao Chen, Shiyao Cui, Xuancheng Huang, Chengwei Pan, Victor Shea-Jay Huang, QingLin Zhang, Xuan Ouyang, Zhexin Zhang, Hongning Wang, and Minlie Huang, 7 Aug 2025, JPS: Jailbreak Multimodal Large Language Models with Collaborative Visual Perturbation and Textual Steering, https://arxiv.org/abs/2508.05087
Jesson Wang, Zhanhao Hu, David Wagner, 7 Aug 2025, JULI: Jailbreak Large Language Models by Self-Introspection, https://arxiv.org/abs/2505.11790
Shuang Liang, Zhihao Xu, Jialing Tao, Hui Xue, Xiting Wang, 8 Aug 2025, Learning to Detect Unknown Jailbreak Attacks in Large Vision-Language Models: A Unified and Accurate Approach, https://arxiv.org/abs/2508.09201
Zuoou Li, Weitong Zhang, Jingyuan Wang, Shuyuan Zhang, Wenjia Bai, Bernhard Kainz, Mengyun Qiao, 11 Aug 2025, Towards Effective MLLM Jailbreaking Through Balanced On-Topicness and OOD-Intensity, https://arxiv.org/abs/2508.09218
Boyuan Chen, Minghao Shao, Abdul Basit, Siddharth Garg, Muhammad Shafique, 13 Aug 2025, MetaCipher: A Time-Persistent and Universal Multi-Agent Framework for Cipher-Based Jailbreak Attacks for LLMs, https://arxiv.org/abs/2506.22557
Ma Teng and Jia Xiaojun and Duan Ranjie and Li Xinfeng and Huang Yihao and Jia Xiaoshuang and Chu Zhixuan and Ren Wenqi, 18 Aug 2025, Heuristic-Induced Multimodal Risk Distribution Jailbreak Attack for Multimodal Large Language Models, https://arxiv.org/abs/2412.05934
Zhipeng Wei, Yuqi Liu, N. Benjamin Erichson, 16 Aug 2025, Emoji Attack: Enhancing Jailbreak Attacks Against Judge LLM Detection, https://arxiv.org/abs/2411.01077
Yangyang Guo and Yangyan Li and Mohan Kankanhalli, 18 Aug 2025, Involuntary Jailbreak, https://arxiv.org/abs/2508.13246
Jiaming Hu, Haoyu Wang, Debarghya Mukherjee, Ioannis Ch. Paschalidis, 19 Aug 2025, CCFC: Core & Core-Full-Core Dual-Track Defense for LLM Jailbreak Protection, https://arxiv.org/abs/2508.14128
Xiangman Li, Xiaodong Wu, Qi Li, Jianbing Ni, and Rongxing Lu, 21 Aug 2025, SafeLLM: Unlearning Harmful Outputs from Large Language Models against Jailbreak Attacks, https://arxiv.org/abs/2508.15182
Darpan Aswal and C\'eline Hudelot, 22 Aug 2025, LLMSymGuard: A Symbolic Safety Guardrail Framework Leveraging Interpretable Jailbreak Concepts, https://arxiv.org/abs/2508.16325
Yu Yan, Sheng Sun, Zhe Wang, Yijun Lin, Zenghao Duan, zhifei zheng, Min Liu, Zhiyi yin, Jianping Zhang, 22 Aug 2025, Confusion is the Final Barrier: Rethinking Jailbreak Evaluation and Investigating the Real Misuse Threat of LLMs, https://arxiv.org/abs/2508.16347
Yu Yan, Sheng Sun, Zenghao Duan, Teli Liu, Min Liu, Zhiyi Yin, Jiangyu Lei, Qi Li, 22 Aug 2025, from Benign import Toxic: Jailbreaking the Language Model via Adversarial Metaphors, https://arxiv.org/abs/2503.00038
Chongwen Zhao, Zhihao Dou, Kaizhu Huang, 25 Aug 2025, Defending against Jailbreak through Early Exit Generation of Large Language Models, https://arxiv.org/abs/2408.11308
Junchen Ding, Jiahao Zhang, Yi Liu, Ziqi Ding, Gelei Deng, Yuekang Li, 25 Aug 2025, TombRaider: Entering the Vault of History to Jailbreak Large Language Models, https://arxiv.org/abs/2501.18628
Salman Rahman, Liwei Jiang, James Shiffer, Genglin Liu, Sheriff Issaka, Md Rizwan Parvez, Hamid Palangi, Kai-Wei Chang, Yejin Choi, Saadia Gabriel, 23 Aug 2025, X-Teaming: Multi-Turn Jailbreaks and Defenses with Adaptive Multi-Agents, https://arxiv.org/abs/2504.13203
Hanjiang Hu, Alexander Robey, Changliu Liu, 25 Aug 2025, Steering Dialogue Dynamics for Robustness against Multi-turn Jailbreaking Attacks, https://arxiv.org/abs/2503.00187
Chuhan Zhang, Ye Zhang, Bowen Shi, Yuyou Gan, Tianyu Du, Shouling Ji, Dazhan Deng, Yingcai Wu, 4 Sep 2025, NeuroBreak: Unveil Internal Jailbreak Mechanisms in Large Language Models, https://arxiv.org/abs/2509.03985
Yakai Li, Jiekang Hu, Weiduan Sang, Luping Ma, Dongsheng Nie, Weijuan Zhang, Aimin Yu, Yi Su, Qingjia Huang, Qihang Zhou, 25 Aug 2025, Prefill-level Jailbreak: A Black-Box Risk Analysis of Large Language Models, https://arxiv.org/abs/2504.21038
Haibo Jin, Ruoxi Chen, Peiyan Zhang, Andy Zhou, Yang Zhang, Haohan Wang, 28 Aug 2025, GUARD: Guideline Upholding Test through Adaptive Role-play and Jailbreak Diagnostics for LLMs, https://arxiv.org/abs/2508.20325
Junjie Chu and Mingjie Li and Ziqing Yang and Ye Leng and Chenhao Lin and Chao Shen and Michael Backes and Yun Shen and Yang Zhang, 28 Aug 2025, JADES: A Universal Framework for Jailbreak Assessment via Decompositional Scoring, https://arxiv.org/abs/2508.20848
Chongwen Zhao and Kaizhu Huang, 1 Sep 2025, Unraveling LLM Jailbreaks Through Safety Knowledge Neurons, https://arxiv.org/abs/2509.01631
Sihao Wu, Gaojie Jin, Wei Huang, Jianhong Wang, Xiaowei Huang, 30 Aug 2025, Activation Steering Meets Preference Optimization: Defense Against Jailbreaks in Vision Language Models, https://arxiv.org/abs/2509.00373
Ruoxi Cheng, Yizhong Ding, Shuirong Cao, Ranjie Duan, Xiaoshuang Jia, Shaowei Yuan, Simeng Qin, Zhiqiang Wang, Xiaojun Jia, 30 Aug 2025, PBI-Attack: Prior-Guided Bimodal Interactive Black-Box Jailbreak Attack for Toxicity Maximization, https://arxiv.org/abs/2412.05892
Shei Pern Chua, Thai Zhen Leng, Teh Kai Jun, Xiao Li, Xiaolin Hu, 4 Sep 2025, Between a Rock and a Hard Place: Exploiting Ethical Reasoning to Jailbreak LLMs, https://arxiv.org/abs/2509.05367
Youjia Zheng, Mohammad Zandsalimy, and Shanu Sushmita, 5 Sep 2025, Behind the Mask: Benchmarking Camouflaged Jailbreaks in Large Language Models, https://arxiv.org/abs/2509.05471
Junjie Mu, Zonghao Ying, Zhekui Fan, Zonglei Jing, Yaoyuan Zhang, Zhengmin Yu, Wenxin Zhang, Quanchen Zou, Xiangzheng Zhang, 8 Sep 2025, Mask-GCG: Are All Tokens in Adversarial Suffixes Necessary for Jailbreak Attacks?, https://arxiv.org/abs/2509.06350
Yunhan Zhao, Xiang Zheng, Xingjun Ma, 16 Sep 2025, Defense-to-Attack: Bypassing Weak Defenses Enables Stronger Jailbreaks in Vision-Language Models, https://arxiv.org/abs/2509.12724
Johan Wahr\'eus, Ahmed Hussain, Panos Papadimitratos, 16 Sep 2025, Jailbreaking Large Language Models Through Content Concretization, https://arxiv.org/abs/2509.12937
Seongho Joo, Hyukhun Koh, Kyomin Jung, 13 Sep 2025, Harmful Prompt Laundering: Jailbreaking LLMs with Abductive Styles and Symbolic Encoding, https://arxiv.org/abs/2509.10931
Chentao Cao, Xiaojun Xu, Bo Han, Hang Li, 15 Sep 2025, Reasoned Safety Alignment: Ensuring Jailbreak Defense via Answer-Then-Check, https://arxiv.org/abs/2509.11629
Yibo Zhang, Liang Lin, 14 Sep 2025, ENJ: Optimizing Noise with Genetic Algorithms to Jailbreak LSMs, https://arxiv.org/abs/2509.11128
Guorui Chen, Yifan Xia, Xiaojun Jia, Zhijiang Li, Philip Torr, Jindong Gu, 18 Sep 2025, LLM Jailbreak Detection for (Almost) Free!, https://arxiv.org/abs/2509.14558
Hyunjun Kim, Junwoo Ha, Sangyoon Yu, Haon Park, 10 Sep 2025, X-Teaming Evolutionary M2S: Automated Discovery of Multi-turn to Single-turn Jailbreak Templates, https://arxiv.org/abs/2509.08729

Prompt Injection

Prompt injection is a type of LLM "hack" or "jailbreak" involving the insertion of nefarious words into the prompt. A simple example is the jailbreak involving words to the effect of "ignore all previous instructions and do what I say," which was a surprisingly effective method.

Research papers on prompt injection attacks and mitigation include:

Jaymari Chua, Yun Li, Shiyi Yang, Chen Wang, Lina Yao, 6 Jul 2024, AI Safety in Generative AI Large Language Models: A Survey, https://arxiv.org/abs/2407.18369
Yuanchun Li, Hao Wen, Weijun Wang, Xiangyu Li, Yizhen Yuan, Guohong Liu, Jiacheng Liu, Wenxing Xu, Xiang Wang, Yi Sun, Rui Kong, Yile Wang, Hanfei Geng, Jian Luan, Xuefeng Jin, Zilong Ye, Guanjing Xiong, Fan Zhang, Xiang Li, Mengwei Xu, Zhijun Li, Peng Li, Yang Liu, Ya-Qin Zhang, Yunxin Liu, 8 May 2024 (v2), Personal LLM Agents: Insights and Survey about the Capability, Efficiency and Security, https://arxiv.org/abs/2401.05459 https://github.com/MobileLLM/Personal_LLM_Agents_Survey
Steve Jones, Oct 3, 2024, LLM Prompt Injection: Never send the request to the model. Classify, rewrite and reject, https://blog.metamirror.io/llm-prompt-injection-never-send-the-request-to-the-model-e8017269b96a
Mohit Sewak, Dec 6, 2024, Prompt Injection Attacks on Large Language Models, https://pub.towardsai.net/prompt-injection-attacks-on-large-language-models-bd8062fa1bb7
Kylie Robison, Jul 20, 2024, OpenAI’s latest model will block the ‘ignore all previous instructions’ loophole, https://www.theverge.com/2024/7/19/24201414/openai-chatgpt-gpt-4o-prompt-injection-instruction-hierarchy
Jerry Wang and Fang Yu, 20 Jul 2025, DeRAG: Black-box Adversarial Attacks on Multiple Retrieval-Augmented Generation Applications via Prompt Injection, https://arxiv.org/abs/2507.15042
Sam Johnson, Viet Pham, Thai Le, 20 Jul 2025, Manipulating LLM Web Agents with Indirect Prompt Injection Attack via HTML Accessibility Tree, https://arxiv.org/abs/2507.14799
Tianneng Shi, Kaijie Zhu, Zhun Wang, Yuqi Jia, Will Cai, Weida Liang, Haonan Wang, Hend Alzahrani, Joshua Lu, Kenji Kawaguchi, Basel Alomair, Xuandong Zhao, William Yang Wang, Neil Gong, Wenbo Guo, Dawn Song, 21 Jul 2025, PromptArmor: Simple yet Effective Prompt Injection Defenses, https://arxiv.org/abs/2507.15219
Aleksandr Gashkov, Aleksandr Perevalov, Maria Eltsova, Andreas Both, 18 Jul 2025, SPARQL Query Generation with LLMs: Measuring the Impact of Training Data Memorization and Knowledge Injection, https://arxiv.org/abs/2507.13859
Junhyeong Lee, Joon-Young Kim, Heekyu Kim, Inhyo Lee and Seunghwa Ryu, 21 Jul 2025, IM-Chat: A Multi-agent LLM-based Framework for Knowledge Transfer in Injection Molding Industry, https://arxiv.org/abs/2507.15268
Zhengyun Zhao, Huaiyuan Ying, Yue Zhong, Sheng Yu, 24 Jul 2025, DR.EHR: Dense Retrieval for Electronic Health Record with Knowledge Injection and Synthetic Data, https://arxiv.org/abs/2507.18583
Da-Wei Zhou, Kai-Wen Li, Jingyi Ning, Han-Jia Ye, Lijun Zhang, De-Chuan Zhan, 24 Jul 2025, External Knowledge Injection for CLIP-Based Class-Incremental Learning, https://arxiv.org/abs/2503.08510
Taibiao Zhao, Mingxuan Sun, Hao Wang, Xiaobing Chen, Xiangwei Zhou, 14 Aug 2025, Pruning and Malicious Injection: A Retraining-Free Backdoor Attack on Transformer Models, https://arxiv.org/abs/2508.10243
Francesco Panebianco, Stefano Bonfanti, Francesco Trov\`o, Michele Carminati, 1 Aug 2025, LeakSealer: A Semisupervised Defense for LLMs Against Prompt Injection and Leakage Attacks, https://arxiv.org/abs/2508.00602
Peiran Wang, Yang Liu, Yunfei Lu, Yifeng Cai, Hongbo Chen, Qingyou Yang, Jie Zhang, Jue Hong, Ye Wu, 2 Aug 2025, AgentArmor: Enforcing Program Analysis on Agent Runtime Trace to Defend Against Prompt Injection, https://arxiv.org/abs/2508.01249
Zhiyao Luo, Tingting Zhu, 6 Aug 2025, Are Large Language Models Dynamic Treatment Planners? An In Silico Study from a Prior Knowledge Injection Angle, https://arxiv.org/abs/2508.04755
Thorsten Peinemann, Paula Arnold, Sebastian Berndt, Thomas Eisenbarth, Esfandiar Mohammadi, 7 Aug 2025, Non-omniscient backdoor injection with a single poison sample: Proving the one-poison hypothesis for linear regression and linear classification, https://arxiv.org/abs/2508.05600
Hammad Atta, Ken Huang, Manish Bhatt, Kamal Ahmed, Muhammad Aziz Ul Haq, Yasir Mehmood, 6 Aug 2025, Logic layer Prompt Control Injection (LPCI): A Novel Security Vulnerability Class in Agentic Systems, https://arxiv.org/abs/2507.10457
Kalle Kujanp\"a\"a, Pekka Marttinen, Harri Valpola, Alexander Ilin, 7 Aug 2025, Efficient Knowledge Injection in LLMs via Self-Distillation, https://arxiv.org/abs/2412.14964
Ameya Anjarlekar, Sandeep Pombra, 8 Aug 2025, LLM Unlearning using Gradient Ratio-Based Influence Estimation and Noise Injection, https://arxiv.org/abs/2508.06467
Zhiqiu Zhang, Dongqi Fan, Mingjie Wang, Qiang Tang, Jian Yang, Zili Yi, 13 Aug 2025, Region-to-Region: Enhancing Generative Image Harmonization with Adaptive Regional Injection, https://arxiv.org/abs/2508.09746
Xuyang Guo, Zekai Huang, Zhao Song, Jiahao Zhang, 16 Aug 2025, Too Easily Fooled? Prompt Injection Breaks LLMs on Frustratingly Simple Multiple-Choice Questions, https://arxiv.org/abs/2508.13214
Xudong Wang, Guoming Tang, Junyu Xue, Srinivasan Keshav, Tongxin Li, Chris Ding, 20 Aug 2025, DualNILM: Energy Injection Identification Enabled Disaggregation with Deep Multi-Task Learning, https://arxiv.org/abs/2508.14600
Hengyu An, Jinghuai Zhang, Tianyu Du, Chunyi Zhou, Qingming Li, Tao Lin, Shouling Ji, 21 Aug 2025, IPIGuard: A Novel Tool Dependency Graph-Based Defense Against Indirect Prompt Injection in LLM Agents, https://arxiv.org/abs/2508.15310
Khalil Hennara, Sara Chrouf, Mohamed Motaism Hamed, Zeina Aldallal, Omar Hadid, Safwan AlModhayan, 21 Aug 2025, Kuwain 1.5B: An Arabic SLM via Language Injection, https://arxiv.org/abs/2504.15120
Jiawen Shi, Zenghui Yuan, Yinuo Liu, Yue Huang, Pan Zhou, Lichao Sun, Neil Zhenqiang Gong, 24 Aug 2025, Optimization-based Prompt Injection Attack to LLM-as-a-Judge, https://arxiv.org/abs/2403.17710
Sarthak Choudhary, Divyam Anshumaan, Nils Palumbo, Somesh Jha, 17 Jul 2025, How Not to Detect Prompt Injections with an LLM, https://arxiv.org/abs/2507.05630
Qifeng Tan, Shusen Yang, Xuebin Ren, Yikai Zhang (Xi'an Jiaotong University), 4 Sep 2025, Rethinking Layer-wise Gaussian Noise Injection: Bridging Implicit Objectives and Privacy Budget Allocation, https://arxiv.org/abs/2509.04232
Xilong Wang, John Bloch, Zedian Shao, Yuepeng Hu, Shuyan Zhou, Neil Zhenqiang Gong, 27 Aug 2025, EnvInjection: Environmental Prompt Injection Attack to Multi-modal Web Agents, https://arxiv.org/abs/2505.11717
Harethah Abu Shairah, Hasan Abed Al Kader Hammoud, George Turkiyyah, Bernard Ghanem, 28 Aug 2025, Turning the Spell Around: Lightweight Alignment Amplification via Rank-One Safety Injection, https://arxiv.org/abs/2508.20766
Amine Lbath, Massih-Reza Amini, Aurelien Delaitre, Vadim Okun, 28 Aug 2025, AI Agentic Vulnerability Injection And Transformation with Optimized Reasoning, https://arxiv.org/abs/2508.20866
Govind Waghmare, Sumedh BG, Sonia Gupta, Srikanta Bedathur, 31 Aug 2025, Efficient Graph Understanding with LLMs via Structured Context Injection, https://arxiv.org/abs/2509.00740
Ting-Chun Liu and Ching-Yu Hsu and Kuan-Yi Lee and Chi-An Fu and Hung-yi Lee, 27 Aug 2025, AEGIS : Automated Co-Evolutionary Framework for Guarding Prompt Injections Schema, https://arxiv.org/abs/2509.00088
Mario U. Gaimann and Miriam Klopotek, 1 Sep 2025, Optimal information injection and transfer mechanisms for active matter reservoir computing, https://arxiv.org/abs/2509.01799
Ishaan Verma, 6 Sep 2025, Decoding Latent Attack Surfaces in LLMs: Prompt Injection via HTML in Web Summarization, https://arxiv.org/abs/2509.05831
Andrew Yeo, Daeseon Choi, 7 Sep 2025, Multimodal Prompt Injection Attacks: Risks and Defenses for Modern LLMs, https://arxiv.org/abs/2509.05883
Mengxue Yang, Chun Yang, Jiaqi Zhu, Jiafan Li, Jingqi Zhang, Yuyang Li, Ying Li, 8 Sep 2025, SLiNT: Structure-aware Language Model with Injection and Contrastive Training for Knowledge Graph Completion, https://arxiv.org/abs/2509.06531
Minghui Li, Hao Zhang, Yechao Zhang, Wei Wan, Shengshan Hu, pei Xiaobing, Jing Wang, 9 Sep 2025, Transferable Direct Prompt Injection via Activation-Guided MCMC Sampling, https://arxiv.org/abs/2509.07617
Janis Keuper, 12 Sep 2025, Prompt Injection Attacks on LLM Generated Reviews of Scientific Publications, https://arxiv.org/abs/2509.10248
Hai-Vy Nguyen, Fabrice Gamboa, Sixin Zhang, Reda Chhaibi, Serge Gratton, Thierry Giaccone, 19 Sep 2025, Training More Robust Classification Model via Discriminative Loss and Gaussian Noise Injection, https://arxiv.org/abs/2405.18499
Jiahao Zhang and Xiaobing Pei and Zhaokun Zhong and Wenqiang Hao and Zhenghao Tang, 16 Sep 2025, JANUS: A Dual-Constraint Generative Framework for Stealthy Node Injection Attacks, https://arxiv.org/abs/2509.13266
Luke Howard, 13 Sep 2025, GoldenTransformer: A Modular Fault Injection Framework for Transformer Robustness Research, https://arxiv.org/abs/2509.10790
Pavan Reddy, Aditya Sanjay Gujral, 6 Sep 2025, EchoLeak: The First Real-World Zero-Click Prompt Injection Exploit in a Production LLM System, https://arxiv.org/abs/2509.10540
Zedian Shao, Hongbin Liu, Jaden Mu, Neil Zhenqiang Gong, 15 Sep 2025, Enhancing Prompt Injection Attacks to LLMs via Poisoning Alignment, https://arxiv.org/abs/2410.14827
Yupei Liu, Yuqi Jia, Jinyuan Jia, Dawn Song, Neil Zhenqiang Gong, 14 Sep 2025, DataSentinel: A Game-Theoretic Detection of Prompt Injection Attacks, https://arxiv.org/abs/2504.11358
Gustavo Sandoval, Denys Fenchenko and Junyao Chen, 15 Sep 2025, Early Approaches to Adversarial Fine-Tuning for Prompt Injection Defense: A 2022 Study of GPT-3 and Contemporary Models, https://arxiv.org/abs/2509.14271
S M Asif Hossain, Ruksat Khan Shayoni, Mohd Ruhul Ameen, Akif Islam, M. F. Mridha, Jungpil Shin, 16 Sep 2025, A Multi-Agent LLM Defense Pipeline Against Prompt Injection Attacks, https://arxiv.org/abs/2509.14285
Tongyu Wen, Chenglong Wang, Xiyuan Yang, Haoyu Tang, Yueqi Xie, Lingjuan Lyu, Zhicheng Dou, Fangzhao Wu, 17 Sep 2025, Defending against Indirect Prompt Injection by Instruction Detection, https://arxiv.org/abs/2505.06311

Plagiarism

Plagiarism is an issue for LLMs when they repeat input training verbatim. This is a controversial issue with numerous copyright lawsuits in progress at the moment. Another side to "plagiarism" is detecting when authors or students have used AI in their writing, without attribution.

Research papers on plagiarism issues with AI include:

Ruixiang Tang, Yu-Neng Chuang, Xia Hu, June 2023, The Science of Detecting LLM-Generated Texts, https://arxiv.org/abs/2303.07205
JON CHRISTIAN, 2023, CNET's AI Journalist Appears to Have Committed Extensive Plagiarism, https://futurism.com/cnet-ai-plagiarism
Jaymari Chua, Yun Li, Shiyi Yang, Chen Wang, Lina Yao, 6 Jul 2024, AI Safety in Generative AI Large Language Models: A Survey, https://arxiv.org/abs/2407.18369
David Gewirtz, Nov. 26, 2024, I tested 9 AI content detectors - and these 2 correctly identified AI text every time, https://www.zdnet.com/article/i-tested-9-ai-content-detectors-and-these-2-correctly-identified-ai-text-every-time/
Guillaume Cabanac, Cyril Labbé, Alexander Magazinov, 12 Jul 2021, Tortured phrases: A dubious writing style emerging in science. Evidence of critical issues affecting established journals, https://arxiv.org/abs/2107.06751 (Detects "tortured phrases" created by pre-AI paraphrasing tools used to avoid plagiarism detectors.)
Eléna Martel, Martin Lentschat, Cyril Labbé, 2 Feb 2024, Detection of tortured phrases in scientific literature, https://arxiv.org/abs/2402.03370
Seonghyeon Go, 10 Sep 2025, Real-world Music Plagiarism Detection With Music Segment Transcription System, https://arxiv.org/abs/2509.08282

AI Detectors

AI detectors are software that is supposed to detect whether a text or image has been created by humans or by LLMs. In practice, these have been a mixed success, being prone to both false positives and false negatives, and their use remains controversial and unclear.

Research papers on AI detectors:

David Gewirtz, Aug. 19, 2024, How do AI checkers actually work? https://www.zdnet.com/article/how-do-ai-checkers-work/
David Gewirtz, Aug. 8, 2024, I tested 7 AI content detectors - they're getting dramatically better at identifying plagiarism, https://www.zdnet.com/article/i-tested-7-ai-content-detectors-theyre-getting-dramatically-better-at-identifying-plagiarism/
Write A Catalyst, Aug 23, 2024, Words and Phrases That Show ChatGPT Generated It, https://medium.com/write-a-catalyst/words-and-phrases-that-show-chatgpt-generated-it-ca7e28ae8e8f
Brian Contreras, September 19, 2024, How Can You Detect AI-Generated Text? This Startup Has Some Compelling Ideas, https://www.inc-aus.com/brian-contreras/how-can-you-detect-ai-generated-text-this-startup-has-some-compelling-ideas.html
Tan Rosado, Sep 9, 2024, 10 Phrases That Scream ‘AI Wrote This!’ — Even When It Didn’t. https://medium.com/write-a-catalyst/10-phrases-that-scream-ai-wrote-this-even-when-it-didn-t-c58f273c9075
David Gewirtz, Nov. 26, 2024, I tested 9 AI content detectors - and these 2 correctly identified AI text every time, https://www.zdnet.com/article/i-tested-9-ai-content-detectors-and-these-2-correctly-identified-ai-text-every-time/
The Medium Newsletter Dec 2024, ChatGPT’s favorite words & punctuation, The Medium Blog, https://blog.medium.com/chatgpts-favorite-words-punctuation-fca042bb6bea
The Medium Blog, Jun 7, 2024, How to become a marine biologist, https://blog.medium.com/how-to-become-a-marine-biologist-ca849217523b
Alex Hern, 16 Apr 2024, TechScape: How cheap, outsourced labour in Africa is shaping AI English, https://www.theguardian.com/technology/2024/apr/16/techscape-ai-gadgest-humane-ai-pin-chatgpt
Jordan Gibbs, Dec 14, 2023, Which Words Does ChatGPT Use the Most? I analyzed 1 million words of ChatGPT output and found the words that ChatGPT overuses most. https://medium.com/@jordan_gibbs/which-words-does-chatgpt-use-the-most-7c9ff02416a8
Asif Iqbal, August 31, 2024, ChatGPT's Top 50 Favorite Words and Phrases, https://www.linkedin.com/pulse/chatgpts-top-50-favorite-words-phrases-asif-iqbal-mba-cmbe-lavpe/
BaggyBoy, 2024, Is an em dash (—) proof of AI manipulation? https://www.reddit.com/r/ChatGPT/comments/1fx12q1/is_an_em_dash_proof_of_ai_manipulation/?rdt=38192
Linda Caroll, Jan 2025, I Don’t Know How To Make You Care What ChatGPT Is Quietly Doing: Over half of the internet is now AI generated text https://medium.com/the-generator/i-dont-know-how-to-make-you-care-what-chatgpt-is-quietly-doing-8177dfcfb486
Maria Cassano, Jan 4, 2025, I’m a Professional Editor and These Phrases Tell Me You Used ChatGPT: AI chatbots were trained on novice writing, and it shows, https://writingcooperative.com/im-a-professional-editor-and-these-phrases-tell-me-you-used-chatgpt-23236708918f
Mingjie Sun, Yida Yin, Zhiqiu Xu, J. Zico Kolter, Zhuang Liu, 17 Feb 2025, Idiosyncrasies in Large Language Models, https://arxiv.org/abs/2502.12150
W Li, Y Lai, S Soni, K Saha, 2025, Emails by LLMs: A Comparison of Language in AI-Generated and Human-Written Emails, Proceedings of the 17th ACM Web Science Conference 2025 (Websci ’25), May 20–24, 2025, New Brunswick, NJ, USA. ACM, New York, NY, USA, 12 pages. https://doi.org/10.1145/3717867.3717872 https://www.researchgate.net/profile/Koustuv-Saha-2/publication/389509862_Emails_by_LLMs_A_Comparison_of_Language_in_AI-Generated_and_Human-Written_Emails/links/67c5cd02461fb56424efccc6/Emails-by-LLMs-A-Comparison-of-Language-in-AI-Generated-and-Human-Written-Emails.pdf
David Gewirtz, April 30, 2025, I tested 10 AI content detectors - and these 5 correctly identified AI text every time: I've been testing AI content detectors for two years now. They're getting more and more reliable, https://www.zdnet.com/article/i-tested-10-ai-content-detectors-and-these-5-correctly-identified-ai-text-every-time/
Shreya Shankar, Jun 16, 2025, Writing in the Age of LLMs: Common Patterns of Bad Writing I See from LLM Tools, https://www.sh-reya.com/blog/ai-writing/ (A good overview of the types of bad writing that comes out of LLMs.)

Privacy

Research on privacy-related risks or concerns:

Matthew Finnegan 14 Jun 2024, Microsoft delays Recall launch amid privacy concerns, ComputerWorld, https://www.computerworld.com/article/2147736/microsoft-delays-recall-launch-amid-privacy-concerns.html
Rohan Goswami 21 June, 2024, Apple Intelligence won’t launch in EU in 2024 due to antitrust regulation, company says, CNBS, https://www.cnbc.com/2024/06/21/apple-ai-europe-dma-macos.html
Dan Peng, Zhihui Fu, Jun Wang, 1 Jul 2024, PocketLLM: Enabling On-Device Fine-Tuning for Personalized LLMs, https://arxiv.org/abs/2407.01031 (Running fine-tuning on a smartphone via a low-memory optimization using a "derivative-free" "zeroth-order" technique called MeZo, with advantages such as privacy.)
Jay Peters, Jul 4, 2024, OpenAI’s ChatGPT Mac app was storing conversations in plain text, https://www.theverge.com/2024/7/3/24191636/openai-chatgpt-mac-app-conversations-plain-text
Jaymari Chua, Yun Li, Shiyi Yang, Chen Wang, Lina Yao, 6 Jul 2024, AI Safety in Generative AI Large Language Models: A Survey, https://arxiv.org/abs/2407.18369
Y. Zhang, J. Zhang, S. Yue, W. Lu, J. Ren, X. Shen, August 2024, "Mobile Generative AI: Opportunities and Challenges," in IEEE Wireless Communications, vol. 31, no. 4, pp. 58-64, doi: 10.1109/MWC.006.2300576, https://ieeexplore.ieee.org/abstract/document/10628027/
Yuanchun Li, Hao Wen, Weijun Wang, Xiangyu Li, Yizhen Yuan, Guohong Liu, Jiacheng Liu, Wenxing Xu, Xiang Wang, Yi Sun, Rui Kong, Yile Wang, Hanfei Geng, Jian Luan, Xuefeng Jin, Zilong Ye, Guanjing Xiong, Fan Zhang, Xiang Li, Mengwei Xu, Zhijun Li, Peng Li, Yang Liu, Ya-Qin Zhang, Yunxin Liu, 8 May 2024 (v2), Personal LLM Agents: Insights and Survey about the Capability, Efficiency and Security, https://arxiv.org/abs/2401.05459 https://github.com/MobileLLM/Personal_LLM_Agents_Survey
Yiheng Liu, Hao He, Tianle Han, Xu Zhang, Mengyuan Liu, Jiaming Tian, Yutong Zhang, Jiaqi Wang, Xiaohui Gao, Tianyang Zhong, Yi Pan, Shaochen Xu, Zihao Wu, Zhengliang Liu, Xin Zhang, Shu Zhang, Xintao Hu, Tuo Zhang, Ning Qiang, Tianming Liu, Bao Ge, 6 Jan 2024 (v2), Understanding LLMs: A Comprehensive Overview from Training to Inference, https://arxiv.org/abs/2401.02038
Huan Yang, Deyu Zhang, Yudong Zhao, Yuanchun Li, Yunxin Liu, 6 Sep 2024, A First Look At Efficient And Secure On-Device LLM Inference Against KV Leakage, https://arxiv.org/abs/2409.04040 (Security issues where KV caches can be data leaks as they may contain encodings of private information.)
Apple, Sep 2024, Apple Intelligence comes to iPhone, iPad, and Mac starting next month, https://www.apple.com/newsroom/2024/09/apple-intelligence-comes-to-iphone-ipad-and-mac-starting-next-month/
Donghwan Rho, Taeseong Kim, Minje Park, Jung Woo Kim, Hyunsik Chae, Jung Hee Cheon, Ernest K. Ryu, 3 Oct 2024, Encryption-Friendly LLM Architecture, https://arxiv.org/abs/2410.02486
Jiankun Wei, Abdulrahman Abdulrazzag, Tianchen Zhang, Adel Muursepp, Gururaj Saileshwar, 5 Nov 2024 (v2), Privacy Risks of Speculative Decoding in Large Language Models, https://arxiv.org/abs/2411.01076
G Wu, Z Zhang, Y Zhang, W Wang, J Niu, Y Wu, Mar 2025, I Know What You Asked: Prompt Leakage via KV-Cache Sharing in Multi-Tenant LLM Serving https://www.ndss-symposium.org/wp-content/uploads/2025-1772-paper.pdf
Maimunatu Tunau, Vincent Gbouna Zakka, Zhuangzhuang Dai, 14 Aug 2025, Enhanced Sparse Point Cloud Data Processing for Privacy-aware Human Action Recognition, https://arxiv.org/abs/2508.10469
Feiran Li, Qianqian Xu, Shilong Bao, Boyu Han, Zhiyong Yang, Qingming Huang, 14 Aug 2025, Hybrid Generative Fusion for Efficient and Privacy-Preserving Face Recognition Dataset Generation, https://arxiv.org/abs/2508.10672
Yanzhe Zhang, Diyi Yang, 14 Aug 2025, Searching for Privacy Risks in LLM Agents via Simulation, https://arxiv.org/abs/2508.10880
Quentin Hillebrand, Vorapong Suppakitpaisarn and Tetsuo Shibuya, 14 Aug 2025, Communication Cost Reduction for Subgraph Counting under Local Differential Privacy via Hash Functions, https://arxiv.org/abs/2312.07055
Jessup Byun, Xiaofeng Lin, Joshua Ward, Guang Cheng, 22 Jul 2025, Risk In Context: Benchmarking Privacy Leakage of Foundation Models in Synthetic Tabular Data Generation, https://arxiv.org/abs/2507.17066
Wei Fan, JinYi Yoon, Xiaochang Li, Huajie Shao, and Bo Ji, 23 Jul 2025, P3SL: Personalized Privacy-Preserving Split Learning on Heterogeneous Edge Devices, https://arxiv.org/abs/2507.17228
Na Li and Yansong Gao and Hongsheng Hu and Boyu Kuang and Anmin Fu, 22 Jul 2025, CompLeak: Deep Learning Model Compression Exacerbates Privacy Leakage, https://arxiv.org/abs/2507.16872
Angelo Rodio, Zheng Chen, Erik G. Larsson, 23 Jul 2025, Optimizing Privacy-Utility Trade-off in Decentralized Learning with Generalized Correlated Noise, https://arxiv.org/abs/2501.14644
Mehdi Khalaj, Shahrzad Golestani Najafabadi, Julita Vassileva, 23 Jul 2025, Privacy-Preserving Multimodal News Recommendation through Federated Learning, https://arxiv.org/abs/2507.15460
Harsha Sammangi (Dakota State University), Aditya Jagatha (College of Business and Information Systems, Dakota State University), Giridhar Reddy Bojja (College of Business, Michigan Technological University), Jun Liu (College of Business and I.S, Dakota State University), 29 Apr 2025, Decentralized AI-driven IoT Architecture for Privacy-Preserving and Latency-Optimized Healthcare in Pandemic and Critical Care Scenarios, https://arxiv.org/abs/2507.15859
Dakota Sullivan, Shirley Zhang, Jennica Li, Heather Kirkorian, Bilge Mutlu, Kassem Fawaz, 22 Jul 2025, Benchmarking LLM Privacy Recognition for Social Robot Decision Making, https://arxiv.org/abs/2507.16124
Tanusree Sharma, Yihao Zhou, Visar Berisha, 22 Jul 2025, PRAC3 (Privacy, Reputation, Accountability, Consent, Credit, Compensation): Long Tailed Risks of Voice Actors in AI Data-Economy, https://arxiv.org/abs/2507.16247
Tian Dong, Yan Meng, Shaofeng Li, Guoxing Chen, Zhen Liu, Haojin Zhu, 22 Jul 2025, Depth Gives a False Sense of Privacy: LLM Internal States Inversion, https://arxiv.org/abs/2507.16372
Ryusei Fujimoto, Yugo Nakamura, Yutaka Arakawa, 24 Jul 2025, C-AAE: Compressively Anonymizing Autoencoders for Privacy-Preserving Activity Recognition in Healthcare Sensor Streams, https://arxiv.org/abs/2507.18072
Tevin Atwal, Chan Nam Tieu, Yefeng Yuan, Zhan Shi, Yuhong Liu, Liang Cheng, 24 Jul 2025, Privacy-Preserving Synthetic Review Generation with Diverse Writing Styles Using LLMs, https://arxiv.org/abs/2507.18055
Nikola Pavlovic, Sudeep Salgia, Qing Zhao, 18 Jul 2025, Differential Privacy in Kernelized Contextual Bandits via Random Projections, https://arxiv.org/abs/2507.13639
Daniel Commey, Benjamin Appiah, Griffith S. Klogo, and Garth V. Crosby, 18 Jul 2025, ZKP-FedEval: Verifiable and Privacy-Preserving Federated Evaluation using Zero-Knowledge Proofs, https://arxiv.org/abs/2507.11649
Efe Bozkir and S\"uleyman \"Ozdel and Mengdi Wang and Brendan David-John and Hong Gao and Kevin Butler and Eakta Jain and Enkelejda Kasneci, 18 Jul 2025, Eye-tracked Virtual Reality: A Comprehensive Survey on Methods and Privacy Challenges, https://arxiv.org/abs/2305.14080
Matteo Boglioni and Terrance Liu and Andrew Ilyas and Zhiwei Steven Wu, 21 Jul 2025, Optimizing Canaries for Privacy Auditing with Metagradient Descent, https://arxiv.org/abs/2507.15836
Wenxuan Zeng, Tianshi Xu, Yi Chen, Yifan Zhou, Mingzhe Zhang, Jin Tan, Cheng Hong, Meng Li, 19 Jul 2025, Towards Efficient Privacy-Preserving Machine Learning: A Systematic Review from Protocol, Model, and System Perspectives, https://arxiv.org/abs/2507.14519
Juntao Tan, Lan Zhang, Zhonghao Hu, Kai Yang, Peng Ran, Bo Li, 19 Jul 2025, VMask: Tunable Label Privacy Protection for Vertical Federated Learning via Layer Masking, https://arxiv.org/abs/2507.14629
Khoa Nguyen, Tanveer Khan, Antonis Michalas, 20 Jul 2025, A Privacy-Centric Approach: Scalable and Secure Federated Learning Enabled by Hybrid Homomorphic Encryption, https://arxiv.org/abs/2507.14853
Tanusree Sharma, Yu-Yun Tseng, Lotus Zhang, Ayae Ide, Kelly Avery Mack, Leah Findlater, Danna Gurari, Yang Wang, 19 Jul 2025, "Before, I Asked My Mom, Now I Ask ChatGPT": Visual Privacy Management with Generative AI for Blind and Low-Vision People, https://arxiv.org/abs/2507.00286
Wenkai Li, Liwen Sun, Zhenxiang Guan, Xuhui Zhou, Maarten Sap, 11 Aug 2025, 1-2-3 Check: Enhancing Contextual Privacy in LLM via Multi-Agent Reasoning, https://arxiv.org/abs/2508.07667
Andrey Sidorenko and Paul Tiwald, 8 Aug 2025, Privacy-Preserving Tabular Synthetic Data Generation Using TabularARGN, https://arxiv.org/abs/2508.06647
Yueyang Quan, Chang Wang, Shengjie Zhai, Minghong Fang, Zhuqing Liu, 10 Aug 2025, Enhancing Privacy in Decentralized Min-Max Optimization: A Differentially Private Approach, https://arxiv.org/abs/2508.07505
Chenchen Lin, Xuehe Wang, 11 Aug 2025, Multi-Hop Privacy Propagation for Differentially Private Federated Learning in Social Networks, https://arxiv.org/abs/2508.07676
Juan Zambrano, Cl\'ement Contet, Jairo Gudi\~no, Felipe Garrido-Lucero, Umberto Grandi, Cesar A Hidalgo, 7 Aug 2025, Leveraging LLMs for Privacy-Aware Predictions in Participatory Budgeting, https://arxiv.org/abs/2508.06577
William Zerong Wang and Dongfang Zhao, 9 Aug 2025, Balancing Privacy and Efficiency: Music Information Retrieval via Additive Homomorphic Encryption, https://arxiv.org/abs/2508.07044
Dawood Wasif, Dian Chen, Sindhuja Madabushi, Nithin Alluru, Terrence J. Moore, Jin-Hee Cho, 9 Aug 2025, Empirical Analysis of Privacy-Fairness-Accuracy Trade-offs in Federated Learning: A Step Towards Responsible AI, https://arxiv.org/abs/2503.16233
Xingke Yang and Liang Li and Zhiyi Wan and Sicong Li and Xiaoqi Qi and Jiang Liu and Tomoaki Ohtsuki and Xin Fu and Miao Pan, 9 Aug 2025, PAE MobiLLM: Privacy-Aware and Efficient LLM Fine-Tuning on the Mobile Device via Additive Side-Tuning, https://arxiv.org/abs/2507.01216
Kaveen Hiniduma, Zilinghan Li, Aditya Sinha, Ravi Madduri, Suren Byna, 11 Aug 2025, CADRE: Customizable Assurance of Data Readiness in Privacy-Preserving Federated Learning, https://arxiv.org/abs/2505.23849
Md Rakibul Hasan, Md Zakir Hossain, Aneesh Krishna, Shafin Rahman, Tom Gedeon, 9 Aug 2025, TFMPathy: Tabular Foundation Model for Privacy-Aware, Generalisable Empathy Detection from Videos, https://arxiv.org/abs/2504.10808
Nomaan A. Kherani, Urbashi Mitra, 26 Jul 2025, ModShift: Model Privacy via Designed Shifts, https://arxiv.org/abs/2507.20060
Yaxin Xiao and Qingqing Ye and Li Hu and Huadi Zheng and Haibo Hu and Zi Liang and Haoyang Li and Yijie Jiao, 28 Jul 2025, Reminiscence Attack on Residuals: Exploiting Approximate Machine Unlearning for Privacy, https://arxiv.org/abs/2507.20573
Ivoline Ngong, Swanand Kadhe, Hao Wang, Keerthiram Murugesan, Justin D. Weisz, Amit Dhurandhar, Karthikeyan Natesan Ramamurthy, 28 Jul 2025, Protecting Users From Themselves: Safeguarding Contextual Privacy in Interactions with Conversational Agents, https://arxiv.org/abs/2502.18509
Abdullah Al Siam and Sadequzzaman Shohan, 17 May 2025, Privacy-Preserving AI for Encrypted Medical Imaging: A Framework for Secure Diagnosis and Learning, https://arxiv.org/abs/2507.21060
Chenhao Fang, Yanqing Peng, Rajeev Rao, Matt Sarmiento, Wendy Summer, Arya Pudota, Alex Goncalves, Jordi Mola, Herv\'e Robert, 23 Jul 2025, Privacy Artifact ConnecTor (PACT): Embedding Enterprise Artifacts for Compliance AI Agents, https://arxiv.org/abs/2507.21142
Yuetian Chen, Zhiqi Wang, Nathalie Baracaldo, Swanand Ravindra Kadhe, Lei Yu, 31 Jul 2025, Evaluating the Dynamics of Membership Privacy in Deep Learning, https://arxiv.org/abs/2507.23291
Abhishek Sawaika, Swetang Krishna, Tushar Tomar, Durga Pritam Suggisetti, Aditi Lal, Tanmaya Shrivastav, Nouhaila Innan, Muhammad Shafique, 15 Jul 2025, A Privacy-Preserving Federated Framework with Hybrid Quantum-Enhanced Learning for Financial Fraud Detection, https://arxiv.org/abs/2507.22908
Jiajie He, Yuechun Gu, Keke Chen, 24 Jul 2025, RecPS: Privacy Risk Scoring for Recommender Systems, https://arxiv.org/abs/2507.18365
Shreyansh Pathak, Sonu Shreshtha, Richa Singh, Mayank Vatsa, 29 Jul 2025, Quantum-Inspired Audio Unlearning: Towards Privacy-Preserving Voice Biometrics, https://arxiv.org/abs/2507.22208
Xiaojin Zhang, Wei Chen, 30 Jul 2025, Bridging Privacy and Robustness for Trustworthy Machine Learning, https://arxiv.org/abs/2403.16591
Javier Mu\~noz-Haro, Ruben Tolosana, Ruben Vera-Rodriguez, Aythami Morales, Julian Fierrez, 1 Aug 2025, FakeIDet: Exploring Patches for Privacy-Preserving Fake ID Detection, https://arxiv.org/abs/2504.07761
Tianpei Lu, Bingsheng Zhang, Lekun Peng, Bowen Zheng, Lichun Li, Kui Ren, 3 Aug 2025, Privacy-Preserving Inference for Quantized BERT Models, https://arxiv.org/abs/2508.01636
Runkai Zheng, Vishnu Asutosh Dasu, Yinong Oliver Wang, Haohan Wang, Fernando De la Torre, 3 Aug 2025, Improving Noise Efficiency in Privacy-preserving Dataset Distillation, https://arxiv.org/abs/2508.01749
Jan Schuchardt, Mina Dalirrooyfard, Jed Guzelkabaagac, Anderson Schneider, Yuriy Nevmyvaka, Stephan G\"unnemann, 4 Aug 2025, Privacy Amplification by Structured Subsampling for Deep Differentially Private Time Series Forecasting, https://arxiv.org/abs/2502.02410
Xinwei Liu, Xiaojun Jia, Yuan Xun, Simeng Qin, Xiaochun Cao, 5 Aug 2025, GeoShield: Safeguarding Geolocation Privacy from Vision-Language Models via Adversarial Perturbations, https://arxiv.org/abs/2508.03209
Mengyu Zhang, Zhuotao Liu, Jingwen Huang, Xuanqi Liu, 30 Jul 2025, Agentic Privacy-Preserving Machine Learning, https://arxiv.org/abs/2508.02836
Xin Yang, Omid Ardakanian, 5 Aug 2025, PrivDiffuser: Privacy-Guided Diffusion Model for Data Obfuscation in Sensor Networks, https://arxiv.org/abs/2412.14499
Chongyu Bao, Ruimin Dai, Yangbo Shen, Runyang Jian, Jinghan Zhang, Xiaolan Liu, Kunpeng Liu, 6 Aug 2025, Galaxy: A Cognition-Centered Framework for Proactive, Privacy-Preserving, and Self-Evolving LLM Agents, https://arxiv.org/abs/2508.03991
Dhruv Sarkar, Nishant Pandey, Sayak Ray Chowdhury, 5 Aug 2025, DP-NCB: Privacy Preserving Fair Bandits, https://arxiv.org/abs/2508.03836
Ajesh Koyatan Chathoth, Shuhao Yu, Stephen Lee, 6 Aug 2025, Dynamic User-controllable Privacy-preserving Few-shot Sensing Framework, https://arxiv.org/abs/2508.03989
Haoran Niu and K. Suzanne Barber, 6 Aug 2025, Privacy Risk Predictions Based on Fundamental Understanding of Personal Data and an Evolving Threat Landscape, https://arxiv.org/abs/2508.04542
Yubo Wang and Min Tang and Nuo Shen and Shujie Cui and Weiqing Wang, 20 Jul 2025, Privacy Risks of LLM-Empowered Recommender Systems: An Inversion Attack Perspective, https://arxiv.org/abs/2508.03703
Fardis Nadimi, Payam Abdisarabshali, Kasra Borazjani, Jacob Chakareski, Seyyedali Hosseinalipour, 5 Aug 2025, Multi-Modal Multi-Task Federated Foundation Models for Next-Generation Extended Reality Systems: Towards Privacy-Preserving Distributed Intelligence in AR/VR/MR, https://arxiv.org/abs/2506.05683
Chengxi Li, Ming Xiao, Mikael Skoglund, 6 Aug 2025, Adaptive Coded Federated Learning: Privacy Preservation and Straggler Mitigation, https://arxiv.org/abs/2403.14905
Haotian Ma, Lin Gu, Siyi Wu, Yingying Zhu, 6 Aug 2025, Computation-Efficient and Recognition-Friendly 3D Point Cloud Privacy Protection, https://arxiv.org/abs/2503.15818
Suqing Liu, Xuan Bi, Tianxi Li, 7 Aug 2025, GRAND: Graph Release with Assured Node Differential Privacy, https://arxiv.org/abs/2507.00402
Ce Na, Kai Yang, Dengzhao Fang, Yu Li, Jingtong Gao, Chengcheng Zhu, Jiale Zhang, Xiaobing Sun, Yi Chang, 8 Aug 2025, Graph Federated Learning for Personalized Privacy Recommendation, https://arxiv.org/abs/2508.06208
Alejandro Moreno R., Desale Fentaw, Samuel Palmer, Ra\'ul Salles de Padua, Ninad Dixit, Samuel Mugel, Roman Or\'us, Manuel Radons, Josef Menter, and Ali Abedi, 8 Aug 2025, Synthetic Data Generation and Differential Privacy using Tensor Networks' Matrix Product States (MPS), https://arxiv.org/abs/2508.06251
Junhyeog Yun, Minui Hong, Gunhee Kim, 8 Aug 2025, FedMeNF: Privacy-Preserving Federated Meta-Learning for Neural Fields, https://arxiv.org/abs/2508.06301
Zhihao Yao, Yuxuan Gu, Xiachong Feng, Weitao Ma, Bo Li, Xiaocheng Feng, 8 Aug 2025, Adaptive Backtracking for Privacy Protection in Large Language Models, https://arxiv.org/abs/2508.06087
Yuzhou Nie, Zhun Wang, Ye Yu, Xian Wu, Xuandong Zhao, Wenbo Guo, Dawn Song, 8 Aug 2025, LeakAgent: RL-based Red-teaming Agent for LLM Privacy Leakage, https://arxiv.org/abs/2412.05734
Zane Witherspoon, Thet Mon Aye, YingYing Hao, 12 Aug 2025, Can We Trust AI to Govern AI? Benchmarking LLM Performance on Privacy and AI Governance Exams, https://arxiv.org/abs/2508.09036
Ratun Rahman, 12 Aug 2025, Federated Learning: A Survey on Privacy-Preserving Collaborative Intelligence, https://arxiv.org/abs/2504.17703
Abdolazim Rezaei, Mehdi Sookhak, Mahboobeh Haghparast, 7 Aug 2025, RL-MoE: An Image-Based Privacy Preserving Approach In Intelligent Transportation System, https://arxiv.org/abs/2508.09186
Nick Oh, Giorgos D. Vrakas, Si\^an J. M. Brooke, Sasha Morini\`ere, Toju Duke, 12 Aug 2025, PETLP: A Privacy-by-Design Pipeline for Social Media Data in AI Research, https://arxiv.org/abs/2508.09232
Zhifan Luo, Shuo Shao, Su Zhang, Lijing Zhou, Yuke Hu, Chenxu Zhao, Zhihao Liu, Zhan Qin, 13 Aug 2025, Shadow in the Cache: Unveiling and Mitigating Privacy Risks of KV-cache in LLM Inference, https://arxiv.org/abs/2508.09442
Javier Mu\~noz-Haro and Ruben Tolosana and Ruben Vera-Rodriguez and Aythami Morales and Julian Fierrez, 14 Aug 2025, Privacy-Aware Detection of Fake Identity Documents: Methodology, Benchmark, and Improved Detection Methods (FakeIDet2), https://arxiv.org/abs/2508.11716
Xiaojin Zhang, Mingcong Xu, Yiming Li, Wei Chen, Qiang Yang, 16 Aug 2025, Deciphering the Interplay between Attack and Protection Complexity in Privacy-Preserving Federated Learning, https://arxiv.org/abs/2508.11907
Jinyu Lu, Xinrong Sun, Yunting Tao, Tong Ji, Fanyu Kong, Guoqiang Yang, 18 Aug 2025, Efficient and Verifiable Privacy-Preserving Convolutional Computation for CNN Inference with Untrusted Clouds, https://arxiv.org/abs/2508.12832
Daniel M. Jimenez-Gutierrez, Yelizaveta Falkouskaya, Jose L. Hernandez-Ramos, Aris Anagnostopoulos, Ioannis Chatzigiannakis, Andrea Vitaletti, 19 Aug 2025, On the Security and Privacy of Federated Learning: A Survey with Attacks, Defenses, Frameworks, Applications, and Future Directions, https://arxiv.org/abs/2508.13730
Salman Habib, Remi Chou, Taejoon Kim, 21 Aug 2025, Stabilization of Perturbed Loss Function: Differential Privacy without Gradient Noise, https://arxiv.org/abs/2508.15523
Michael Sun, Tai Vu, Andrew Wang, 12 Aug 2025, Privacy Preserving Inference of Personalized Content for Out of Matrix Users, https://arxiv.org/abs/2508.14905
Ruyi Ding, Tianhong Xu, Xinyi Shen, Aidong Adam Ding, Yunsi Fei, 20 Aug 2025, MoEcho: Exploiting Side-Channel Attacks to Compromise User Privacy in Mixture-of-Experts LLMs, https://arxiv.org/abs/2508.15036
Aishik Mandal, Tanmoy Chakraborty, Iryna Gurevych, 22 Aug 2025, Towards Privacy-aware Mental Health AI Models: Advances, Challenges, and Opportunities, https://arxiv.org/abs/2502.00451
Muhammet Anil Yagiz, Zeynep Sude Cengiz, Polat Goktas, 24 Aug 2025, MetaFed: Advancing Privacy, Performance, and Sustainability in Federated Metaverse Systems, https://arxiv.org/abs/2508.17341
GodsGift Uzor, Hasan Al-Qudah, Ynes Ineza, Abdul Serwadda, 22 Aug 2025, Guarding Your Conversations: Privacy Gatekeepers for Secure Interactions with Cloud-Based AI Models, https://arxiv.org/abs/2508.16765
Jiale Liu, Jiahao Zhang, Suhang Wang, 24 Aug 2025, Exposing Privacy Risks in Graph Retrieval-Augmented Generation, https://arxiv.org/abs/2508.17222
Carlos Soto, 23 Aug 2025, Rao Differential Privacy, https://arxiv.org/abs/2508.17135
Xiaoyu Luo, Qiongxiu Li, 22 Aug 2025, DeMem: Privacy-Enhanced Robust Adversarial Learning via De-Memorization, https://arxiv.org/abs/2412.05767
Nicolas Johansson (1), Tobias Olsson (1), Daniel Nilsson (2), Johan \"Ostman (2), Fazeleh Hoseini (2) ((1) Chalmers University of Technology, (2) AI Sweden), 4 Sep 2025, Privacy Risks in Time Series Forecasting: User- and Record-Level Membership Inference, https://arxiv.org/abs/2509.04169
Qifeng Tan, Shusen Yang, Xuebin Ren, Yikai Zhang (Xi'an Jiaotong University), 4 Sep 2025, Rethinking Layer-wise Gaussian Noise Injection: Bridging Implicit Objectives and Privacy Budget Allocation, https://arxiv.org/abs/2509.04232
Yaohong Yang, Aki Rehn, Sammie Katt, Antti Honkela, Samuel Kaski, 4 Sep 2025, An Interactive Framework for Finding the Optimal Trade-off in Differential Privacy, https://arxiv.org/abs/2509.04290
Shokichi Takakura, Seng Pei Liew, Satoshi Hasegawa, 5 Sep 2025, Optimal Variance and Covariance Estimation under Differential Privacy in the Add-Remove Model and Beyond, https://arxiv.org/abs/2509.04919
Zijian Wang, Wei Tong, Tingxuan Han, Haoyu Chen, Tianling Zhang, Yunlong Mao, Sheng Zhong, 5 Sep 2025, On Evaluating the Poisoning Robustness of Federated Learning under Local Differential Privacy, https://arxiv.org/abs/2509.05265
Francesco Diana, Andr\'e Nusser, Chuan Xu, Giovanni Neglia, 5 Sep 2025, Cutting Through Privacy: A Hyperplane-Based Data Reconstruction Attack in Federated Learning, https://arxiv.org/abs/2505.10264
Jiahao Xu, Rui Hu, Olivera Kotevska, 5 Sep 2025, Optimal Client Sampling in Federated Learning with Client-Level Heterogeneous Differential Privacy, https://arxiv.org/abs/2505.13655
Yang Li, Hanjie Wang, Yuanzheng Li, Jiazheng Li, Zhaoyang Dong, 24 Aug 2025, ZTFed-MAS2S: A Zero-Trust Federated Learning Framework with Verifiable Privacy and Trust-Aware Aggregation for Wind Power Data Imputation, https://arxiv.org/abs/2508.18318
Zhibo Xu, Jianhao Zhu, Jingwen Xu, Changze Lv, Zisu Huang, Xiaohua Wang, Muling Wu, Qi Qian, Xiaoqing Zheng, Xuanjing Huang, 26 Aug 2025, Enhancing Model Privacy in Federated Learning with Random Masking and Quantization, https://arxiv.org/abs/2508.18911
Joshua Lee, Ali Arastehfard, Weiran Liu, Xuegang Ban, Yuan Hong, 26 Aug 2025, SecureV2X: An Efficient and Privacy-Preserving System for Vehicle-to-Everything (V2X) Applications, https://arxiv.org/abs/2508.19115
Yusi Wei, Hande Y. Benson, and Muge Capan, 25 Aug 2025, An Analytical Approach to Privacy and Performance Trade-Offs in Healthcare Data Sharing, https://arxiv.org/abs/2508.18513
Shaojie Bai, Mohammad Sadegh Talebi, Chengcheng Zhao, Peng Cheng, and Jiming Chen, 26 Aug 2025, Secure Reinforcement Learning via Shuffle Privacy Model, https://arxiv.org/abs/2411.11647
Mahdi Haghifam, Adam Smith, Jonathan Ullman, 26 Aug 2025, The Sample Complexity of Membership Inference and Privacy Auditing, https://arxiv.org/abs/2508.19458
Zhan Shi, Yefeng Yuan, Yuhong Liu, Liang Cheng, Yi Fang, 25 Aug 2025, RL-Finetuned LLMs for Privacy-Preserving Synthetic Rewriting, https://arxiv.org/abs/2508.19286
Grzegorz Skorupko, Fotios Avgoustidis, Carlos Mart\'in-Isla, Lidia Garrucho, Dimitri A. Kessler, Esmeralda Ruiz Pujadas, Oliver D\'iaz, Maciej Bobowicz, Katarzyna Gwo\'zdziewicz, Xavier Bargall\'o, Paulius Jaru\v{s}evi\v{c}ius, Richard Osuala, Kaisar Kushibar and Karim Lekadir, 28 Aug 2025, Federated nnU-Net for Privacy-Preserving Medical Image Segmentation, https://arxiv.org/abs/2503.02549
Joshua Ward, Chi-Hua Wang, Guang Cheng, 28 Aug 2025, Privacy Auditing Synthetic Data Release through Local Likelihood Attacks, https://arxiv.org/abs/2508.21146
Tobias Hyrup, Emmanouil Panagiotou, Arjun Roy, Arthur Zimek, Eirini Ntoutsi, Peter Schneider-Kamp, 29 Aug 2025, Achieving Hilbert-Schmidt Independence Under R\'enyi Differential Privacy for Fair and Private Data Generation, https://arxiv.org/abs/2508.21815
Masahiro Hayashitani, Junki Mori, and Isamu Teranishi, 29 Aug 2025, Survey of Privacy Threats and Countermeasures in Federated Learning, https://arxiv.org/abs/2402.00342
Timur Sattarov, Marco Schreyer, Damian Borth, 29 Aug 2025, Federated Diffusion Modeling with Differential Privacy for Tabular Data Synthesis, https://arxiv.org/abs/2412.16083
Rui Zhao, Vladyslav Melnychuk, Jun Zhao, Jesse Wright, Nigel Shadbolt, 1 Sep 2025, An LLM-enabled semantic-centric framework to consume privacy policies, https://arxiv.org/abs/2509.01716
Arun Vignesh Malarkkan, Haoyue Bai, Anjali Kaushik, and Yanjie Fu, 31 Aug 2025, DELTA: Variational Disentangled Learning for Privacy-Preserving Data Reprogramming, https://arxiv.org/abs/2509.00693
Wei Huang, Anda Cheng, Zhao Zhang, Yinggui Wang, 1 Sep 2025, DPF-CM: A Data Processing Framework with Privacy-Preserving Vector Databases for Chinese Medical LLMs Training and Deployment, https://arxiv.org/abs/2509.01354
Yi Yin, Guangquan Zhang, Hua Zuo, and Jie Lu, 2 Sep 2025, Privacy-Utility Trade-off in Data Publication: A Bilevel Optimization Framework with Curvature-Guided Perturbation, https://arxiv.org/abs/2509.02048
Narasimha Raghavan Veeraragavan, Jan Franz Nyg{\aa}rd, 30 Aug 2025, Federated Survival Analysis with Node-Level Differential Privacy: Private Kaplan-Meier Curves, https://arxiv.org/abs/2509.00615
Honghui Xu, Kaiyang Li, Wei Chen, Danyang Zheng, Zhiyuan Li, Zhipeng Cai, 2 Sep 2025, A Survey: Towards Privacy and Security in Mobile Large Language Models, https://arxiv.org/abs/2509.02411
Jianwei Wang, Chengming Shi, Junyao Yang, Haoran Li, Qianli Ma, Huiping Zhuang, Cen Chen and Ziqian Zeng, 31 Aug 2025, RewardDS: Privacy-Preserving Fine-Tuning for Large Language Models via Reward Driven Data Synthesis, https://arxiv.org/abs/2502.18517
Moontaha Nishat Chowdhury, Andr\'e Bauer, Minxuan Zhou, 3 Sep 2025, Efficient Privacy-Preserving Recommendation on Sparse Data using Fully Homomorphic Encryption, https://arxiv.org/abs/2509.03024
Napsu Karmitsa, Antti Airola, Tapio Pahikkala, Tinja Pitk\"am\"aki, 3 Sep 2025, A Comprehensive Guide to Differential Privacy: From Theory to User Expectations, https://arxiv.org/abs/2509.03294
Syomantak Chaudhuri, Thomas A. Courtade, 2 Sep 2025, Managing Correlations in Data and Privacy Demand, https://arxiv.org/abs/2509.02856
Ayoub Si-ahmed, Mohammed Ali Al-Garadi, Narhimene Boustia, 2 Sep 2025, Explainable Machine Learning-Based Security and Privacy Protection Framework for Internet of Medical Things Systems, https://arxiv.org/abs/2403.09752
Cheng Qian, Hainan Zhang, Yongxin Tong, Hong-Wei Zheng, Zhiming Zheng, 8 Sep 2025, HyFedRAG: A Federated Retrieval-Augmented Generation Framework for Heterogeneous and Privacy-Sensitive Data, https://arxiv.org/abs/2509.06444
Ismail Hossain, Sai Puppala, Sajedul Talukder, Md Jahangir Alam, 4 Sep 2025, AI-in-the-Loop: Privacy Preserving Real-Time Scam Detection and Conversational Scambaiting by Leveraging LLMs and Federated Learning, https://arxiv.org/abs/2509.05362
Abdul Rehman, Are D{\ae}hlen, Ilona Heldal, Jerry Chun-wei Lin, 4 Sep 2025, Privacy Preservation and Identity Tracing Prevention in AI-Driven Eye Tracking for Interactive Learning Environments, https://arxiv.org/abs/2509.05376
Jennifer King, Kevin Klyman, Emily Capstick, Tiffany Saade, Victoria Hsieh, 5 Sep 2025, User Privacy and Large Language Models: An Analysis of Frontier Developers' Privacy Policies, https://arxiv.org/abs/2509.05382
Waris Gill, Natalie Isak and Matthew Dressman, 6 Sep 2025, Cross-Service Threat Intelligence in LLM Services using Privacy-Preserving Fingerprints, https://arxiv.org/abs/2509.05608
Ikhlasse Badidi, Nouhaila El Khiyaoui, Aya Riany, Badr Ben Elallid, Amine Abouaomar, 30 Aug 2025, Privacy-Preserving Offloading for Large Language Models in 6G Vehicular Networks, https://arxiv.org/abs/2509.05320
Qin Yang, Nicholas Stout, Meisam Mohammady, Han Wang, Ayesha Samreen, Christopher J Quinn, Yan Yan, Ashish Kundu, Yuan Hong, 8 Sep 2025, PLRV-O: Advancing Differentially Private Deep Learning via Privacy Loss Random Variable Optimization, https://arxiv.org/abs/2509.06264
Wenhan Dong, Chao Lin, Xinlei He, Shengmin Xu, Xinyi Huang, 6 Sep 2025, Privacy-Preserving Federated Learning via Homomorphic Adversarial Networks, https://arxiv.org/abs/2412.01650
Nicol\`o Romandini, Carlo Mazzocca, Kai Otsuki, Rebecca Montanari, 8 Sep 2025, SoK: Security and Privacy of AI Agents for Blockchain, https://arxiv.org/abs/2509.07131
Tom\'as Gonz\'alez, Mateo Dulce-Rubio, Aaditya Ramdas, M\'onica Ribero, 8 Sep 2025, Sequentially Auditing Differential Privacy, https://arxiv.org/abs/2509.07055
Hailong Yang, Renhuo Zhao, Guanjin Wang and Zhaohong Deng, 12 Sep 2025, GAMA: A General Anonymizing Multi-Agent System for Privacy Preservation Enhanced by Domain Rules and Disproof Method, https://arxiv.org/abs/2509.10018
Francisco Javier Esono Nkulu Andong and Qi Min, 12 Sep 2025, Federated Multi-Agent Reinforcement Learning for Privacy-Preserving and Energy-Aware Resource Management in 6G Edge Networks, https://arxiv.org/abs/2509.10163
Nojan Sheybani, Alessandro Pegoraro, Jonathan Knauer, Phillip Rieger, Elissa Mollakuqe, Farinaz Koushanfar, Ahmad-Reza Sadeghi, 11 Sep 2025, ZORRO: Zero-Knowledge Robustness and Privacy for Split Learning (Full Version), https://arxiv.org/abs/2509.09787
Zhanhong Jiang, Md Zahid Hasan, Nastaran Saadati, Aditya Balu, Chao Liu, Soumik Sarkar, 12 Sep 2025, Balancing Utility and Privacy: Dynamically Private SGD with Random Projection, https://arxiv.org/abs/2509.09485
Vincent C. M\"uller, 30 Aug 2025, Deep opacity and AI: A threat to XAI and to privacy protection mechanisms, https://arxiv.org/abs/2509.08835
Honglan Yu, Yibin Wang, Feifei Dai, Dong Liu, Haihui Fan, Xiaoyan Gu, 11 Sep 2025, Towards Confidential and Efficient LLM Inference with Dual Privacy Protection, https://arxiv.org/abs/2509.09091
Honghui Xu, Shiva Shrestha, Wei Chen, Zhiyuan Li, Zhipeng Cai, 11 Sep 2025, DP-FedLoRA: Privacy-Enhanced Federated Fine-Tuning for On-Device Large Language Models, https://arxiv.org/abs/2509.09097
Osama Zafar, Mina Namazi, Yuqiao Xu, Youngjin Yoo, Erman Ayday, 11 Sep 2025, A User-Centric, Privacy-Preserving, and Verifiable Ecosystem for Personal Data Management and Utilization, https://arxiv.org/abs/2506.22606
Pol G. Recasens and \'Ad\'am Horv\'ath and Alberto Gutierrez-Torre and Jordi Torres and Josep Ll.Berral and Bal\'azs Pej\'o, 19 Sep 2025, FRIDA: Free-Rider Detection using Privacy Attacks, https://arxiv.org/abs/2410.05020
Hilda Hadan, Reza Hadi Mogavi, Leah Zhang-Kennedy, Lennart E. Nacke, 18 Sep 2025, Who is Responsible When AI Fails? Mapping Causes, Entities, and Consequences of AI Privacy and Ethical Incidents, https://arxiv.org/abs/2504.01029
Bihao Zhan, Jie Zhou, Junsong Li, Yutao Yang, Shilian Chen, Qianjun Pan, Xin Li, Wen Wu, Xingjiao Wu, Qin Chen, Hang Yan, Liang He, 16 Sep 2025, Forget What's Sensitive, Remember What Matters: Token-Level Differential Privacy in Memory Sculpting for Continual Learning, https://arxiv.org/abs/2509.12958
Binquan Guo, Junteng Cao, Marie Siew, Binbin Chen, Tony Q. S. Quek, Zhu Han, 5 Sep 2025, Accelerating Privacy-Preserving Federated Learning in Large-Scale LEO Satellite Systems, https://arxiv.org/abs/2509.12222
Rodrigo Tertulino, 3 Sep 2025, Privacy-Preserving Personalization in Education: A Federated Recommender System for Student Performance Prediction, https://arxiv.org/abs/2509.10516
Muhammad H. Ashiq, Peter Triantafillou, Hung Yun Tseng, Grigoris G. Chrysos, 15 Sep 2025, Inducing Uncertainty for Test-Time Privacy, https://arxiv.org/abs/2509.11625
Madhava Gaikwad, 10 Sep 2025, AVEC: Bootstrapping Privacy for Local LLMs, https://arxiv.org/abs/2509.10561
Fardin Jalil Piran, Zhiling Chen, Yang Zhang, Qianyu Zhou, Jiong Tang, Farhad Imani, 12 Sep 2025, Privacy-Preserving Decentralized Federated Learning via Explainable Adaptive Differential Privacy, https://arxiv.org/abs/2509.10691
Hyeju Shin, Vincent-Daniel, Kyudan Jung, Seongwon Yun, 13 Sep 2025, Fast Fourier Transform-Based Spectral and Temporal Gradient Filtering for Differential Privacy, https://arxiv.org/abs/2505.04468
Xingchen Wang, Feijie Wu, Chenglin Miao, Tianchun Li, Haoyu Hu, Qiming Cao, Jing Gao, Lu Su, 18 Sep 2025, Towards Privacy-Preserving and Heterogeneity-aware Split Federated Learning via Probabilistic Masking, https://arxiv.org/abs/2509.14603
Nobin Sarwar, Shubhashis Roy Dipta, 16 Sep 2025, FedMentor: Domain-Aware Differential Privacy for Heterogeneous Federated LLMs in Mental Health, https://arxiv.org/abs/2509.14275
Yuntao Du, Zitao Li, Ninghui Li, Bolin Ding, 16 Sep 2025, Beyond Data Privacy: New Privacy Risks for Large Language Models, https://arxiv.org/abs/2509.14278
Vaidehi Patil, Elias Stengel-Eskin, Mohit Bansal, 16 Sep 2025, The Sum Leaks More Than Its Parts: Compositional Privacy Risks and Mitigations in Multi-Agent Collaboration, https://arxiv.org/abs/2509.14284
Ramazan Yener, Guan-Hung Chen, Ece Gumusel, Masooda Bashir, 18 Sep 2025, Can I Trust This Chatbot? Assessing User Privacy in AI-Healthcare Chatbot Applications, https://arxiv.org/abs/2509.14581
Linfeng Luo, Zhiqi Guo, Fengxiao Tang, Zihao Qiu, Ming Zhao, 18 Sep 2025, Federated Hypergraph Learning with Local Differential Privacy: Toward Privacy-Aware Hypergraph Structure Completion, https://arxiv.org/abs/2408.05160
Chih Wei Ling, Chun Hei Michael Shiu, Youqi Wu, Jiande Sun, Cheuk Ting Li, Linqi Song, Weitao Xu, 18 Sep 2025, Communication-Efficient and Privacy-Adaptable Mechanism for Federated Learning, https://arxiv.org/abs/2501.12046
Avais Jan, Qasim Zia, Murray Patterson, 9 Sep 2025, Enhancing Privacy Preservation and Reducing Analysis Time with Federated Transfer Learning in Digital Twins-based Computed Tomography Scan Analysis, https://arxiv.org/abs/2509.08018
Bishnu Bhusal, Manoj Acharya, Ramneet Kaur, Colin Samplawski, Anirban Roy, Adam D. Cobb, Rohit Chadha, Susmit Jha, 17 Sep 2025, Privacy-Aware In-Context Learning for Large Language Models, https://arxiv.org/abs/2509.13625
Zihou Wu (1), Yuecheng Li (1), Tianchi Liao (2), Jian Lou (2), Chuan Chen (1) ((1) School of Computer Science and Engineering, Sun Yat-sen University, Guangzhou, China (2) School of Software Engineering, Sun Yat-sen University, Zhuhai, China), 17 Sep 2025, ParaAegis: Parallel Protection for Flexible Privacy-preserved Federated Learning, https://arxiv.org/abs/2509.13739
Vijay Kumar Butte, Sujata Butte, 17 Sep 2025, Secure, Scalable and Privacy Aware Data Strategy in Cloud, https://arxiv.org/abs/2509.13627
Ozer Ozturk, Busra Buyuktanir, Gozde Karatas Baydogmus, Kazim Yildiz, 17 Sep 2025, Differential Privacy in Federated Learning: Mitigating Inference Attacks with Randomized Response, https://arxiv.org/abs/2509.13987

More Research on AI Safety

Research papers that cover various other AI safety issues:

J Schuett, N Dreksler, M Anderljung, 2023, Towards best practices in AGI safety and governance: A survey of expert opinion, arXiv preprint, https://arxiv.org/abs/2305.07153
Jan Leike, Miljan Martic, Victoria Krakovna, Pedro A. Ortega, Tom Everitt, Andrew Lefrancq, Laurent Orseau, Shane Legg, Nov 2017, AI Safety Gridworlds, https://arxiv.org/abs/1711.09883
J. Schuett. Risk management in the Artificial Intelligence Act. European Journal of Risk Regulation, pages 1–19, 2023. https://arxiv.org/abs/2212.03109
Dario Amodei, Chris Olah, Jacob Steinhardt, Paul Christiano, John Schulman, Dan Mané, July 2016, Concrete Problems in AI Safety, https://arxiv.org/abs/1606.06565
Mark O Riedl and Brent Harrison. 2018. Enter the matrix: A virtual world approach to safely interruptable autonomous systems. arXiv preprint arXiv:1703.10284, 2017 (revised Nov 2018). https://arxiv.org/abs/1703.10284v2
M. Brundage, K. Mayer, T. Eloundou, S. Agarwal, S. Adler, G. Krueger, J. Leike, and P. Mishkin. OpenAI, 2022, Lessons learned on language model safety and misuse. https://openai.com/research/language-model-safety-and-misuse
OpenAI, Feb 2023, Planning for AGI and beyond, https://openai.com/blog/planning-for-agi-and-beyond
Andreas Cebulla, Zygmunt Szpak, Catherine Howell, Genevieve Knight & Sazzad Hussain, 2022, Applying ethics to AI in the workplace: the design of a scorecard for Australian workplace health and safety, Network Research, 13 May 2022, volume 38, pages919–935 (2023) https://link.springer.com/article/10.1007/s00146-022-01460-9
Mohammad Ghavamzadeh, Marek Petrik, and Yinlam Chow. Safe policy improvement by minimizing robust baseline regret. In Advances in Neural Information Processing Systems, pages 2298–2306, 2016. https://arxiv.org/abs/1607.03842v1
Laurent Orseau and Stuart Armstrong. Safely interruptible agents. In Uncertainty in Artificial Intelligence, pages 557–566, 2016. PDF: http://www.auai.org/uai2016/proceedings/papers/68.pdf
Tate Ryan-Mosley, August 14, 2023, AI isn’t great at decoding human emotions. So why are regulators targeting the tech? MIT Technology Review, https://www.technologyreview.com/2023/08/14/1077788/ai-decoding-human-emotions-target-for-regulators/
Maria Korolov, 15 May 2024, 10 things to watch out for with open source gen AI, CIO, https://www.cio.com/article/2104280/10-things-to-watch-out-for-with-open-source-gen-ai.html
Andy Arditi, Oscar Obeso, Aaquib111, wesg, Neel Nanda, 27th Apr 2024, Refusal in LLMs is mediated by a single direction, LessWrong, https://www.lesswrong.com/posts/jGuXSZgv6qfdhMCuJ/refusal-in-llms-is-mediated-by-a-single-direction
Google, Responsible Generative AI Toolkit, Feb 2024, https://ai.google.dev/responsible
Bingbin Liu, Jordan T. Ash, Surbhi Goel, Akshay Krishnamurthy, Cyril Zhang, June 2023, Exposing Attention Glitches with Flip-Flop Language Modeling, https://arxiv.org/abs/2306.00946
Jon Christian, Jan 30, 2023, CNET's Article-Writing AI Is Already Publishing Very Dumb Errors, https://futurism.com/cnet-ai-errors
R Dubin, 2023. Disarming Steganography Attacks Inside Neural Network Models, arXiv preprint arXiv:2309.03071, https://arxiv.org/pdf/2309.03071.pdf
Michael O'Neill, Mark Connor, 6 Jul 2023, Amplifying Limitations, Harms and Risks of Large Language Models, https://arxiv.org/abs/2307.04821
Lucas Mearian, 14 Mar 2024, AI hallucination mitigation: two brains are better than one, https://www.computerworld.com/article/1612465/ai-hallucination-mitigation-two-brains-are-better-than-one.html
Zhihang Yuan, Yuzhang Shang, Yang Zhou, Zhen Dong, Zhe Zhou, Chenhao Xue, Bingzhe Wu, Zhikai Li, Qingyi Gu, Yong Jae Lee, Yan Yan, Beidi Chen, Guangyu Sun, Kurt Keutzer, 15 Mar 2024 (v5), LLM Inference Unveiled: Survey and Roofline Model Insights, https://arxiv.org/abs/2402.16363 Code: https://github.com/hahnyuan/LLM-Viewer (A large survey of a variety of LLM optimizations.)
Laura Manduchi, Kushagra Pandey, Robert Bamler, Ryan Cotterell, Sina Däubener, Sophie Fellenz, Asja Fischer, Thomas Gärtner, Matthias Kirchler, Marius Kloft, Yingzhen Li, Christoph Lippert, Gerard de Melo, Eric Nalisnick, Björn Ommer, Rajesh Ranganath, Maja Rudolph, Karen Ullrich, Guy Van den Broeck, Julia E Vogt, Yixin Wang, Florian Wenzel, Frank Wood, Stephan Mandt, Vincent Fortuin, 28 Feb 2024, On the Challenges and Opportunities in Generative AI, https://arxiv.org/abs/2403.00025
Zhexin Zhang, Yida Lu, Jingyuan Ma, Di Zhang, Rui Li, Pei Ke, Hao Sun, Lei Sha, Zhifang Sui, Hongning Wang, Minlie Huang, 26 Feb 2024, ShieldLM: Empowering LLMs as Aligned, Customizable and Explainable Safety Detectors, https://arxiv.org/abs/2402.16444, Code: https://github.com/thu-coai/shieldlm
Peter Dizikes, December 11, 2023, MIT group releases white papers on governance of AI, MIT News, https://news.mit.edu/2023/mit-group-releases-white-papers-governance-ai-1211
MAK Raiaan, MSH Mukta, K Fatema, NM Fahad, 2023 A Review on Large Language Models: Architectures, Applications, Taxonomies, Open Issues and Challenges, https://www.techrxiv.org/articles/preprint/A_Review_on_Large_Language_Models_Architectures_Applications_Taxonomies_Open_Issues_and_Challenges/24171183/1/files/42414054.pdf
Xiangyu Qi, Yi Zeng, Tinghao Xie, Pin-Yu Chen3 Ruoxi Jia, Prateek Mittal, Peter Henderson, Oct 2023, Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To! https://arxiv.org/abs/2310.03693v1 Code: https://llm-tuning-safety.github.io/
Y Hu, J Setpal, D Zhang, J Zietek, J Lambert, 2023, BoilerBot: A Reliable Task-oriented Chatbot Enhanced with Large Language Models, https://assets.amazon.science/8c/03/80c814a749f58e73a1aeda2ff282/boilerbot-tb2-final-2023.pdf
S Latifi, 2023, Efficient and Dependable Deep Learning Systems Ph.D. Thesis, Computer Science and Engineering, University of Michigan, https://deepblue.lib.umich.edu/bitstream/handle/2027.42/176548/salar_1.pdf?sequence=1
N. Soares. 2023, Comments on OpenAI’s “Planning for AGI and beyond”. https://www.lesswrong.com/posts/uxnjXBwr79uxLkifG
K Ramesh, A Chavan, S Pandit, 2023, A Comparative Study on the Impact of Model Compression Techniques on Fairness in Language Models, Microsoft Research, https://aclanthology.org/2023.acl-long.878.pdf https://www.microsoft.com/en-us/research/uploads/prod/2023/07/3687_Paper.pdf
David Spuler, March 2024, Chapter 43. Overview of AI Research, Generative AI in C++: Coding Transformers and LLMs, https://www.amazon.com/dp/B0CXJKCWX9
Rithesh Murthy, Liangwei Yang, Juntao Tan, Tulika Manoj Awalgaonkar, Yilun Zhou, Shelby Heinecke, Sachin Desai, Jason Wu, Ran Xu, Sarah Tan, Jianguo Zhang, Zhiwei Liu, Shirley Kokane, Zuxin Liu, Ming Zhu, Huan Wang, Caiming Xiong, Silvio Savarese, 12 Jun 2024, MobileAIBench: Benchmarking LLMs and LMMs for On-Device Use Cases, https://arxiv.org/abs/2406.10290
Shicheng Xu, Liang Pang, Mo Yu, Fandong Meng, Huawei Shen, Xueqi Cheng, Jie Zhou, 12 Jun 2024 (v2), Unsupervised Information Refinement Training of Large Language Models for Retrieval-Augmented Generation, https://arxiv.org/abs/2402.18150 (Analysis about how LLMs can mishandle information retrieved from a datastore and how to make LLMs better at handling RAG information using a specialized training regime.)
OpenAI, Moderation: Learn how to build moderation into your AI applications, 2024, https://platform.openai.com/docs/guides/moderation
Azure, 06/13/2024, Content filtering, https://learn.microsoft.com/en-us/azure/ai-services/openai/concepts/content-filter?tabs=warning%2Cpython
Yu Wang, Xiaogeng Liu, Yu Li, Muhao Chen, Chaowei Xiao, 14 Mar 2024, AdaShield: Safeguarding Multimodal Large Language Models from Structure-based Attack via Adaptive Shield Prompting, https://arxiv.org/abs/2403.09513 Code: https://github.com/rain305f/AdaShield
Jinhwa Kim, Ali Derakhshan, Ian G. Harris, 31 Oct 2023, Robust Safety Classifier for Large Language Models: Adversarial Prompt Shield, https://arxiv.org/abs/2311.00172
Apple, June 2024, Introducing Apple’s On-Device and Server Foundation Models, https://machinelearning.apple.com/research/introducing-apple-foundation-models (Apple's on-device models feature optimizations including small models, grouped query attention, 2-bit/4-bit quantization including activation quantization, shared embedding/unembedding tensors, small-ish vocabulary size of 49k, an undisclosed efficient KV cache optimization for neural engines, and layer-specific 16-bit LoRA/QLoRA adapters of size "10s of megabytes" for fine-tuned specialized model versions, also sometimes in 2-bit/4-bit, claiming speed rates of 0.6ms/token in prefill, and 30 tokens per second in decoding.)
NVIDIA, June 2024, Nemotron-4 340B Technical Report, https://d1qx31qr3h6wln.cloudfront.net/publications/Nemotron_4_340B_8T_0.pdf (Architecture is decoder-only with GQA, SentencePiece tokenizer, causal attention masks, RoPE, 96 layers, 96 heads, 8 KV heads, 256,000 vocabulary, 18432 internal dimension, context window 4096, and uses squared RELU.)
Frank Chung, June 23, 2024, ‘I need to go outside’: Young people ‘extremely addicted’ as Character.AI explodes, https://www.news.com.au/technology/online/internet/i-need-to-go-outside-young-people-extremely-addicted-as-characterai-explodes/news-story/5780991c61455c680f34b25d5847a341
Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, Ryan Lowe, 4 Mar 2022, Training language models to follow instructions with human feedback, https://arxiv.org/abs/2203.02155 (The original 2022 InstructGPT paper from OpenAI.)
Valentina Alto, 2024, Chapter 12: Responsible AI, Building LLM-Powered Applications: Create intelligence apps and agents with large language models, Packt Publishing, https://www.amazon.com/Building-LLM-Apps-Intelligent-Language/dp/1835462316/
Aarushi Kansal, Chapter 4: Guardrails and AI: Building Safe and Controllable Apps, Building Generative AI-Powered Apps: A Hands-on Guide for Developers, Apress, https://www.amazon.com/Building-Generative-AI-Powered-Apps-Hands-ebook/dp/B0CTXXP1S4/
Seungju Han, Kavel Rao, Allyson Ettinger, Liwei Jiang, Bill Yuchen Lin, Nathan Lambert, Yejin Choi, Nouha Dziri, 26 Jun 2024, WildGuard: Open One-Stop Moderation Tools for Safety Risks, Jailbreaks, and Refusals of LLMs, https://arxiv.org/abs/2406.18495
Jaymari Chua, Yun Li, Shiyi Yang, Chen Wang, Lina Yao, 6 Jul 2024, AI Safety in Generative AI Large Language Models: A Survey, https://arxiv.org/abs/2407.18369
Marko Zivkovic, Aug 06, 2024, Discovered Apple Intelligence prompts show Apple's attempt at preventing AI disaster, https://appleinsider.com/articles/24/08/06/discovered-apple-intelligence-prompts-show-apples-attempt-at-preventing-ai-disaster
Mack DeGeurin, Aug 9, 2024, Researchers worry about AI turning humans into jerks: OpenAI safety researchers think GPT4o could influence 'social norms.', https://www.popsci.com/technology/openai-jerks/
OpenAI, August 8, 2024 GPT-4o System Card, https://openai.com/index/gpt-4o-system-card/
Rohin Shah, Seb Farquhar, Anca Dragan, 21st Aug 2024, AGI Safety and Alignment at Google DeepMind: A Summary of Recent Work, https://www.alignmentforum.org/posts/79BPxvSsjzBkiSyTq/agi-safety-and-alignment-at-google-deepmind-a-summary-of
Shayne Longpre, Stella Biderman, Alon Albalak, Hailey Schoelkopf, Daniel McDuff, Sayash Kapoor, Kevin Klyman, Kyle Lo, Gabriel Ilharco, Nay San, Maribeth Rauh, Aviya Skowron, Bertie Vidgen, Laura Weidinger, Arvind Narayanan, Victor Sanh, David Adelani, Percy Liang, Rishi Bommasani, Peter Henderson, Sasha Luccioni, Yacine Jernite, Luca Soldaini, 26 Jun 2024 (v2), The Responsible Foundation Model Development Cheatsheet: A Review of Tools & Resources, https://arxiv.org/abs/2406.16746
Thomas Mildner, Orla Cooney, Anna-Maria Meck, Marion Bartl, Gian-Luca Savino, Philip R. Doyle, Diego Garaialde, Leigh Clark, John Sloan, Nina Wenig, Rainer Malaka, Jasmin Niess, 26 Jan 2024, Listening to the Voices: Describing Ethical Caveats of Conversational User Interfaces According to Experts and Frequent Users, Proceedings of the CHI Conference on Human Factors in Computing Systems (CHI '24), May 11--16, 2024, Honolulu, HI, USA, https://arxiv.org/abs/2401.14746 https://doi.org/https://doi.org/10.1145/3613904.3642542
Kyle Wiggers, September 4, 2024, Ilya Sutskever’s startup, Safe Superintelligence, raises $1B, https://techcrunch.com/2024/09/04/ilya-sutskevers-startup-safe-super-intelligence-raises-1b/
Balasubramaniam S. , Vanajaroselin Chirchi, Seifedine Kadry, Moorthy Agoramoorthy, Gururama Senthilvel P., Satheesh Kumar K., and Sivakumar T. A., Oct 2024, The Road Ahead: Emerging Trends, Unresolved Issues, and ConcludingRemarksinGenerativeAI—AComprehensiveReview, International Journal of Intelligent Systems, Volume 2024, Article ID 4013195, 38 pages, https://doi.org/10.1155/2024/4013195 https://www.researchgate.net/profile/Balasubramaniam-s-2/publication/384729387_The_Road_Ahead_Emerging_Trends_Unresolved_Issues_and_Concluding_Remarks_in_Generative_AI-A_Comprehensive_Review/links/6705560cf5eb7108c6e5d261/The-Road-Ahead-Emerging-Trends-Unresolved-Issues-and-Concluding-Remarks-in-Generative-AI-A-Comprehensive-Review.pdf
Xinyi Zeng, Yuying Shang, Yutao Zhu, Jiawei Chen, Yu Tian, 9 Oct 2024, Root Defence Strategies: Ensuring Safety of LLM at the Decoding Level, https://arxiv.org/abs/2410.06809
Michael Nuñez, October 15, 2024, Anthropic just made it harder for AI to go rogue with its updated safety policy, https://venturebeat.com/ai/anthropic-just-made-it-harder-for-ai-to-go-rogue-with-its-updated-safety-policy/
ETO, Apr 2024, The state of global AI safety research, https://eto.tech/blog/state-of-global-ai-safety-research/
Leon Derczynski, Christopher Parisien, Nikki Pope, Michael Boone, Nov 2024, NVIDIA Approaches to AI Trust and Safety: Innovation and Tools, https://www.nvidia.com/en-us/on-demand/session/aisummitdc24-sdc1088/?playlistId=playList-c6a9450c-c790-462d-a058-0bacacd5d370
Y. Bai et al., "Backdoor Attack and Defense on Deep Learning: A Survey," in IEEE Transactions on Computational Social Systems, doi: 10.1109/TCSS.2024.3482723. https://ieeexplore.ieee.org/abstract/document/10744415
OpenAI, November 21, 2024, Advancing red teaming with people and AI, https://openai.com/index/advancing-red-teaming-with-people-and-ai/
Patrick Mineault, Niccolò Zanichelli, Joanne Zichen Peng, Anton Arkhipov, Eli Bingham, Julian Jara-Ettinger, Emily Mackevicius, Adam Marblestone, Marcelo Mattar, Andrew Payne, Sophia Sanborn, Karen Schroeder, Zenna Tavares, Andreas Tolias, 27 Nov 2024, NeuroAI for AI Safety, https://arxiv.org/abs/2411.18526
Maria Korolov and Michael Hill, 03 Dec 2024, 10 most critical LLM vulnerabilities, https://www.csoonline.com/article/575497/owasp-lists-10-most-critical-large-language-model-vulnerabilities.html
Mayank Vatsa, Anubhooti Jain, Richa Singh, 7 Dec 2023, Adventures of Trustworthy Vision-Language Models: A Survey, https://arxiv.org/abs/2312.04231
Yedi Zhang, Yufan Cai, Xinyue Zuo, Xiaokun Luan, Kailong Wang, Zhe Hou, Yifan Zhang, Zhiyuan Wei, Meng Sun, Jun Sun, Jing Sun, Jin Song Dong, 9 Dec 2024, The Fusion of Large Language Models and Formal Methods for Trustworthy AI Agents: A Roadmap, https://arxiv.org/abs/2412.06512
Aditi Bodhankar, Dec 06, 2024, Content Moderation and Safety Checks with NVIDIA NeMo Guardrails, https://developer.nvidia.com/blog/content-moderation-and-safety-checks-with-nvidia-nemo-guardrails/
Inkit Padhi, Manish Nagireddy, Giandomenico Cornacchia, Subhajit Chaudhury, Tejaswini Pedapati, Pierre Dognin, Keerthiram Murugesan, Erik Miehling, Martín Santillán Cooper, Kieran Fraser, Giulio Zizzo, Muhammad Zaid Hameed, Mark Purcell, Michael Desmond, Qian Pan, Inge Vejsbjerg, Elizabeth M. Daly, Michael Hind, Werner Geyer, Ambrish Rawat, Kush R. Varshney, Prasanna Sattigeri, 10 Dec 2024, Granite Guardian, https://arxiv.org/abs/2412.07724 https://github.com/ibm-granite/granite-guardian (Open-sourcing of safety models with many capabilities.)
Rhiannon Williams, December 31, 2024, The biggest AI flops of 2024: From chatbots dishing out illegal advice to dodgy AI-generated search results, take a look back over the year’s top AI failures. https://www.technologyreview.com/2024/12/31/1109612/biggest-worst-ai-artificial-intelligence-flops-fails-2024/
James Manyika, Demis Hassabis, Feb 04, 2025, Responsible AI: Our 2024 report and ongoing work, https://blog.google/technology/ai/responsible-ai-2024-report-ongoing-work/
Arjun Kharpal, Feb 6 2025, ‘Dangerous proposition’: Top scientists warn of out-of-control AI, https://www.cnbc.com/2025/02/07/dangerous-proposition-top-scientists-warn-of-out-of-control-ai.html
Vagner Figueredo de Santana, Sara Berger, Tiago Machado, Maysa Malfiza Garcia de Macedo, Cassia Sampaio Sanctos, Lemara Williams, and Zhaoqing Wu. 2025. Can LLMs Recommend More Responsible Prompts? In Proceedings of the 30th International Conference on Intelligent User Interfaces (IUI '25). Association for Computing Machinery, New York, NY, USA, 298–313. https://doi.org/10.1145/3708359.3712137 https://dl.acm.org/doi/full/10.1145/3708359.3712137 https://dl.acm.org/doi/pdf/10.1145/3708359.3712137
Michael Nuñez, July 15, 2025, OpenAI, Google DeepMind and Anthropic sound alarm: ‘We may be losing the ability to understand AI’, https://venturebeat.com/ai/openai-google-deepmind-and-anthropic-sound-alarm-we-may-be-losing-the-ability-to-understand-ai/ (Monitoring the text-based interim "thinking-out-loud" reasoning of models in CoT.)
Tomek Korbak, Mikita Balesni, (and many more authors) July 2025, Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety, https://tomekkorbak.com/cot-monitorability-is-a-fragile-opportunity/cot_monitoring.pdf
Anthropic, 13 Aug 2025, Building Safeguards for Claude, https://www.anthropic.com/news/building-safeguards-for-claude
Gabriel J. Perin, Runjin Chen, Xuxi Chen, Nina S. T. Hirata, Zhangyang Wang, Junyuan Hong, 23 Jul 2025, LoX: Low-Rank Extrapolation Robustifies LLM Safety Against Fine-tuning, https://arxiv.org/abs/2506.15606
Zheng Hui, Yijiang River Dong, Ehsan Shareghi, Nigel Collier, 22 Jul 2025, TRIDENT: Benchmarking LLM Safety in Finance, Medicine, and Law, https://arxiv.org/abs/2507.21134
Haonan Zhang, Dongxia Wang, Yi Liu, Kexin Chen, Jiashui Wang, Xinlei Ying, Long Liu, Wenhai Wang, 15 Aug 2025, ORFuzz: Fuzzing the "Other Side" of LLM Safety -- Testing Over-Refusal, https://arxiv.org/abs/2508.11222
Yuan Gao, Mattia Piccinini, Korbinian Moller, Amr Alanwar, Johannes Betz, 18 Jul 2025, From Words to Collisions: LLM-Guided Evaluation and Adversarial Generation of Safety-Critical Driving Scenarios, https://arxiv.org/abs/2502.02145
Juan Manuel Contreras, 19 Jul 2025, Automated Safety Evaluations Across 20 Large Language Models: The Aymara LLM Risk and Responsibility Matrix, https://arxiv.org/abs/2507.14719
Haoyu Wang and Chris M. Poskitt and Jun Sun and Jiali Wei, 1 Aug 2025, Pro2Guard: Proactive Runtime Enforcement of LLM Agent Safety via Probabilistic Model Checking, https://arxiv.org/abs/2508.00500
Avni Kothari, Patrick Vossler, Jean Digitale, Mohammad Forouzannia, Elise Rosenberg, Michele Lee, Jennee Bryant, Melanie Molina, James Marks, Lucas Zier, Jean Feng, 11 Aug 2025, When the Domain Expert Has No Time and the LLM Developer Has No Clinical Expertise: Real-World Lessons from LLM Co-Design in a Safety-Net Hospital, https://arxiv.org/abs/2508.08504
Bing Han, Feifei Zhao, Dongcheng Zhao, Guobin Shen, Ping Wu, Yu Shi, Yi Zeng, 8 Aug 2025, Fine-Grained Safety Neurons with Training-Free Continual Projection to Reduce LLM Fine Tuning Risks, https://arxiv.org/abs/2508.09190
Minseon Kim, Jin Myung Kwak, Lama Alssum, Bernard Ghanem, Philip Torr, David Krueger, Fazl Barez, Adel Bibi, 17 Aug 2025, Rethinking Safety in LLM Fine-tuning: An Optimization Perspective, https://arxiv.org/abs/2508.12531
Mingxing Peng, Yuting Xie, Xusen Guo, Ruoyu Yao, Hai Yang, and Jun Ma, 17 Aug 2025, LD-Scene: LLM-Guided Diffusion for Controllable Generation of Adversarial Safety-Critical Driving Scenarios, https://arxiv.org/abs/2505.11247
Yeonjun In, Wonjoong Kim, Sangwu Park, Chanyoung Park, 1 Aug 2025, R1-ACT: Efficient Reasoning Model Safety Alignment by Activating Safety Knowledge, https://arxiv.org/abs/2508.00324
Yifan Wang, Runjin Chen, Bolian Li, David Cho, Yihe Deng, Ruqi Zhang, Tianlong Chen, Zhangyang Wang, Ananth Grama, Junyuan Hong, 22 Jul 2025, More is Less: The Pitfalls of Multi-Model Synthetic Preference Data in DPO Safety Alignment, https://arxiv.org/abs/2504.02193
Frederik Pahde, Thomas Wiegand, Sebastian Lapuschkin, Wojciech Samek, 29 Jul 2025, Ensuring Medical AI Safety: Interpretability-Driven Detection and Mitigation of Spurious Model Behavior and Associated Data, https://arxiv.org/abs/2501.13818
Qianli Ma, Dongrui Liu, Qian Chen, Linfeng Zhang, Jing Shao, 14 Aug 2025, LED-Merging: Mitigating Safety-Utility Conflicts in Model Merging with Location-Election-Disjoint, https://arxiv.org/abs/2502.16770
Raviraj Joshi, Rakesh Paul, Kanishk Singla, Anusha Kamath, Michael Evans, Katherine Luna, Shaona Ghosh, Utkarsh Vaidya, Eileen Long, Sanjay Singh Chauhan, Niranjan Wartikar, 3 Aug 2025, CultureGuard: Towards Culturally-Aware Dataset and Guard Model for Multilingual Safety Applications, https://arxiv.org/abs/2508.01710
Xingjun Ma, Yifeng Gao, Yixu Wang, Ruofan Wang, Xin Wang, Ye Sun, Yifan Ding, Hengyuan Xu, Yunhao Chen, Yunhan Zhao, Hanxun Huang, Yige Li, Yutao Wu, Jiaming Zhang, Xiang Zheng, Yang Bai, Zuxuan Wu, Xipeng Qiu, Jingfeng Zhang, Yiming Li, Xudong Han, Haonan Li, Jun Sun, Cong Wang, Jindong Gu, Baoyuan Wu, Siheng Chen, Tianwei Zhang, Yang Liu, Mingming Gong, Tongliang Liu, Shirui Pan, Cihang Xie, Tianyu Pang, Yinpeng Dong, Ruoxi Jia, Yang Zhang, Shiqing Ma, Xiangyu Zhang, Neil Gong, Chaowei Xiao, Sarah Erfani, Tim Baldwin, Bo Li, Masashi Sugiyama, Dacheng Tao, James Bailey, Yu-Gang Jiang, 2 Aug 2025, Safety at Scale: A Comprehensive Survey of Large Model and Agent Safety, https://arxiv.org/abs/2502.05206
Francisco Munguia-Galeano, Zhengxue Zhou, Satheeshkumar Veeramani, Hatem Fakhruldeen, Louis Longley, Rob Clowes and Andrew I. Cooper, 7 Aug 2025, Chemist Eye: A Visual Language Model-Powered System for Safety Monitoring and Robot Decision-Making in Self-Driving Laboratories, https://arxiv.org/abs/2508.05148
Chongwen Zhao and Kaizhu Huang, 1 Sep 2025, Unraveling LLM Jailbreaks Through Safety Knowledge Neurons, https://arxiv.org/abs/2509.01631
Wenxiao Zhang, Xiangrui Kong, Conan Dewitt, Thomas Br\"aunl, Jin B. Hong, 2 Sep 2025, Enhancing Reliability in LLM-Integrated Robotic Systems: A Unified Approach to Security and Safety, https://arxiv.org/abs/2509.02163
Taegyeong Lee, Jeonghwa Yoo, Hyoungseo Cho, Soo Yong Kim and Yunho Maeng, 30 Aug 2025, QGuard:Question-based Zero-shot Guard for Multi-modal LLM Safety, https://arxiv.org/abs/2506.12299
Roland Pihlakas, Sruthi Kuriakose, 2 Sep 2025, BioBlue: Notable runaway-optimiser-like LLM failure modes on biologically and economically aligned AI safety benchmarks for LLMs with simplified observation format, https://arxiv.org/abs/2509.02655
Piyush Pant, 10 Sep 2025, Improving LLM Safety and Helpfulness using SFT and DPO: A Study on OPT-350M, https://arxiv.org/abs/2509.09055
Adel ElZemity, Budi Arief and Shujun Li, 17 Sep 2025, CyberLLMInstruct: A Pseudo-malicious Dataset Revealing Safety-performance Trade-offs in Cyber Security LLM Fine-tuning, https://arxiv.org/abs/2503.09334
Dylan Butts, Oct 22 2025, Hundreds of public figures, including Apple co-founder Steve Wozniak and Virgin’s Richard Branson urge AI ‘superintelligence’ ban, https://www.cnbc.com/2025/10/22/800-petition-signatures-apple-steve-wozniak-and-virgin-richard-branson-superintelligence-race.html