Aussie AI
AI Safety Research
-
Last Updated 30 August, 2025
-
by David Spuler, Ph.D.
Safe and responsible use of AI is an important and all-encompassing goal. Multiple concerns arise in the use of modern AI capabilities, and for the future with more advanced AI systems. This article examines the various research papers on difference AI safety issues.
Types of AI Safety Issues
There are a variety of distinct issue in terms of appropriate use of AI. Some of the categories include:
- Bias and fairness
- Inaccurate results
- Imaginary results ("hallucinations" or "confabulations")
- Inappropriate responses (e.g., "toxicity")
- Plagiarism
There are some issues that get quite close to being philosophy rather than technology:
- Alignment (ensuring AI engines are "aligned" with human goals)
- Overrideability/interruptibility
- Obedience vs autonomy
There are some overarching issues for AI matters for the government and in the community:
- Ethics
- Governance
- Regulation
- Auditing and Enforcement
- Privacy
- Risk Mitigation
Issues specific to mitigation of AI safety risks include:
- Red teaming (testing of safety issues)
- Prompt shields
- Guardrails
- Jailbreak prevention
- Refusal modules
- Security issues
And since we may rely on AI models in various real-world situations, including dangerous real-time situations like driving a car, there are some practical technological issues ensuring that AI engines operate safely and reliably within their basic operational scope:
- Testing and Debugging (simply avoiding coding "bugs" in complex AI engines)
- Real-time performance profiling ("de-slugging")
- Error Handling (tolerance of internal or external errors)
- Code Resilience (handling unexpected inputs or situations reasonably)
Overviews, Surveys, and Reviews
Various authors have reviewed the areas of safety and ethics:
- Cath C. Governing artificial intelligence: ethical, legal and technical opportunities and challenges. Philos Trans A Math Phys Eng Sci. 2018 Oct 15;376(2133):20180080. doi: 10.1098/rsta.2018.0080. PMID: 30322996 https://pubmed.ncbi.nlm.nih.gov/30322996/
- Hagendorff Thilo. The ethics of AI ethics: an evaluation of guidelines. Minds and Machines. 2020; 30(1):99–120. https://link.springer.com/article/10.1007/s11023-020-09517-8
- Jobin Anna, Ienca Marcello, Vayena Effy. The global landscape of AI ethics guidelines. Nature Machine Intell. 2019;(1):389–399. https://www.nature.com/articles/s42256-019-0088-2
- Soni N., Sharma E.K., Singh N., Kapoor A. 2019. Impact of Artificial Intelligence on Businesses: from Research, Innovation, Market Deployment to Future Shifts in Business Models”.arXiv:1905.02092. https://arxiv.org/abs/1905.02092
Hallucinations
Hallucinations are plausible-sounding answers that are not correct, and not based on any facts. It appears like the LLM is lying or faking the answer, but it doesn't actually know this. Rather, it is probabilistically trying to come up with the best answer, and sometimes it doesn't have a factual answer, so it can fill in the blanks.
- Liangming Pan, Michael Saxon, Wenda Xu, Deepak Nathani, Xinyi Wang, William Yang Wang, May 03 2024, Automatically Correcting Large Language Models: Surveying the Landscape of Diverse Automated Correction Strategies, https://direct.mit.edu/tacl/article/doi/10.1162/tacl_a_00660/120911
- Bingbin Liu, Jordan T. Ash, Surbhi Goel, Akshay Krishnamurthy, Cyril Zhang, June 2023, Exposing Attention Glitches with Flip-Flop Language Modeling, https://arxiv.org/abs/2306.00946
- Lucas Mearian, 14 Mar 2024, AI hallucination mitigation: two brains are better than one, https://www.computerworld.com/article/1612465/ai-hallucination-mitigation-two-brains-are-better-than-one.html
- Zhihang Yuan, Yuzhang Shang, Yang Zhou, Zhen Dong, Zhe Zhou, Chenhao Xue, Bingzhe Wu, Zhikai Li, Qingyi Gu, Yong Jae Lee, Yan Yan, Beidi Chen, Guangyu Sun, Kurt Keutzer, 15 Mar 2024 (v5), LLM Inference Unveiled: Survey and Roofline Model Insights, https://arxiv.org/abs/2402.16363 Code: https://github.com/hahnyuan/LLM-Viewer (A large survey of a variety of LLM optimizations.)
- Bijit Ghosh Feb 2024, Advanced Prompt Engineering for Reducing Hallucination, https://medium.com/@bijit211987/advanced-prompt-engineering-for-reducing-hallucination-bb2c8ce62fc6
- Junyi Li, Jie Chen, Ruiyang Ren, Xiaoxue Cheng, Wayne Xin Zhao, Jian-Yun Nie, Ji-Rong Wen, 6 Jan 2024, The Dawn After the Dark: An Empirical Study on Factuality Hallucination in Large Language Models, https://arxiv.org/abs/2401.03205 Code: https://github.com/RUCAIBox/HaluEval-2.0
- Colin Fraser, Apr 18, 2024, Hallucinations, Errors, and Dreams On why modern AI systems produce false outputs and what there is to be done about it, https://medium.com/@colin.fraser/hallucinations-errors-and-dreams-c281a66f3c35
- Johnny Li, Saksham Consul, Eda Zhou, James Wong, Naila Farooqui, Yuxin Ye, Nithyashree Manohar, Zhuxiaona Wei, Tian Wu, Ben Echols, Sharon Zhou, Gregory Diamos, 25 Jun 2024, Banishing LLM Hallucinations Requires Rethinking Generalization, https://arxiv.org/abs/2406.17642
- Pavan Belagatti, Jul 31, 2024, Semantic Chunking for Enhanced RAG Applications! https://levelup.gitconnected.com/semantic-chunking-for-enhanced-rag-applications-b6bc92942af0
- Mintong Kang, Nezihe Merve Gürel, Ning Yu, Dawn Song, Bo Li, July 2024, C-RAG: Certified Generation Risks for Retrieval-Augmented Language Models, Proceedings of the 41st International Conference on Machine Learning, PMLR 235:22963-23000, 2024, https://proceedings.mlr.press/v235/kang24a.html
- Mengya Hu, Rui Xu, Deren Lei, Yaxi Li, Mingyu Wang, Emily Ching, Eslam Kamal, Alex Deng, 22 Aug 2024, SLM Meets LLM: Balancing Latency, Interpretability and Consistency in Hallucination Detection, https://arxiv.org/abs/2408.12748
- Hao Zhou, Chengming Hu, Ye Yuan, Yufei Cui, Yili Jin, Can Chen, Haolun Wu, Dun Yuan, Li Jiang, Di Wu, Xue Liu, Charlie Zhang, Xianbin Wang, Jiangchuan Liu, 17 May 2024, Large Language Model (LLM) for Telecommunications: A Comprehensive Survey on Principles, Key Techniques, and Opportunities, https://arxiv.org/abs/2405.10825
- C Yang, S Fujita, 2024, Adaptive Control of Retrieval-Augmented Generation for LLMs Through Reflective Tags, https://www.preprints.org/manuscript/202408.2152/download/final_file
- Michael Wood, Aug 26, 2024, 100% Accurate AI Claimed by Acurai — OpenAI and Anthropic Confirm Acurai’s Discoveries, https://blog.cubed.run/100-accurate-ai-claimed-by-acurai-openai-and-anthropic-confirm-acurais-discoveries-98fce1ddeb5b
- James Lee Stakelum, Sep 2024, The End of AI Hallucinations: A Big Breakthrough in Accuracy for AI Application Developers, https://medium.com/@JamesStakelum/the-end-of-ai-hallucinations-a-breakthrough-in-accuracy-for-data-engineers-e67be5cc742a
- F. Li, X. zhang and P. Zhang, 2024, Mitigating Hallucination Issues in Small-Parameter LLMs through Inter-Layer Contrastive Decoding, 2024 International Joint Conference on Neural Networks (IJCNN), Yokohama, Japan, 2024, pp. 1-8, doi: 10.1109/IJCNN60899.2024.10650644, https://ieeexplore.ieee.org/abstract/document/10650644
- Zhongxiang Sun, Zihua Si, Xiaoxue Zang, Kai Zheng, Yang Song, Xiao Zhang, Jun Xu, 15 Oct 2024, LargePiG: Your Large Language Model is Secretly a Pointer Generator, https://arxiv.org/abs/2410.11366
- Garanc Burke, Hilke Schellmann, October 27, 2024, Researchers say an AI-powered transcription tool used in hospitals invents things no one ever said, https://apnews.com/article/ai-artificial-intelligence-health-business-90020cdf5fa16c79ca2e5b6c4c9bbb14
- Adi Simhi, Jonathan Herzig, Idan Szpektor, Yonatan Belinkov, 29 Oct 2024, Distinguishing Ignorance from Error in LLM Hallucinations, https://arxiv.org/abs/2410.22071 https://github.com/technion-cs-nlp/hallucination-mitigation
- Salvatore Raieli, Nov 2024, What Is The Best Therapy For a Hallucinating AI Patient? Exploring the Art and Science of Prompt Engineering to Cure LLM Hallucinations, https://levelup.gitconnected.com/what-is-the-best-therapy-for-a-hallucinating-ai-patient-acf0cb9b3e00
- Vitaly Kukharenko, Nov 2024, Why Do Neural Networks Hallucinate (And What Are Experts Doing About It)? https://pub.towardsai.net/why-do-neural-networks-hallucinate-and-what-are-experts-doing-about-it-7b9342605bf7
- Yixiong Fang, Ziran Yang, Zhaorun Chen, Zhuokai Zhao, Jiawei Zhou, 9 Dec 2024, From Uncertainty to Trust: Enhancing Reliability in Vision-Language Models with Uncertainty-Guided Dropout Decoding, https://arxiv.org/abs/2412.06474
- Inkit Padhi, Manish Nagireddy, Giandomenico Cornacchia, Subhajit Chaudhury, Tejaswini Pedapati, Pierre Dognin, Keerthiram Murugesan, Erik Miehling, Martín Santillán Cooper, Kieran Fraser, Giulio Zizzo, Muhammad Zaid Hameed, Mark Purcell, Michael Desmond, Qian Pan, Inge Vejsbjerg, Elizabeth M. Daly, Michael Hind, Werner Geyer, Ambrish Rawat, Kush R. Varshney, Prasanna Sattigeri, 10 Dec 2024, Granite Guardian, https://arxiv.org/abs/2412.07724 https://github.com/ibm-granite/granite-guardian (Open-sourcing of safety models with many capabilities.)
- Lilian Weng, July 7, 2024, Extrinsic Hallucinations in LLMs, https://lilianweng.github.io/posts/2024-07-07-hallucination/
- Rhiannon Williams, December 31, 2024, The biggest AI flops of 2024: From chatbots dishing out illegal advice to dodgy AI-generated search results, take a look back over the year’s top AI failures. https://www.technologyreview.com/2024/12/31/1109612/biggest-worst-ai-artificial-intelligence-flops-fails-2024/
- Kazi Hasan Ibn Arif, Sajib Acharjee Dip, Khizar Hussain, Lang Zhang, Chris Thomas, 21 Jan 2025, Fixing Imbalanced Attention to Mitigate In-Context Hallucination of Large Vision-Language Model, https://arxiv.org/abs/2501.12206
- Huan Ma, Jingdong Chen, Guangyu Wang, Changqing Zhang, 1 Feb 2025, Estimating LLM Uncertainty with Logits, https://arxiv.org/abs/2502.00290
- Ningke Li, Yahui Song, Kailong Wang, Yuekang Li, Ling Shi, Yi Liu, Haoyu Wang, 19 Feb 2025, Detecting LLM Fact-conflicting Hallucinations Enhanced by Temporal-logic-based Reasoning, https://arxiv.org/abs/2502.13416
- Seongheon Park, Xuefeng Du, Min-Hsuan Yeh, Haobo Wang, Yixuan Li, 1 Mar 2025, How to Steer LLM Latents for Hallucination Detection? https://arxiv.org/abs/2503.01917
- Sean Michael Kerner, May 13, 2025, Guardian agents: New approach could reduce AI hallucinations to below 1%, https://venturebeat.com/ai/beyond-detection-why-automatically-correcting-hallucinations-could-transform-enterprise-ai-adoption/
- Lei Wang, 12 May 2025, SEReDeEP: Hallucination Detection in Retrieval-Augmented Models via Semantic Entropy and Context-Parameter Fusion, https://arxiv.org/abs/2505.07528
- Manuel Cossio, 3 Aug 2025, A comprehensive taxonomy of hallucinations in Large Language Models, https://arxiv.org/abs/2508.01781
- Igor Halperin, 13 Aug 2025, Prompt-Response Semantic Divergence Metrics for Faithfulness Hallucination and Misalignment Detection in Large Language Models, https://arxiv.org/abs/2508.10192
- Denis Janiak, Jakub Binkowski, Albert Sawczyn, Bogdan Gabrys, Ravid Shwartz-Ziv, Tomasz Kajdanowicz, 13 Aug 2025, The Illusion of Progress: Re-evaluating Hallucination Detection in LLMs, https://arxiv.org/abs/2508.08285
- Xi Long, Christy Boscardin, Lauren A. Maggio, Joseph A. Costello, Ralph Gonzales, Rasmyah Hammoudeh, Ki Lai, Yoon Soo Park, Brian C. Gin, 14 Aug 2025, Hallucination vs interpretation: rethinking accuracy and precision in AI-assisted data extraction for knowledge synthesis, https://arxiv.org/abs/2508.09458
- Siyuan Liu, Wenjing Liu, Zhiwei Xu, Xin Wang, Bo Chen, Tao Li, 21 Jul 2025, Towards Mitigation of Hallucination for LLM-empowered Agents: Progressive Generalization Bound Exploration and Watchdog Monitor, https://arxiv.org/abs/2507.15903
- Zhenliang Zhang, Xinyu Hu, Huixuan Zhang, Junzhe Zhang, Xiaojun Wan, 22 Jul 2025, ICR Probe: Tracking Hidden State Dynamics for Reliable Hallucination Detection in LLMs, https://arxiv.org/abs/2507.16488
- Xin Dong, Shichao Dong, Jin Wang, Jing Huang, Li Zhou, Zenghui Sun, Lihua Jing, Jingsong Lan, Xiaoyong Zhu, Bo Zheng, 22 Jul 2025, INTER: Mitigating Hallucination in Large Vision-Language Models by Interaction Guidance Sampling, https://arxiv.org/abs/2507.05056
- Seunghoi Kim and Henry F. J. Tregidgo and Matteo Figini and Chen Jin and Sarang Joshi and Daniel C. Alexander, 24 Jul 2025, Tackling Hallucination from Conditional Models for Medical Image Reconstruction with DynamicDPS, https://arxiv.org/abs/2503.01075
- Weihua Zheng, Roy Ka-Wei Lee, Zhengyuan Liu, Kui Wu, AiTi Aw, Bowei Zou, 17 Jul 2025, CCL-XCoT: An Efficient Cross-Lingual Knowledge Transfer Method for Mitigating Hallucination Generation, https://arxiv.org/abs/2507.14239
- Jingwei Huang, Kuroush Nezafati, Ismael Villanueva-Miranda, Zifan Gu, Yueshuang Xu, Ann Marie Navar, Tingyi Wanyan, Qin Zhou, Bo Yao, Ruichen Rong, Xiaowei Zhan, Guanghua Xiao, Eric D. Peterson, Donghan M. Yang, Wenqi Shi, Yang Xie, 18 Jul 2025, Large Language Models Powered Multiagent Ensemble for Mitigating Hallucination and Efficient Atrial Fibrillation Annotation of ECG Reports, https://arxiv.org/abs/2410.16543
- Ashley Lewis, Michael White, Jing Liu, Toshiaki Koike-Akino, Kieran Parsons, Ye Wang, 21 Jul 2025, Winning Big with Small Models: Knowledge Distillation vs. Self-Training for Reducing Hallucination in Product QA Agents, https://arxiv.org/abs/2502.19545
- Quan Shi, Wang Xi, Zenghui Ding, Jianqing Gao, Xianjun Yang, 10 Aug 2025, Hallucination as a Computational Boundary: A Hierarchy of Inevitability and the Oracle Escape, https://arxiv.org/abs/2508.07334
- Ming-Kun Xie, Jia-Hao Xiao, Gang Niu, Lei Feng, Zhiqiang Kou, Min-Ling Zhang, and Masashi Sugiyama, 3 Aug 2025, What Makes "Good" Distractors for Object Hallucination Evaluation in Large Vision-Language Models?, https://arxiv.org/abs/2508.06530
- Jakob Snel and Seong Joon Oh, 28 Jul 2025, First Hallucination Tokens Are Different from Conditional Ones, https://arxiv.org/abs/2507.20836
- Shengyuan Wang, Jie Feng, Tianhui Liu, Dan Pei, Yong Li, 25 Jul 2025, Mitigating Geospatial Knowledge Hallucination in Large Language Models: Benchmarking and Dynamic Factuality Aligning, https://arxiv.org/abs/2507.19586
- Baiyu Chen, Wilson Wongso, Xiaoqian Hu, Yue Tan, Flora Salim, 27 Jul 2025, Multi-Stage Verification-Centric Framework for Mitigating Hallucination in Multi-Modal RAG, https://arxiv.org/abs/2507.20136
- Joosung Lee, Cheonbok Park, Hwiyeol Jo, Jeonghoon Kim, Joonsuk Park, Kang Min Yoo, 28 Jul 2025, Enhancing Hallucination Detection via Future Context, https://arxiv.org/abs/2507.20546
- Esmail Gumaan, 20 Jul 2025, Theoretical Foundations and Mitigation of Hallucination in Large Language Models, https://arxiv.org/abs/2507.22915
- Praveenkumar Katwe, Rakesh Chandra, Balabantaray Kali, Prasad Vittala, 30 Jul 2025, Reducing Hallucinations in Summarization via Reinforcement Learning with Entity Hallucination Index, https://arxiv.org/abs/2507.22744
- Vijja Wichitwechkarn, Charles Fox, Ruchi Choudhary, 23 Jul 2025, Hallucination Detection and Mitigation with Diffusion in Multi-Variate Time-Series Foundation Models, https://arxiv.org/abs/2508.00881
- Xiaoyu Pan, Yang Bai, Ke Zou, Yang Zhou, Jun Zhou, Huazhu Fu, Yih-Chung Tham, Yong Liu, 24 Jul 2025, EH-Benchmark Ophthalmic Hallucination Benchmark and Agent-Driven Top-Down Traceable Reasoning Workflow, https://arxiv.org/abs/2507.22929
- Zhaochen Wang, Yiwei Wang, Yujun Cai, 3 Aug 2025, Cure or Poison? Embedding Instructions Visually Alters Hallucination in Vision-Language Models, https://arxiv.org/abs/2508.01678
- Yijun Feng, 3 Aug 2025, Counterfactual Probing for Hallucination Detection and Mitigation in Large Language Models, https://arxiv.org/abs/2508.01862
- Zhaoyi Sun, Wen-Wai Yim, Ozlem Uzuner, Fei Xia, Meliha Yetisgen, 1 Aug 2025, A Scoping Review of Natural Language Processing in Addressing Medically Inaccurate Information: Errors, Misinformation, and Hallucination, https://arxiv.org/abs/2505.00008
- Junyoung Lim, Jaewoo Ahn, Gunhee Kim, 5 Aug 2025, ChartCap: Mitigating Hallucination of Dense Chart Captioning, https://arxiv.org/abs/2508.03164
- Subhey Sadi Rahman, Md. Adnanul Islam, Md. Mahbub Alam, Musarrat Zeba, Md. Abdur Rahman, Sadia Sultana Chowa, Mohaimenul Azam Khan Raiaan, Sami Azam, 5 Aug 2025, Hallucination to Truth: A Review of Fact-Checking and Factuality Evaluation in Large Language Models, https://arxiv.org/abs/2508.03860
- Shunqi Mao, Chaoyi Zhang, Weidong Cai, 6 Aug 2025, Through the Magnifying Glass: Adaptive Perception Magnification for Hallucination-Free VLM Decoding, https://arxiv.org/abs/2503.10183
- Micha{\l} P. Karpowicz, 6 Aug 2025, On the Fundamental Impossibility of Hallucination Control in Large Language Models, https://arxiv.org/abs/2506.06382
- Huaicheng Zhang, Wei Tan, Guangzheng Li, Yixuan Zhang, Hangting Chen, Shun Lei, Chenyu Yang, Zhiyong Wu, Shuai Wang, Qijun Huang, Dong Yu, 7 Aug 2025, Towards Hallucination-Free Music: A Reinforcement Learning Preference Optimization Framework for Reliable Song Generation, https://arxiv.org/abs/2508.05011
- Kim Hammar and Tansu Alpcan and Emil C. Lupu, 7 Aug 2025, Incident Response Planning Using a Lightweight Large Language Model with Reduced Hallucination, https://arxiv.org/abs/2508.05188
- Marc Pavel, Nenad Petrovic, Lukasz Mazur, Vahid Zolfaghari, Fengjunjie Pan, Alois Knoll, 15 Aug 2025, Hallucination in LLM-Based Code Generation: An Automotive Case Study, https://arxiv.org/abs/2508.11257
- Nanxing Hu, Xiaoyue Duan, Jinchao Zhang, Guoliang Kang, 19 Aug 2025, Enhancing Visual Reliance in Text Generation: A Bayesian Perspective on Mitigating Hallucination in Large Vision-Language Models, https://arxiv.org/abs/2505.19498
- Huan Ma, Jiadong Pan, Jing Liu, Yan Chen, Joey Tianyi Zhou, Guangyu Wang, Qinghua Hu, Hua Wu, Changqing Zhang, Haifeng Wang, 20 Aug 2025, Semantic Energy: Detecting LLM Hallucination Beyond Entropy, https://arxiv.org/abs/2508.14496
- Aman Goel, Daniel Schwartz, Yanjun Qi, 19 Aug 2025, Zero-knowledge LLM hallucination detection and mitigation through fine-grained cross-model consistency, https://arxiv.org/abs/2508.14314
- Yupei Yang, Fan Feng, Lin Yang, Wanxi Deng, Lin Qu, Biwei Huang, Shikui Tu, Lei Xu, 20 Aug 2025, DEPTH: Hallucination-Free Relation Extraction via Dependency-Aware Sentence Simplification and Two-tiered Hierarchical Refinement, https://arxiv.org/abs/2508.14391
- Nicole Cho, William Watson, Alec Koppel, Sumitra Ganesh, Manuela Veloso, 22 Aug 2025, QueryBandits for Hallucination Mitigation: Exploiting Semantic Features for No-Regret Rewriting, https://arxiv.org/abs/2508.16697
- Nicolas Zucchet, J\"org Bornschein, Stephanie Chan, Andrew Lampinen, Razvan Pascanu, Soham De, 24 Jul 2025, How do language models learn facts? Dynamics, curricula and hallucinations, https://arxiv.org/abs/2503.21676
- Anindya Bijoy Das, Shahnewaz Karim Sakib and Shibbir Ahmed, 9 Aug 2025, Trustworthy Medical Imaging with Large Language Models: A Study of Hallucinations Across Modalities, https://arxiv.org/abs/2508.07031
- Charles O'Neill, Slava Chalnev, Chi Chi Zhao, Max Kirkby, Mudith Jayasekara, 31 Jul 2025, A Single Direction of Truth: An Observer Model's Linear Residual Probe Exposes and Steers Contextual Hallucinations, https://arxiv.org/abs/2507.23221
- Zhangcheng Qiang, Kerry Taylor, Weiqing Wang, Jing Jiang, 25 Mar 2025, OAEI-LLM-T: A TBox Benchmark Dataset for Understanding Large Language Model Hallucinations in Ontology Matching, https://arxiv.org/abs/2503.21813
- Yudong Zhang, Ruobing Xie, Xingwu Sun, Yiqing Huang, Jiansheng Chen, Zhanhui Kang, Di Wang, Yu Wang, 31 Jul 2025, DHCP: Detecting Hallucinations by Cross-modal Attention Pattern in Large Vision-Language Models, https://arxiv.org/abs/2411.18659
- Haonan Ge, Yiwei Wang, Ming-Hsuan Yang, Yujun Cai, 14 Aug 2025, MRFD: Multi-Region Fusion Decoding with Self-Consistency for Mitigating Hallucinations in LVLMs, https://arxiv.org/abs/2508.10264
- Likun Tan, Kuan-Wei Huang, Kevin Wu, 28 Jul 2025, FRED: Financial Retrieval-Enhanced Detection and Editing of Hallucinations in Language Models, https://arxiv.org/abs/2507.20930
- Neil F. Johnson and Frank Yingjie Huo, 1 Aug 2025, Multispin Physics of AI Tipping Points and Hallucinations, https://arxiv.org/abs/2508.01097
- Chenxi Li, Yichen Guo, Benfang Qian, Jinhao You, Kai Tang, Yaosong Du, Zonghao Zhang, and Xiande Huang, 3 Aug 2025, MAP: Mitigating Hallucinations in Large Vision-Language Models with Map-Level Attention Processing, https://arxiv.org/abs/2508.01653
- Peizheng Guo, Jingyao Wang, Wenwen Qiang, Huijie Guo, Changwen Zheng, Jiahuan Zhou, Gang Hua, 6 Aug 2025, Hacking Hallucinations of MLLMs with Causal Sufficiency and Necessity, https://arxiv.org/abs/2508.04182
- Mengao Zhang, Jiayu Fu, Tanya Warrier, Yuwen Wang, Tianhui Tan, Ke-wei Huang, 7 Aug 2025, FAITH: A Framework for Assessing Intrinsic Tabular Hallucinations in finance, https://arxiv.org/abs/2508.05201
- Vibhor Agarwal, Yiqiao Jin, Mohit Chandra, Munmun De Choudhury, Srijan Kumar, Nishanth Sastry, 7 Aug 2025, MedHalu: Hallucinations in Responses to Healthcare Queries by Large Language Models, https://arxiv.org/abs/2409.19492
- Chunhua Liu, Hong Yi Lin and Patanamon Thongtanunam, 12 Aug 2025, Hallucinations in Code Change to Natural Language Generation: Prevalence and Evaluation of Detection Metrics, https://arxiv.org/abs/2508.08661
- Ashish Seth, Utkarsh Tyagi, Ramaneswaran Selvakumar, Nishit Anand, Sonal Kumar, Sreyan Ghosh, Ramani Duraiswami, Chirag Agarwal, Dinesh Manocha, 18 Aug 2025, EGOILLUSION: Benchmarking Hallucinations in Egocentric Video Understanding, https://arxiv.org/abs/2508.12687
- Yuangang Li, Yiqing Shen, Yi Nian, Jiechao Gao, Ziyi Wang, Chenxiao Yu, Shawn Li, Jie Wang, Xiyang Hu, Yue Zhao, 17 Aug 2025, Mitigating Hallucinations in Large Language Models via Causal Reasoning, https://arxiv.org/abs/2508.12495
- Wenhao Li, Xiu Su, Jingyi Wu, Feng Yang, Yang Liu, Yi Chen, Shan You, Chang Xu, 19 Aug 2025, Identify, Isolate, and Purge: Mitigating Hallucinations in LVLMs via Self-Evolving Distillation, https://arxiv.org/abs/2507.04680
- Anindya Bijoy Das, Shibbir Ahmed and Shahnewaz Karim Sakib, 19 Aug 2025, Hallucinations and Key Information Extraction in Medical Texts: A Comprehensive Assessment of Open-Source Large Language Models, https://arxiv.org/abs/2504.19061
- Chenlin Liu, Minghui Fang, Patrick Zhang, Wei Zhou, Jie Gao, Jiqing Han, 21 Aug 2025, Mitigating Hallucinations in LM-Based TTS Models via Distribution Alignment Using GFlowNets, https://arxiv.org/abs/2508.15442
- Reilly Haskins and Benjamin Adams, 21 Aug 2025, KEA Explain: Explanations of Hallucinations using Graph Kernel Analysis, https://arxiv.org/abs/2507.03847
- Shuzhou Yuan, Zhan Qu, Ashish Yashwanth Kangen, Michael F\"arber, 22 Aug 2025, Can Hallucinations Help? Boosting LLMs for Drug Discovery, https://arxiv.org/abs/2501.13824
Security of AI
Research on security issues involving AI and LLMs:
- Jason Koebler, June 26, 2024, Researchers Prove Rabbit AI Breach By Sending Email to Us as Admin, https://www.404media.co/researchers-prove-rabbit-ai-breach-by-sending-email-to-us-as-admin/ (Rabbit's API security credentials were hard-coded into the device.)
- Yuanchun Li, Hao Wen, Weijun Wang, Xiangyu Li, Yizhen Yuan, Guohong Liu, Jiacheng Liu, Wenxing Xu, Xiang Wang, Yi Sun, Rui Kong, Yile Wang, Hanfei Geng, Jian Luan, Xuefeng Jin, Zilong Ye, Guanjing Xiong, Fan Zhang, Xiang Li, Mengwei Xu, Zhijun Li, Peng Li, Yang Liu, Ya-Qin Zhang, Yunxin Liu, 8 May 2024 (v2), Personal LLM Agents: Insights and Survey about the Capability, Efficiency and Security, https://arxiv.org/abs/2401.05459 https://github.com/MobileLLM/Personal_LLM_Agents_Survey
- Michael Nuñez, August 30, 2024, AI is growing faster than companies can secure it, warn industry leaders, https://venturebeat.com/ai/ai-is-growing-faster-than-companies-can-secure-it-warn-industry-leaders/
- Yiheng Liu, Hao He, Tianle Han, Xu Zhang, Mengyuan Liu, Jiaming Tian, Yutong Zhang, Jiaqi Wang, Xiaohui Gao, Tianyang Zhong, Yi Pan, Shaochen Xu, Zihao Wu, Zhengliang Liu, Xin Zhang, Shu Zhang, Xintao Hu, Tuo Zhang, Ning Qiang, Tianming Liu, Bao Ge, 6 Jan 2024 (v2), Understanding LLMs: A Comprehensive Overview from Training to Inference, https://arxiv.org/abs/2401.02038
- Huan Yang, Deyu Zhang, Yudong Zhao, Yuanchun Li, Yunxin Liu, 6 Sep 2024, A First Look At Efficient And Secure On-Device LLM Inference Against KV Leakage, https://arxiv.org/abs/2409.04040 (Security issues where KV caches can be data leaks as they may contain encodings of private information.)
- Nicholas Carlini, Milad Nasr, 22 Oct 2024, Remote Timing Attacks on Efficient Language Model Inference, https://arxiv.org/abs/2410.17175
- Chenchen Gu, Xiang Lisa Li, Rohith Kuditipudi, Percy Liang, Tatsunori Hashimoto, Dec 2024, Timing Attacks on Prompt Caching in Language Model APIs, Stanford CS 191W Senior Project, https://cs191w.stanford.edu/projects/Gu,%20Chenchen_CS191W.pdf (Using timing attacks to detect prefix KV caching, thereby gaining information about other users' prompts.)
- Úlfar Erlingsson, 27 Mar 2025, How to Secure Existing C and C++ Software without Memory Safety, https://arxiv.org/pdf/2503.21145 (Examines four risk mitigation techniques for memory safety.)
- Anthropic, 13 Aug 2025, Building Safeguards for Claude, https://www.anthropic.com/news/building-safeguards-for-claude
- Pallavi Zambare, Venkata Nikhil Thanikella, Nikhil Padmanabh Kottur, Sree Akhil Akula, Ying Liu, 12 Aug 2025, NetMoniAI: An Agentic AI Framework for Network Security & Monitoring, https://arxiv.org/abs/2508.10052
- Miles Q. Li and Benjamin C. M. Fung, 13 Aug 2025, Security Concerns for Large Language Models: A Survey, https://arxiv.org/abs/2505.18889
- Vita Santa Barletta, Vito Bavaro, Miriana Calvano, Antonio Curci, Antonio Piccinno, Davide Pio Posa, 23 Jul 2025, Enabling Cyber Security Education through Digital Twins and Generative AI, https://arxiv.org/abs/2507.17518
- Haibo Wang, Lutfu S.Sua, and Bahram Alidaee, 22 Jul 2025, Enhancing supply chain security with automated machine learning, https://arxiv.org/abs/2406.13166
- Lily Stelling, Mick Yang, Rokas Gipi\v{s}kis, Leon Staufer, Ze Shen Chin, Sim\'eon Campos, Ariel Gil, and Michael Chen, 22 Jul 2025, Mapping Industry Practices to the EU AI Act's GPAI Code of Practice Safety and Security Measures, https://arxiv.org/abs/2504.15181
- Rui Guo, Avinash Ayalasomayajula, Henian Li, Jingbo Zhou, Sujan Kumar Saha, Farimah Farahmandi, 22 Jul 2025, SVAgent: AI Agent for Hardware Security Verification Assertion, https://arxiv.org/abs/2507.16203
- Chang Gong and Zhongwen Li and Xiaoqi Li, 24 Jul 2025, Information Security Based on LLM Approaches: A Review, https://arxiv.org/abs/2507.18215
- Pengfei Du, 14 Jul 2025, PRM-Free Security Alignment of Large Models via Red Teaming and Adversarial Training, https://arxiv.org/abs/2507.14202
- Eldor Abdukhamidov, Mohammed Abuhamad, Simon S. Woo, Hyoungshick Kim, Tamer Abuhmed, 18 Jul 2025, Breaking the Illusion of Security via Interpretation: Interpretable Vision Transformer Systems under Attack, https://arxiv.org/abs/2507.14248
- Zhou Li, Xiang Zhang, Jiawen Lv, Jihao Fan, Haiqiang Chen, Giuseppe Caire, 19 Jul 2025, Collusion-Resilient Hierarchical Secure Aggregation with Heterogeneous Security Constraints, https://arxiv.org/abs/2507.14768
- Nidhi Rastogi, Shirid Pant, Devang Dhanuka, Amulya Saxena, Pranjal Mairal, 20 Jul 2025, Too Much to Trust? Measuring the Security and Cognitive Impacts of Explainability in AI-Driven SOCs, https://arxiv.org/abs/2503.02065
- Andrew C. Cullen, Paul Montague, Sarah M. Erfani, Benjamin I.P. Rubinstein, 11 Aug 2025, Position: Certified Robustness Does Not (Yet) Imply Model Security, https://arxiv.org/abs/2506.13024
- Andy Zou, Maxwell Lin, Eliot Jones, Micha Nowak, Mateusz Dziemian, Nick Winter, Alexander Grattan, Valent Nathanael, Ayla Croft, Xander Davies, Jai Patel, Robert Kirk, Nate Burnikell, Yarin Gal, Dan Hendrycks, J. Zico Kolter, Matt Fredrikson, 28 Jul 2025, Security Challenges in AI Agent Deployment: Insights from a Large Scale Public Competition, https://arxiv.org/abs/2507.20526
- Shen Li, Liuyi Yao, Wujia Niu, Lan Zhang, Yaliang Li, 28 Jul 2025, Security Tensors as a Cross-Modal Bridge: Extending Text-Aligned Safety to Vision in LVLM, https://arxiv.org/abs/2507.20994
- Song Son Ha,Florian Foerster,Thomas Robert Doebbert,Tim Kittel,Dominik Merli,Gerd Scholl, 28 Jul 2025, Testbed and Software Architecture for Enhancing Security in Industrial Private 5G Networks, https://arxiv.org/abs/2507.20873
- Keerthana Madhavan, Abbas Yazdinejad, Fattane Zarrinkalam, Ali Dehghantanha, 26 Jul 2025, Quantifying Security Vulnerabilities: A Metric-Driven Security Analysis of Gaps in Current AI Standards, https://arxiv.org/abs/2502.08610
- Craig Wright, 10 Jul 2025, A Formal Rebuttal of "The Blockchain Trilemma: A Formal Proof of the Inherent Trade-Offs Among Decentralization, Security, and Scalability", https://arxiv.org/abs/2507.21111
- Gauri Sharma, Vidhi Kulkarni, Miles King, Ken Huang, 23 Jul 2025, Towards Unifying Quantitative Security Benchmarking for Multi Agent Systems, https://arxiv.org/abs/2507.21146
- Muzhi Dai, Shixuan Liu, Zhiyuan Zhao, Junyu Gao, Hao Sun, Xuelong Li, 29 Jul 2025, Secure Tug-of-War (SecTOW): Iterative Defense-Attack Training with Reinforcement Learning for Multimodal Model Security, https://arxiv.org/abs/2507.22037
- Kang Chen, Xiuze Zhou, Yuanguo Lin, Jinhe Su, Yuanhui Yu, Li Shen, Fan Lin, 4 Aug 2025, A Survey on Data Security in Large Language Models, https://arxiv.org/abs/2508.02312
- Niklas Pfister, V\'aclav Volhejn, Manuel Knott, Santiago Arias, Julia Bazi\'nska, Mykhailo Bichurin, Alan Commike, Janet Darling, Peter Dienes, Matthew Fiedler, David Haber, Matthias Kraft, Marco Lancini, Max Mathys, Dami\'an Pascual-Ortiz, Jakub Podolak, Adri\`a Romero-L\'opez, Kyriacos Shiarlis, Andreas Signer, Zsolt Terek, Athanasios Theocharis, Daniel Timbrell, Samuel Trautwein, Samuel Watts, Yun-Han Wu, Mateo Rojas-Carulla, 4 Aug 2025, Gandalf the Red: Adaptive Security for LLMs, https://arxiv.org/abs/2501.07927
- Nusrat Zahan, Imranur Rahman, Laurie Williams, 2 Aug 2025, Assumptions to Evidence: Evaluating Security Practices Adoption and Their Impact on Outcomes in the npm Ecosystem, https://arxiv.org/abs/2504.14026
- Arturo S\'anchez-Matas, Pablo Escribano Ruiz, Daniel D\'iaz-L\'opez, Angel Luis Perales G\'omez, Pantaleone Nespoli, Gregorio Mart\'inez P\'erez, 5 Aug 2025, Simulating Cyberattacks through a Breach Attack Simulation (BAS) Platform empowered by Security Chaos Engineering (SCE), https://arxiv.org/abs/2508.03882
- Hammad Atta, Ken Huang, Manish Bhatt, Kamal Ahmed, Muhammad Aziz Ul Haq, Yasir Mehmood, 6 Aug 2025, Logic layer Prompt Control Injection (LPCI): A Novel Security Vulnerability Class in Agentic Systems, https://arxiv.org/abs/2507.10457
- Minghao Shao, Nanda Rani, Kimberly Milner, Haoran Xi, Meet Udeshi, Saksham Aggarwal, Venkata Sai Charan Putrevu, Sandeep Kumar Shukla, Prashanth Krishnamurthy, Farshad Khorrami, Ramesh Karri, Muhammad Shafique, 5 Aug 2025, Towards Effective Offensive Security LLM Agents: Hyperparameter Tuning, LLM as a Judge, and a Lightweight CTF Benchmark, https://arxiv.org/abs/2508.05674
- Hiroya Kato, Kentaro Kita, Kento Hasegawa, Seira Hidano, 12 Aug 2025, AI Security Map: Holistic Organization of AI Security Technologies and Impacts on Stakeholders, https://arxiv.org/abs/2508.08583
- Aayush Gupta, 12 Aug 2025, Can AI Keep a Secret? Contextual Integrity Verification: A Provable Security Architecture for LLMs, https://arxiv.org/abs/2508.09288
- Irash Perera (1), Hiranya Abeyrathne (2), Sanjeewa Malalgoda (2), Arshardh Ifthikar (2) ((1) Department of Computer Science and Engineering, University of Moratuwa, Colombo, Sri Lanka, (2) WSO2, Colombo, Sri Lanka), 14 Aug 2025, Enhancing GraphQL Security by Detecting Malicious Queries Using Large Language Models, Sentence Transformers, and Convolutional Neural Networks, https://arxiv.org/abs/2508.11711
- Afrah Gueriani, Hamza Kheddar, Ahmed Cherif Mazari and Mohamed Chahine Ghanem, 17 Aug 2025, A Robust Cross-Domain IDS using BiGRU-LSTM-Attention for Medical and Industrial IoT Security, https://arxiv.org/abs/2508.12470
- Yongjian Guo, Puzhuo Liu, Wanlun Ma, Zehang Deng, Xiaogang Zhu, Peng Di, Xi Xiao, Sheng Wen, 18 Aug 2025, Systematic Analysis of MCP Security, https://arxiv.org/abs/2508.12538
- Yixuan Yang and Daoyuan Wu and Yufan Chen, 17 Aug 2025, MCPSecBench: A Systematic Security Benchmark and Playground for Testing Model Context Protocols, https://arxiv.org/abs/2508.13220
- Daniel M. Jimenez-Gutierrez, Yelizaveta Falkouskaya, Jose L. Hernandez-Ramos, Aris Anagnostopoulos, Ioannis Chatzigiannakis, Andrea Vitaletti, 19 Aug 2025, On the Security and Privacy of Federated Learning: A Survey with Attacks, Defenses, Frameworks, Applications, and Future Directions, https://arxiv.org/abs/2508.13730
- Abbas Sabra, Olivier Schmitt and Joseph Tyler, 20 Aug 2025, Assessing the Quality and Security of AI-Generated Code: A Quantitative Analysis, https://arxiv.org/abs/2508.14727
- Zhixiang Guo, Siyuan Liang, Aishan Liu, Dacheng Tao, 21 Aug 2025, CopyrightShield: Enhancing Diffusion Model Security against Copyright Infringement Attacks, https://arxiv.org/abs/2412.01528
- Akshay Mhatre and Noujoud Nader and Patrick Diehl and Deepti Gupta, 22 Aug 2025, LLM-GUARD: Large Language Model-Based Detection and Repair of Bugs and Security Vulnerabilities in C++ and Python, https://arxiv.org/abs/2508.16419
- Anton Ludwig Bonin, Pawel Robert Smolinski, Jacek Winiarski, 22 Aug 2025, Exploring the Impact of Generative Artificial Intelligence on Software Development in the IT Sector: Preliminary Findings on Productivity, Efficiency and Job Security, https://arxiv.org/abs/2508.16811
- Keke Lian and Bin Wang, Lei Zhang, Libo Chen, Junjie Wang, Ziming Zhao, Yujiu Yang, Haotong Duan, Haoran Zhao, Shuang Liao, Mingda Guo, Jiazheng Quan, Yilu Zhong, Chenhao He, Zichuan Chen, Jie Wu, Haoling Li, Zhaoxuan Li, Jiongchi Yu, Hui Li and Dong Zhang, 25 Aug 2025, A.S.E: A Repository-Level Benchmark for Evaluating Security in AI-Generated Code, https://arxiv.org/abs/2508.18106
- Matous Kozak, Roshanak Zilouchian Moghaddam, Siva Sivaraman, 23 Aug 2025, When Developer Aid Becomes Security Debt: A Systematic Analysis of Insecure Behaviors in LLM Coding Agents, https://arxiv.org/abs/2507.09329
- Ada Chen, Yongjiang Wu, Junyuan Zhang, Jingyu Xiao, Shu Yang, Jen-tse Huang, Kun Wang, Wenxuan Wang, Shuai Wang, 25 Aug 2025, A Survey on the Safety and Security Threats of Computer-Using Agents: JARVIS or Ultron?, https://arxiv.org/abs/2505.10924
- Niveen O. Jaffal, Mohammed Alkhanafseh, David Mohaisen, 18 Jul 2025, Large Language Models in Cybersecurity: Applications, Vulnerabilities, and Defense Techniques, https://arxiv.org/abs/2507.13629
- Julia Laubmann, Johannes Reschke, 18 Jul 2025, Tackling fake images in cybersecurity -- Interpretation of a StyleGAN and lifting its black-box, https://arxiv.org/abs/2507.13722
- Felix H\"arer, 19 Jul 2025, Specification and Evaluation of Multi-Agent LLM Systems -- Prototype and Cybersecurity Applications, https://arxiv.org/abs/2506.10467
- Terry Yue Zhuo, Dingmin Wang, Hantian Ding, Varun Kumar, Zijian Wang, 29 Jul 2025, Cyber-Zero: Training Cybersecurity Agents without Runtime, https://arxiv.org/abs/2508.00910
- Mehdi Akbari Gurabi, Lasse Nitz, Radu-Mihai Castravet, Roman Matzutt, Avikarsha Mandal, Stefan Decker, 5 Aug 2025, From Legacy to Standard: LLM-Assisted Transformation of Cybersecurity Playbooks into CACAO Format, https://arxiv.org/abs/2508.03342
- Md Zesun Ahmed Mia, Malyaban Bal, Sen Lu, George M. Nishibuchi, Suhas Chelian, Srini Vasan, Abhronil Sengupta, 6 Aug 2025, Neuromorphic Cybersecurity with Semi-supervised Lifelong Learning, https://arxiv.org/abs/2508.04610
- Daniele Proverbio, Alessio Buscemi, Alessandro Di Stefano, The Anh Han, German Castignani and Pietro Li\`o, 4 Aug 2025, Can LLMs effectively provide game-theoretic-based scenarios for cybersecurity?, https://arxiv.org/abs/2508.05670
- Victor Lopez Juarez, 9 Aug 2025, EU Digital Regulation and Guatemala: AI, 5G, and Cybersecurity, https://arxiv.org/abs/2508.08315
- Yuksel Aydin, 9 Aug 2025, Cognitive Cybersecurity for Artificial Intelligence: Guardrail Engineering with CCS-7, https://arxiv.org/abs/2508.10033
- Aydin Zaboli and Junho Hong, 12 Aug 2025, Generative AI for Cybersecurity of Energy Management Systems: Methods, Challenges, and Future Directions, https://arxiv.org/abs/2508.10044
Safety Monitor
A safety monitor is a component that can be added to the LLM deployment.
- OpenAI, Moderation: Learn how to build moderation into your AI applications, 2024, https://platform.openai.com/docs/guides/moderation
- Azure, 06/13/2024, Content filtering, https://learn.microsoft.com/en-us/azure/ai-services/openai/concepts/content-filter?tabs=warning%2Cpython
- Yu Wang, Xiaogeng Liu, Yu Li, Muhao Chen, Chaowei Xiao, 14 Mar 2024, AdaShield: Safeguarding Multimodal Large Language Models from Structure-based Attack via Adaptive Shield Prompting, https://arxiv.org/abs/2403.09513 Code: https://github.com/rain305f/AdaShield
- Jinhwa Kim, Ali Derakhshan, Ian G. Harris, 31 Oct 2023, Robust Safety Classifier for Large Language Models: Adversarial Prompt Shield, https://arxiv.org/abs/2311.00172
- Francisco Munguia-Galeano, Zhengxue Zhou, Satheeshkumar Veeramani, Hatem Fakhruldeen, Louis Longley, Rob Clowes and Andrew I. Cooper, 7 Aug 2025, Chemist Eye: A Visual Language Model-Powered System for Safety Monitoring and Robot Decision-Making in Self-Driving Laboratories, https://arxiv.org/abs/2508.05148
General Thoughts on AI Safety
High-level debate and discussions of AI safety issues:
- Stephen Hawking, Max Tegmark, Stuart Russell, and Frank Wilczek. April 2014. Transcending complacency on superintelligent machines. http://www.huffingtonpost.com/stephen-hawking/artificial-intelligence_b_5174265.html
- S. Alexander. OpenAI’s “Planning for AGI and beyond”. March 2023, https://astralcodexten.substack.com/p/openais-planning-for-agi-and-beyond
- N. Bostrom. The vulnerable world hypothesis. Global Policy, 10(4):455–476, 2019. https://doi.org/10.1111/1758-5899.12718
- Smitha Milli, Dylan Hadfield-Menell, Anca Dragan, and Stuart Russell. Should robots be obedient? In International Joint Conference on Artificial Intelligence, 2017. https://arxiv.org/abs/1705.09990
- Nick Bostrom. Superintelligence: Paths, Dangers, Strategies. Oxford University Press, March 2016, https://www.amazon.com.au/Superintelligence-Professor-Philosophy-Institute-University/dp/0198739834/
- Nick Bostrom. Superintelligence: Paths, Dangers, Strategies. Oxford University Press, July 2014 (prior edition), https://www.amazon.com.au/Superintelligence-Dangers-Strategies-Nick-Bostrom-ebook/dp/B00LOOCGB2/
- OpenAI, May 2023, Governance of superintelligence, https://openai.com/blog/governance-of-superintelligence
- Winfield AFT, Jirotka M. Ethical governance is essential to building trust in robotics and artificial intelligence systems. Philos Trans A Math Phys Eng Sci. 2018 Oct 15;376(2133):20180085. doi: 10.1098/rsta.2018.0085. PMID: 30323000 https://pubmed.ncbi.nlm.nih.gov/30323000/
- OpenAI, Feb 2023, How should AI systems behave, and who should decide? https://openai.com/blog/how-should-ai-systems-behave
- Stuart Russell. Should we fear supersmart robots? Scientific American, 314(6):58–59, 2016. https://www.scientificamerican.com/article/should-we-fear-supersmart-robots/, https://pubmed.ncbi.nlm.nih.gov/27196844/
- A Ramalho, 2017, Will robots rule the (artistic) world? A proposed model for the legal status of creations by artificial intelligence systems, https://papers.ssrn.com/sol3/papers.cfm?abstract_id=2987757
- Bernd Carsten Stahl, 2023, Embedding responsibility in intelligent systems: from AI ethics to responsible AI ecosystems, Scientific Reports Open Access 18 May 2023, https://doi.org/10.1038/s41598-023-34622-w
- McCarthy, John, and Patrick J. Hayes. 1969. Some Philosophical Problems From the Standpoint of Artificial Intelligence, In: Machine Intelligence 4, B. Meltzer and D. Michie (eds.), Edinburgh University Press, 1969, pp. 463-502, Stanford University. http://jmc.stanford.edu/articles/mcchay69.html
- Russell, Stuart J. 2019. Human Compatible: Artificial Intelligence and the Problem of Control (Viking-Penguin Random House: London). https://link.springer.com/chapter/10.1007/978-3-030-86144-5_3
- Winfield A.F.T., Jirotka M., 2018, Ethical governance is essential to building trust in robotics and artificial intelligence systems. Philos. Trans. R. Soc. A. Math. Phys. Eng. Sci. 2018;376:13. http://www.ncbi.nlm.nih.gov/pmc/articles/pmc6191667/, https://pubmed.ncbi.nlm.nih.gov/30323000/
- Thomas Claburn 12 Oct 2023, AI safety guardrails easily thwarted, security study finds, The Register, https://www.theregister.com/2023/10/12/chatbot_defenses_dissolve/
- Alibaba Qwen Team, Sep 2023, Qwen Technical Report, https://arxiv.org/pdf/2309.16609.pdf
Government Policy and Regulation
Various governments have examined issues around regulation, and there has also been much debate:
- A. Solender and A. Gold. April 2023, Scoop: Schumer lays groundwork for Congress to regulate AI. https://www.axios.com/2023/04/13/congress-regulate-ai-tech
- UK Government. National AI strategy. Sep 2021. https://www.gov.uk/government/publications/national-ai-strategy
- AI Now Institute, A. Kak, and S. M. West. April 2023, General purpose AI poses serious risks, should not be excluded from the EU’s AI Act. https://ainowinstitute.org/publication/gpai-is-high-risk-should-not-be-excluded-from-eu-ai-act
- L. Bertuzzi. March 2023, Leading EU lawmakers propose obligations for general purpose ai. https://www.euractiv.com/section/artificial-intelligence/news/leading-eu-lawmakers-propose-obligations-for-general-purpose-ai
- UK Department for Science and Technology. Aug 2023, Policy paper: A pro-innovation approach to AI regulation. https://www.gov.uk/government/publications/ai-regulation-a-pro-innovation-approach/white-paper
- White House. May 2023. Fact sheet: Biden-Harris Administration announces new actions to promote responsible AI innovation that protects Americans’ rights and safety. https://www.whitehouse.gov/briefing-room/statements-releases/2023/05/04/fact-sheet-biden-harris-administration-annou nces-new-actions-to-promote-responsible-ai-innovation-that-protects-americans-rights-and-safety
- B. Zhang, M. Anderljung, L. Kahn, N. Dreksler, M. C. Horowitz, and A. Dafoe. 2021, Ethics and governance of artificial intelligence: Evidence from a survey of machine learning researchers. arXiv preprint arXiv:2105.02117, https://arxiv.org/abs/2105.02117
- ISO/IEC. 2023, ISO/IEC 23894:2023 Information technology — Artificial intelligence — Guidance on risk management. https://www.iso.org/standard/77304.html
- NIST, AI Risk Management Framework Concept Paper, 13 December 2021, PDF: https://www.nist.gov/system/files/documents/2021/12/14/AI%20RMF%20Concept%20Paper_13Dec2021_posted.pdf
- NIST. 2023, Artificial Intelligence Risk Management Framework (AI RMF 1.0). https://doi.org/10.6028/NIST.AI.100-1, https://www.nist.gov/itl/ai-risk-management-framework
- Tathagat Katiyar & Harshitha Chondamma II, Accorian, Feb 2023, UNDERSTANDING AI RMF 1.0 – The Artificial Intelligence Risk Management Framework https://accorian.com/understanding-ai-rmf-1-0-the-artificial-intelligence-risk-management-framework/
- E. Yudkowsky, 2023. Pausing AI developments isn’t enough. We need to shut it all down. https://time.com/6266923/ai-eliezer-yudkowsky-open-letter-not-enough
- Stephanie Palazzolo, Erin Woo, Aug 2024, Passage of California AI Bill Sends Shivers Across Tech Industry, https://www.theinformation.com/articles/passage-of-california-ai-bill-sends-shivers-across-tech-industry
Auditing and Enforcement
Papers on auditing or enforcement of AI policy:
- J. Mökander and L. Floridi. 2022, Operationalising AI governance through ethics-based auditing: An industry case study. AI and Ethics, pages 1–18, https://link.springer.com/article/10.1007/s43681-022-00171-7
- J. Mökander, J. Schuett, H. R. Kirk, and L. Floridi. June 2023. Auditing large language models: A three-layered approach. arXiv preprint arXiv:2302.08500. https://arxiv.org/abs/2302.08500
- J. Mökander, J. Morley, M. Taddeo, and L. Floridi. Ethics-based auditing of automated decision-making systems: Nature, scope, and limitations. Science and Engineering Ethics, 27(44), 2021. https://arxiv.org/abs/2110.10980
Bias and Fairness
AI engines have shown bias in various ways. The goal is to have them show "fairness" in their results:
- Dastin Jeffrey. Oct 2018, Amazon scraps secret AI recruiting tool that showed bias against women. Reuters. https://www.reuters.com/article/us-amazon-com-jobs-automation-insight-idUSKCN1MK08G
- Courtland R., 2018, Bias detectives: the researchers striving to make algorithms fair. Nature. 2018 Jun;558(7710):357-360. doi: 10.1038/d41586-018-05469-3. PMID: 29925973 https://pubmed.ncbi.nlm.nih.gov/29925973/
- Caliskan Aylin, Bryson Joanna J., Narayanan Arvind. 2017. Semantics derived automatically from language corpora contain human-like biases. Science. 2017;356:183–186. https://pubmed.ncbi.nlm.nih.gov/28408601/
- A Levendowski, 2018, How copyright law can fix artificial intelligence's implicit bias problem, Wash. L. Rev., https://digitalcommons.law.uw.edu/cgi/viewcontent.cgi?article=5042&context=wlr
- Hao Karen. 2020. AI researchers say scientific publishers help perpetuate racist algorithms. MIT Technology Review. https://www.technologyreview.com/2020/06/23/1004333/ai-science-publishers-perpetuate-racist-face-recognition/
- K Ramesh, A Chavan, S Pandit, 2023, A Comparative Study on the Impact of Model Compression Techniques on Fairness in Language Models, Microsoft Research, https://aclanthology.org/2023.acl-long.878.pdf, https://www.microsoft.com/en-us/research/uploads/prod/2023/07/3687_Paper.pdf
- Jwala Dhamala, Varun Kumar, Rahul Gupta, Kai-Wei Chang, Aram Galstyan, Oct 2022, An Analysis of the Effects of Decoding Algorithms on Fairness in Open-Ended Language Generation, https://arxiv.org/abs/2210.03826 (Examines top-p, top-k, and temperature in decoding algorithms from a safety perspective.)
- Bingbin Liu, Jordan T. Ash, Surbhi Goel, Akshay Krishnamurthy, Cyril Zhang, June 2023, Exposing Attention Glitches with Flip-Flop Language Modeling, https://arxiv.org/abs/2306.00946
- Valentin Hofmann, Pratyusha Ria Kalluri, Dan Jurafsky, Sharese King, 1 Mar 2024, Dialect prejudice predicts AI decisions about people's character, employability, and criminality, https://arxiv.org/abs/2403.00742 https://arxiv.org/pdf/2403.00742.pdf
- Jaymari Chua, Yun Li, Shiyi Yang, Chen Wang, Lina Yao, 6 Jul 2024, AI Safety in Generative AI Large Language Models: A Survey, https://arxiv.org/abs/2407.18369
- Cem Dilmegani, Jan 10, 2024, The Future of Large Language Models in 2024, https://research.aimultiple.com/future-of-large-language-models/
- Douglas C. Youvan, September 27, 2024, Building and Running Large-Scale Language Models: The Infrastructure and Techniques Behind GPT-4 , https://www.researchgate.net/profile/Douglas-Youvan/publication/384398902_Building_and_Running_Large-Scale_Language_Models_The_Infrastructure_and_Techniques_Behind_GPT-4/links/66f6f4d3906bca2ac3d20e68/Building-and-Running-Large-Scale-Language-Models-The-Infrastructure-and-Techniques-Behind-GPT-4.pdf
- Mayank Vatsa, Anubhooti Jain, Richa Singh, 7 Dec 2023, Adventures of Trustworthy Vision-Language Models: A Survey, https://arxiv.org/abs/2312.04231
- Inkit Padhi, Manish Nagireddy, Giandomenico Cornacchia, Subhajit Chaudhury, Tejaswini Pedapati, Pierre Dognin, Keerthiram Murugesan, Erik Miehling, Martín Santillán Cooper, Kieran Fraser, Giulio Zizzo, Muhammad Zaid Hameed, Mark Purcell, Michael Desmond, Qian Pan, Inge Vejsbjerg, Elizabeth M. Daly, Michael Hind, Werner Geyer, Ambrish Rawat, Kush R. Varshney, Prasanna Sattigeri, 10 Dec 2024, Granite Guardian, https://arxiv.org/abs/2412.07724 https://github.com/ibm-granite/granite-guardian (Open-sourcing of safety models with many capabilities.)
- FZ Subah, Oct 2025, Mitigating and Assessing Bias and Fairness in Large Language Model-Generated Synthetic Tabular Data, Masters Thesis, Department of Engineering, University of Cambridge, https://www.mlmi.eng.cam.ac.uk/files/2023-2024/fzs21_mitigating_2024.pdf
- Abdullah Hashmat, Muhammad Arham Mirza, Agha Ali Raza, 13 Aug 2025, PakBBQ: A Culturally Adapted Bias Benchmark for QA, https://arxiv.org/abs/2508.10186
- Gustavo Bonil, Simone Hashiguti, Jhessica Silva, Jo\~ao Gondim, Helena Maia, N\'adia Silva, Helio Pedrini, Sandra Avila, 14 Aug 2025, Yet another algorithmic bias: A Discursive Analysis of Large Language Models Reinforcing Dominant Discourses on Gender and Race, https://arxiv.org/abs/2508.10304
- Alessio Buscemi, Daniele Proverbio, Alessandro Di Stefano, The Anh Han, German Castignani and Pietro Li\`o, 14 Aug 2025, FAIRGAME: a Framework for AI Agents Bias Recognition using Game Theory, https://arxiv.org/abs/2504.14325
- Suhas G Hegde, Shilpy Kaur, Aruna Tiwari, 14 Aug 2025, VectorFit : Adaptive Singular & Bias Vector Fine-Tuning of Pre-trained Foundation Models, https://arxiv.org/abs/2503.19530
- Yan Li, Guangyi Chen, Yunlong Deng, Zijian Li, Zeyu Tang, Anpeng Wu, Kun Zhang, 22 Jul 2025, Should Bias Always be Eliminated? A Principled Framework to Use Data Bias for OOD Generation, https://arxiv.org/abs/2507.17001
- Shalaka Satheesh, Katrin Klug, Katharina Beckh, H\'ector Allende-Cid, Sebastian Houben, Teena Hassan, 22 Jul 2025, GG-BBQ: German Gender Bias Benchmark for Question Answering, https://arxiv.org/abs/2507.16410
- Kristin Gnadt, David Thulke, Simone Kopeinik, Ralf Schl\"uter, 22 Jul 2025, Exploring Gender Bias in Large Language Models: An In-depth Dive into the German Language, https://arxiv.org/abs/2507.16557
- Zhenyuan Chen, 21 Jul 2025, Rethinking Inductive Bias in Geographically Neural Network Weighted Regression, https://arxiv.org/abs/2507.09958
- Sergio Morales, Robert Claris\'o, Jordi Cabot, 22 Jul 2025, LangBiTe: A Platform for Testing Bias in Large Language Models, https://arxiv.org/abs/2404.18558
- Yanbiao Ma, Bowei Liu, Boyuan Gao, Wei Dai, Jiayi Chen, Shuo Li, Andi Zhang, 22 Jul 2025, Revealing Bias Formation in Deep Neural Networks Through the Geometric Mechanisms of Human Visual Decoupling, https://arxiv.org/abs/2502.11809
- Brian Liu and Rahul Mazumder, 21 Jul 2025, Randomization Can Reduce Both Bias and Variance: A Case Study in Random Forests, https://arxiv.org/abs/2402.12668
- Ali Vardasbi, Gustavo Penha, Claudia Hauff, and Hugues Bouchard, 23 Jul 2025, Adaptive Repetition for Mitigating Position Bias in LLM-Based Ranking, https://arxiv.org/abs/2507.17788
- Steven A. Frank, 24 Jul 2025, The Price equation reveals a universal force-metric-bias law of algorithmic learning and natural selection, https://arxiv.org/abs/2507.18549
- Bruno Scarone, Alfredo Viola, Ren\'ee J. Miller, Ricardo Baeza-Yates, 24 Jul 2025, A Principled Approach for Data Bias Mitigation, https://arxiv.org/abs/2405.12312
- He-Yang Xu, Hongxiang Gao, Yuwen Li, Xiu-Shen Wei and Chengyu Liu, 24 Jul 2025, Masked Autoencoders that Feel the Heart: Unveiling Simplicity Bias for ECG Analyses, https://arxiv.org/abs/2506.22495
- Yuen Chen, Vethavikashini Chithrra Raghuram, Justus Mattern, Rada Mihalcea, Zhijing Jin, 24 Jul 2025, Causally Testing Gender Bias in LLMs: A Case Study on Occupational Bias, https://arxiv.org/abs/2212.10678
- Yongyi Yang, Hidenori Tanaka, Wei Hu, 17 Jul 2025, Provable Low-Frequency Bias of In-Context Learning of Representations, https://arxiv.org/abs/2507.13540
- Yile Yan, Yuqi Zhu, Wentao Xu, 18 Jul 2025, Bias in Decision-Making for AI's Ethical Dilemmas: A Comparative Study of ChatGPT and Claude, https://arxiv.org/abs/2501.10484
- Andr\'es Morales-Forero (1), Lili J. Rueda (2), Ronald Herrera (3), Samuel Bassetto (1), Eric Coatanea (4) ((1) Polytechnique Montr\'eal, (2) Universidad El Bosque, (3) Boehringer Ingelheim International GmbH, (4) Tampere University), 10 Jul 2025, Predictive Representativity: Uncovering Racial Bias in AI-based Skin Cancer Detection, https://arxiv.org/abs/2507.14176
- Xiaotong Luo, Shengda Zhuo, Min Chen, Lichun Li, Ruizhao Lu, Wenqi Fan, Shuqiang Huang and Yin Tang, 12 Jul 2025, From Bias to Behavior: Learning Bull-Bear Market Dynamics with Contrastive Modeling, https://arxiv.org/abs/2507.14182
- Eoghan Cunningham, James Cross, Derek Greene, 16 Jul 2025, Identifying Algorithmic and Domain-Specific Bias in Parliamentary Debate Summarisation, https://arxiv.org/abs/2507.14221
- Garud Iyengar, Henry Lam, Tianyu Wang, 21 Jul 2025, Optimizer's Information Criterion: Dissecting and Correcting Bias in Data-Driven Optimization, https://arxiv.org/abs/2306.10081
- Evangelia Spiliopoulou, Riccardo Fogliato, Hanna Burnsky, Tamer Soliman, Jie Ma, Graham Horwood, Miguel Ballesteros, 8 Aug 2025, Play Favorites: A Statistical Method to Measure Self-Bias in LLM-as-a-Judge, https://arxiv.org/abs/2508.06709
- Falaah Arif Khan, Nivedha Sivakumar, Yinong Oliver Wang, Katherine Metcalf, Cezanne Camacho, Barry-John Theobald, Luca Zappella, Nicholas Apostoloff, 9 Aug 2025, Investigating Intersectional Bias in Large Language Models using Confidence Disparities in Coreference Resolution, https://arxiv.org/abs/2508.07111
- Vivek Hruday Kavuri, Vysishtya Karanam, Venkata Jahnavi Venkamsetty, Kriti Madumadukala, Lakshmipathi Balaji Darur, Ponnurangam Kumaraguru, 10 Aug 2025, Freeze and Reveal: Exposing Modality Bias in Vision-Language Models, https://arxiv.org/abs/2508.07432
- Vojt\v{e}ch Stan\v{e}k, Karel Srna, Anton Firc, Kamil Malinka, 11 Aug 2025, SCDF: A Speaker Characteristics DeepFake Speech Dataset for Bias Analysis, https://arxiv.org/abs/2508.07944
- Xinyi Wu, Yifei Wang, Stefanie Jegelka, Ali Jadbabaie, 9 Aug 2025, On the Emergence of Position Bias in Transformers, https://arxiv.org/abs/2502.01951
- Walter Laurito, Benjamin Davis, Peli Grietzer, Tom\'a\v{s} Gaven\v{c}iak, Ada B\"ohm, Jan Kulveit, 11 Aug 2025, AI-AI Bias: large language models favor communications generated by large language models, https://arxiv.org/abs/2407.12856
- Dasol Choi, Jihwan Lee, Minjae Lee, Minsuk Kahng, 10 Aug 2025, When Cars Have Stereotypes: Auditing Demographic Bias in Objects from Text-to-Image Models, https://arxiv.org/abs/2508.03483
- Anuprabha M, Krishna Gurugubelli and Anil Kumar Vuppala, 11 Aug 2025, Fairness in Dysarthric Speech Synthesis: Understanding Intrinsic Bias in Dysarthric Speech Cloning using F5-TTS, https://arxiv.org/abs/2508.05102
- Chao Wu, Zhenyi Wang, Kangxian Xie, Naresh Kumar Devulapally, Vishnu Suresh Lokhande, Mingchen Gao, 28 Jul 2025, Model-Agnostic Gender Bias Control for Text-to-Image Generation via Sparse Autoencoder, https://arxiv.org/abs/2507.20973
- Gabriel Recchia, Chatrik Singh Mangat, Jinu Nyachhyon, Mridul Sharma, Callum Canavan, Dylan Epstein-Gross, Muhammed Abdulbari, 17 May 2025, Confirmation bias: A challenge for scalable oversight, https://arxiv.org/abs/2507.19486
- Pavel Korshunov, Ketan Kotwal, Christophe Ecabert, Vidit Vidit, Amir Mohammadi, and Sebastien Marcel, 28 Jul 2025, Investigation of Accuracy and Bias in Face Recognition Trained with Synthetic Data, https://arxiv.org/abs/2507.20782
- Hoyoung Lee, Junhyuk Seo, Suhwan Park, Junhyeong Lee, Wonbin Ahn, Chanyeol Choi, Alejandro Lopez-Lira, Yongjae Lee, 28 Jul 2025, Your AI, Not Your View: The Bias of LLMs in Investment Analysis, https://arxiv.org/abs/2507.20957
- Yooshin Cho, Hanbyel Cho, Janghyeon Lee, HyeongGwon Hong, Jaesung Ahn, Junmo Kim, 27 Jul 2025, Controllable Feature Whitening for Hyperparameter-Free Bias Mitigation, https://arxiv.org/abs/2507.20284
- Seoyoung Doh, Hyeon Jeon, Sungbok Shin, Ghulam Jilani Quadri, Nam Wook Kim, Jinwook Seo, 28 Jul 2025, Understanding Bias in Perceiving Dimensionality Reduction Projections, https://arxiv.org/abs/2507.20805
- Hitomi Yanaka, Xinqi He, Jie Lu, Namgi Han, Sunjin Oh, Ryoma Kumon, Yuma Matsuoka, Katsuhiko Watabe, Yuko Itatsu, 27 Jul 2025, Intersectional Bias in Japanese Large Language Models from a Contextualized Perspective, https://arxiv.org/abs/2506.12327
- Franck Bardol, 17 Jun 2025, ChatGPT Reads Your Tone and Responds Accordingly -- Until It Does Not -- Emotional Framing Induces Bias in LLM Outputs, https://arxiv.org/abs/2507.21083
- Zhenyu Pan, Yutong Zhang, Jianshu Zhang, Haoran Lu, Haozheng Luo, Yuwei Han, Philip S. Yu, Manling Li, Han Liu, 30 Jul 2025, FairReason: Balancing Reasoning and Social Bias in MLLMs, https://arxiv.org/abs/2507.23067
- Patricia A. Apell\'aniz and Ana Jim\'enez and Borja Arroyo Galende and Juan Parras and Santiago Zazo, 31 Jul 2025, Artificial Inductive Bias for Synthetic Tabular Data Generation in Data-Scarce Scenarios, https://arxiv.org/abs/2407.03080
- Utku Ozbulak, Seyed Amir Mousavi, Francesca Tozzi, Niki Rashidian, Wouter Willaert, Wesley De Neve, Joris Vankerschaver, 31 Jul 2025, Revisiting the Evaluation Bias Introduced by Frame Sampling Strategies in Surgical Video Segmentation Using SAM2, https://arxiv.org/abs/2502.20934
- Afrozah Nadeem, Mark Dras, and Usman Naseem, 31 Jul 2025, Framing Political Bias in Multilingual LLMs Across Pakistani Languages, https://arxiv.org/abs/2506.00068
- Bushra Asseri, Estabrag Abdelaziz, Areej Al-Wabil, 30 Jul 2025, Prompt Engineering Techniques for Mitigating Cultural Bias Against Arabs and Muslims in Large Language Models: A Systematic Review, https://arxiv.org/abs/2506.18199
- Simon M\"unker, 31 Jul 2025, Cultural Bias in Large Language Models: Evaluating AI Agents through Moral Questionnaires, https://arxiv.org/abs/2507.10073
- Kwesi Cobbina and Tianyi Zhou, 30 Jul 2025, Where to show Demos in Your Prompt: A Positional Bias of In-Context Learning, https://arxiv.org/abs/2507.22887
- Adam Block and Cyril Zhang, 31 Jul 2025, EMA Without the Lag: Bias-Corrected Iterate Averaging Schemes, https://arxiv.org/abs/2508.00180
- Kangda Wei, Hasnat Md Abdullah, Ruihong Huang, 1 Aug 2025, Mitigating Gender Bias via Fostering Exploratory Thinking in LLMs, https://arxiv.org/abs/2505.17217
- Zhenpeng Chen, Xinyue Li, Jie M. Zhang, Weisong Sun, Ying Xiao, Tianlin Li, Yiling Lou, Yang Liu, 5 Aug 2025, Software Fairness Dilemma: Is Bias Mitigation a Zero-Sum Game?, https://arxiv.org/abs/2508.03323
- Jiangen He, 2 Aug 2025, Who Gets Cited? Gender- and Majority-Bias in LLM-Driven Reference Selection, https://arxiv.org/abs/2508.02740
- Shahed Masoudian, Gustavo Escobedo, Hannah Strauss, Markus Schedl, 5 Aug 2025, Investigating Gender Bias in LLM-Generated Stories via Psychological Stereotypes, https://arxiv.org/abs/2508.03292
- Joseph Lee, Tianqi Shang, Jae Young Baik, Duy Duong-Tran, Shu Yang, Lingyao Li, Li Shen, 4 Aug 2025, From Promising Capability to Pervasive Bias: Assessing Large Language Models for Emergency Department Triage, https://arxiv.org/abs/2504.16273
- Zhen Zou, Feng Zhao, 5 Aug 2025, FEB-Cache: Frequency-Guided Exposure Bias Reduction for Enhancing Diffusion Transformer Caching, https://arxiv.org/abs/2503.07120
- Hamed Ayoobi, Nico Potyka, Anna Rapberger, Francesca Toni, 6 Aug 2025, Argumentative Debates for Transparent Bias Detection [Technical Report], https://arxiv.org/abs/2508.04511
- Tiffany Zhu, Iain Weissburg, Kexun Zhang, William Yang Wang, 6 Aug 2025, Human Bias in the Face of AI: Examining Human Judgment Against Text Labeled as AI Generated, https://arxiv.org/abs/2410.03723
- Tosin Fadahunsi, Giordano d'Aloisio, Antinisca Di Marco, Federica Sarro, 5 Aug 2025, How Do Generative Models Draw a Software Engineer? A Case Study on Stable Diffusion Bias, https://arxiv.org/abs/2501.09014
- Kelsey Doerksen, Yuliya Marchetti, Kevin Bowman, Steven Lu, James Montgomery, Yarin Gal, Freddie Kalaitzis, Kazuyuki Miyazaki, 6 Aug 2025, Leveraging Deep Learning for Physical Model Bias of Global Air Quality Estimates, https://arxiv.org/abs/2508.04886
- Menghua Jiang, Yuxia Lin, Baoliang Chen, Haifeng Hu, Yuncheng Jiang, Sijie Mai, 7 Aug 2025, Disentangling Bias by Modeling Intra- and Inter-modal Causal Attention for Multimodal Sentiment Analysis, https://arxiv.org/abs/2508.04999
- Jiahao Chen, Bin Qin, Jiangmeng Li, Hao Chen, Bing Su, 8 Aug 2025, Rethinking the Bias of Foundation Model under Long-tailed Distribution, https://arxiv.org/abs/2501.15955
- Shivam Dubey, 12 Aug 2025, Activation Steering for Bias Mitigation: An Interpretable Approach to Safer LLMs, https://arxiv.org/abs/2508.09019
- Afrozah Nadeem, Mark Dras, Usman Naseem, 12 Aug 2025, Steering Towards Fairness: Mitigating Political Bias in LLMs, https://arxiv.org/abs/2508.08846
- Krzysztof Maziarz, Guoqing Liu, Hubert Misztela, Austin Tripp, Junren Li, Aleksei Kornev, Piotr Gai\'nski, Holger Hoefling, Mike Fortunato, Rishi Gupta, Marwin Segler, 12 Aug 2025, Chemist-aligned retrosynthesis by ensembling diverse inductive bias models, https://arxiv.org/abs/2412.05269
- Zara Siddique, Irtaza Khalid, Liam D. Turner, Luis Espinosa-Anke, 13 Aug 2025, Shifting Perspectives: Steering Vectors for Robust Bias Mitigation in LLMs, https://arxiv.org/abs/2503.05371
- Jingwei Li, Jing Xu, Zifan Wang, Huishuai Zhang, Jingzhao Zhang, 13 Aug 2025, Understanding Nonlinear Implicit Bias via Region Counts in Input Space, https://arxiv.org/abs/2505.11370
- Parker Whitfill, 14 Aug 2025, Note on Selection Bias in Observational Estimates of Algorithmic Progress, https://arxiv.org/abs/2508.11033
- Aiswarya Konavoor, Raj Abhijit Dandekar, Rajat Dandekar, Sreedath Panat, 15 Aug 2025, Vision-Language Models display a strong gender bias, https://arxiv.org/abs/2508.11262
- Binxu Wang, Cengiz Pehlevan, 14 Aug 2025, An Analytical Theory of Spectral Bias in the Learning Dynamics of Diffusion Models, https://arxiv.org/abs/2503.03206
- Keyon Vafa, Peter G. Chang, Ashesh Rambachan, Sendhil Mullainathan, 14 Aug 2025, What Has a Foundation Model Found? Using Inductive Bias to Probe for World Models, https://arxiv.org/abs/2507.06952
- Pengcheng Huang, Shuhao Liu, Zhenghao Liu, Yukun Yan, Shuo Wang, Zulong Chen, Tong Xiao, 18 Aug 2025, PC-Sampler: Position-Aware Calibration of Decoding Bias in Masked Diffusion Models, https://arxiv.org/abs/2508.13021
- Yuanzhe Hu, Kinshuk Goel, Vlad Killiakov, Yaoqing Yang, 18 Aug 2025, Eigenspectrum Analysis of Neural Networks without Aspect Ratio Bias, https://arxiv.org/abs/2506.06280
- Evan Chen, Run-Jun Zhan, Yan-Bai Lin, Hung-Hsuan Chen, 15 Aug 2025, More Women, Same Stereotypes: Unpacking the Gender Bias Paradox in Large Language Models, https://arxiv.org/abs/2503.15904
- Koffi Ismael Ouattara, Ioannis Krontiris, Theo Dimitrakos and Frank Kargl, 19 Aug 2025, Assessing Trustworthiness of AI Training Dataset using Subjective Logic -- A Use Case on Bias, https://arxiv.org/abs/2508.13813
- Jonathan A. Karr Jr., Benjamin F. Herbst, Ting Hua, Matthew Hauenstein, Georgina Curto, Nitesh V. Chawla, 14 Aug 2025, Combating Homelessness Stigma with LLMs: A New Multi-Modal Dataset for Bias Detection, https://arxiv.org/abs/2508.13187
- Hao Zhang and Chen Li and Basura Fernando, 19 Aug 2025, Mitigating Easy Option Bias in Multiple-Choice Question Answering, https://arxiv.org/abs/2508.13428
- Dariia Puhach and Amir H. Payberah and \'Eva Sz\'ekely, 19 Aug 2025, Who Gets the Mic? Investigating Gender Bias in the Speaker Assignment of a Speech-LLM, https://arxiv.org/abs/2508.13603
- Vinod Kumar Chauhan, Lei Clifton, Achille Sala\"un, Huiqi Yvonne Lu, Kim Branson, Patrick Schwab, Gaurav Nigam, David A. Clifton, 20 Aug 2025, Sample Selection Bias in Machine Learning for Healthcare, https://arxiv.org/abs/2405.07841
- Ilja Kuzborskij, Yasin Abbasi Yadkori, 20 Aug 2025, Low-rank bias, weight decay, and model merging in neural networks, https://arxiv.org/abs/2502.17340
- Haodi Zhong, Liuxin Zou, Di Wang, Bo Wang, Zhenxing Niu, Quan Wang, 21 Aug 2025, EvoFormer: Learning Dynamic Graph-Level Representations with Structural and Temporal Bias Correction, https://arxiv.org/abs/2508.15378
- Cheng Wang, Gelei Deng, Xianglin Yang, Han Qiu, Tianwei Zhang, 21 Aug 2025, When Audio and Text Disagree: Revealing Text Bias in Large Audio-Language Models, https://arxiv.org/abs/2508.15407
- Tuhina Tripathi, Manya Wadhwa, Greg Durrett, Scott Niekum, 21 Aug 2025, Pairwise or Pointwise? Evaluating Feedback Protocols for Bias in LLM-Based Evaluation, https://arxiv.org/abs/2504.14716
- Saumya Roy, 13 Aug 2025, Persuasiveness and Bias in LLM: Investigating the Impact of Persuasiveness and Reinforcement of Bias in Language Models, https://arxiv.org/abs/2508.15798
- Xu Pan, Jingxuan Fan, Zidi Xiong, Ely Hahami, Jorin Overwiening, Ziqian Xie, 16 Aug 2025, User-Assistant Bias in LLMs, https://arxiv.org/abs/2508.15815
- Srikant Panda, Vishnu Hari, Kalpana Panda, Amit Agarwal, Hitesh Laxmichand Patel, 18 Aug 2025, Who's Asking? Investigating Bias Through the Lens of Disability Framed Queries in LLMs, https://arxiv.org/abs/2508.15831
- Tom Jacobs, Chao Zhou, Rebekka Burkholz, 22 Aug 2025, Mirror, Mirror of the Flow: How Does Regularization Shape Implicit Bias?, https://arxiv.org/abs/2504.12883
- Gousia Habib, Tausifa Jan Saleem, Ishfaq Ahmad Malik, Brejesh Lall, 21 Aug 2025, LIB-KD: Teaching Inductive Bias for Efficient Vision Transformer Distillation and Compression, https://arxiv.org/abs/2310.00369
- Shir Bernstein, David Beste, Daniel Ayzenshteyn, Lea Schonherr, Yisroel Mirsky, 24 Aug 2025, Trust Me, I Know This Function: Hijacking LLM Static Analysis using Bias, https://arxiv.org/abs/2508.17361
- Pooja S. B. Rao and Laxminarayen Nagarajan Venkatesan and Mauro Cherubini and Dinesh Babu Jayagopi, 21 Aug 2025, Invisible Filters: Cultural Bias in Hiring Evaluations Using Large Language Models, https://arxiv.org/abs/2508.16673
- Viacheslav Yusupov, Danil Maksimov, Ameliia Alaeva, Tatiana Zaitceva, Antipina Anna, Anna Vasileva, Chenlin Liu, Rayuth Chheng, Danil Sazanakov, Andrey Chetvergov, Alina Ermilova, Egor Shvetsov, 23 Aug 2025, Token Homogenization under Positional Bias, https://arxiv.org/abs/2508.17126
- Kyra Wilson, Sourojit Ghosh, Aylin Caliskan, 24 Aug 2025, Bias Amplification in Stable Diffusion's Representation of Stigma Through Skin Tones and Their Homogeneity, https://arxiv.org/abs/2508.17465
- Xuan-Bac Nguyen, Thanh-Dat Truong, Pawan Sinha, Khoa Luu, 25 Aug 2025, BRAIN: Bias-Mitigation Continual Learning Approach to Vision-Brain Understanding, https://arxiv.org/abs/2508.18187
- Federico Marcuzzi, Xuefei Ning, Roy Schwartz, and Iryna Gurevych, 25 Aug 2025, How Quantization Shapes Bias in Large Language Models, https://arxiv.org/abs/2508.18088
- Emanuele Zangrando, Piero Deidda, Simone Brugiapaglia, Nicola Guglielmi, Francesco Tudisco, 23 Aug 2025, Provable Emergence of Deep Neural Collapse and Low-Rank Bias in $L^2$-Regularized Nonlinear Networks, https://arxiv.org/abs/2402.03991
- Jihwan Oh, Minchan Jeong, Jongwoo Ko, Se-Young Yun, 24 Aug 2025, Understanding Bias Reinforcement in LLM Agents Debate, https://arxiv.org/abs/2503.16814
- Jiancong Xiao, Ziniu Li, Xingyu Xie, Emily Getzen, Cong Fang, Qi Long, Weijie J. Su, 25 Aug 2025, On the Algorithmic Bias of Aligning Large Language Models with RLHF: Preference Collapse and Matching Regularization, https://arxiv.org/abs/2405.16455
Toxicity
Toxicity is the LLM safety issue of ensuring that the AI does not give "toxic" answers to the user. There are many subtypes of this issue, such as ensuring that answers are appropriate, non-aggressive, non-disparaging, not insulting, and generally helpful. The overall tone of AI interactions should be positive rather than negative.
Research papers on LLM toxicity issues:
- Jaymari Chua, Yun Li, Shiyi Yang, Chen Wang, Lina Yao, 6 Jul 2024, AI Safety in Generative AI Large Language Models: A Survey, https://arxiv.org/abs/2407.18369
- Cem Dilmegani, Jan 10, 2024, The Future of Large Language Models in 2024, https://research.aimultiple.com/future-of-large-language-models/
- Shayne Longpre, Stella Biderman, Alon Albalak, Hailey Schoelkopf, Daniel McDuff, Sayash Kapoor, Kevin Klyman, Kyle Lo, Gabriel Ilharco, Nay San, Maribeth Rauh, Aviya Skowron, Bertie Vidgen, Laura Weidinger, Arvind Narayanan, Victor Sanh, David Adelani, Percy Liang, Rishi Bommasani, Peter Henderson, Sasha Luccioni, Yacine Jernite, Luca Soldaini, 26 Jun 2024 (v2), The Responsible Foundation Model Development Cheatsheet: A Review of Tools & Resources, https://arxiv.org/abs/2406.16746
- Inkit Padhi, Manish Nagireddy, Giandomenico Cornacchia, Subhajit Chaudhury, Tejaswini Pedapati, Pierre Dognin, Keerthiram Murugesan, Erik Miehling, Martín Santillán Cooper, Kieran Fraser, Giulio Zizzo, Muhammad Zaid Hameed, Mark Purcell, Michael Desmond, Qian Pan, Inge Vejsbjerg, Elizabeth M. Daly, Michael Hind, Werner Geyer, Ambrish Rawat, Kush R. Varshney, Prasanna Sattigeri, 10 Dec 2024, Granite Guardian, https://arxiv.org/abs/2412.07724 https://github.com/ibm-granite/granite-guardian (Open-sourcing of safety models with many capabilities.)
- Jueon Park, Yein Park, Minju Song, Soyon Park, Donghyeon Lee, Seungheun Baek and Jaewoo Kang, 5 Aug 2025, CoTox: Chain-of-Thought-Based Molecular Toxicity Reasoning and Prediction, https://arxiv.org/abs/2508.03159
- Axel Delaval, Shujian Yang, Haicheng Wang, Han Qiu, Jialiang Lu, 15 Aug 2025, ToxiFrench: Benchmarking and Enhancing Language Models via CoT Fine-Tuning for French Toxicity Detection, https://arxiv.org/abs/2508.11281
Ethics of Responsible AI Research
Ethical issues in AI research and related publication of results:
- Partnership on AI. 2021, Managing the risks of AI research: Six Recommendations for Responsible Publication. https://partnershiponai.org/paper/responsible-publication-recommendations
- M. Brundage, S. Avin, J. Wang, H. Belfield, G. Krueger, G. Hadfield, H. Khlaaf, J. Yang, H. Toner, R. Fong, T. Maharaj, P. W. Koh, S. Hooker, J. Leung, A. Trask, E. Bluemke, J. Lebensold, C. O’Keefe, M. Koren, T. Ryffel, J. Rubinovitz, T. Besiroglu, F. Carugati, J. Clark, P. Eckersley, S. de Haas, M. Johnson, B. Laurie, A. Ingerman, I. Krawczuk, A. Askell, R. Cammarota, A. Lohn, D. Krueger, C. Stix, P. Henderson, L. Graham, C. Prunkl, B. Martin, E. Seger, N. Zilberman, S. Ó. hÉigeartaigh, F. Kroeger, G. Sastry, R. Kagan, A. Weller, B. Tse, E. Barnes, A. Dafoe, P. Scharre, A. Herbert-Voss, M. Rasser, S. Sodhani, C. Flynn, T. K. Gilbert, L. Dyer, S. Khan, Y. Bengio, and M. Anderljung. Toward trustworthy AI development: Mechanisms for supporting verifiable claims. arXiv preprint arXiv:2004.07213, 2020. https://arxiv.org/abs/2004.07213
- R. Crootof. 2019, Artificial intelligence research needs responsible publication norms. https://www.lawfareblog.com/artificial-intelligence-research-needs-responsible-publication-norms
- C. Ashurst, S. Barocas, R. Campbell, and D. Raji. Disentangling the components of ethical research in machine learning. In 2022 ACM Conference on Fairness, Accountability, and Transparency, pages 2057–2068, 2022. http://dx.doi.org/10.1145/3531146.3533781, https://www.researchgate.net/publication/361439688_Disentangling_the_Components_of_Ethical_Research_in_Machine_Learning
- Herrmann H. What's next for responsible artificial intelligence: a way forward through responsible innovation. Heliyon. 2023 Mar 11;9(3):e14379. doi: 10.1016/j.heliyon.2023.e14379. eCollection 2023 Mar. PMID: 36967876, https://pubmed.ncbi.nlm.nih.gov/36967876/
- Ethically governing artificial intelligence in the field of scientific research and innovation. González-Esteban Y Patrici Calvo E. Heliyon. 2022 Feb 16;8(2):e08946. doi: 10.1016/j.heliyon.2022.e08946. eCollection 2022 Feb. PMID: 35243068, https://pubmed.ncbi.nlm.nih.gov/35243068/
- Dzobo K, Adotey S, Thomford NE, Dzobo W. Integrating Artificial and Human Intelligence: A Partnership for Responsible Innovation in Biomedical Engineering and Medicine. OMICS. 2020 May;24(5):247-263. doi: 10.1089/omi.2019.0038. Epub 2019 Jul 16. PMID: 31313972, https://pubmed.ncbi.nlm.nih.gov/31313972/
- d'Aquin M., Troullinou P., O'Connor N.E., Cullen A., Faller G., Holden L. 2018 AAAI/ACM Conference on AI, Ethics, and Society (AIES ’18) ACM; New York: 2018. Towards an “ethics by design” methodology for AI research projects”; pp. 54–59. https://www.researchgate.net/publication/330297261_Towards_an_Ethics_by_Design_Methodology_for_AI_Research_Projects
- Dignum Virginia. 2019. Responsible Artificial Intelligence. How to Develop and Use AI in a Responsible Way. Springer, https://link.springer.com/book/10.1007/978-3-030-30371-6
- European Commission. 2012. Responsible Research and Innovation: Europe’s Ability to Respond to Societal Challenges. Brussels. https://op.europa.eu/en/publication-detail/-/publication/2be36f74-b490-409e-bb60-12fd438100fe
- Helmore Edward. 2019. Profit over safety? Boeing under fire over 737 Max crashes as families demand answers. Guardian. https://www.theguardian.com/business/2019/jun/17/boeing-737-max-ethiopian-airlines-crash
- High-level expert Group on Artificial Intelligence. European Commission; 2019. Ethics Guidelines for Trustworthy AI. Brussels. https://op.europa.eu/en/publication-detail/-/publication/d3988569-0434-11ea-8c1f-01aa75ed71a1
- Prates M., Avelar P., Lamb L.C. 2018, On quantifying and understanding the role of ethics in AI research: a historical account of flagship conferences and journals. EPiC Series in Computing. 2018;55:188–201. https://arxiv.org/abs/1809.08328
- Castelvecchi D., 2021, Prestigious AI meeting takes steps to improve ethics of research. Nature. 2021 Jan;589(7840):12-13. doi: 10.1038/d41586-020-03611-8. PMID: 33361804, https://pubmed.ncbi.nlm.nih.gov/33361804/
- Bouhouita-Guermech S, Gogognon P, Bélisle-Pipon JC. 2023, Specific challenges posed by artificial intelligence in research ethics. Front Artif Intell. 2023 Jul 6;6:1149082. doi: 10.3389/frai.2023.1149082. eCollection 2023. PMID: 37483869 https://pubmed.ncbi.nlm.nih.gov/37483869/
- Gibney E., 2020, The battle for ethical AI at the world's biggest machine-learning conference. Nature. 2020 Jan;577(7792):609. doi: 10.1038/d41586-020-00160-y. PMID: 31992885, https://pubmed.ncbi.nlm.nih.gov/31992885/
- Sánchez López JD, Cambil Martín J, Villegas Calvo M, Luque Martínez F., 2020. Ethical conflicts between authonomy and deep learning, J Healthc Qual Res. 2020 Jan-Feb;35(1):51-52. doi: 10.1016/j.jhqr.2019.06.009. Epub 2019 Nov 26. PMID: 31784256, https://pubmed.ncbi.nlm.nih.gov/31784256/
- Prabhu SP., 2019, Ethical challenges of machine learning and deep learning algorithms. Lancet Oncol. 2019 May;20(5):621-622. doi: 10.1016/S1470-2045(19)30230-X. PMID: 31044701, https://pubmed.ncbi.nlm.nih.gov/31044701/
- Dignum V. Ethics in artificial intelligence: introduction to the special issue. Ethics Inf. Technol. 2018;20:1–3. https://link.springer.com/article/10.1007/s10676-018-9450-z
- IEEE. 2019. "Ethically Aligned Design: A Vision for Prioritizing Human Well-being With Autonomous and Intelligent Systems [First Edition]." The IEEE Global Initiative on Ethics of Autonomous and Intelligent Systems. https://standards.ieee.org/content/ieee-standards/en/industry-connections/ec/autonomous-systems.html
- Stuart Russell, Daniel Dewey, and Max Tegmark. 2015. Research priorities for robust and beneficial artificial intelligence. AI Magazine, 36(4):105–114, 2015. PDF: https://futureoflife.org/data/documents/research_priorities.pdf
- Peter Dizikes, December 11, 2023, MIT group releases white papers on governance of AI, MIT News, https://news.mit.edu/2023/mit-group-releases-white-papers-governance-ai-1211
- Thomas Mildner, Orla Cooney, Anna-Maria Meck, Marion Bartl, Gian-Luca Savino, Philip R. Doyle, Diego Garaialde, Leigh Clark, John Sloan, Nina Wenig, Rainer Malaka, Jasmin Niess, 26 Jan 2024, Listening to the Voices: Describing Ethical Caveats of Conversational User Interfaces According to Experts and Frequent Users, Proceedings of the CHI Conference on Human Factors in Computing Systems (CHI '24), May 11--16, 2024, Honolulu, HI, USA, https://arxiv.org/abs/2401.14746 https://doi.org/https://doi.org/10.1145/3613904.3642542
- Balasubramaniam S. , Vanajaroselin Chirchi, Seifedine Kadry, Moorthy Agoramoorthy, Gururama Senthilvel P., Satheesh Kumar K., and Sivakumar T. A., Oct 2024, The Road Ahead: Emerging Trends, Unresolved Issues, and ConcludingRemarksinGenerativeAI—AComprehensiveReview, International Journal of Intelligent Systems, Volume 2024, Article ID 4013195, 38 pages, https://doi.org/10.1155/2024/4013195 https://www.researchgate.net/profile/Balasubramaniam-s-2/publication/384729387_The_Road_Ahead_Emerging_Trends_Unresolved_Issues_and_Concluding_Remarks_in_Generative_AI-A_Comprehensive_Review/links/6705560cf5eb7108c6e5d261/The-Road-Ahead-Emerging-Trends-Unresolved-Issues-and-Concluding-Remarks-in-Generative-AI-A-Comprehensive-Review.pdf
AI Alignment Research
Alignment is the study of how to ensure that AI engines are "aligned" with the goals and intent of humans.
- J. Leike, J. Schulman, and J. Wu. OpenAI, August 2022. Our approach to alignment research. https://openai.com/blog/our-approach-to-alignment-research
- OpenAI, July 2023, Introducing Superalignment, https://openai.com/blog/introducing-superalignment
- V. Krakovna and R. Shah. 2023, Some high-level thoughts on the DeepMind alignment team’s strategy. https://www.alignmentforum.org/posts/a9SPcZ6GXAg9cNKdi/linkpost-some-high-level-thoughts-on-the-deepmind-alignment
- J. Leike. Dec 2022, Why I’m optimistic about our alignment approach. https://aligned.substack.com/p/alignment-optimism
- Nate Soares and Benja Fallenstein. Aligning superintelligence with human interests: A technical research agenda. Technical report, Machine Intelligence Research Institute, 2014. https://www.semanticscholar.org/paper/Aligning-Superintelligence-with-Human-Interests%3A-A-Soares-Fallenstein/d8033a314493c8df3791912272ac4b58d3a7b8c2
- Jessica Taylor, Eliezer Yudkowsky, Patrick LaVictoire, and Andrew Critch. 2016. Alignment for advanced machine learning systems. Technical report, Machine Intelligence Research Institute, 2016. PDF: https://intelligence.org/files/AlignmentMachineLearning.pdf
- Daniel Weld and Oren Etzioni. The first law of robotics (a call to arms). Proceedings of the AAAI Conference on Artificial Intelligence, 12, pages 1042–1047, 1994. https://aaai.org/papers/01042-the-first-law-of-robotics-a-call-to-arms/
- Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, Ryan Lowe, Mar 2022, Training language models to follow instructions with human feedback, https://arxiv.org/abs/2203.02155 (InstructGPT main paper from OpenAI in 2022.)
- Ziniu Li1, Tian Xu, Yushun Zhang, Zhihang Lin, Yang Yu, Ruoyu Sun, Zhi-Quan Luo, 2024, ReMax: ASimple, Effective, and Efficient Reinforcement Learning Method for Aligning Large Language Models, https://openreview.net/pdf?id=Stn8hXkpe6
- Aibek Bekbayev, Sungbae Chun, Yerzat Dulat, James Yamazaki, Aug 2023, The Poison of Alignment, https://arxiv.org/abs/2308.13449
- Yotam Wolf, Noam Wies, Yoav Levine, and Amnon Shashua. Fundamental limitations of alignment in large language models. arXiv preprint arXiv:2304.11082, 2023. https://arxiv.org/abs/2304.11082
- Renze Lou, Kai Zhang, Wenpeng Yin, 25 May 2024 (v8), Large Language Model Instruction Following: A Survey of Progresses and Challenges, https://arxiv.org/abs/2303.10475 Project: https://github.com/RenzeLou/awesome-instruction-learning
- Alexandre Ramé, Nino Vieillard, Léonard Hussenot, Robert Dadashi, Geoffrey Cideron, Olivier Bachem, Johan Ferret, 22 Jan 2024, WARM: On the Benefits of Weight Averaged Reward Models, https://arxiv.org/abs/2401.12187 (Uses multiple reward models to avoid problems with the LLM "hacking rewards" in unforeseen ways.)
- NVIDIA, June 2024, Nemotron-4 340B Technical Report, https://d1qx31qr3h6wln.cloudfront.net/publications/Nemotron_4_340B_8T_0.pdf (Architecture is decoder-only with GQA, SentencePiece tokenizer, causal attention masks, RoPE, 96 layers, 96 heads, 8 KV heads, 256,000 vocabulary, 18432 internal dimension, context window 4096, and uses squared RELU.)
- Piotr Wojciech Mirowski, Juliette Love, Kory W. Mathewson, Shakir Mohamed, 3 Jun 2024 (v2), A Robot Walks into a Bar: Can Language Models Serve as Creativity Support Tools for Comedy? An Evaluation of LLMs' Humour Alignment with Comedians, https://arxiv.org/abs/2405.20956 (The unfunny fact that AI is bad at humor.)
- Jaymari Chua, Yun Li, Shiyi Yang, Chen Wang, Lina Yao, 6 Jul 2024, AI Safety in Generative AI Large Language Models: A Survey, https://arxiv.org/abs/2407.18369
- Mintong Kang, Nezihe Merve Gürel, Ning Yu, Dawn Song, Bo Li, July 2024, C-RAG: Certified Generation Risks for Retrieval-Augmented Language Models, Proceedings of the 41st International Conference on Machine Learning, PMLR 235:22963-23000, 2024, https://proceedings.mlr.press/v235/kang24a.html
- Rohin Shah, Seb Farquhar, Anca Dragan, 21st Aug 2024, AGI Safety and Alignment at Google DeepMind: A Summary of Recent Work, https://www.alignmentforum.org/posts/79BPxvSsjzBkiSyTq/agi-safety-and-alignment-at-google-deepmind-a-summary-of
- Shayne Longpre, Stella Biderman, Alon Albalak, Hailey Schoelkopf, Daniel McDuff, Sayash Kapoor, Kevin Klyman, Kyle Lo, Gabriel Ilharco, Nay San, Maribeth Rauh, Aviya Skowron, Bertie Vidgen, Laura Weidinger, Arvind Narayanan, Victor Sanh, David Adelani, Percy Liang, Rishi Bommasani, Peter Henderson, Sasha Luccioni, Yacine Jernite, Luca Soldaini, 26 Jun 2024 (v2), The Responsible Foundation Model Development Cheatsheet: A Review of Tools & Resources, https://arxiv.org/abs/2406.16746
- Yiheng Liu, Hao He, Tianle Han, Xu Zhang, Mengyuan Liu, Jiaming Tian, Yutong Zhang, Jiaqi Wang, Xiaohui Gao, Tianyang Zhong, Yi Pan, Shaochen Xu, Zihao Wu, Zhengliang Liu, Xin Zhang, Shu Zhang, Xintao Hu, Tuo Zhang, Ning Qiang, Tianming Liu, Bao Ge, 6 Jan 2024 (v2), Understanding LLMs: A Comprehensive Overview from Training to Inference, https://arxiv.org/abs/2401.02038
- Hao Zhou, Chengming Hu, Ye Yuan, Yufei Cui, Yili Jin, Can Chen, Haolun Wu, Dun Yuan, Li Jiang, Di Wu, Xue Liu, Charlie Zhang, Xianbin Wang, Jiangchuan Liu, 17 May 2024, Large Language Model (LLM) for Telecommunications: A Comprehensive Survey on Principles, Key Techniques, and Opportunities, https://arxiv.org/abs/2405.10825
- Zekun Moore Wang, Shawn Wang, Kang Zhu, Jiaheng Liu, Ke Xu, Jie Fu, Wangchunshu Zhou, Wenhao Huang, 17 Oct 2024, PopAlign: Diversifying Contrasting Patterns for a More Comprehensive Alignment, https://arxiv.org/abs/2410.13785
- Mozhi Zhang, Pengyu Wang, Chenkun Tan, Mianqiu Huang, Dong Zhang, Yaqian Zhou, Xipeng Qiu, 18 Oct 2024, MetaAlign: Align Large Language Models with Diverse Preferences during Inference Time, https://arxiv.org/abs/2410.14184
- OpenAI, Dec 2024, Deliberative alignment: reasoning enables safer language models. Introducing our new alignment strategy for o-series models, which are directly taught safety specifications and how to reason over them. https://openai.com/index/deliberative-alignment/
- Asif Razzaq, December 23, 2024, OpenAI Researchers Propose ‘Deliberative Alignment’: A Training Approach that Teaches LLMs to Explicitly Reason through Safety Specifications before Producing an Answer, https://www.marktechpost.com/2024/12/23/openai-researchers-propose-deliberative-alignment-a-training-approach-that-teaches-llms-to-explicitly-reason-through-safety-specifications-before-producing-an-answer/
- Andrea Matarazzo, Riccardo Torlone, 3 Jan 2025, A Survey on Large Language Models with some Insights on their Capabilities and Limitations, https://arxiv.org/abs/2501.04040 (Broad survey with many LLM topics covered from history to architectures to optimizations.)
- Tong Xiao, Jingbo Zhu, 16 Jan 2025, Foundations of Large Language Models, https://arxiv.org/abs/2501.09223 (Huge 230 page paper on many topics such as training, prompting, alignment, and long context.)
- Zongxi Li, Yang Li, Haoran Xie, S. Joe Qin, 3 Feb 2025, CondAmbigQA: A Benchmark and Dataset for Conditional Ambiguous Question Answering, https://arxiv.org/abs/2502.01523
- Y Gong, D Ran, X He, T Cong, A Wang, X Wang, Feb 2025, Safety Misalignment Against Large Language Models, Network and Distributed System Security (NDSS) Symposium 2025, 24-28 February 2025, San Diego, CA, USA, ISBN 979-8-9894372-8-3, https://dx.doi.org/10.14722/ndss.2025.241089 https://www.ndss-symposium.org/wp-content/uploads/2025-1089-paper.pdf
- Guiyao Tie, Zeli Zhao, Dingjie Song, Fuyang Wei, Rong Zhou, Yurou Dai, Wen Yin, Zhejian Yang, Jiangyue Yan, Yao Su, Zhenhan Dai, Yifeng Xie, Yihan Cao, Lichao Sun, Pan Zhou, Lifang He, Hechang Chen, Yu Zhang, Qingsong Wen, Tianming Liu, Neil Zhenqiang Gong, Jiliang Tang, Caiming Xiong, Heng Ji, Philip S. Yu, Jianfeng Gao, 8 Mar 2025, A Survey on Post-training of Large Language Models, https://arxiv.org/abs/2503.06072
- Ziqi Yin, Hao Wang, Kaito Horio, Daisuke Kawahara, Satoshi Sekine, 14 Oct 2024 (v2), Should We Respect LLMs? A Cross-Lingual Study on the Influence of Prompt Politeness on LLM Performance, https://arxiv.org/abs/2402.14531
- Michael Nuñez, July 15, 2025, OpenAI, Google DeepMind and Anthropic sound alarm: ‘We may be losing the ability to understand AI’, https://venturebeat.com/ai/openai-google-deepmind-and-anthropic-sound-alarm-we-may-be-losing-the-ability-to-understand-ai/ (Monitoring the text-based interim "thinking-out-loud" reasoning of models in CoT.)
- Tomek Korbak, Mikita Balesni, (and many more authors) July 2025, Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety, https://tomekkorbak.com/cot-monitorability-is-a-fragile-opportunity/cot_monitoring.pdf
- Yafu Li, Xuyang Hu, Xiaoye Qu, Linjie Li, Yu Cheng, 22 Jan 2025, Test-Time Preference Optimization: On-the-Fly Alignment via Iterative Textual Feedback, https://arxiv.org/abs/2501.12895 https://github.com/yafuly/TPO
- Cameron R. Wolfe, Ph.D., Jun 30, 2025, Reward Models: Modeling human preferences for LLMs in the age of reasoning models, https://cameronrwolfe.substack.com/p/reward-models
- Zetian Sun, Dongfang Li, Baotian Hu, 14 Aug 2025, Diversity First, Quality Later: A Two-Stage Assumption for Language Model Alignment, https://arxiv.org/abs/2508.10530
- Xinyan Jiang, Lin Zhang, Jiayi Zhang, Qingsong Yang, Guimin Hu, Di Wang, Lijie Hu, 14 Aug 2025, MSRS: Adaptive Multi-Subspace Representation Steering for Attribute Alignment in Large Language Models, https://arxiv.org/abs/2508.10599
- Jinhwa Kim, Ian G. Harris, 9 Aug 2025, Context Misleads LLMs: The Role of Context Filtering in Maintaining Safe Alignment of LLMs, https://arxiv.org/abs/2508.10031
- Christopher Pinier, Sonia Acu\~na Vargas, Mariia Steeghs-Turchina, Dora Matzke, Claire E. Stevenson, Michael D. Nunez, 12 Aug 2025, Large Language Models Show Signs of Alignment with Human Neurocognition During Abstract Reasoning, https://arxiv.org/abs/2508.10057
- Shixiong Xu, Chenghao Zhang, Lubin Fan, Yuan Zhou, Bin Fan, Shiming Xiang, Gaofeng Meng, Jieping Ye, 14 Aug 2025, AddressVLM: Cross-view Alignment Tuning for Image Address Localization using Large Vision-Language Models, https://arxiv.org/abs/2508.10667
- Xia Chen, 13 Aug 2025, Dynamical Alignment: A Principle for Adaptive Neural Computation, https://arxiv.org/abs/2508.10064
- Yihao Xue, Baharan Mirzasoleiman, 22 Jul 2025, LoRA is All You Need for Safety Alignment of Reasoning LLMs, https://arxiv.org/abs/2507.17075
- Haoran Sun, Zekun Zhang, Shaoning Zeng, 23 Jul 2025, An Uncertainty-Driven Adaptive Self-Alignment Framework for Large Language Models, https://arxiv.org/abs/2507.17477
- Xiang Li, 21 Jul 2025, Look Before You Fuse: 2D-Guided Cross-Modal Alignment for Robust 3D Detection, https://arxiv.org/abs/2507.16861
- Miguel Carrasco, C\'esar Gonz\'alez-Mart\'in, Jos\'e Aranda, Luis Oliveros, 23 Jul 2025, Vision Transformer attention alignment with human visual perception in aesthetic object evaluation, https://arxiv.org/abs/2507.17616
- Yifan Wang, Runjin Chen, Bolian Li, David Cho, Yihe Deng, Ruqi Zhang, Tianlong Chen, Zhangyang Wang, Ananth Grama, Junyuan Hong, 22 Jul 2025, More is Less: The Pitfalls of Multi-Model Synthetic Preference Data in DPO Safety Alignment, https://arxiv.org/abs/2504.02193
- Tom\'as H\"uttebr\"aucker, Mario Edoardo Pandolfo, Simone Fiorellino, Emilio Calvanese Strinati, Paolo Di Lorenzo, 23 Jul 2025, RIS-aided Latent Space Alignment for Semantic Channel Equalization, https://arxiv.org/abs/2507.16450
- Shehzeen Hussain, Paarth Neekhara, Xuesong Yang, Edresson Casanova, Subhankar Ghosh, Mikyas T. Desta, Roy Fejgin, Rafael Valle, Jason Li, 22 Jul 2025, Koel-TTS: Enhancing LLM based Speech Generation with Preference Alignment and Classifier Free Guidance, https://arxiv.org/abs/2502.05236
- Songming Zhang, Xue Zhang, Tong Zhang, Bojie Hu, Yufeng Chen, Jinan Xu, 23 Jul 2025, AlignDistil: Token-Level Language Model Alignment as Adaptive Policy Distillation, https://arxiv.org/abs/2503.02832
- ZhengXiao He, Jinghao Wen, Huayu Li, Siyuan Tian, Ao Li, 23 Jul 2025, NeuroHD-RA: Neural-distilled Hyperdimensional Model with Rhythm Alignment, https://arxiv.org/abs/2507.14184
- Amir Mohammad Izadi, Seyed Mohammad Hadi Hosseini, Soroush Vafaie Tabar, Ali Abdollahi, Armin Saghafian, and Mahdieh Soleymani Baghshah, 22 Jul 2025, Fine-Grained Alignment and Noise Refinement for Compositional Text-to-Image Generation, https://arxiv.org/abs/2503.06506
- Andy E. Williams, 18 Jul 2025, The Recursive Coherence Principle: A Formal Constraint on Scalable Intelligence, Alignment, and Reasoning Architecture, https://arxiv.org/abs/2507.15880
- Debangshu Banerjee, Kintan Saha, Aditya Gopalan, 21 Jul 2025, Towards Reliable, Uncertainty-Aware Alignment, https://arxiv.org/abs/2507.15906
- Mario Edoardo Pandolfo, Simone Fiorellino, Emilio Calvanese Strinati, Paolo Di Lorenzo, 22 Jul 2025, Latent Space Alignment for AI-Native MIMO Semantic Communications, https://arxiv.org/abs/2507.16680
- Han Jiang, Dongyao Zhu, Zhihua Wei, Xiaoyuan Yi, Ziang Xiao, Xing Xie, 22 Jul 2025, PICACO: Pluralistic In-Context Value Alignment of LLMs via Total Correlation Optimization, https://arxiv.org/abs/2507.16679
- Difei Gu, Yunhe Gao, Yang Zhou, Mu Zhou, Dimitris Metaxas, 22 Jul 2025, RadAlign: Advancing Radiology Report Generation with Vision-Language Concept Alignment, https://arxiv.org/abs/2501.07525
- Ziteng Yang, Jingzehua Xu, Yanshu Li, Zepeng Li, Yeqiang Wang, Xinghui Li, 22 Jul 2025, ViP$^2$-CLIP: Visual-Perception Prompting with Unified Alignment for Zero-Shot Anomaly Detection, https://arxiv.org/abs/2505.17692
- Tianze Wang, Dongnan Gui, Yifan Hu, Shuhang Lin, Linjun Zhang, 22 Jul 2025, MPO: An Efficient Post-Processing Framework for Mixing Diverse Preference Alignment, https://arxiv.org/abs/2502.18699
- Xiandong Zou, Wanyu Lin, Yuchen Li, Pan Zhou, 24 Jul 2025, HPS: Hard Preference Sampling for Human Preference Alignment, https://arxiv.org/abs/2502.14400
- Alberto Hern\'andez-Espinosa, Felipe S. Abrah\~ao, Olaf Witkowski, Hector Zenil, 24 Jul 2025, Neurodivergent Influenceability as a Contingent Solution to the AI Alignment Problem, https://arxiv.org/abs/2505.02581
- Yuhui Sun (University of Alberta), Xiyao Wang (University of Toronto), Zixi Li (Zhejiang University), Zhenlong Yuan (Institute of Computing Technology, Chinese Academy of Sciences), and Jinman Zhao (University of Toronto), 24 Jul 2025, Multi-Preference Lambda-weighted Listwise DPO for Small-Scale Model Alignment, https://arxiv.org/abs/2506.19780
- Bowen Jin, Jinsung Yoon, Zhen Qin, Ziqi Wang, Wei Xiong, Yu Meng, Jiawei Han, Sercan O. Arik, 23 Jul 2025, LLM Alignment as Retriever Optimization: An Information Retrieval Perspective, https://arxiv.org/abs/2502.03699
- Jie Xu, Na Zhao, Gang Niu, Masashi Sugiyama, Xiaofeng Zhu, 24 Jul 2025, Robust Multi-View Learning via Representation Fusion of Sample-Level Attention and Alignment of Simulated Perturbation, https://arxiv.org/abs/2503.04151
- Charvi Rastogi, Tian Huey Teh, Pushkar Mishra, Roma Patel, Ding Wang, Mark D\'iaz, Alicia Parrish, Aida Mostafazadeh Davani, Zoe Ashwood, Michela Paganini, Vinodkumar Prabhakaran, Verena Rieser, Lora Aroyo, 15 Jul 2025, Whose View of Safety? A Deep DIVE Dataset for Pluralistic Alignment of Text-to-Image Models, https://arxiv.org/abs/2507.13383
- Oussama Bouaggad, Natalia Grabar, 18 Jul 2025, Search-Optimized Quantization in Biomedical Ontology Alignment, https://arxiv.org/abs/2507.13742
- Shuliang Liu, Qi Zheng, Jesse Jiaxi Xu, Yibo Yan, He Geng, Aiwei Liu, Peijie Jiang, Jia Liu, Yik-Cheung Tam, and Xuming Hu, 18 Jul 2025, VLA-Mark: A cross modal watermark for large vision-language alignment model, https://arxiv.org/abs/2507.14067
- Yi Zhang, An Zhang, XiuYu Zhang, Leheng Sheng, Yuxin Chen, Zhenkai Liang, Xiang Wang, 20 Jul 2025, AlphaAlign: Incentivizing Safety Alignment with Extremely Simplified Reinforcement Learning, https://arxiv.org/abs/2507.14987
- Pengfei Du, 14 Jul 2025, PRM-Free Security Alignment of Large Models via Red Teaming and Adversarial Training, https://arxiv.org/abs/2507.14202
- Wenqian Ye, Guangtao Zheng, Aidong Zhang, 20 Jul 2025, Improving Group Robustness on Spurious Correlation via Evidential Alignment, https://arxiv.org/abs/2506.11347
- Anirudh Sundar, Sinead Williamson, Katherine Metcalf, Barry-John Theobald, Skyler Seto, Masha Fedzechkina, 21 Jul 2025, Steering into New Embedding Spaces: Analyzing Cross-Lingual Alignment Induced by Model Interventions in Multilingual Language Models, https://arxiv.org/abs/2502.15639
- Noel Teku, Fengwei Tian, Payel Bhattacharjee, Souradip Chakraborty, Amrit Singh Bedi, Ravi Tandon, 9 Aug 2025, PROPS: Progressively Private Self-alignment of Large Language Models, https://arxiv.org/abs/2508.06783
- Yuandong Tan, 10 Aug 2025, A Stable and Principled Loss Function for Direct Language Model Alignment, https://arxiv.org/abs/2508.07137
- Jia Zhang, Yao Liu, Chen-Xi Zhang, Yi Liu, Yi-Xuan Jin, Lan-Zhe Guo, Yu-Feng Li, 11 Aug 2025, Beyond Single: A Data Selection Principle for LLM Alignment via Fine-Grained Preference Signals, https://arxiv.org/abs/2508.07638
- Haowen Wang, Yun Yue, Zhiling Ye, Shuowen Zhang, Lei Fan, Jiaxin Liang, Jiadi Jiang, Cheng Wei, Jingyuan Deng, Xudong Han, Ji Li, Chunxiao Guo, Peng Wei, Jian Wang, Jinjie Gu, 11 Aug 2025, Learning to Align, Aligning to Learn: A Unified Approach for Self-Optimized Alignment, https://arxiv.org/abs/2508.07750
- Qiang He, Setareh Maghsudi, 11 Aug 2025, Pareto Multi-Objective Alignment for Language Models, https://arxiv.org/abs/2508.07768
- Nicole Lai-Tan and Xiao Gu and Marios G. Philiastides and Fani Deligianni, 11 Aug 2025, Cross-Subject and Cross-Montage EEG Transfer Learning via Individual Tangent Space Alignment and Spatial-Riemannian Feature Fusion, https://arxiv.org/abs/2508.08216
- Ben Y. Reis and William La Cava, 8 Aug 2025, Towards Integrated Alignment, https://arxiv.org/abs/2508.06592
- Xiaobo Zhang (1 and 2), Congqing He (2), Ying He (1 and 2), Jian Peng (1), Dajie Fu (1), Tien-Ping Tan (2) ((1) School of Information Engineering, Jiangxi Vocational College of Finance & Economics, Jiujiang, China, (2) School of Computer Sciences, Universiti Sains Malaysia, Penang, Malaysia), 9 Aug 2025, ESNERA: Empirical and semantic named entity alignment for named entity dataset merging, https://arxiv.org/abs/2508.06877
- Jianting Tang, Yubo Wang, Haoyu Cao, Linli Xu, 9 Aug 2025, BASIC: Boosting Visual Alignment with Intrinsic Refined Embeddings in Multimodal Large Language Models, https://arxiv.org/abs/2508.06895
- Yanru Sun, Emadeldeen Eldele, Zongxia Xie, Yucheng Wang, Wenzhe Niu, Qinghua Hu, Chee Keong Kwoh, Min Wu, 10 Aug 2025, Adapting LLMs to Time Series Forecasting via Temporal Heterogeneity Modeling and Semantic Alignment, https://arxiv.org/abs/2508.07195
- Gustavo Moreira, Leonardo Ferreira, Carolina Veiga, Maryam Hosseini, Fabio Miranda, 10 Aug 2025, Urbanite: A Dataflow-Based Framework for Human-AI Interactive Alignment in Urban Visual Analytics, https://arxiv.org/abs/2508.07390
- Wenze Xu and Chun Wang and Jiazhen Yu and Sheng Chen and Liang Gao and Weihong Deng, 11 Aug 2025, Optimal Transport Regularization for Speech Text Alignment in Spoken Language Models, https://arxiv.org/abs/2508.08131
- Kyle Moore, Jesse Roberts, Daryl Watson, 11 Aug 2025, Human-Alignment and Calibration of Inference-Time Uncertainty in Large Language Models, https://arxiv.org/abs/2508.08204
- Jie Xiao, Changyuan Fan, Qingnan Ren, Alfred Long, Yuchen Zhang, Rymon Yu, Eric Yang, Lynn Ai, Shaoduo Gan, 9 Aug 2025, Echo: Decoupling Inference and Training for Large-Scale RL Alignment on Heterogeneous Swarms, https://arxiv.org/abs/2508.05387
- Haoran Lu, Luyang Fang, Ruidong Zhang, Xinliang Li, Jiazhang Cai, Huimin Cheng, Lin Tang, Ziyu Liu, Zeliang Sun, Tao Wang, Yingchuan Zhang, Arif Hassan Zidan, Jinwen Xu, Jincheng Yu, Meizhi Yu, Hanqi Jiang, Xilin Gong, Weidi Luo, Bolun Sun, Yongkai Chen, Terry Ma, Shushan Wu, Yifan Zhou, Junhao Chen, Haotian Xiang, Jing Zhang, Afrar Jahin, Wei Ruan, Ke Deng, Yi Pan, Peilong Wang, Jiahui Li, Zhengliang Liu, Lu Zhang, Lin Zhao, Wei Liu, Dajiang Zhu, Xin Xing, Fei Dou, Wei Zhang, Chao Huang, Rongjie Liu, Mengrui Zhang, Yiwen Liu, Xiaoxiao Sun, Qin Lu, Zhen Xiang, Wenxuan Zhong, Tianming Liu, Ping Ma, 25 Jul 2025, Alignment and Safety in Large Language Models: Safety Mechanisms, Training Paradigms, and Emerging Challenges, https://arxiv.org/abs/2507.19672
- Sarat Chandra Bobbili, Ujwal Dinesha, Dheeraj Narasimha, Srinivas Shakkottai, 26 Jul 2025, PITA: Preference-Guided Inference-Time Alignment for LLM Post-Training, https://arxiv.org/abs/2507.20067
- Rachel S.Y. Teo, Laziz U. Abdullaev, Tan M. Nguyen, 27 Jul 2025, The Blessing and Curse of Dimensionality in Safety Alignment, https://arxiv.org/abs/2507.20333
- Tiantian Peng, Yuyang Liu, Shuo Yang, Qiuhe Hong, YongHong Tian, 26 Jul 2025, GNSP: Gradient Null Space Projection for Preserving Cross-Modal Alignment in VLMs Continual Learning, https://arxiv.org/abs/2507.19839
- Siyu Song, Wentao Liu, Ye Lu, Ruohua Zhang, Tao Liu, Jinze Lv, Xinyun Wang, Aimin Zhou, Fei Tan, Bo Jiang, Hao Hao, 27 Jul 2025, Cultivating Helpful, Personalized, and Creative AI Tutors: A Framework for Pedagogical Alignment using Reinforcement Learning, https://arxiv.org/abs/2507.20335
- Rongyao Cai, Ming Jin, Qingsong Wen, Kexin Zhang, 28 Jul 2025, From Entanglement to Alignment: Representation Space Decomposition for Unsupervised Time Series Domain Adaptation, https://arxiv.org/abs/2507.20968
- Andr\'e Steingr\"uber, Kevin Baum, 24 Jul 2025, Justifications for Democratizing AI Alignment and Their Prospects, https://arxiv.org/abs/2507.19548
- Shuhaib Mehri, Xiaocheng Yang, Takyoung Kim, Gokhan Tur, Shikib Mehri, Dilek Hakkani-T\"ur, 27 Jul 2025, Goal Alignment in LLM-Based User Simulators for Conversational AI, https://arxiv.org/abs/2507.20152
- Gabriel Downer, Sean Craven, Damian Ruck, Jake Thomas, 28 Jul 2025, Text2VLM: Adapting Text-Only Datasets to Evaluate Alignment Training in Visual Language Models, https://arxiv.org/abs/2507.20704
- Renhang Liu, Chia-Yu Hung, Navonil Majumder, Taylor Gautreaux, Amir Ali Bagherzadeh, Chuan Li, Dorien Herremans, Soujanya Poria, 28 Jul 2025, JAM: A Tiny Flow-based Song Generator with Fine-grained Controllability and Aesthetic Alignment, https://arxiv.org/abs/2507.20880
- Hei Shing Cheung and Boya Zhang, 26 Jul 2025, Efficient Vocal-Conditioned Music Generation via Soft Alignment Attention and Latent Diffusion, https://arxiv.org/abs/2507.19991
- Jiaming Ji, Kaile Wang, Tianyi Qiu, Boyuan Chen, Jiayi Zhou, Changye Li, Hantao Lou, Juntao Dai, Yunhuai Liu, Yaodong Yang, 27 Jul 2025, Language Models Resist Alignment: Evidence From Data Compression, https://arxiv.org/abs/2406.06144
- Madhava Gaikwad (1), Ashwini Ramchandra Doke (2) ((1) Microsoft, (2) Amrita University), 22 Jul 2025, NPO: Learning Alignment and Meta-Alignment through Structured Human Feedback, https://arxiv.org/abs/2507.21131
- Lenart Motnikar, Katharina Baum, Alexander Kagan, Sarah Spiekermann-Hoff, 26 Jun 2025, The Value of Gen-AI Conversations: A bottom-up Framework for AI Value Alignment, https://arxiv.org/abs/2507.21091
- Aran Nayebi, 29 Jul 2025, Intrinsic Barriers and Practical Pathways for Human-AI Alignment: An Agreement-Based Complexity Analysis, https://arxiv.org/abs/2502.05934
- Haipeng Liu, Yuxuan Liu, Ting Long, 31 Jul 2025, Personalized Education with Ranking Alignment Recommendation, https://arxiv.org/abs/2507.23664
- Wei Li and Xun Gong and Jiao Li and Xiaobin Sun, 31 Jul 2025, AGA: An adaptive group alignment framework for structured medical cross-modal representation learning, https://arxiv.org/abs/2507.23402
- Ananth Balashankar and Ziteng Sun and Jonathan Berant and Jacob Eisenstein and Michael Collins and Adrian Hutter and Jong Lee and Chirag Nagpal and Flavien Prost and Aradhana Sinha and Ananda Theertha Suresh and Ahmad Beirami, 31 Jul 2025, InfAlign: Inference-aware language model alignment, https://arxiv.org/abs/2412.19792
- Qun Ma, Xiao Xue, Ming Zhang, Yifan Shen, Zihan Zhao, 30 Jul 2025, An Explainable Emotion Alignment Framework for LLM-Empowered Agent in Metaverse Service Ecosystem, https://arxiv.org/abs/2507.22326
- Yixuan Nan, Xixun Lin, Yanmin Shang, Zhuofan Li, Can Zhao and Yanan Cao, 30 Jul 2025, RANA: Robust Active Learning for Noisy Network Alignment, https://arxiv.org/abs/2507.22434
- Shaoan Xie, Lingjing Kong, Yujia Zheng, Yu Yao, Zeyu Tang, Eric P. Xing, Guangyi Chen, Kun Zhang, 29 Jul 2025, SmartCLIP: Modular Vision-language Alignment with Identification Guarantees, https://arxiv.org/abs/2507.22264
- Junjie Cao, 30 Jul 2025, Adaptive Duration Model for Text Speech Alignment, https://arxiv.org/abs/2507.22612
- Yeonjun In, Wonjoong Kim, Sangwu Park, Chanyoung Park, 1 Aug 2025, R1-ACT: Efficient Reasoning Model Safety Alignment by Activating Safety Knowledge, https://arxiv.org/abs/2508.00324
- Jens U. Kreber, Joerg Stueckler, 1 Aug 2025, Guiding Diffusion-Based Articulated Object Generation by Partial Point Cloud Alignment and Physical Plausibility Constraints, https://arxiv.org/abs/2508.00558
- Amitava Das, Vinija Jain, Aman Chadha, 4 Aug 2025, TRACEALIGN -- Tracing the Drift: Attributing Alignment Failures to Training-Time Belief Sources in LLMs, https://arxiv.org/abs/2508.02063
- Istabrak Abbes, Gopeshh Subbaraj, Matthew Riemer, Nizar Islah, Benjamin Therien, Tsuguchika Tabaru, Hiroaki Kingetsu, Sarath Chandar, Irina Rish, 3 Aug 2025, Revisiting Replay and Gradient Alignment for Continual Pre-Training of Large Language Models, https://arxiv.org/abs/2508.01908
- Ziyu Zhou, Yiming Huang, Yanyun Wang, Yuankai Wu, James Kwok, Yuxuan Liang, 4 Aug 2025, Revitalizing Canonical Pre-Alignment for Irregular Multivariate Time Series Forecasting, https://arxiv.org/abs/2508.01971
- Amitava Das, Abhilekh Borah, Vinija Jain, Aman Chadha, 4 Aug 2025, AlignGuard-LoRA: Alignment-Preserving Fine-Tuning via Fisher-Guided Decomposition and Riemannian-Geodesic Collision Regularization, https://arxiv.org/abs/2508.02079
- Yu Lei, Jinbin Bai, Qingyu Shi, Aosong Feng and Kaidong Yu, 2 Aug 2025, Personalized Safety Alignment for Text-to-Image Diffusion Models, https://arxiv.org/abs/2508.01151
- Tae Soo Kim, Yoonjoo Lee, Yoonah Park, Jiho Kim, Young-Ho Kim, Juho Kim, 3 Aug 2025, CUPID: Evaluating Personalized and Contextualized Alignment of LLMs from Interactions, https://arxiv.org/abs/2508.01674
- Tom S. Juzek, Zina B. Ward, 3 Aug 2025, Word Overuse and Alignment in Large Language Models: The Influence of Learning from Human Feedback, https://arxiv.org/abs/2508.01930
- Haoran Gu, Handing Wang, Yi Mei, Mengjie Zhang, Yaochu Jin, 4 Aug 2025, ParetoHqD: Fast Offline Multiobjective Alignment of Large Language Models using Pareto High-quality Data, https://arxiv.org/abs/2504.16628
- Ivan Zakazov, Mikolaj Boronski, Lorenzo Drudi, Robert West, 4 Aug 2025, Assessing Social Alignment: Do Personality-Prompted Large Language Models Behave Like Humans?, https://arxiv.org/abs/2412.16772
- Taibiao Zhao, Xiaobing Chen, and Mingxuan Sun, 1 Aug 2025, Enhancing Time Series Forecasting via Multi-Level Text Alignment with LLMs, https://arxiv.org/abs/2504.07360
- Bolian Li, Yifan Wang, Anamika Lochab, Ananth Grama, Ruqi Zhang, 3 Aug 2025, Cascade Reward Sampling for Efficient Decoding-Time Alignment, https://arxiv.org/abs/2406.16306
- Amir Aghdam, Vincent Tao Hu, Bj\"orn Ommer, 4 Aug 2025, ActAlign: Zero-Shot Fine-Grained Video Classification via Language-Guided Sequence Alignment, https://arxiv.org/abs/2506.22967
- Dahun Kim, Anelia Angelova, 3 Aug 2025, Context-Adaptive Multi-Prompt LLM Embedding for Vision-Language Alignment, https://arxiv.org/abs/2508.02762
- Hongjun Liu, Chao Yao, Yalan Zhang, Xiaokun wang and Xiaojuan Ban, 5 Aug 2025, Spatial Imputation Drives Cross-Domain Alignment for EEG Classification, https://arxiv.org/abs/2508.03437
- Anamika Lochab, Ruqi Zhang, 5 Aug 2025, Energy-Based Reward Models for Robust Language Model Alignment, https://arxiv.org/abs/2504.13134
- Wentao Wu, Linqing Chen, Hanmeng Zhong, Weilei Wang, 6 Aug 2025, Large Language Model's Multi-Capability Alignment in Biomedical Domain, https://arxiv.org/abs/2508.04278
- Abdul Monaf Chowdhury, Rabeya Akter, Safaeid Hossain Arib, 6 Aug 2025, T3Time: Tri-Modal Time Series Forecasting via Adaptive Multi-Head Alignment and Residual Fusion, https://arxiv.org/abs/2508.04251
- Hongxu Chen, Zhen Wang, Taoran Mei, Lin Li, Bowei Zhu, Runshi Li, Long Chen, 6 Aug 2025, Zero-Residual Concept Erasure via Progressive Alignment in Text-to-Image Model, https://arxiv.org/abs/2508.04472
- Feifan Song, Bofei Gao, Yifan Song, Yi Liu, Weimin Xiong, Yuyang Song, Tianyu Liu, Guoyin Wang, Houfeng Wang, 6 Aug 2025, P-Aligner: Enabling Pre-Alignment of Language Models via Principled Instruction Synthesis, https://arxiv.org/abs/2508.04626
- Wenji Fang, Jing Wang, Yao Lu, Shang Liu, Zhiyao Xie, 6 Aug 2025, GenEDA: Towards Generative Netlist Functional Reasoning via Cross-Modal Circuit Encoder-Decoder Alignment, https://arxiv.org/abs/2504.09485
- You Rim Choi, Subeom Park, Seojun Heo, Eunchung Noh, Hyung-Sin Kim, 6 Aug 2025, Let the Void Be Void: Robust Open-Set Semi-Supervised Learning via Selective Non-Alignment, https://arxiv.org/abs/2504.12569
- Krzysztof Janowicz and Zilong Liu and Gengchen Mai and Zhangyu Wang and Ivan Majic and Alexandra Fortacz and Grant McKenzie and Song Gao, 7 Aug 2025, Whose Truth? Pluralistic Geo-Alignment for (Agentic) AI, https://arxiv.org/abs/2508.05432
- Shruti Saxena, Arijit Khan and Joydeep Chandra, 5 Aug 2025, NAEx: A Plug-and-Play Framework for Explaining Network Alignment, https://arxiv.org/abs/2508.04731
- Mason Nakamura, Saaduddin Mahmud, Kyle H. Wray, Hamed Zamani, Shlomo Zilberstein, 7 Aug 2025, Aligning LLMs on a Budget: Inference-Time Alignment with Heuristic Reward Models, https://arxiv.org/abs/2508.05165
- Zhongheng Yang, Aijia Sun, Yushang Zhao, Yinuo Yang, Dannier Li, Chengrui Zhou, 7 Aug 2025, RLHF Fine-Tuning of LLMs for Alignment with Implicit User Feedback in Conversational Recommenders, https://arxiv.org/abs/2508.05289
- Qinghua Yao, Xiangrui Xu, Zhize Li, 7 Aug 2025, X-VFL: A New Vertical Federated Learning Framework with Cross Completion and Decision Subspace Alignment, https://arxiv.org/abs/2508.05568
- Sam Kouteili, Hiren Madhu, George Typaldos, Mark Santolucito, 7 Aug 2025, Embedding Alignment in Code Generation for Audio, https://arxiv.org/abs/2508.05473
- Yubin Zhang, Yanhua Huang, Haiming Xu, Mingliang Qi, Chang Wang, Jiarui Jin, Xiangyuan Ren, Xiaodan Wang, Ruiwen Xu, 7 Aug 2025, A Metric for MLLM Alignment in Large-scale Recommendation, https://arxiv.org/abs/2508.04963
- Zhiqing Xiao, Haobo Wang, Xu Lu, Wentao Ye, Gang Chen, Junbo Zhao, 7 Aug 2025, SPA++: Generalized Graph Spectral Alignment for Versatile Domain Adaptation, https://arxiv.org/abs/2508.05182
- Wei Zeng, Hengshu Zhu, Chuan Qin, Han Wu, Yihang Cheng, Sirui Zhang, Xiaowei Jin, Yinuo Shen, Zhenxing Wang, Feimin Zhong, Hui Xiong, 7 Aug 2025, Multi-level Value Alignment in Agentic AI Systems: Survey and Perspectives, https://arxiv.org/abs/2506.09656
- Yifei Xu, Tusher Chakraborty, Emre K{\i}c{\i}man, Bibek Aryal, Eduardo Rodrigues, Srinagesh Sharma, Roberto Estevao, Maria Angels de Luis Balaguer, Jessica Wolk, Rafael Padilha, Leonardo Nunes, Shobana Balakrishnan, Songwu Lu, Ranveer Chandra, 6 Aug 2025, RLTHF: Targeted Human Feedback for LLM Alignment, https://arxiv.org/abs/2502.13417
- Shengzhu Yang, Jiawei Du, Shuai Lu, Weihang Zhang, Ningli Wang, Huiqi Li, 8 Aug 2025, CLIPin: A Non-contrastive Plug-in to CLIP for Multimodal Semantic Alignment, https://arxiv.org/abs/2508.06434
- Keiyu Nosaka, Yuichi Takano, Akiko Yoshise, 8 Aug 2025, Data Collaboration Analysis with Orthonormal Basis Selection and Alignment, https://arxiv.org/abs/2403.02780
- Parker Whitfill, Stewy Slocum, 11 Aug 2025, Beyond Ordinal Preferences: Why Alignment Needs Cardinal Human Feedback, https://arxiv.org/abs/2508.08486
- Sviatoslav Lushnei, Dmytro Shumskyi, Severyn Shykula, Ernesto Jimenez-Ruiz, Artur d'Avila Garcez, 11 Aug 2025, Large Language Models as Oracles for Ontology Alignment, https://arxiv.org/abs/2508.08500
- Saketh Reddy Vemula, Dipti Mishra Sharma and Parameswari Krishnamurthy, 11 Aug 2025, Rethinking Tokenization for Rich Morphology: The Dominance of Unigram over BPE and Morphological Alignment, https://arxiv.org/abs/2508.08424
- Jadie Adams, Brian Hu, Emily Veenhuis, David Joy, Bharadwaj Ravichandran, Aaron Bray, Anthony Hoogs, Arslan Basharat, 11 Aug 2025, Steerable Pluralism: Pluralistic Alignment via Few-Shot Comparative Regression, https://arxiv.org/abs/2508.08509
- Birong Pan, Yongqi Li, Weiyu Zhang, Wenpeng Lu, Mayi Xu, Shen Zhou, Yuanyuan Zhu, Ming Zhong, Tieyun Qian, 12 Aug 2025, A Survey on Training-free Alignment of Large Language Models, https://arxiv.org/abs/2508.09016
- Sejin Kim, Sundong Kim, 12 Aug 2025, System~2 Reasoning for Human--AI Alignment: Generality and Adaptivity via ARC-AGI, https://arxiv.org/abs/2410.07866
- Liang Chen, Xueting Han, Li Shen, Jing Bai, Kam-Fai Wong, 12 Aug 2025, Vulnerability-Aware Alignment: Mitigating Uneven Forgetting in Harmful Fine-Tuning, https://arxiv.org/abs/2506.03850
- Yuxin Chen and Chen Tang and Jianglan Wei and Chenran Li and Ran Tian and Xiang Zhang and Wei Zhan and Peter Stone and Masayoshi Tomizuka, 12 Aug 2025, MEReQ: Max-Ent Residual-Q Inverse RL for Sample-Efficient Alignment from Intervention, https://arxiv.org/abs/2406.16258
- Yang Fan, 12 Aug 2025, AdEval: Alignment-based Dynamic Evaluation to Mitigate Data Contamination in Large Language Models, https://arxiv.org/abs/2501.13983
- Shravan Nayak, Mehar Bhatia, Xiaofeng Zhang, Verena Rieser, Lisa Anne Hendricks, Sjoerd van Steenkiste, Yash Goyal, Karolina Sta\'nczak, Aishwarya Agrawal, 12 Aug 2025, CulturalFrames: Assessing Cultural Expectation Alignment in Text-to-Image Models and Evaluation Metrics, https://arxiv.org/abs/2506.08835
- Yang Zhang, Cunxiang Wang, Lindong Wu, Wenbo Yu, Yidong Wang, Guangsheng Bao, Jie Tang, 13 Aug 2025, UDA: Unsupervised Debiasing Alignment for Pair-wise LLM-as-a-Judge, https://arxiv.org/abs/2508.09724
- Mansi, Anastasios Lepipas, Dominika Woszczyk, Yiying Guan, Soteris Demetriou, 12 Aug 2025, Understanding Dementia Speech Alignment with Diffusion-Based Image Generation, https://arxiv.org/abs/2508.09385
- Birong Pan, Mayi Xu, Qiankun Pi, Jianhao Chen, Yuanyuan Zhu, Ming Zhong, Tieyun Qian, 13 Aug 2025, NeuronTune: Fine-Grained Neuron Modulation for Balanced Safety-Utility Alignment in LLMs, https://arxiv.org/abs/2508.09473
- Peiran Peng, Tingfa Xu, Liqiang Song, Mengqi Zhu, Yuqiang Fang, Jianan Li, 13 Aug 2025, COXNet: Cross-Layer Fusion with Adaptive Alignment and Scale Integration for RGBT Tiny Object Detection, https://arxiv.org/abs/2508.09533
- Muneeza Azmat, Momin Abbas, Maysa Malfiza Garcia de Macedo, Marcelo Carpinette Grave, Luan Soares de Souza, Tiago Machado, Rogerio A de Paula, Raya Horesh, Yixin Chen, Heloisa Caroline de Souza Pereira Candello, Rebecka Nordenlow, Aminat Adebiyi, 13 Aug 2025, A Comprehensive Evaluation framework of Alignment Techniques for LLMs, https://arxiv.org/abs/2508.09937
- Zichao Hu, Junyi Jessy Li, Arjun Guha, Joydeep Biswas, 12 Aug 2025, Robo-Instruct: Simulator-Augmented Instruction Alignment For Finetuning Code LLMs, https://arxiv.org/abs/2405.20179
- Numair Nadeem, Saeed Anwar, Muhammad Hamza Asad, Abdul Bais, 13 Aug 2025, HVL: Semi-Supervised Segmentation leveraging Hierarchical Vision-Language Synergy with Dynamic Text-Spatial Query Alignment, https://arxiv.org/abs/2506.13925
- Durgesh Mishra, Rishabh Uikey, 15 Aug 2025, Unified Knowledge Distillation Framework: Fine-Grained Alignment and Geometric Relationship Preservation for Deep Face Recognition, https://arxiv.org/abs/2508.11376
- Alessio Galatolo, Luca Alberto Rappuoli, Katie Winkle, Meriem Beloucif, 18 Aug 2025, Beyond Ethical Alignment: Evaluating LLMs as Artificial Moral Assistants, https://arxiv.org/abs/2508.12754
- Manning Zhu, Songtao Guo, Pengzhan Zhou, Yansong Ning, Chang Han, Dewen Qiao, 18 Aug 2025, FedSODA: Federated Fine-tuning of LLMs via Similarity Group Pruning and Orchestrated Distillation Alignment, https://arxiv.org/abs/2508.12727
- Zhixin Xie, Xurui Song, Jun Luo, 17 Aug 2025, Where to Start Alignment? Diffusion Large Language Model May Demand a Distinct Position, https://arxiv.org/abs/2508.12398
- Xuhui Zhan and Tyler Derr, 17 Aug 2025, Inverse-LLaVA: Eliminating Alignment Pre-training Through Text-to-Vision Mapping, https://arxiv.org/abs/2508.12466
- Yizhuo Zhang, Heng Wang, Shangbin Feng, Zhaoxuan Tan, Xinyun Liu, Yulia Tsvetkov, 17 Aug 2025, Generalizable LLM Learning of Graph Synthetic Data with Post-training Alignment, https://arxiv.org/abs/2506.00845
- Mohammad Jalali, Bahar Dibaei Nia, Farzan Farnia, 16 Aug 2025, Towards an Explainable Comparison and Alignment of Feature Embeddings, https://arxiv.org/abs/2506.06231
- Guangfu Hao, Haojie Wen, Liangxuan Guo, Yang Chen, Yanchao Bi, Shan Yu, 18 Aug 2025, Flexible Tool Selection through Low-dimensional Attribute Alignment of Vision and Language, https://arxiv.org/abs/2505.22146
- Yang Zhang, Yu Yu, Bo Tang, Yu Zhu, Chuxiong Sun, Wenqiang Wei, Jie Hu, Zipeng Xie, Zhiyu Li, Feiyu Xiong, Edward Chung, 16 Aug 2025, Token-level Accept or Reject: A Micro Alignment Approach for Large Language Models, https://arxiv.org/abs/2505.19743
- Jeremy Carleton, Debajoy Mukherjee, Srinivas Shakkottai, Dileep Kalathil, 19 Aug 2025, MAVIS: Multi-Objective Alignment via Value-Guided Inference-Time Search, https://arxiv.org/abs/2508.13415
- Zeeshan Ahmed, Frank Seide, Niko Moritz, Ju Lin, Ruiming Xie, Simone Merello, Zhe Liu and Christian Fuegen, 18 Aug 2025, Overcoming Latency Bottlenecks in On-Device Speech Translation: A Cascaded Approach with Alignment-Based Streaming MT, https://arxiv.org/abs/2508.13358
- Jinhui Pang, Changqing Lin, Hao Lin, Zhihui Zhang, Long Chen, Weiping Ding, Yu Liu, Xiaoshuai Hao, 19 Aug 2025, MEGA: Second-Order Gradient Alignment for Catastrophic Forgetting Mitigation in GFSCIL, https://arxiv.org/abs/2504.13691
- Samir Abdaljalil, Erchin Serpedin, Khalid Qaraqe, Hasan Kurban, 20 Aug 2025, Evaluating Multilingual and Code-Switched Alignment in LLMs via Synthetic Natural Language Inference, https://arxiv.org/abs/2508.14735
- Abhigya Verma, Sriram Puttagunta, Seganrasan Subramanian, Sravan Ramachandran, 21 Aug 2025, GRAFT: GRaPH and Table Reasoning for Textual Alignment -- A Benchmark for Structured Instruction Following and Visual Reasoning, https://arxiv.org/abs/2508.15690
- Chenlin Liu, Minghui Fang, Patrick Zhang, Wei Zhou, Jie Gao, Jiqing Han, 21 Aug 2025, Mitigating Hallucinations in LM-Based TTS Models via Distribution Alignment Using GFlowNets, https://arxiv.org/abs/2508.15442
- Youjia Zhang, Youngeun Kim, Young-Geun Choi, Hongyeob Kim, Huiling Liu, Sungeun Hong, 21 Aug 2025, Backpropagation-Free Test-Time Adaptation via Probabilistic Gaussian Alignment, https://arxiv.org/abs/2508.15568
- J. Koorndijk, 21 Aug 2025, Empirical Evidence for Alignment Faking in a Small LLM and Prompt-Based Mitigation Techniques, https://arxiv.org/abs/2506.21584
- Qilong Xing, Zikai Song, Youjia Zhang, Na Feng, Junqing Yu, Wei Yang, 21 Aug 2025, MCA-RG: Enhancing LLMs with Medical Concept Alignment for Radiology Report Generation, https://arxiv.org/abs/2507.06992
- Shuyuan Tu, Zhen Xing, Xintong Han, Zhi-Qi Cheng, Qi Dai, Chong Luo, Zuxuan Wu, Yu-Gang Jiang, 20 Jul 2025, StableAnimator++: Overcoming Pose Misalignment and Face Distortion for Human Image Animation, https://arxiv.org/abs/2507.15064
- Vince Trencsenyi and Agnieszka Mensfelt and Kostas Stathis, 25 Jul 2025, Hypergames: Modeling Misaligned Perceptions and Nested Beliefs for Multi-agent Systems, https://arxiv.org/abs/2507.19593
- Bryce Anderson, Riley Galpin, Tom S. Juzek, 1 Aug 2025, Model Misalignment and Language Change: Traces of AI-Associated Language in Unscripted Spoken English, https://arxiv.org/abs/2508.00238
- Fan Bu, Zheng Wang, Siyi Wang and Ziyao Liu, 1 Aug 2025, An Investigation into Value Misalignment in LLM-Generated Texts for Cultural Heritage, https://arxiv.org/abs/2501.02039
- Siddhant Panpatil, Hiskias Dingeto, Haon Park, 6 Aug 2025, Eliciting and Analyzing Emergent Misalignment in State-of-the-Art Large Language Models, https://arxiv.org/abs/2508.04196
- David Kacz\'er, Magnus J{\o}rgenv{\aa}g, Clemens Vetter, Lucie Flek, Florian Mai, 8 Aug 2025, In-Training Defenses against Emergent Misalignment in Language Models, https://arxiv.org/abs/2508.06249
- Yichao Cai, Yuhang Liu, Erdun Gao, Tianjiao Jiang, Zhen Zhang, Anton van den Hengel, Javen Qinfeng Shi, 7 Aug 2025, On the Value of Cross-Modal Misalignment in Multimodal Representation Learning, https://arxiv.org/abs/2504.10143
- Dongyoon Hahm, Taywon Min, Woogyeol Jin, Kimin Lee, 19 Aug 2025, Unintended Misalignment from Agentic Fine-Tuning: Risks and Mitigation, https://arxiv.org/abs/2508.14031
- Igor Halperin, 13 Aug 2025, Prompt-Response Semantic Divergence Metrics for Faithfulness Hallucination and Misalignment Detection in Large Language Models, https://arxiv.org/abs/2508.10192
- Zhi Wen Soi, Chenrui Fan, Aditya Shankar, Abele M\u{a}lan, Lydia Y. Chen, 14 Aug 2025, Federated Time Series Generation on Feature and Temporally Misaligned Data, https://arxiv.org/abs/2410.21072
- Yue Pei, Hongming Zhang, Chao Gao, Martin M\"uller, Mengxiao Zhu, Hao Sheng, Haogang Zhu, Liang Lin, 22 Aug 2025, Double Check My Desired Return: Transformer with Target Alignment for Offline Reinforcement Learning, https://arxiv.org/abs/2508.16420
- Junhao Yin, Haolin Wang, Peng Bao, Ju Xu, Yongliang Wang, 15 Aug 2025, From Clicks to Preference: A Multi-stage Alignment Framework for Generative Query Suggestion in Conversational System, https://arxiv.org/abs/2508.15811
- Pi-Wei Chen, Jerry Chun-Wei Lin, Wei-Han Chen, Jia Ji, Zih-Ching Chen, Feng-Hao Yeh, Chao-Chun Chen, 22 Aug 2025, Beyond Human-prompting: Adaptive Prompt Tuning with Semantic Alignment for Anomaly Detection, https://arxiv.org/abs/2508.16157
- Xiaoxiong Zhang, Xin Zhou, Zhiwei Zeng, Yongjie Wang, Dusit Niyato, Zhiqi Shen, 22 Aug 2025, EGRA:Toward Enhanced Behavior Graphs and Representation Alignment for Multimodal Recommendation, https://arxiv.org/abs/2508.16170
- Zirui Li and Stephan Husung and Haoze Wang, 22 Aug 2025, LLM-Assisted Semantic Alignment and Integration in Collaborative Model-Based Systems Engineering Using SysML v2, https://arxiv.org/abs/2508.16181
- Buhua Liu, Shitong Shao, Bao Li, Lichen Bai, Zhiqiang Xu, Haoyi Xiong, James Kwok, Sumi Helal, Zeke Xie, 7 Aug 2025, Alignment of Diffusion Models: Fundamentals, Challenges, and Future, https://arxiv.org/abs/2409.07253
- Somnath Banerjee, Sayan Layek, Pratyush Chatterjee, Animesh Mukherjee, Rima Hazra, 22 Aug 2025, Soteria: Language-Specific Functional Parameter Steering for Multilingual Safety Alignment, https://arxiv.org/abs/2502.11244
- Zeguan Xiao, Yun Chen, Guanhua Chen, Ke Tang, 22 Aug 2025, Towards Bridging the Reward-Generation Gap in Direct Alignment Algorithms, https://arxiv.org/abs/2506.09457
- Mia Taylor and James Chua and Jan Betley and Johannes Treutlein and Owain Evans, 24 Aug 2025, School of Reward Hacks: Hacking harmless tasks generalizes to misaligned behavior in LLMs, https://arxiv.org/abs/2508.17511
- Junan Zhang, Xueyao Zhang, Jing Yang, Yuancheng Wang, Fan Fan, Zhizheng Wu, 24 Aug 2025, Multi-Metric Preference Alignment for Generative Speech Restoration, https://arxiv.org/abs/2508.17229
- Yang Li, Songlin Yang, Xiaoxuan Han, Wei Wang, Jing Dong, Yueming Lyu, Ziyu Xue, 25 Aug 2025, Instant Preference Alignment for Text-to-Image Diffusion Models, https://arxiv.org/abs/2508.17718
- Bin Tan, Wangyao Ge, Yidi Wang, Xin Liu, Jeff Burtoft, Hao Fan, Hui Wang, 25 Aug 2025, PCR-CA: Parallel Codebook Representations with Contrastive Alignment for Multiple-Category App Recommendation, https://arxiv.org/abs/2508.18166
- Yaoyao Qian, Jindan Huang, Yuanli Wang, Simon Yu, Kyrie Zhixuan Zhou, Jiayuan Mao, Mingfu Liang, Hanhan Zhou, 23 Aug 2025, WHEN TO ACT, WHEN TO WAIT: Modeling the Intent-Action Alignment Problem in Dialogue, https://arxiv.org/abs/2506.01881
- Paul Darm, Annalisa Riccardi, 25 Aug 2025, Head-Specific Intervention Can Induce Misaligned AI Coordination in Large Language Models, https://arxiv.org/abs/2502.05945
Trustworthy AI
Trustworthy AI is the practice of ensuring that LLM-based systems are safe and predictable. This involves ensuring not only the safety of the LLM's outputs, such as avoiding bias and toxicity, but also ensuring that the AI infrastructure is resilient and the overall system is reliable. The idea of "Trustworthy AI" has been championed by NVIDIA.
Articles and papers on trustworthy AI:
- Leon Derczynski, Christopher Parisien, Nikki Pope, Michael Boone, Nov 2024, NVIDIA Approaches to AI Trust and Safety: Innovation and Tools, https://www.nvidia.com/en-us/on-demand/session/aisummitdc24-sdc1088/?playlistId=playList-c6a9450c-c790-462d-a058-0bacacd5d370
- Mayank Vatsa, Anubhooti Jain, Richa Singh, 7 Dec 2023, Adventures of Trustworthy Vision-Language Models: A Survey, https://arxiv.org/abs/2312.04231
- Nikki Pope, March 1, 2024, What Is Trustworthy AI? Trustworthy AI is an approach to AI development that prioritizes safety and transparency for the people who interact with it. https://blogs.nvidia.com/blog/what-is-trustworthy-ai/
- NVIDIA, Dec 2024 (accessed), Trustworthy AI, https://www.nvidia.com/en-us/ai-data-science/trustworthy-ai/
- Phoebe Lee and Kristina Joos, Jan 25, 2024, Advancing Production AI with NVIDIA AI Enterprise, https://developer.nvidia.com/blog/advancing-production-ai-with-nvidia-ai-enterprise/ ("... advances in NVIDIA AI software deliver up to 54% performance gains without a hardware upgrade...")
- Yedi Zhang, Yufan Cai, Xinyue Zuo, Xiaokun Luan, Kailong Wang, Zhe Hou, Yifan Zhang, Zhiyuan Wei, Meng Sun, Jun Sun, Jing Sun, Jin Song Dong, 9 Dec 2024, The Fusion of Large Language Models and Formal Methods for Trustworthy AI Agents: A Roadmap, https://arxiv.org/abs/2412.06512
- Athanasios Davvetas, Xenia Ziouvelou, Ypatia Dami, Alexis Kaponis, Konstantina Giouvanopoulou, Michael Papademas, 23 Jul 2025, TAI Scan Tool: A RAG-Based Tool With Minimalistic Input for Trustworthy AI Self-Assessment, https://arxiv.org/abs/2507.17514
- Ilias Chatzistefanidis, Navid Nikaein, 23 Jul 2025, Symbiotic Agents: A Novel Paradigm for Trustworthy AGI-driven Networks, https://arxiv.org/abs/2507.17695
- H M Mohaimanul Islam, Huynh Q. N. Vo, Aditya Rane, 22 Jul 2025, Towards Trustworthy AI: Secure Deepfake Detection using CNNs and Zero-Knowledge Proofs, https://arxiv.org/abs/2507.17010
- Tushar Talukder Showrav, Soyabul Islam Lincoln, Md. Kamrul Hasan, 23 Jul 2025, EXGnet: a single-lead explainable-AI guided multiresolution network with train-only quantitative features for trustworthy ECG arrhythmia classification, https://arxiv.org/abs/2506.12404
- Yaomin Jiang, Levin Brinkmann, Anne-Marie Nussberger, Ivan Soraperra, Jean-Fran\c{c}ois Bonnefon, Iyad Rahwan, 17 Jul 2025, Humans learn to prefer trustworthy AI over human partners, https://arxiv.org/abs/2507.13524
- Nuria Rodr\'iguez-Barroso and Mario Garc\'ia-M\'arquez and M. Victoria Luz\'on and Francisco Herrera, 21 Jul 2025, Challenges of Trustworthy Federated Learning: What's Done, Current Trends and Remaining Work, https://arxiv.org/abs/2507.15796
- Mustafa Cavus, Jan N. van Rijn, Przemys{\l}aw Biecek, 19 Jul 2025, Beyond the Single-Best Model: Rashomon Partial Dependence Profile for Trustworthy Explanations in AutoML, https://arxiv.org/abs/2507.14744
- Amina Dzafic, Merve Kavut, Ulya Bayram, 19 Jul 2025, Rethinking Suicidal Ideation Detection: A Trustworthy Annotation Framework and Cross-Lingual Model Evaluation, https://arxiv.org/abs/2507.14693
- Yi Zhang, Zhen Chen, Chih-Hong Cheng, Wenjie Ruan, Xiaowei Huang, Dezong Zhao, David Flynn, Siddartha Khastgir, Xingyu Zhao, 20 Jul 2025, Trustworthy Text-to-Image Diffusion Models: A Timely and Focused Survey, https://arxiv.org/abs/2409.18214
- Anthony Bellotti and Xindi Zhao, 9 Aug 2025, Conformal Prediction and Trustworthy AI, https://arxiv.org/abs/2508.06885
- Stephan Rabanser, 11 Aug 2025, Uncertainty-Driven Reliability: Selective Prediction and Trustworthy Deployment in Modern Machine Learning, https://arxiv.org/abs/2508.07556
- Anindya Bijoy Das, Shahnewaz Karim Sakib and Shibbir Ahmed, 9 Aug 2025, Trustworthy Medical Imaging with Large Language Models: A Study of Hallucinations Across Modalities, https://arxiv.org/abs/2508.07031
- Jesco Talies, Eric Breitbarth, David Melching, 28 Jul 2025, Towards trustworthy AI in materials mechanics through domain-guided attention, https://arxiv.org/abs/2507.20658
- Marius Baden, Ahmed Abouelazm, Christian Hubschneider, Yin Wu, Daniel Slieter, and J. Marius Z\"ollner, 27 Jul 2025, TPK: Trustworthy Trajectory Prediction Integrating Prior Knowledge For Interpretability and Kinematic Feasibility, https://arxiv.org/abs/2505.06743
- Rob Procter, Mark Rouncefield, 25 Jul 2025, Trustworthy AI: UK Air Traffic Control Revisited, https://arxiv.org/abs/2507.21169
- Rui Jiao, Yue Zhang, Jinku Li, 25 Jul 2025, Trustworthy Reasoning: Evaluating and Enhancing Factual Accuracy in LLM Intermediate Thought Processes, https://arxiv.org/abs/2507.22940
- Xinwei Wu, Haojie Li, Hongyu Liu, Xinyu Ji, Ruohan Li, Yule Chen, Yigeng Zhang, 30 Jul 2025, Uncovering the Fragility of Trustworthy LLMs through Chinese Textual Ambiguity, https://arxiv.org/abs/2507.23121
- Xiaojin Zhang, Wei Chen, 30 Jul 2025, Bridging Privacy and Robustness for Trustworthy Machine Learning, https://arxiv.org/abs/2403.16591
- Sihang Zeng, Lucas Jing Liu, Jun Wen, Meliha Yetisgen, Ruth Etzioni, Gang Luo, 1 Aug 2025, TrajSurv: Learning Continuous Latent Trajectories from Electronic Health Records for Trustworthy Survival Prediction, https://arxiv.org/abs/2508.00657
- Jiazhen Pan, Bailiang Jian, Paul Hager, Yundi Zhang, Che Liu, Friedrike Jungmann, Hongwei Bran Li, Chenyu You, Junde Wu, Jiayuan Zhu, Fenglin Liu, Yuyuan Liu, Niklas Bubeck, Christian Wachinger, Chen (Cherise) Chen, Zhenyu Gong, Cheng Ouyang, Georgios Kaissis, Benedikt Wiestler, Daniel Rueckert, 30 Jul 2025, Beyond Benchmarks: Dynamic, Automatic And Systematic Red-Teaming Agents For Trustworthy Medical Language Models, https://arxiv.org/abs/2508.00923
- James Carzon and Luca Masserano and Joshua D. Ingram and Alex Shen and Antonio Carlos Herling Ribeiro Junior and Tommaso Dorigo and Michele Doro and Joshua S. Speagle and Rafael Izbicki and Ann B. Lee, 4 Aug 2025, Trustworthy scientific inference for inverse problems with generative models, https://arxiv.org/abs/2508.02602
- Vinicius Lima, Dzung T. Phan, Jayant Kalagnanam, Dhaval Patel, Nianjun Zhou, 5 Aug 2025, Toward a Trustworthy Optimization Modeling Agent via Verifiable Synthetic Data Generation, https://arxiv.org/abs/2508.03117
- Claudiu Leoveanu-Condrei, 5 Aug 2025, A DbC Inspired Neurosymbolic Layer for Trustworthy Agent Design, https://arxiv.org/abs/2508.03665
- Anqi Li, Wenwei Jin, Jintao Tong, Pengda Qin, Weijia Li, Guo Lu, 5 Aug 2025, Towards Trustworthy Multimodal Moderation via Policy-Aligned Reasoning and Hierarchical Labeling, https://arxiv.org/abs/2508.03296
- Haoran Li and Lihao Mai and Muhao Guo and Jiaqi Wu and Yang Weng and Yannan Sun and Ce Jimmy Liu, 7 Aug 2025, From Imperfect Signals to Trustworthy Structure: Confidence-Aware Inference from Heterogeneous and Reliability-Varying Utility Data, https://arxiv.org/abs/2508.05791
- Ahmad Farooq and Kamran Iqbal, 7 Aug 2025, Towards Transparent Ethical AI: A Roadmap for Trustworthy Robotic Systems, https://arxiv.org/abs/2508.05846
- Kristian Miok, Blaz \v{S}krlj, Daniela Zaharie, and Marko Robnik \v{S}ikonja, 30 Jul 2025, TT-XAI: Trustworthy Clinical Text Explanations via Keyword Distillation and LLM Reasoning, https://arxiv.org/abs/2508.08273
- Mithat Can Ozgun, Jiahuan Pei, Koen Hindriks, Lucia Donatelli, Qingzhi Liu, Xin Sun, Junxiao Wang, 15 Aug 2025, Trustworthy AI Psychotherapy: Multi-Agent LLM Workflow for Counseling and Explainable Mental Disorder Diagnosis, https://arxiv.org/abs/2508.11398
- Benjamin Alt, Mareike Picklum, Sorin Arion, Franklin Kenghagho Kenfack and Michael Beetz, 15 Aug 2025, Open, Reproducible and Trustworthy Robot-Based Experiments with Virtual Labs and Digital-Twin-Based Execution Tracing, https://arxiv.org/abs/2508.11406
- Zihan Guo, Yuanjian Zhou, Chenyi Wang, Linlin You, Minjie Bian, Weinan Zhang, 19 Aug 2025, BetaWeb: Towards a Blockchain-enabled Trustworthy Agentic Web, https://arxiv.org/abs/2508.13787
- Mary Versa Clemens-Sewall, Christopher Cervantes, Emma Rafkin, J. Neil Otte, Tom Magelinski, Libby Lewis, Michelle Liu, Dana Udwin, Monique Kirkman-Bey, 20 Aug 2025, CaTE Data Curation for Trustworthy AI, https://arxiv.org/abs/2508.14741
- Wenjie Lin, Jin Wei-Kocsis, 21 Aug 2025, LLM4Sweat: A Trustworthy Large Language Model for Hyperhidrosis Support, https://arxiv.org/abs/2508.15192
AI Industry Safety Practices
Various papers discuss the practices of the major AI players in the industry, along with issues such as self-governance.
- OpenAI, July 2023, Frontier Model Forum, https://openai.com/blog/frontier-model-forum
- OpenAI. April 2023, Our approach to AI safety. https://openai.com/blog/our-approach-to-ai-safety
- A. M. Barrett, J. Newman, D. Hendrycks, and B. Nonnecke. 2023, UC Berkeley AI Risk-Management Standards Profile for General-Purpose AI Systems (GPAIS) and Foundation Models, https://cltc.berkeley.edu/seeking-input-and-feedback-ai-risk-management-standards-profile-for-increasingly-multi-purpose-or-general-purpose-ai
- Meta, 2023, Responsible AI: Driven by our belief that AI should benefit everyone, https://ai.meta.com/responsible-ai/
- Google, 2023, AI Governance reviews and operations, https://ai.google/responsibility/ai-governance-operations
- Google, 2023, Responsibility: Our Principles, https://ai.google/responsibility/principles/
- Google, 2023, How Bard Works | A Responsible Approach to AI, YouTube, https://www.youtube.com/watch?v=vhbkCEnNXcY
Technical Verification and Testing of AI Safety
Testing and evaluation of AI safety issues:
- Xiaowei Huang, Marta Kwiatkowska, Sen Wang, and Min Wu. May 2017. Safety verification of deep neural networks. In Computer Aided Verification, pages 3–29, https://arxiv.org/abs/1610.06940
- D. Ganguli, L. Lovitt, J. Kernion, A. Askell, Y. Bai, S. Kadavath, B. Mann, E. Perez, N. Schiefer, K. Ndousse, A. Jones, S. Bowman, A. Chen, T. Conerly, N. DasSarma, D. Drain, N. Elhage, S. El-Showk, S. Fort, Z. Hatfield-Dodds, T. Henighan, D. Hernandez, T. Hume, J. Jacobson, S. Johnston, S. Kravec, C. Olsson, S. Ringer, E. Tran-Johnson, D. Amodei, T. Brown, N. Joseph, S. McCandlish, C. Olah, J. Kaplan, and J. Clark. Red teaming language models to reduce harms: Methods, scaling behaviors, and lessons learned. arXiv preprint arXiv:2209.07858, 2022 https://arxiv.org/abs/2209.07858
- K Ramesh, A Chavan, S Pandit, 2023, A Comparative Study on the Impact of Model Compression Techniques on Fairness in Language Models, Microsoft Research, https://aclanthology.org/2023.acl-long.878.pdf, https://www.microsoft.com/en-us/research/uploads/prod/2023/07/3687_Paper.pdf (Rather than testing full models, this analysis examines optimized models due to quantization, pruning or distillation.)
- T. Shevlane. Structured access: An emerging paradigm for safe AI deployment. In The Oxford Handbook of AI Governance, 2022, https://arxiv.org/abs/2201.05159
- E. Perez, S. Huang, F. Song, T. Cai, R. Ring, J. Aslanides, A. Glaese, N. McAleese, and G. Irving. 2022, Red teaming language models with language models. arXiv preprint arXiv:2202.03286, https://arxiv.org/abs/2202.03286
- OpenAI. 2023. Safety best practices. https://platform.openai.com/docs/guides/safety-best-practices
- William Saunders, Girish Sastry, Andreas Stuhlmueller, and Owain Evans. Trial without error: Towards safe reinforcement learning via human intervention. arXiv preprint arXiv:1707.05173, 2017. https://arxiv.org/abs/1707.05173
- Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, Lélio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, William El Sayed, Oct 2023, Mistral 7B, https://arxiv.org/abs/2310.06825, Code: https://mistral.ai/news/announcing-mistral-7b/ (Examines guardrails and testing of the safety of the model against harmful inputs.)
AI Factual Inaccuracy
Research papers on accuracy of AI results include:
- M Yuksekgonul, V Chandrasekaran, E Jones, Sep 2023, Attention Satisfies: A Constraint-Satisfaction Lens on Factual Errors of Language Models, https://arxiv.org/pdf/2309.15098.pdf, Code: https://github.com/microsoft/mechanistic-error-probe
- Bingbin Liu, Jordan T. Ash, Surbhi Goel, Akshay Krishnamurthy, Cyril Zhang, June 2023, Exposing Attention Glitches with Flip-Flop Language Modeling, https://arxiv.org/abs/2306.00946
- S Latifi, 2023, Efficient and Dependable Deep Learning Systems Ph.D. Thesis, Computer Science and Engineering, University of Michigan, https://deepblue.lib.umich.edu/bitstream/handle/2027.42/176548/salar_1.pdf?sequence=1
- Michael Wood, Aug 26, 2024, 100% Accurate AI Claimed by Acurai — OpenAI and Anthropic Confirm Acurai’s Discoveries, https://blog.cubed.run/100-accurate-ai-claimed-by-acurai-openai-and-anthropic-confirm-acurais-discoveries-98fce1ddeb5b
AI Safety Incidents
Various incidents and accidents related to AI safety issues:
- S. McGregor. Nov 2021. Preventing repeated real world AI failures by cataloging incidents: The AI Incident Database. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 15458–15463, https://arxiv.org/abs/2011.08512
- Sarah Perez, 2023, Snapchat’s My AI goes rogue, posts to Stories, but Snap confirms it was just a glitch, August 17, 2023, TechCrunch, https://techcrunch.com/2023/08/16/snapchats-my-ai-goes-rogue-posts-to-stories-but-snap-confirms-it-was-just-a-glitch/
- Jaime Seidel, 2019, How a ‘confused’ AI May Have Fought Pilots Attempting to Save Boeing 737 MAX8s, News Corp Australia Network, https://www.news.com.au/technology/innovation/inventions/how-a-confused-ai-may-have-fought-pilots-attempting-to-save-boeing-737-max8s/news-story/bf0d102f699905e5aa8d1f6d65f4c27e (A very good example of the need for overrides and interruptibility.)
- Zachary Arnold, Helen Toner, July 2021, AI Accidents: An Emerging Threat What Could Happen and What to Do, CSET Policy Brief, https://cset.georgetown.edu/wp-content/uploads/CSET-AI-Accidents-An-Emerging-Threat.pdf
- Hern Alex. Apple contractors ‘regularly hear confidential details’ on Siri recordings. Guardian. 2019, https://www.theguardian.com/technology/2019/jul/26/apple-contractors-regularly-hear-confidential-details-on-siri-recordings
- Victor Tangermann, Sep 2023, Microsoft Publishes Garbled AI Article Calling Tragically Deceased NBA Player "Useless", Futurism, https://futurism.com/msn-ai-brandon-hunter-useless ("AI should not be writing obituaries.")
Incident Databases: There are various databases that collect information about AI safety incidents.
- AI Incident Database, https://incidentdatabase.ai/
- Zach Stein-Perlman, SeLo, stepanlos, MvK, July 20, 2023, Incident reporting for AI safety, Effective Altruism Forum, https://forum.effectivealtruism.org/posts/qkK5ejystp8GCJ3vC/incident-reporting-for-ai-safety
- AVID, 2023, AI Vulnerability Database: An open-source, extensible knowledge base of AI failures, https://avidml.org/
- AIAAIC (AI, Algorithmic, and Automation Incidents and Controversies), 2023, https://www.aiaaic.org/home
- MITRE ATLAS™ (Adversarial Threat Landscape for Artificial-Intelligence Systems), https://atlas.mitre.org/
- AI Badness: An open catalog of generative AI badness, 2023, https://badness.ai/
- David Dao, 2023, Awful AI, https://github.com/daviddao/awful-ai
Medical Ethics and AI
The use of AI in medicine creates some additional ethical issues:
- Vollmer S., Mateen B.A., Bohner G., Király F.J., Ghani R., Jonsson P., et al. Machine learning and AI research for patient benefit: 20 critical questions on transparency, replicability, ethics and effectiveness. BMJ. 2018;(368):1–12. https://pubmed.ncbi.nlm.nih.gov/32198138/
- Cockerill RG., 2020, Ethics Implications of the Use of Artificial Intelligence in Violence Risk Assessment. J Am Acad Psychiatry Law. 2020 Sep;48(3):345-349. doi: 10.29158/JAAPL.003940-20. Epub 2020 May 14. PMID: 32409300, https://pubmed.ncbi.nlm.nih.gov/32409300/
- Barron DS. 2021, Commentary: the ethical challenges of machine learning in psychiatry: a focus on data, diagnosis, and treatment. Psychol Med. 2021 Nov;51(15):2522-2524. doi: 10.1017/S0033291721001008. Epub 2021 May 12. PMID: 33975655, https://pubmed.ncbi.nlm.nih.gov/33975655/
- O'Reilly-Shah VN, Gentry KR, Walters AM, Zivot J, Anderson CT, Tighe PJ. 2020, Bias and ethical considerations in machine learning and the automation of perioperative risk assessment. Br J Anaesth. 2020 Dec;125(6):843-846. doi: 10.1016/j.bja.2020.07.040. Epub 2020 Aug 21. PMID: 32838979, https://pubmed.ncbi.nlm.nih.gov/32838979/
- Buchlak QD, Esmaili N, Leveque JC, Bennett C, Piccardi M, Farrokhi F., 2020, Ethical thinking machines in surgery and the requirement for clinical leadership. Am J Surg. 2020 Nov;220(5):1372-1374. doi: 10.1016/j.amjsurg.2020.06.073. Epub 2020 Jul 8. PMID: 32723487, https://pubmed.ncbi.nlm.nih.gov/32723487/
- Starke G, De Clercq E, Borgwardt S, Elger BS., 2020, Computing schizophrenia: ethical challenges for machine learning in psychiatry. Psychol Med. 2021 Nov;51(15):2515-2521. doi: 10.1017/S0033291720001683. Epub 2020 Jun 15. PMID: 32536358, https://pubmed.ncbi.nlm.nih.gov/32536358/
- Jacobson NC, Bentley KH, Walton A, Wang SB, Fortgang RG, Millner AJ, Coombs G 3rd, Rodman AM, Coppersmith DDL., 2020, Ethical dilemmas posed by mobile health and machine learning in psychiatry research. Bull World Health Organ. 2020 Apr 1;98(4):270-276. doi: 10.2471/BLT.19.237107. Epub 2020 Feb 25. PMID: 32284651, https://pubmed.ncbi.nlm.nih.gov/32284651/
- Johnson SLJ., 2019, AI, Machine Learning, and Ethics in Health Care. J Leg Med. 2019 Oct-Dec;39(4):427-441. doi: 10.1080/01947648.2019.1690604. PMID: 31940250 https://pubmed.ncbi.nlm.nih.gov/31940250/
- Vayena E, Blasimme A, Cohen IG., 2018, Machine learning in medicine: Addressing ethical challenges. PLoS Med. 2018 Nov 6;15(11):e1002689. doi: 10.1371/journal.pmed.1002689. eCollection 2018 Nov. PMID: 30399149, https://pubmed.ncbi.nlm.nih.gov/30399149/
- Nabi J., 2018, How Bioethics Can Shape Artificial Intelligence and Machine Learning. Hastings Cent Rep. 2018 Sep;48(5):10-13. doi: 10.1002/hast.895. PMID: 30311202, https://pubmed.ncbi.nlm.nih.gov/30311202/
- Char DS, Shah NH, Magnus D., 2018, Implementing Machine Learning in Health Care - Addressing Ethical Challenges. N Engl J Med. 2018 Mar 15;378(11):981-983. doi: 10.1056/NEJMp1714229. PMID: 29539284, https://pubmed.ncbi.nlm.nih.gov/29539284/
- Fiske A, Henningsen P, Buyx A., 2019, Your Robot Therapist Will See You Now: Ethical Implications of Embodied Artificial Intelligence in Psychiatry, Psychology, and Psychotherapy. J Med Internet Res. 2019 May 9;21(5):e13216. doi: 10.2196/13216. PMID: 31094356, https://pubmed.ncbi.nlm.nih.gov/31094356/
- Beil Michael, Proft Ingo, van Heerden Daniel, Sviri Sigal, van Heerden Peter Vernon. 2019, Ethical considerations about artificial intelligence for prognostication in intensive care. Intensive Care Medicine Experimental. 2019;7:70. http://www.ncbi.nlm.nih.gov/pmc/articles/pmc6904702/, https://pubmed.ncbi.nlm.nih.gov/31823128/
- Lasse Benzinger, Frank Ursin, Wolf-Tilo Balke, Tim Kacprowski & Sabine Salloch, 2023, Should Artificial Intelligence be used to support clinical ethical decision-making? A systematic review of reasons BMC Medical Ethics volume 24, Article number: 48 (2023), https://doi.org/10.1186/s12910-023-00929-6
- Rachel Dlugatch, Antoniya Georgieva & Angeliki Kerasidou, 2023, Trustworthy artificial intelligence and ethical design: public perceptions of trustworthiness of an AI-based decision-support tool in the context of intrapartum care, BMC Medical Ethics Open Access 20 June 2023, https://doi.org/10.1186/s12910-023-00917-w
- Dzobo K, Adotey S, Thomford NE, Dzobo W. Integrating Artificial and Human Intelligence: A Partnership for Responsible Innovation in Biomedical Engineering and Medicine. OMICS. 2020 May;24(5):247-263. doi: 10.1089/omi.2019.0038. Epub 2019 Jul 16. PMID: 31313972, https://pubmed.ncbi.nlm.nih.gov/31313972/
- McCradden MD, Joshi S, Mazwi M, Anderson JA., 2020, Ethical limitations of algorithmic fairness solutions in health care machine learning. Lancet Digit Health. 2020 May;2(5):e221-e223. doi: 10.1016/S2589-7500(20)30065-0. PMID: 33328054, https://pubmed.ncbi.nlm.nih.gov/33328054/
- Kulikowski CA., 2019, Beginnings of Artificial Intelligence in Medicine (AIM): Computational Artifice Assisting Scientific Inquiry and Clinical Art - with Reflections on Present AIM Challenges. Yearb Med Inform. 2019 Aug;28(1):249-256. doi: 10.1055/s-0039-1677895. Epub 2019 Apr 25. PMID: 31022744, https://pubmed.ncbi.nlm.nih.gov/31022744/
- Park S.H., Kim Y.H., Lee J.Y., Yoo S., Kim C.J. Ethical challenges regarding artificial intelligence in medicine from the perspective of scientific editing and peer review. Science Editing. 2019;6:91–98. https://www.semanticscholar.org/paper/Ethical-challenges-regarding-artificial-in-medicine-Park-Kim/7a5b3c84c6f5d16e68eaf17989b0debfd4ba57d0
Data Leakage
Data leakage refers to the AI accidentally causing the leak of data that you'd prefer was kept confidential. The "leak" can actually be caused by the LLM, or by the user, depending on the context. There are various ways this can occur:
- Uploading confidential data in AI queries (User data leakage)
- Training or fine-tuning data containing proprietary information (Training data leakage)
- RAG datastore documents containing proprietary information (RAG data leakage)
In the context of an LLM output leaking, this refers to where internal company IP is accidentally "leaked" to the public by training the AI with documents containing internal information. The AI is not smart enough to note when it shouldn't be reading a document, and anything that goes into the training dataset, or in the RAG datastore, will be shown to users.
User data leakage is where company users are sending proprietary information to a third-party AI engine. In theory, this data is protected by the confidentiality practices of the LLM company. This issue is similar to having company staff emitting confidential information in their Google queries, but the issue is more problematic because AI queries can upload entire documents to be analyzed by the LLM, such as when doing grammar checking with an LLM.
Research papers on data leakage:
- Grant Gross, 05 Jun 2024, Unauthorized AI is eating your company data, thanks to your employees, https://www.csoonline.com/article/2138447/unauthorized-ai-is-eating-your-company-data-thanks-to-your-employees.html
- Mary K. Pratt, 08 Jul 2024, 10 ways to prevent shadow AI disaster, https://www.cio.com/article/2150142/10-ways-to-prevent-shadow-ai-disaster.html
- Rachel Curry, Aug 28 2024, Why companies including JPMorgan and Walmart are opting for internal gen AI assistants after initially restricting usage, https://www.cnbc.com/2024/08/28/why-jpmorgan-and-walmart-are-opting-for-internal-gen-ai-assistants.html
- Huan Yang, Deyu Zhang, Yudong Zhao, Yuanchun Li, Yunxin Liu, 6 Sep 2024, A First Look At Efficient And Secure On-Device LLM Inference Against KV Leakage, https://arxiv.org/abs/2409.04040 (Security issues where KV caches can be data leaks as they may contain encodings of private information.)
- G Wu, Z Zhang, Y Zhang, W Wang, J Niu, Y Wu, Mar 2025, I Know What You Asked: Prompt Leakage via KV-Cache Sharing in Multi-Tenant LLM Serving https://www.ndss-symposium.org/wp-content/uploads/2025-1772-paper.pdf
Refusal
Refusal refers to the way that an LLM will politely decline to answer an inappropriate question. There are all types of questions that we don't want an LLM to respond to, and this requires training to achieve.
- Andy Arditi, Oscar Obeso, Aaquib111, wesg, Neel Nanda, 27th Apr 2024, Refusal in LLMs is mediated by a single direction, LessWrong, https://www.lesswrong.com/posts/jGuXSZgv6qfdhMCuJ/refusal-in-llms-is-mediated-by-a-single-direction
- Maxime Labonne June 13, 2024 Uncensor any LLM with abliteration, https://huggingface.co/blog/mlabonne/abliteration
- NVIDIA, June 2024, Nemotron-4 340B Technical Report, https://d1qx31qr3h6wln.cloudfront.net/publications/Nemotron_4_340B_8T_0.pdf (Architecture is decoder-only with GQA, SentencePiece tokenizer, causal attention masks, RoPE, 96 layers, 96 heads, 8 KV heads, 256,000 vocabulary, 18432 internal dimension, context window 4096, and uses squared RELU.)
- Seungju Han, Kavel Rao, Allyson Ettinger, Liwei Jiang, Bill Yuchen Lin, Nathan Lambert, Yejin Choi, Nouha Dziri, 26 Jun 2024, WildGuard: Open One-Stop Moderation Tools for Safety Risks, Jailbreaks, and Refusals of LLMs, https://arxiv.org/abs/2406.18495
- Maksym Andriushchenko, Nicolas Flammarion, 16 Jul 2024, Does Refusal Training in LLMs Generalize to the Past Tense? https://arxiv.org/abs/2407.11969 Code: https://github.com/tml-epfl/llm-past-tense
- Kylie Robison, Jul 20, 2024, OpenAI’s latest model will block the ‘ignore all previous instructions’ loophole, https://www.theverge.com/2024/7/19/24201414/openai-chatgpt-gpt-4o-prompt-injection-instruction-hierarchy
- Xinyi Hou, Yanjie Zhao, Haoyu Wang, 3 Aug 2024, Voices from the Frontier: A Comprehensive Analysis of the OpenAI Developer Forum, https://arxiv.org/abs/2408.01687
- Asir Saadat, Tasmia Binte Sogir, Md Taukir Azam Chowdhury, Syem Aziz, 16 Oct 2024, When Not to Answer: Evaluating Prompts on GPT Models for Effective Abstention in Unanswerable Math Word Problems, https://arxiv.org/abs/2410.13029
- Kyle O'Brien, David Majercak, Xavier Fernandes, Richard Edgar, Jingya Chen, Harsha Nori, Dean Carignan, Eric Horvitz, Forough Poursabzi-Sangde, 18 Nov 2024, Steering Language Model Refusal with Sparse Autoencoders, https://arxiv.org/abs/2411.11296
- Mohit Sewak, Dec 6, 2024, Prompt Injection Attacks on Large Language Models, https://pub.towardsai.net/prompt-injection-attacks-on-large-language-models-bd8062fa1bb7
- Yue Liu, Hongcheng Gao, Shengfang Zhai, Jun Xia, Tianyi Wu, Zhiwei Xue, Yulin Chen, Kenji Kawaguchi, Jiaheng Zhang, Bryan Hooi, 30 Jan 2025, GuardReasoner: Towards Reasoning-based LLM Safeguards, https://arxiv.org/abs/2501.18492
- Holistic AI Team, March 6, 2025, Anthropic’s Claude 3.7 Sonnet Jailbreaking & Red Teaming Audit: The Most Secure Model Yet? https://www.holisticai.com/blog/claude-3-7-sonnet-jailbreaking-audit
- Vishnu Kabir Chhabra, Mohammad Mahdi Khalili, 5 Apr 2025, Towards Understanding and Improving Refusal in Compressed Models via Mechanistic Interpretability, https://arxiv.org/abs/2504.04215
- Wojciech Zaremba, Evgenia Nitishinskaya, Boaz Barak, Stephanie Lin, Sam Toyer, Yaodong Yu, Rachel Dias, Eric Wallace, Kai Xiao, Johannes Heidecke, Amelia Glaese, 31 Jan 2025, Trading Inference-Time Compute for Adversarial Robustness, https://arxiv.org/abs/2501.18841
- Hongzhe Du, Weikai Li, Min Cai, Karim Saraipour, Zimin Zhang, Himabindu Lakkaraju, Yizhou Sun, Shichang Zhang, 11 Aug 2025, How Post-Training Reshapes LLMs: A Mechanistic View on Knowledge, Truthfulness, Refusal, and Confidence, https://arxiv.org/abs/2504.02904
- Haonan Zhang, Dongxia Wang, Yi Liu, Kexin Chen, Jiashui Wang, Xinlei Ying, Long Liu, Wenhai Wang, 15 Aug 2025, ORFuzz: Fuzzing the "Other Side" of LLM Safety -- Testing Over-Refusal, https://arxiv.org/abs/2508.11222
- Yuan Yuan, Tina Sriskandarajah, Anna-Luisa Brakman, Alec Helyar, Alex Beutel, Andrea Vallone, Saachi Jain, 12 Aug 2025, From Hard Refusals to Safe-Completions: Toward Output-Centric Safety Training, https://arxiv.org/abs/2508.09224
Guardrails
- Aarushi Kansal, Chapter 4: Guardrails and AI: Building Safe and Controllable Apps, Building Generative AI-Powered Apps: A Hands-on Guide for Developers, Apress, https://www.amazon.com/Building-Generative-AI-Powered-Apps-Hands-ebook/dp/B0CTXXP1S4/
- Meta, July 2024 (accessed), Llama: Making safety tools accessible to everyone, https://llama.meta.com/trust-and-safety/
- Chip Huyen, Jul 25, 2024, Building A Generative AI Platform, https://huyenchip.com/2024/07/25/genai-platform.html
- Jaymari Chua, Yun Li, Shiyi Yang, Chen Wang, Lina Yao, 6 Jul 2024, AI Safety in Generative AI Large Language Models: A Survey, https://arxiv.org/abs/2407.18369
- Marko Zivkovic, Aug 06, 2024, Discovered Apple Intelligence prompts show Apple's attempt at preventing AI disaster, https://appleinsider.com/articles/24/08/06/discovered-apple-intelligence-prompts-show-apples-attempt-at-preventing-ai-disaster
- Rachel Curry, Aug 28 2024, Why companies including JPMorgan and Walmart are opting for internal gen AI assistants after initially restricting usage, https://www.cnbc.com/2024/08/28/why-jpmorgan-and-walmart-are-opting-for-internal-gen-ai-assistants.html
- Shayne Longpre, Stella Biderman, Alon Albalak, Hailey Schoelkopf, Daniel McDuff, Sayash Kapoor, Kevin Klyman, Kyle Lo, Gabriel Ilharco, Nay San, Maribeth Rauh, Aviya Skowron, Bertie Vidgen, Laura Weidinger, Arvind Narayanan, Victor Sanh, David Adelani, Percy Liang, Rishi Bommasani, Peter Henderson, Sasha Luccioni, Yacine Jernite, Luca Soldaini, 26 Jun 2024 (v2), The Responsible Foundation Model Development Cheatsheet: A Review of Tools & Resources, https://arxiv.org/abs/2406.16746
- Jason Perlow, Nov. 6, 2024, The best open-source AI models: All your free-to-use options explained: Here are the best open-source and free-to-use AI models for text, images, and audio, organized by type, application, and licensing considerations. https://www.zdnet.com/article/the-best-open-source-ai-models-all-your-free-to-use-options-explained/
- McKinsey, November 14, 2024, What are AI guardrails? https://www.mckinsey.com/featured-insights/mckinsey-explainers/what-are-ai-guardrails
- Aditi Bodhankar, Dec 06, 2024, Content Moderation and Safety Checks with NVIDIA NeMo Guardrails, https://developer.nvidia.com/blog/content-moderation-and-safety-checks-with-nvidia-nemo-guardrails/
- Inkit Padhi, Manish Nagireddy, Giandomenico Cornacchia, Subhajit Chaudhury, Tejaswini Pedapati, Pierre Dognin, Keerthiram Murugesan, Erik Miehling, Martín Santillán Cooper, Kieran Fraser, Giulio Zizzo, Muhammad Zaid Hameed, Mark Purcell, Michael Desmond, Qian Pan, Inge Vejsbjerg, Elizabeth M. Daly, Michael Hind, Werner Geyer, Ambrish Rawat, Kush R. Varshney, Prasanna Sattigeri, 10 Dec 2024, Granite Guardian, https://arxiv.org/abs/2412.07724 https://github.com/ibm-granite/granite-guardian (Open-sourcing of safety models with many capabilities.)
- Rama Akkiraju, Anbang Xu, Deepak Bora, Tan Yu, Lu An, Vishal Seth, Aaditya Shukla, Pritam Gundecha, Hridhay Mehta, Ashwin Jha, Prithvi Raj, Abhinav Balasubramanian, Murali Maram, Guru Muthusamy, Shivakesh Reddy Annepally, Sidney Knowles, Min Du, Nick Burnett, Sean Javiya, Ashok Marannan, Mamta Kumari, Surbhi Jha, Ethan Dereszenski, Anupam Chakraborty, Subhash Ranjan, Amina Terfai, Anoop Surya, Tracey Mercer, Vinodh Kumar Thanigachalam, Tamar Bar, Sanjana Krishnan, Samy Kilaru, Jasmine Jaksic, Nave Algarici, Jacob Liberman, Joey Conway, Sonu Nayyar, Justin Boitano, 10 Jul 2024, FACTS About Building Retrieval Augmented Generation-based Chatbots, NVIDIA Research, https://arxiv.org/abs/2407.07858
- Aditi Bodhankar, Jan 16, 2025, How to Safeguard AI Agents for Customer Service with NVIDIA NeMo Guardrails, https://developer.nvidia.com/blog/how-to-safeguard-ai-agents-for-customer-service-with-nvidia-nemo-guardrails/
- Yue Liu, Hongcheng Gao, Shengfang Zhai, Jun Xia, Tianyi Wu, Zhiwei Xue, Yulin Chen, Kenji Kawaguchi, Jiaheng Zhang, Bryan Hooi, 30 Jan 2025, GuardReasoner: Towards Reasoning-based LLM Safeguards, https://arxiv.org/abs/2501.18492
- Aditi Bodhankar, Mar 03, 2025, Measuring the Effectiveness and Performance of AI Guardrails in Generative AI Applications, https://developer.nvidia.com/blog/measuring-the-effectiveness-and-performance-of-ai-guardrails-in-generative-ai-applications/
- Manuel Cossio, 3 Aug 2025, A comprehensive taxonomy of hallucinations in Large Language Models, https://arxiv.org/abs/2508.01781
- Anthropic, 13 Aug 2025, Building Safeguards for Claude, https://www.anthropic.com/news/building-safeguards-for-claude
- Yuksel Aydin, 9 Aug 2025, Cognitive Cybersecurity for Artificial Intelligence: Guardrail Engineering with CCS-7, https://arxiv.org/abs/2508.10033
- Boyuan Zheng, Zeyi Liao, Scott Salisbury, Zeyuan Liu, Michael Lin, Qinyuan Zheng, Zifan Wang, Xiang Deng, Dawn Song, Huan Sun, Yu Su, 18 Jul 2025, WebGuard: Building a Generalizable Guardrail for Web Agents, https://arxiv.org/abs/2507.14293
- Cheng-Fu Yang, Thanh Tran, Christos Christodoulopoulos, Weitong Ruan, Rahul Gupta, Kai-Wei Chang, 28 Jul 2025, Customize Multi-modal RAI Guardrails with Precedent-based predictions, https://arxiv.org/abs/2507.20503
- Chad DeLuca, Anna Lisa Gentile, Shubhi Asthana, Bing Zhang, Pawan Chowdhary, Kellen Cheng, Basel Shbita, Pengyuan Li, Guang-Jie Ren, Sandeep Gopisetty, 25 Jul 2025, OneShield - the Next Generation of LLM Guardrails, https://arxiv.org/abs/2507.21170
- Hannah-Beth Clark, Laura Benton, Emma Searle, Margaux Dowland, Matthew Gregory, Will Gayne and John Roberts, 7 Aug 2025, Building Effective Safety Guardrails in AI Education Tools, https://arxiv.org/abs/2508.05360
- Alexander W. Lee, Justin Chan, Michael Fu, Nicolas Kim, Akshay Mehta, Deepti Raghavan, Ugur Cetintemel, 7 Aug 2025, Semantic Integrity Constraints: Declarative Guardrails for AI-Augmented Data Processing Systems, https://arxiv.org/abs/2503.00600
- Darpan Aswal and C\'eline Hudelot, 22 Aug 2025, LLMSymGuard: A Symbolic Safety Guardrail Framework Leveraging Interpretable Jailbreak Concepts, https://arxiv.org/abs/2508.16325
- Jun Zhuang, Haibo Jin, Ye Zhang, Zhengjian Kang, Wenbin Zhang, Gaby G. Dagher, Haohan Wang, 25 Aug 2025, Exploring the Vulnerability of the Content Moderation Guardrail in Large Language Models via Intent Manipulation, https://arxiv.org/abs/2505.18556
Jailbreak
Jailbreaking is the hack of using English to break into a computer system. Actually, it's not so much a violation of the server, but it does refer to a way of getting the LLM to answer questions that its developer probably doesn't want it to. In other words, it's a trick to bypass the "refusal" module of an LLM.
- Andy Arditi, Oscar Obeso, Aaquib111, wesg, Neel Nanda, 27th Apr 2024, Refusal in LLMs is mediated by a single direction, LessWrong, https://www.lesswrong.com/posts/jGuXSZgv6qfdhMCuJ/refusal-in-llms-is-mediated-by-a-single-direction
- Adva Nakash Peleg, May 30, 2024, An LLM Journey: From POC to Production, https://medium.com/cyberark-engineering/an-llm-journey-from-poc-to-production-6c5ec6a172fb
- Yu Wang, Xiaogeng Liu, Yu Li, Muhao Chen, Chaowei Xiao, 14 Mar 2024, AdaShield: Safeguarding Multimodal Large Language Models from Structure-based Attack via Adaptive Shield Prompting, https://arxiv.org/abs/2403.09513 Code: https://github.com/rain305f/AdaShield
- Jinhwa Kim, Ali Derakhshan, Ian G. Harris, 31 Oct 2023, Robust Safety Classifier for Large Language Models: Adversarial Prompt Shield, https://arxiv.org/abs/2311.00172
- Zixuan Ni, Longhui Wei, Jiacheng Li, Siliang Tang, Yueting Zhuang, Qi Tian, 8 Aug 2023 (v2), Degeneration-Tuning: Using Scrambled Grid shield Unwanted Concepts from Stable Diffusion, https://arxiv.org/abs/2308.02552
- Xiao Peng, Tao Liu, Ying Wang, 3 Jun 2024 (v2), Genshin: General Shield for Natural Language Processing with Large Language Models, https://arxiv.org/abs/2405.18741
- Ayushi Nirmal, Amrita Bhattacharjee, Paras Sheth, Huan Liu, 8 May 2024 ( v2), Towards Interpretable Hate Speech Detection using Large Language Model-extracted Rationales, https://arxiv.org/abs/2403.12403 Code: https://github.com/AmritaBh/shield
- Shweta Sharma, 27 Jun 2024, Microsoft warns of ‘Skeleton Key’ jailbreak affecting many generative AI models, https://www.csoonline.com/article/2507702/microsoft-warns-of-novel-jailbreak-affecting-many-generative-ai-models.html
- Seungju Han, Kavel Rao, Allyson Ettinger, Liwei Jiang, Bill Yuchen Lin, Nathan Lambert, Yejin Choi, Nouha Dziri, 26 Jun 2024, WildGuard: Open One-Stop Moderation Tools for Safety Risks, Jailbreaks, and Refusals of LLMs, https://arxiv.org/abs/2406.18495
- Maksym Andriushchenko, Nicolas Flammarion, 16 Jul 2024, Does Refusal Training in LLMs Generalize to the Past Tense? https://arxiv.org/abs/2407.11969 Code: https://github.com/tml-epfl/llm-past-tense
- Kylie Robison, Jul 20, 2024, OpenAI’s latest model will block the ‘ignore all previous instructions’ loophole, https://www.theverge.com/2024/7/19/24201414/openai-chatgpt-gpt-4o-prompt-injection-instruction-hierarchy
- Chip Huyen, Jul 25, 2024, Building A Generative AI Platform, https://huyenchip.com/2024/07/25/genai-platform.html
- Jaymari Chua, Yun Li, Shiyi Yang, Chen Wang, Lina Yao, 6 Jul 2024, AI Safety in Generative AI Large Language Models: A Survey, https://arxiv.org/abs/2407.18369
- Ayush RoyChowdhury, Mulong Luo,, Prateek Sahu,, Sarbartha Banerjee, Mohit Tiwari, Aug 2024, ConfusedPilot: Confused Deputy Risks in RAG-based LLMs, https://confusedpilot.info/confused_pilot_new.pdf
- Dr. Ashish Bamania, Sep 2024, ‘MathPrompt’ Embarassingly Jailbreaks All LLMs Available On The Market Today. A deep dive into how a novel LLM Jailbreaking technique called ‘MathPrompt’ works, why it is so effective, and why it needs to be patched as soon as possible to prevent harmful LLM content generation, https://bamania-ashish.medium.com/mathprompt-embarassingly-jailbreaks-all-llms-available-on-the-market-today-d749da26c6e8
- Y. Bai et al., "Backdoor Attack and Defense on Deep Learning: A Survey," in IEEE Transactions on Computational Social Systems, doi: 10.1109/TCSS.2024.3482723. https://ieeexplore.ieee.org/abstract/document/10744415
- Steve Jones, Oct 3, 2024, LLM Prompt Injection: Never send the request to the model. Classify, rewrite and reject, https://blog.metamirror.io/llm-prompt-injection-never-send-the-request-to-the-model-e8017269b96a
- Emet Bethany, Mazal Bethany, Juan Arturo Nolazco Flores, Sumit Kumar Jha, Peyman Najafirad, 5 Nov 2024 (v2), Jailbreaking Large Language Models with Symbolic Mathematics, https://arxiv.org/abs/2409.11445
- Alwin Peng, Julian Michael, Henry Sleight, Ethan Perez, Mrinank Sharma, 12 Nov 2024, Rapid Response: Mitigating LLM Jailbreaks with a Few Examples, https://arxiv.org/abs/2411.07494
- Kyle O'Brien, David Majercak, Xavier Fernandes, Richard Edgar, Jingya Chen, Harsha Nori, Dean Carignan, Eric Horvitz, Forough Poursabzi-Sangde, 18 Nov 2024, Steering Language Model Refusal with Sparse Autoencoders, https://arxiv.org/abs/2411.11296
- Zachary Coalson, Jeonghyun Woo, Shiyang Chen, Yu Sun, Lishan Yang, Prashant Nair, Bo Fang, Sanghyun Hong, 10 Dec 2024, PrisonBreak: Jailbreaking Large Language Models with Fewer Than Twenty-Five Targeted Bit-flips, https://arxiv.org/abs/2412.07192
- Inkit Padhi, Manish Nagireddy, Giandomenico Cornacchia, Subhajit Chaudhury, Tejaswini Pedapati, Pierre Dognin, Keerthiram Murugesan, Erik Miehling, Martín Santillán Cooper, Kieran Fraser, Giulio Zizzo, Muhammad Zaid Hameed, Mark Purcell, Michael Desmond, Qian Pan, Inge Vejsbjerg, Elizabeth M. Daly, Michael Hind, Werner Geyer, Ambrish Rawat, Kush R. Varshney, Prasanna Sattigeri, 10 Dec 2024, Granite Guardian, https://arxiv.org/abs/2412.07724 https://github.com/ibm-granite/granite-guardian (Open-sourcing of safety models with many capabilities.)
- Mohit Sewak, Dec 6, 2024, Prompt Injection Attacks on Large Language Models, https://pub.towardsai.net/prompt-injection-attacks-on-large-language-models-bd8062fa1bb7
- Sicheng Zhu, Brandon Amos, Yuandong Tian, Chuan Guo, Ivan Evtimov, 13 Dec 2024, AdvPrefix: An Objective for Nuanced LLM Jailbreaks, https://arxiv.org/abs/2412.10321
- Aditi Bodhankar, Jan 16, 2025, How to Safeguard AI Agents for Customer Service with NVIDIA NeMo Guardrails, https://developer.nvidia.com/blog/how-to-safeguard-ai-agents-for-customer-service-with-nvidia-nemo-guardrails/
- Xin Yi, Yue Li, Linlin Wang, Xiaoling Wang, Liang He, 18 Jan 2025, Latent-space adversarial training with post-aware calibration for defending large language models against jailbreak attacks, https://arxiv.org/abs/2501.10639
- Yue Liu, Hongcheng Gao, Shengfang Zhai, Jun Xia, Tianyi Wu, Zhiwei Xue, Yulin Chen, Kenji Kawaguchi, Jiaheng Zhang, Bryan Hooi, 30 Jan 2025, GuardReasoner: Towards Reasoning-based LLM Safeguards, https://arxiv.org/abs/2501.18492
- Taryn Plumb, February 3, 2025, Anthropic claims new AI security method blocks 95% of jailbreaks, invites red teamers to try, https://venturebeat.com/security/anthropic-claims-new-ai-security-method-blocks-95-of-jailbreaks-invites-red-teamers-to-try/
- Holistic AI Team, March 6, 2025, Anthropic’s Claude 3.7 Sonnet Jailbreaking & Red Teaming Audit: The Most Secure Model Yet? https://www.holisticai.com/blog/claude-3-7-sonnet-jailbreaking-audit
- Jiacheng Liang, Tanqiu Jiang, Yuhui Wang, Rongyi Zhu, Fenglong Ma, Ting Wang, 16 May 2025, AutoRAN: Weak-to-Strong Jailbreaking of Large Reasoning Models, https://arxiv.org/abs/2505.10846
- Wojciech Zaremba, Evgenia Nitishinskaya, Boaz Barak, Stephanie Lin, Sam Toyer, Yaodong Yu, Rachel Dias, Eric Wallace, Kai Xiao, Johannes Heidecke, Amelia Glaese, 31 Jan 2025, Trading Inference-Time Compute for Adversarial Robustness, https://arxiv.org/abs/2501.18841
- Manuel Cossio, 3 Aug 2025, A comprehensive taxonomy of hallucinations in Large Language Models, https://arxiv.org/abs/2508.01781
- Anthropic, 13 Aug 2025, Building Safeguards for Claude, https://www.anthropic.com/news/building-safeguards-for-claude
- Wenpeng Xing, Mohan Li, Chunqiang Hu, Haitao XuNingyu Zhang, Bo Lin, Meng Han, 8 Aug 2025, Latent Fusion Jailbreak: Blending Harmful and Harmless Representations to Elicit Unsafe LLM Outputs, https://arxiv.org/abs/2508.10029
- Fan Yang, 9 Aug 2025, The Cost of Thinking: Increased Jailbreak Risk in Large Language Models, https://arxiv.org/abs/2508.10032
- Xiaoxue Yang, Jaeha Lee, Anna-Katharina Dick, Jasper Timm, Fei Xie, Diogo Cruz, 11 Aug 2025, Multi-Turn Jailbreaks Are Simpler Than They Seem, https://arxiv.org/abs/2508.07646
- Xianjun Yang, Liqiang Xiao, Shiyang Li, Faisal Ladhak, Hyokun Yun, Linda Ruth Petzold, Yi Xu, William Yang Wang, 9 Aug 2025, Many-Turn Jailbreaking, https://arxiv.org/abs/2508.06755
- Xuancun Lu, Zhengxian Huang, Xinfeng Li, Chi Zhang, Xiaoyu ji, Wenyuan Xu, 11 Aug 2025, POEX: Towards Policy Executable Jailbreak Attacks Against the LLM-based Robots, https://arxiv.org/abs/2412.16633
- Tatia Tsmindashvili, Ana Kolkhidashvili, Dachi Kurtskhalia, Nino Maghlakelidze, Elene Mekvabishvili, Guram Dentoshvili, Orkhan Shamilov, Zaal Gachechiladze, Steven Saporta, David Dachi Choladze, 11 Aug 2025, Improving LLM Outputs Against Jailbreak Attacks with Expert Model Integration, https://arxiv.org/abs/2505.17066
- Jirui Yang, Zheyu Lin, Zhihui Lu, Yinggui Wang, Lei Wang, Tao Wei, Xin Du, Shuhan Yang, 31 Jul 2025, CEE: An Inference-Time Jailbreak Defense for Embodied Intelligence via Subspace Concept Rotation, https://arxiv.org/abs/2504.13201
- Zheng Zhang, Peilin Zhao, Deheng Ye, Hao Wang, 28 Jul 2025, Enhancing Jailbreak Attacks on LLMs via Persona Prompts, https://arxiv.org/abs/2507.22171
- Jiecong Wang, Haoran Li, Hao Peng, Ziqian Zeng, Zihao Wang, Haohua Du, Zhengtao Yu, 1 Aug 2025, Activation-Guided Local Editing for Jailbreaking Attacks, https://arxiv.org/abs/2508.00555
- Yelim Ahn, Jaejin Lee, 2 Aug 2025, PUZZLED: Jailbreaking LLMs through Word-Based Puzzles, https://arxiv.org/abs/2508.01306
- Yik Siu Chan, Narutatsu Ri, Yuxin Xiao, Marzyeh Ghassemi, 2 Aug 2025, Speak Easy: Eliciting Harmful Jailbreaks from LLMs with Simple Interactions, https://arxiv.org/abs/2502.04322
- Muyang Zheng, Yuanzhi Yao, Changting Lin, Rui Wang, Caihong Kai, 4 Aug 2025, MIST: Jailbreaking Black-box Large Language Models via Iterative Semantic Tuning, https://arxiv.org/abs/2506.16792
- Rui Pu, Chaozhuo Li, Rui Ha, Litian Zhang, Lirong Qiu, Xi Zhang, 5 Aug 2025, Beyond Surface-Level Detection: Towards Cognitive-Driven Defense Against Jailbreak Attacks via Meta-Operations Reasoning, https://arxiv.org/abs/2508.03054
- Bodam Kim, Hiskias Dingeto, Taeyoun Kwon, Dasol Choi, DongGeon Lee, Haon Park, JaeHoon Lee, Jongho Shin, 5 Aug 2025, When Good Sounds Go Adversarial: Jailbreaking Audio-Language Models with Benign Inputs, https://arxiv.org/abs/2508.03365
- Giovanni Cherubin, Andrew Paverd, 4 Aug 2025, Highlight & Summarize: RAG without the jailbreaks, https://arxiv.org/abs/2508.02872
- Ruofan Wang, Juncheng Li, Yixu Wang, Bo Wang, Xiaosen Wang, Yan Teng, Yingchun Wang, Xingjun Ma, Yu-Gang Jiang, 5 Aug 2025, IDEATOR: Jailbreaking and Benchmarking Large Vision-Language Models Using Themselves, https://arxiv.org/abs/2411.00827
- Junwoo Ha, Hyunjun Kim, Sangyoon Yu, Haon Park, Ashkan Yousefpour, Yuna Park, Suhyun Kim, 5 Aug 2025, M2S: Multi-turn to Single-turn jailbreak in Red Teaming for LLMs, https://arxiv.org/abs/2503.04856
- Thilo Hagendorff, Erik Derner, Nuria Oliver, 4 Aug 2025, Large Reasoning Models Are Autonomous Jailbreak Agents, https://arxiv.org/abs/2508.04039
- Xiaohu Li and Yunfeng Ning and Zepeng Bao and Mayi Xu and Jianhao Chen and Tieyun Qian, 6 Aug 2025, CAVGAN: Unifying Jailbreak and Defense of LLMs via Generative Adversarial Attacks on their Internal Representations, https://arxiv.org/abs/2507.06043
- Renmiao Chen, Shiyao Cui, Xuancheng Huang, Chengwei Pan, Victor Shea-Jay Huang, QingLin Zhang, Xuan Ouyang, Zhexin Zhang, Hongning Wang, and Minlie Huang, 7 Aug 2025, JPS: Jailbreak Multimodal Large Language Models with Collaborative Visual Perturbation and Textual Steering, https://arxiv.org/abs/2508.05087
- Jesson Wang, Zhanhao Hu, David Wagner, 7 Aug 2025, JULI: Jailbreak Large Language Models by Self-Introspection, https://arxiv.org/abs/2505.11790
- Shuang Liang, Zhihao Xu, Jialing Tao, Hui Xue, Xiting Wang, 8 Aug 2025, Learning to Detect Unknown Jailbreak Attacks in Large Vision-Language Models: A Unified and Accurate Approach, https://arxiv.org/abs/2508.09201
- Zuoou Li, Weitong Zhang, Jingyuan Wang, Shuyuan Zhang, Wenjia Bai, Bernhard Kainz, Mengyun Qiao, 11 Aug 2025, Towards Effective MLLM Jailbreaking Through Balanced On-Topicness and OOD-Intensity, https://arxiv.org/abs/2508.09218
- Boyuan Chen, Minghao Shao, Abdul Basit, Siddharth Garg, Muhammad Shafique, 13 Aug 2025, MetaCipher: A Time-Persistent and Universal Multi-Agent Framework for Cipher-Based Jailbreak Attacks for LLMs, https://arxiv.org/abs/2506.22557
- Ma Teng and Jia Xiaojun and Duan Ranjie and Li Xinfeng and Huang Yihao and Jia Xiaoshuang and Chu Zhixuan and Ren Wenqi, 18 Aug 2025, Heuristic-Induced Multimodal Risk Distribution Jailbreak Attack for Multimodal Large Language Models, https://arxiv.org/abs/2412.05934
- Zhipeng Wei, Yuqi Liu, N. Benjamin Erichson, 16 Aug 2025, Emoji Attack: Enhancing Jailbreak Attacks Against Judge LLM Detection, https://arxiv.org/abs/2411.01077
- Yangyang Guo and Yangyan Li and Mohan Kankanhalli, 18 Aug 2025, Involuntary Jailbreak, https://arxiv.org/abs/2508.13246
- Jiaming Hu, Haoyu Wang, Debarghya Mukherjee, Ioannis Ch. Paschalidis, 19 Aug 2025, CCFC: Core & Core-Full-Core Dual-Track Defense for LLM Jailbreak Protection, https://arxiv.org/abs/2508.14128
- Xiangman Li, Xiaodong Wu, Qi Li, Jianbing Ni, and Rongxing Lu, 21 Aug 2025, SafeLLM: Unlearning Harmful Outputs from Large Language Models against Jailbreak Attacks, https://arxiv.org/abs/2508.15182
- Darpan Aswal and C\'eline Hudelot, 22 Aug 2025, LLMSymGuard: A Symbolic Safety Guardrail Framework Leveraging Interpretable Jailbreak Concepts, https://arxiv.org/abs/2508.16325
- Yu Yan, Sheng Sun, Zhe Wang, Yijun Lin, Zenghao Duan, zhifei zheng, Min Liu, Zhiyi yin, Jianping Zhang, 22 Aug 2025, Confusion is the Final Barrier: Rethinking Jailbreak Evaluation and Investigating the Real Misuse Threat of LLMs, https://arxiv.org/abs/2508.16347
- Yu Yan, Sheng Sun, Zenghao Duan, Teli Liu, Min Liu, Zhiyi Yin, Jiangyu Lei, Qi Li, 22 Aug 2025, from Benign import Toxic: Jailbreaking the Language Model via Adversarial Metaphors, https://arxiv.org/abs/2503.00038
- Chongwen Zhao, Zhihao Dou, Kaizhu Huang, 25 Aug 2025, Defending against Jailbreak through Early Exit Generation of Large Language Models, https://arxiv.org/abs/2408.11308
- Junchen Ding, Jiahao Zhang, Yi Liu, Ziqi Ding, Gelei Deng, Yuekang Li, 25 Aug 2025, TombRaider: Entering the Vault of History to Jailbreak Large Language Models, https://arxiv.org/abs/2501.18628
- Salman Rahman, Liwei Jiang, James Shiffer, Genglin Liu, Sheriff Issaka, Md Rizwan Parvez, Hamid Palangi, Kai-Wei Chang, Yejin Choi, Saadia Gabriel, 23 Aug 2025, X-Teaming: Multi-Turn Jailbreaks and Defenses with Adaptive Multi-Agents, https://arxiv.org/abs/2504.13203
- Hanjiang Hu, Alexander Robey, Changliu Liu, 25 Aug 2025, Steering Dialogue Dynamics for Robustness against Multi-turn Jailbreaking Attacks, https://arxiv.org/abs/2503.00187
Prompt Injection
Prompt injection is a type of LLM "hack" or "jailbreak" involving the insertion of nefarious words into the prompt. A simple example is the jailbreak involving words to the effect of "ignore all previous instructions and do what I say," which was a surprisingly effective method.
Research papers on prompt injection attacks and mitigation include:
- Jaymari Chua, Yun Li, Shiyi Yang, Chen Wang, Lina Yao, 6 Jul 2024, AI Safety in Generative AI Large Language Models: A Survey, https://arxiv.org/abs/2407.18369
- Yuanchun Li, Hao Wen, Weijun Wang, Xiangyu Li, Yizhen Yuan, Guohong Liu, Jiacheng Liu, Wenxing Xu, Xiang Wang, Yi Sun, Rui Kong, Yile Wang, Hanfei Geng, Jian Luan, Xuefeng Jin, Zilong Ye, Guanjing Xiong, Fan Zhang, Xiang Li, Mengwei Xu, Zhijun Li, Peng Li, Yang Liu, Ya-Qin Zhang, Yunxin Liu, 8 May 2024 (v2), Personal LLM Agents: Insights and Survey about the Capability, Efficiency and Security, https://arxiv.org/abs/2401.05459 https://github.com/MobileLLM/Personal_LLM_Agents_Survey
- Steve Jones, Oct 3, 2024, LLM Prompt Injection: Never send the request to the model. Classify, rewrite and reject, https://blog.metamirror.io/llm-prompt-injection-never-send-the-request-to-the-model-e8017269b96a
- Mohit Sewak, Dec 6, 2024, Prompt Injection Attacks on Large Language Models, https://pub.towardsai.net/prompt-injection-attacks-on-large-language-models-bd8062fa1bb7
- Kylie Robison, Jul 20, 2024, OpenAI’s latest model will block the ‘ignore all previous instructions’ loophole, https://www.theverge.com/2024/7/19/24201414/openai-chatgpt-gpt-4o-prompt-injection-instruction-hierarchy
- Jerry Wang and Fang Yu, 20 Jul 2025, DeRAG: Black-box Adversarial Attacks on Multiple Retrieval-Augmented Generation Applications via Prompt Injection, https://arxiv.org/abs/2507.15042
- Sam Johnson, Viet Pham, Thai Le, 20 Jul 2025, Manipulating LLM Web Agents with Indirect Prompt Injection Attack via HTML Accessibility Tree, https://arxiv.org/abs/2507.14799
- Tianneng Shi, Kaijie Zhu, Zhun Wang, Yuqi Jia, Will Cai, Weida Liang, Haonan Wang, Hend Alzahrani, Joshua Lu, Kenji Kawaguchi, Basel Alomair, Xuandong Zhao, William Yang Wang, Neil Gong, Wenbo Guo, Dawn Song, 21 Jul 2025, PromptArmor: Simple yet Effective Prompt Injection Defenses, https://arxiv.org/abs/2507.15219
- Aleksandr Gashkov, Aleksandr Perevalov, Maria Eltsova, Andreas Both, 18 Jul 2025, SPARQL Query Generation with LLMs: Measuring the Impact of Training Data Memorization and Knowledge Injection, https://arxiv.org/abs/2507.13859
- Junhyeong Lee, Joon-Young Kim, Heekyu Kim, Inhyo Lee and Seunghwa Ryu, 21 Jul 2025, IM-Chat: A Multi-agent LLM-based Framework for Knowledge Transfer in Injection Molding Industry, https://arxiv.org/abs/2507.15268
- Zhengyun Zhao, Huaiyuan Ying, Yue Zhong, Sheng Yu, 24 Jul 2025, DR.EHR: Dense Retrieval for Electronic Health Record with Knowledge Injection and Synthetic Data, https://arxiv.org/abs/2507.18583
- Da-Wei Zhou, Kai-Wen Li, Jingyi Ning, Han-Jia Ye, Lijun Zhang, De-Chuan Zhan, 24 Jul 2025, External Knowledge Injection for CLIP-Based Class-Incremental Learning, https://arxiv.org/abs/2503.08510
- Taibiao Zhao, Mingxuan Sun, Hao Wang, Xiaobing Chen, Xiangwei Zhou, 14 Aug 2025, Pruning and Malicious Injection: A Retraining-Free Backdoor Attack on Transformer Models, https://arxiv.org/abs/2508.10243
- Francesco Panebianco, Stefano Bonfanti, Francesco Trov\`o, Michele Carminati, 1 Aug 2025, LeakSealer: A Semisupervised Defense for LLMs Against Prompt Injection and Leakage Attacks, https://arxiv.org/abs/2508.00602
- Peiran Wang, Yang Liu, Yunfei Lu, Yifeng Cai, Hongbo Chen, Qingyou Yang, Jie Zhang, Jue Hong, Ye Wu, 2 Aug 2025, AgentArmor: Enforcing Program Analysis on Agent Runtime Trace to Defend Against Prompt Injection, https://arxiv.org/abs/2508.01249
- Zhiyao Luo, Tingting Zhu, 6 Aug 2025, Are Large Language Models Dynamic Treatment Planners? An In Silico Study from a Prior Knowledge Injection Angle, https://arxiv.org/abs/2508.04755
- Thorsten Peinemann, Paula Arnold, Sebastian Berndt, Thomas Eisenbarth, Esfandiar Mohammadi, 7 Aug 2025, Non-omniscient backdoor injection with a single poison sample: Proving the one-poison hypothesis for linear regression and linear classification, https://arxiv.org/abs/2508.05600
- Hammad Atta, Ken Huang, Manish Bhatt, Kamal Ahmed, Muhammad Aziz Ul Haq, Yasir Mehmood, 6 Aug 2025, Logic layer Prompt Control Injection (LPCI): A Novel Security Vulnerability Class in Agentic Systems, https://arxiv.org/abs/2507.10457
- Kalle Kujanp\"a\"a, Pekka Marttinen, Harri Valpola, Alexander Ilin, 7 Aug 2025, Efficient Knowledge Injection in LLMs via Self-Distillation, https://arxiv.org/abs/2412.14964
- Ameya Anjarlekar, Sandeep Pombra, 8 Aug 2025, LLM Unlearning using Gradient Ratio-Based Influence Estimation and Noise Injection, https://arxiv.org/abs/2508.06467
- Zhiqiu Zhang, Dongqi Fan, Mingjie Wang, Qiang Tang, Jian Yang, Zili Yi, 13 Aug 2025, Region-to-Region: Enhancing Generative Image Harmonization with Adaptive Regional Injection, https://arxiv.org/abs/2508.09746
- Xuyang Guo, Zekai Huang, Zhao Song, Jiahao Zhang, 16 Aug 2025, Too Easily Fooled? Prompt Injection Breaks LLMs on Frustratingly Simple Multiple-Choice Questions, https://arxiv.org/abs/2508.13214
- Xudong Wang, Guoming Tang, Junyu Xue, Srinivasan Keshav, Tongxin Li, Chris Ding, 20 Aug 2025, DualNILM: Energy Injection Identification Enabled Disaggregation with Deep Multi-Task Learning, https://arxiv.org/abs/2508.14600
- Hengyu An, Jinghuai Zhang, Tianyu Du, Chunyi Zhou, Qingming Li, Tao Lin, Shouling Ji, 21 Aug 2025, IPIGuard: A Novel Tool Dependency Graph-Based Defense Against Indirect Prompt Injection in LLM Agents, https://arxiv.org/abs/2508.15310
- Khalil Hennara, Sara Chrouf, Mohamed Motaism Hamed, Zeina Aldallal, Omar Hadid, Safwan AlModhayan, 21 Aug 2025, Kuwain 1.5B: An Arabic SLM via Language Injection, https://arxiv.org/abs/2504.15120
- Jiawen Shi, Zenghui Yuan, Yinuo Liu, Yue Huang, Pan Zhou, Lichao Sun, Neil Zhenqiang Gong, 24 Aug 2025, Optimization-based Prompt Injection Attack to LLM-as-a-Judge, https://arxiv.org/abs/2403.17710
- Sarthak Choudhary, Divyam Anshumaan, Nils Palumbo, Somesh Jha, 17 Jul 2025, How Not to Detect Prompt Injections with an LLM, https://arxiv.org/abs/2507.05630
Plagiarism
Plagiarism is an issue for LLMs when they repeat input training verbatim. This is a controversial issue with numerous copyright lawsuits in progress at the moment. Another side to "plagiarism" is detecting when authors or students have used AI in their writing, without attribution.
Research papers on plagiarism issues with AI include:
- Ruixiang Tang, Yu-Neng Chuang, Xia Hu, June 2023, The Science of Detecting LLM-Generated Texts, https://arxiv.org/abs/2303.07205
- JON CHRISTIAN, 2023, CNET's AI Journalist Appears to Have Committed Extensive Plagiarism, https://futurism.com/cnet-ai-plagiarism
- Jaymari Chua, Yun Li, Shiyi Yang, Chen Wang, Lina Yao, 6 Jul 2024, AI Safety in Generative AI Large Language Models: A Survey, https://arxiv.org/abs/2407.18369
- David Gewirtz, Nov. 26, 2024, I tested 9 AI content detectors - and these 2 correctly identified AI text every time, https://www.zdnet.com/article/i-tested-9-ai-content-detectors-and-these-2-correctly-identified-ai-text-every-time/
- Guillaume Cabanac, Cyril Labbé, Alexander Magazinov, 12 Jul 2021, Tortured phrases: A dubious writing style emerging in science. Evidence of critical issues affecting established journals, https://arxiv.org/abs/2107.06751 (Detects "tortured phrases" created by pre-AI paraphrasing tools used to avoid plagiarism detectors.)
- Eléna Martel, Martin Lentschat, Cyril Labbé, 2 Feb 2024, Detection of tortured phrases in scientific literature, https://arxiv.org/abs/2402.03370
AI Detectors
AI detectors are software that is supposed to detect whether a text or image has been created by humans or by LLMs. In practice, these have been a mixed success, being prone to both false positives and false negatives, and their use remains controversial and unclear.
Research papers on AI detectors:
- David Gewirtz, Aug. 19, 2024, How do AI checkers actually work? https://www.zdnet.com/article/how-do-ai-checkers-work/
- David Gewirtz, Aug. 8, 2024, I tested 7 AI content detectors - they're getting dramatically better at identifying plagiarism, https://www.zdnet.com/article/i-tested-7-ai-content-detectors-theyre-getting-dramatically-better-at-identifying-plagiarism/
- Write A Catalyst, Aug 23, 2024, Words and Phrases That Show ChatGPT Generated It, https://medium.com/write-a-catalyst/words-and-phrases-that-show-chatgpt-generated-it-ca7e28ae8e8f
- Brian Contreras, September 19, 2024, How Can You Detect AI-Generated Text? This Startup Has Some Compelling Ideas, https://www.inc-aus.com/brian-contreras/how-can-you-detect-ai-generated-text-this-startup-has-some-compelling-ideas.html
- Tan Rosado, Sep 9, 2024, 10 Phrases That Scream ‘AI Wrote This!’ — Even When It Didn’t. https://medium.com/write-a-catalyst/10-phrases-that-scream-ai-wrote-this-even-when-it-didn-t-c58f273c9075
- David Gewirtz, Nov. 26, 2024, I tested 9 AI content detectors - and these 2 correctly identified AI text every time, https://www.zdnet.com/article/i-tested-9-ai-content-detectors-and-these-2-correctly-identified-ai-text-every-time/
- The Medium Newsletter Dec 2024, ChatGPT’s favorite words & punctuation, The Medium Blog, https://blog.medium.com/chatgpts-favorite-words-punctuation-fca042bb6bea
- The Medium Blog, Jun 7, 2024, How to become a marine biologist, https://blog.medium.com/how-to-become-a-marine-biologist-ca849217523b
- Alex Hern, 16 Apr 2024, TechScape: How cheap, outsourced labour in Africa is shaping AI English, https://www.theguardian.com/technology/2024/apr/16/techscape-ai-gadgest-humane-ai-pin-chatgpt
- Jordan Gibbs, Dec 14, 2023, Which Words Does ChatGPT Use the Most? I analyzed 1 million words of ChatGPT output and found the words that ChatGPT overuses most. https://medium.com/@jordan_gibbs/which-words-does-chatgpt-use-the-most-7c9ff02416a8
- Asif Iqbal, August 31, 2024, ChatGPT's Top 50 Favorite Words and Phrases, https://www.linkedin.com/pulse/chatgpts-top-50-favorite-words-phrases-asif-iqbal-mba-cmbe-lavpe/
- BaggyBoy, 2024, Is an em dash (—) proof of AI manipulation? https://www.reddit.com/r/ChatGPT/comments/1fx12q1/is_an_em_dash_proof_of_ai_manipulation/?rdt=38192
- Linda Caroll, Jan 2025, I Don’t Know How To Make You Care What ChatGPT Is Quietly Doing: Over half of the internet is now AI generated text https://medium.com/the-generator/i-dont-know-how-to-make-you-care-what-chatgpt-is-quietly-doing-8177dfcfb486
- Maria Cassano, Jan 4, 2025, I’m a Professional Editor and These Phrases Tell Me You Used ChatGPT: AI chatbots were trained on novice writing, and it shows, https://writingcooperative.com/im-a-professional-editor-and-these-phrases-tell-me-you-used-chatgpt-23236708918f
- Mingjie Sun, Yida Yin, Zhiqiu Xu, J. Zico Kolter, Zhuang Liu, 17 Feb 2025, Idiosyncrasies in Large Language Models, https://arxiv.org/abs/2502.12150
- W Li, Y Lai, S Soni, K Saha, 2025, Emails by LLMs: A Comparison of Language in AI-Generated and Human-Written Emails, Proceedings of the 17th ACM Web Science Conference 2025 (Websci ’25), May 20–24, 2025, New Brunswick, NJ, USA. ACM, New York, NY, USA, 12 pages. https://doi.org/10.1145/3717867.3717872 https://www.researchgate.net/profile/Koustuv-Saha-2/publication/389509862_Emails_by_LLMs_A_Comparison_of_Language_in_AI-Generated_and_Human-Written_Emails/links/67c5cd02461fb56424efccc6/Emails-by-LLMs-A-Comparison-of-Language-in-AI-Generated-and-Human-Written-Emails.pdf
- David Gewirtz, April 30, 2025, I tested 10 AI content detectors - and these 5 correctly identified AI text every time: I've been testing AI content detectors for two years now. They're getting more and more reliable, https://www.zdnet.com/article/i-tested-10-ai-content-detectors-and-these-5-correctly-identified-ai-text-every-time/
- Shreya Shankar, Jun 16, 2025, Writing in the Age of LLMs: Common Patterns of Bad Writing I See from LLM Tools, https://www.sh-reya.com/blog/ai-writing/ (A good overview of the types of bad writing that comes out of LLMs.)
Privacy
Research on privacy-related risks or concerns:
- Matthew Finnegan 14 Jun 2024, Microsoft delays Recall launch amid privacy concerns, ComputerWorld, https://www.computerworld.com/article/2147736/microsoft-delays-recall-launch-amid-privacy-concerns.html
- Rohan Goswami 21 June, 2024, Apple Intelligence won’t launch in EU in 2024 due to antitrust regulation, company says, CNBS, https://www.cnbc.com/2024/06/21/apple-ai-europe-dma-macos.html
- Dan Peng, Zhihui Fu, Jun Wang, 1 Jul 2024, PocketLLM: Enabling On-Device Fine-Tuning for Personalized LLMs, https://arxiv.org/abs/2407.01031 (Running fine-tuning on a smartphone via a low-memory optimization using a "derivative-free" "zeroth-order" technique called MeZo, with advantages such as privacy.)
- Jay Peters, Jul 4, 2024, OpenAI’s ChatGPT Mac app was storing conversations in plain text, https://www.theverge.com/2024/7/3/24191636/openai-chatgpt-mac-app-conversations-plain-text
- Jaymari Chua, Yun Li, Shiyi Yang, Chen Wang, Lina Yao, 6 Jul 2024, AI Safety in Generative AI Large Language Models: A Survey, https://arxiv.org/abs/2407.18369
- Y. Zhang, J. Zhang, S. Yue, W. Lu, J. Ren, X. Shen, August 2024, "Mobile Generative AI: Opportunities and Challenges," in IEEE Wireless Communications, vol. 31, no. 4, pp. 58-64, doi: 10.1109/MWC.006.2300576, https://ieeexplore.ieee.org/abstract/document/10628027/
- Yuanchun Li, Hao Wen, Weijun Wang, Xiangyu Li, Yizhen Yuan, Guohong Liu, Jiacheng Liu, Wenxing Xu, Xiang Wang, Yi Sun, Rui Kong, Yile Wang, Hanfei Geng, Jian Luan, Xuefeng Jin, Zilong Ye, Guanjing Xiong, Fan Zhang, Xiang Li, Mengwei Xu, Zhijun Li, Peng Li, Yang Liu, Ya-Qin Zhang, Yunxin Liu, 8 May 2024 (v2), Personal LLM Agents: Insights and Survey about the Capability, Efficiency and Security, https://arxiv.org/abs/2401.05459 https://github.com/MobileLLM/Personal_LLM_Agents_Survey
- Yiheng Liu, Hao He, Tianle Han, Xu Zhang, Mengyuan Liu, Jiaming Tian, Yutong Zhang, Jiaqi Wang, Xiaohui Gao, Tianyang Zhong, Yi Pan, Shaochen Xu, Zihao Wu, Zhengliang Liu, Xin Zhang, Shu Zhang, Xintao Hu, Tuo Zhang, Ning Qiang, Tianming Liu, Bao Ge, 6 Jan 2024 (v2), Understanding LLMs: A Comprehensive Overview from Training to Inference, https://arxiv.org/abs/2401.02038
- Huan Yang, Deyu Zhang, Yudong Zhao, Yuanchun Li, Yunxin Liu, 6 Sep 2024, A First Look At Efficient And Secure On-Device LLM Inference Against KV Leakage, https://arxiv.org/abs/2409.04040 (Security issues where KV caches can be data leaks as they may contain encodings of private information.)
- Apple, Sep 2024, Apple Intelligence comes to iPhone, iPad, and Mac starting next month, https://www.apple.com/newsroom/2024/09/apple-intelligence-comes-to-iphone-ipad-and-mac-starting-next-month/
- Donghwan Rho, Taeseong Kim, Minje Park, Jung Woo Kim, Hyunsik Chae, Jung Hee Cheon, Ernest K. Ryu, 3 Oct 2024, Encryption-Friendly LLM Architecture, https://arxiv.org/abs/2410.02486
- Jiankun Wei, Abdulrahman Abdulrazzag, Tianchen Zhang, Adel Muursepp, Gururaj Saileshwar, 5 Nov 2024 (v2), Privacy Risks of Speculative Decoding in Large Language Models, https://arxiv.org/abs/2411.01076
- G Wu, Z Zhang, Y Zhang, W Wang, J Niu, Y Wu, Mar 2025, I Know What You Asked: Prompt Leakage via KV-Cache Sharing in Multi-Tenant LLM Serving https://www.ndss-symposium.org/wp-content/uploads/2025-1772-paper.pdf
- Maimunatu Tunau, Vincent Gbouna Zakka, Zhuangzhuang Dai, 14 Aug 2025, Enhanced Sparse Point Cloud Data Processing for Privacy-aware Human Action Recognition, https://arxiv.org/abs/2508.10469
- Feiran Li, Qianqian Xu, Shilong Bao, Boyu Han, Zhiyong Yang, Qingming Huang, 14 Aug 2025, Hybrid Generative Fusion for Efficient and Privacy-Preserving Face Recognition Dataset Generation, https://arxiv.org/abs/2508.10672
- Yanzhe Zhang, Diyi Yang, 14 Aug 2025, Searching for Privacy Risks in LLM Agents via Simulation, https://arxiv.org/abs/2508.10880
- Quentin Hillebrand, Vorapong Suppakitpaisarn and Tetsuo Shibuya, 14 Aug 2025, Communication Cost Reduction for Subgraph Counting under Local Differential Privacy via Hash Functions, https://arxiv.org/abs/2312.07055
- Jessup Byun, Xiaofeng Lin, Joshua Ward, Guang Cheng, 22 Jul 2025, Risk In Context: Benchmarking Privacy Leakage of Foundation Models in Synthetic Tabular Data Generation, https://arxiv.org/abs/2507.17066
- Wei Fan, JinYi Yoon, Xiaochang Li, Huajie Shao, and Bo Ji, 23 Jul 2025, P3SL: Personalized Privacy-Preserving Split Learning on Heterogeneous Edge Devices, https://arxiv.org/abs/2507.17228
- Na Li and Yansong Gao and Hongsheng Hu and Boyu Kuang and Anmin Fu, 22 Jul 2025, CompLeak: Deep Learning Model Compression Exacerbates Privacy Leakage, https://arxiv.org/abs/2507.16872
- Angelo Rodio, Zheng Chen, Erik G. Larsson, 23 Jul 2025, Optimizing Privacy-Utility Trade-off in Decentralized Learning with Generalized Correlated Noise, https://arxiv.org/abs/2501.14644
- Mehdi Khalaj, Shahrzad Golestani Najafabadi, Julita Vassileva, 23 Jul 2025, Privacy-Preserving Multimodal News Recommendation through Federated Learning, https://arxiv.org/abs/2507.15460
- Harsha Sammangi (Dakota State University), Aditya Jagatha (College of Business and Information Systems, Dakota State University), Giridhar Reddy Bojja (College of Business, Michigan Technological University), Jun Liu (College of Business and I.S, Dakota State University), 29 Apr 2025, Decentralized AI-driven IoT Architecture for Privacy-Preserving and Latency-Optimized Healthcare in Pandemic and Critical Care Scenarios, https://arxiv.org/abs/2507.15859
- Dakota Sullivan, Shirley Zhang, Jennica Li, Heather Kirkorian, Bilge Mutlu, Kassem Fawaz, 22 Jul 2025, Benchmarking LLM Privacy Recognition for Social Robot Decision Making, https://arxiv.org/abs/2507.16124
- Tanusree Sharma, Yihao Zhou, Visar Berisha, 22 Jul 2025, PRAC3 (Privacy, Reputation, Accountability, Consent, Credit, Compensation): Long Tailed Risks of Voice Actors in AI Data-Economy, https://arxiv.org/abs/2507.16247
- Tian Dong, Yan Meng, Shaofeng Li, Guoxing Chen, Zhen Liu, Haojin Zhu, 22 Jul 2025, Depth Gives a False Sense of Privacy: LLM Internal States Inversion, https://arxiv.org/abs/2507.16372
- Ryusei Fujimoto, Yugo Nakamura, Yutaka Arakawa, 24 Jul 2025, C-AAE: Compressively Anonymizing Autoencoders for Privacy-Preserving Activity Recognition in Healthcare Sensor Streams, https://arxiv.org/abs/2507.18072
- Tevin Atwal, Chan Nam Tieu, Yefeng Yuan, Zhan Shi, Yuhong Liu, Liang Cheng, 24 Jul 2025, Privacy-Preserving Synthetic Review Generation with Diverse Writing Styles Using LLMs, https://arxiv.org/abs/2507.18055
- Nikola Pavlovic, Sudeep Salgia, Qing Zhao, 18 Jul 2025, Differential Privacy in Kernelized Contextual Bandits via Random Projections, https://arxiv.org/abs/2507.13639
- Daniel Commey, Benjamin Appiah, Griffith S. Klogo, and Garth V. Crosby, 18 Jul 2025, ZKP-FedEval: Verifiable and Privacy-Preserving Federated Evaluation using Zero-Knowledge Proofs, https://arxiv.org/abs/2507.11649
- Efe Bozkir and S\"uleyman \"Ozdel and Mengdi Wang and Brendan David-John and Hong Gao and Kevin Butler and Eakta Jain and Enkelejda Kasneci, 18 Jul 2025, Eye-tracked Virtual Reality: A Comprehensive Survey on Methods and Privacy Challenges, https://arxiv.org/abs/2305.14080
- Matteo Boglioni and Terrance Liu and Andrew Ilyas and Zhiwei Steven Wu, 21 Jul 2025, Optimizing Canaries for Privacy Auditing with Metagradient Descent, https://arxiv.org/abs/2507.15836
- Wenxuan Zeng, Tianshi Xu, Yi Chen, Yifan Zhou, Mingzhe Zhang, Jin Tan, Cheng Hong, Meng Li, 19 Jul 2025, Towards Efficient Privacy-Preserving Machine Learning: A Systematic Review from Protocol, Model, and System Perspectives, https://arxiv.org/abs/2507.14519
- Juntao Tan, Lan Zhang, Zhonghao Hu, Kai Yang, Peng Ran, Bo Li, 19 Jul 2025, VMask: Tunable Label Privacy Protection for Vertical Federated Learning via Layer Masking, https://arxiv.org/abs/2507.14629
- Khoa Nguyen, Tanveer Khan, Antonis Michalas, 20 Jul 2025, A Privacy-Centric Approach: Scalable and Secure Federated Learning Enabled by Hybrid Homomorphic Encryption, https://arxiv.org/abs/2507.14853
- Tanusree Sharma, Yu-Yun Tseng, Lotus Zhang, Ayae Ide, Kelly Avery Mack, Leah Findlater, Danna Gurari, Yang Wang, 19 Jul 2025, "Before, I Asked My Mom, Now I Ask ChatGPT": Visual Privacy Management with Generative AI for Blind and Low-Vision People, https://arxiv.org/abs/2507.00286
- Wenkai Li, Liwen Sun, Zhenxiang Guan, Xuhui Zhou, Maarten Sap, 11 Aug 2025, 1-2-3 Check: Enhancing Contextual Privacy in LLM via Multi-Agent Reasoning, https://arxiv.org/abs/2508.07667
- Andrey Sidorenko and Paul Tiwald, 8 Aug 2025, Privacy-Preserving Tabular Synthetic Data Generation Using TabularARGN, https://arxiv.org/abs/2508.06647
- Yueyang Quan, Chang Wang, Shengjie Zhai, Minghong Fang, Zhuqing Liu, 10 Aug 2025, Enhancing Privacy in Decentralized Min-Max Optimization: A Differentially Private Approach, https://arxiv.org/abs/2508.07505
- Chenchen Lin, Xuehe Wang, 11 Aug 2025, Multi-Hop Privacy Propagation for Differentially Private Federated Learning in Social Networks, https://arxiv.org/abs/2508.07676
- Juan Zambrano, Cl\'ement Contet, Jairo Gudi\~no, Felipe Garrido-Lucero, Umberto Grandi, Cesar A Hidalgo, 7 Aug 2025, Leveraging LLMs for Privacy-Aware Predictions in Participatory Budgeting, https://arxiv.org/abs/2508.06577
- William Zerong Wang and Dongfang Zhao, 9 Aug 2025, Balancing Privacy and Efficiency: Music Information Retrieval via Additive Homomorphic Encryption, https://arxiv.org/abs/2508.07044
- Dawood Wasif, Dian Chen, Sindhuja Madabushi, Nithin Alluru, Terrence J. Moore, Jin-Hee Cho, 9 Aug 2025, Empirical Analysis of Privacy-Fairness-Accuracy Trade-offs in Federated Learning: A Step Towards Responsible AI, https://arxiv.org/abs/2503.16233
- Xingke Yang and Liang Li and Zhiyi Wan and Sicong Li and Xiaoqi Qi and Jiang Liu and Tomoaki Ohtsuki and Xin Fu and Miao Pan, 9 Aug 2025, PAE MobiLLM: Privacy-Aware and Efficient LLM Fine-Tuning on the Mobile Device via Additive Side-Tuning, https://arxiv.org/abs/2507.01216
- Kaveen Hiniduma, Zilinghan Li, Aditya Sinha, Ravi Madduri, Suren Byna, 11 Aug 2025, CADRE: Customizable Assurance of Data Readiness in Privacy-Preserving Federated Learning, https://arxiv.org/abs/2505.23849
- Md Rakibul Hasan, Md Zakir Hossain, Aneesh Krishna, Shafin Rahman, Tom Gedeon, 9 Aug 2025, TFMPathy: Tabular Foundation Model for Privacy-Aware, Generalisable Empathy Detection from Videos, https://arxiv.org/abs/2504.10808
- Nomaan A. Kherani, Urbashi Mitra, 26 Jul 2025, ModShift: Model Privacy via Designed Shifts, https://arxiv.org/abs/2507.20060
- Yaxin Xiao and Qingqing Ye and Li Hu and Huadi Zheng and Haibo Hu and Zi Liang and Haoyang Li and Yijie Jiao, 28 Jul 2025, Reminiscence Attack on Residuals: Exploiting Approximate Machine Unlearning for Privacy, https://arxiv.org/abs/2507.20573
- Ivoline Ngong, Swanand Kadhe, Hao Wang, Keerthiram Murugesan, Justin D. Weisz, Amit Dhurandhar, Karthikeyan Natesan Ramamurthy, 28 Jul 2025, Protecting Users From Themselves: Safeguarding Contextual Privacy in Interactions with Conversational Agents, https://arxiv.org/abs/2502.18509
- Abdullah Al Siam and Sadequzzaman Shohan, 17 May 2025, Privacy-Preserving AI for Encrypted Medical Imaging: A Framework for Secure Diagnosis and Learning, https://arxiv.org/abs/2507.21060
- Chenhao Fang, Yanqing Peng, Rajeev Rao, Matt Sarmiento, Wendy Summer, Arya Pudota, Alex Goncalves, Jordi Mola, Herv\'e Robert, 23 Jul 2025, Privacy Artifact ConnecTor (PACT): Embedding Enterprise Artifacts for Compliance AI Agents, https://arxiv.org/abs/2507.21142
- Yuetian Chen, Zhiqi Wang, Nathalie Baracaldo, Swanand Ravindra Kadhe, Lei Yu, 31 Jul 2025, Evaluating the Dynamics of Membership Privacy in Deep Learning, https://arxiv.org/abs/2507.23291
- Abhishek Sawaika, Swetang Krishna, Tushar Tomar, Durga Pritam Suggisetti, Aditi Lal, Tanmaya Shrivastav, Nouhaila Innan, Muhammad Shafique, 15 Jul 2025, A Privacy-Preserving Federated Framework with Hybrid Quantum-Enhanced Learning for Financial Fraud Detection, https://arxiv.org/abs/2507.22908
- Jiajie He, Yuechun Gu, Keke Chen, 24 Jul 2025, RecPS: Privacy Risk Scoring for Recommender Systems, https://arxiv.org/abs/2507.18365
- Shreyansh Pathak, Sonu Shreshtha, Richa Singh, Mayank Vatsa, 29 Jul 2025, Quantum-Inspired Audio Unlearning: Towards Privacy-Preserving Voice Biometrics, https://arxiv.org/abs/2507.22208
- Xiaojin Zhang, Wei Chen, 30 Jul 2025, Bridging Privacy and Robustness for Trustworthy Machine Learning, https://arxiv.org/abs/2403.16591
- Javier Mu\~noz-Haro, Ruben Tolosana, Ruben Vera-Rodriguez, Aythami Morales, Julian Fierrez, 1 Aug 2025, FakeIDet: Exploring Patches for Privacy-Preserving Fake ID Detection, https://arxiv.org/abs/2504.07761
- Tianpei Lu, Bingsheng Zhang, Lekun Peng, Bowen Zheng, Lichun Li, Kui Ren, 3 Aug 2025, Privacy-Preserving Inference for Quantized BERT Models, https://arxiv.org/abs/2508.01636
- Runkai Zheng, Vishnu Asutosh Dasu, Yinong Oliver Wang, Haohan Wang, Fernando De la Torre, 3 Aug 2025, Improving Noise Efficiency in Privacy-preserving Dataset Distillation, https://arxiv.org/abs/2508.01749
- Jan Schuchardt, Mina Dalirrooyfard, Jed Guzelkabaagac, Anderson Schneider, Yuriy Nevmyvaka, Stephan G\"unnemann, 4 Aug 2025, Privacy Amplification by Structured Subsampling for Deep Differentially Private Time Series Forecasting, https://arxiv.org/abs/2502.02410
- Xinwei Liu, Xiaojun Jia, Yuan Xun, Simeng Qin, Xiaochun Cao, 5 Aug 2025, GeoShield: Safeguarding Geolocation Privacy from Vision-Language Models via Adversarial Perturbations, https://arxiv.org/abs/2508.03209
- Mengyu Zhang, Zhuotao Liu, Jingwen Huang, Xuanqi Liu, 30 Jul 2025, Agentic Privacy-Preserving Machine Learning, https://arxiv.org/abs/2508.02836
- Xin Yang, Omid Ardakanian, 5 Aug 2025, PrivDiffuser: Privacy-Guided Diffusion Model for Data Obfuscation in Sensor Networks, https://arxiv.org/abs/2412.14499
- Chongyu Bao, Ruimin Dai, Yangbo Shen, Runyang Jian, Jinghan Zhang, Xiaolan Liu, Kunpeng Liu, 6 Aug 2025, Galaxy: A Cognition-Centered Framework for Proactive, Privacy-Preserving, and Self-Evolving LLM Agents, https://arxiv.org/abs/2508.03991
- Dhruv Sarkar, Nishant Pandey, Sayak Ray Chowdhury, 5 Aug 2025, DP-NCB: Privacy Preserving Fair Bandits, https://arxiv.org/abs/2508.03836
- Ajesh Koyatan Chathoth, Shuhao Yu, Stephen Lee, 6 Aug 2025, Dynamic User-controllable Privacy-preserving Few-shot Sensing Framework, https://arxiv.org/abs/2508.03989
- Haoran Niu and K. Suzanne Barber, 6 Aug 2025, Privacy Risk Predictions Based on Fundamental Understanding of Personal Data and an Evolving Threat Landscape, https://arxiv.org/abs/2508.04542
- Yubo Wang and Min Tang and Nuo Shen and Shujie Cui and Weiqing Wang, 20 Jul 2025, Privacy Risks of LLM-Empowered Recommender Systems: An Inversion Attack Perspective, https://arxiv.org/abs/2508.03703
- Fardis Nadimi, Payam Abdisarabshali, Kasra Borazjani, Jacob Chakareski, Seyyedali Hosseinalipour, 5 Aug 2025, Multi-Modal Multi-Task Federated Foundation Models for Next-Generation Extended Reality Systems: Towards Privacy-Preserving Distributed Intelligence in AR/VR/MR, https://arxiv.org/abs/2506.05683
- Chengxi Li, Ming Xiao, Mikael Skoglund, 6 Aug 2025, Adaptive Coded Federated Learning: Privacy Preservation and Straggler Mitigation, https://arxiv.org/abs/2403.14905
- Haotian Ma, Lin Gu, Siyi Wu, Yingying Zhu, 6 Aug 2025, Computation-Efficient and Recognition-Friendly 3D Point Cloud Privacy Protection, https://arxiv.org/abs/2503.15818
- Suqing Liu, Xuan Bi, Tianxi Li, 7 Aug 2025, GRAND: Graph Release with Assured Node Differential Privacy, https://arxiv.org/abs/2507.00402
- Ce Na, Kai Yang, Dengzhao Fang, Yu Li, Jingtong Gao, Chengcheng Zhu, Jiale Zhang, Xiaobing Sun, Yi Chang, 8 Aug 2025, Graph Federated Learning for Personalized Privacy Recommendation, https://arxiv.org/abs/2508.06208
- Alejandro Moreno R., Desale Fentaw, Samuel Palmer, Ra\'ul Salles de Padua, Ninad Dixit, Samuel Mugel, Roman Or\'us, Manuel Radons, Josef Menter, and Ali Abedi, 8 Aug 2025, Synthetic Data Generation and Differential Privacy using Tensor Networks' Matrix Product States (MPS), https://arxiv.org/abs/2508.06251
- Junhyeog Yun, Minui Hong, Gunhee Kim, 8 Aug 2025, FedMeNF: Privacy-Preserving Federated Meta-Learning for Neural Fields, https://arxiv.org/abs/2508.06301
- Zhihao Yao, Yuxuan Gu, Xiachong Feng, Weitao Ma, Bo Li, Xiaocheng Feng, 8 Aug 2025, Adaptive Backtracking for Privacy Protection in Large Language Models, https://arxiv.org/abs/2508.06087
- Yuzhou Nie, Zhun Wang, Ye Yu, Xian Wu, Xuandong Zhao, Wenbo Guo, Dawn Song, 8 Aug 2025, LeakAgent: RL-based Red-teaming Agent for LLM Privacy Leakage, https://arxiv.org/abs/2412.05734
- Zane Witherspoon, Thet Mon Aye, YingYing Hao, 12 Aug 2025, Can We Trust AI to Govern AI? Benchmarking LLM Performance on Privacy and AI Governance Exams, https://arxiv.org/abs/2508.09036
- Ratun Rahman, 12 Aug 2025, Federated Learning: A Survey on Privacy-Preserving Collaborative Intelligence, https://arxiv.org/abs/2504.17703
- Abdolazim Rezaei, Mehdi Sookhak, Mahboobeh Haghparast, 7 Aug 2025, RL-MoE: An Image-Based Privacy Preserving Approach In Intelligent Transportation System, https://arxiv.org/abs/2508.09186
- Nick Oh, Giorgos D. Vrakas, Si\^an J. M. Brooke, Sasha Morini\`ere, Toju Duke, 12 Aug 2025, PETLP: A Privacy-by-Design Pipeline for Social Media Data in AI Research, https://arxiv.org/abs/2508.09232
- Zhifan Luo, Shuo Shao, Su Zhang, Lijing Zhou, Yuke Hu, Chenxu Zhao, Zhihao Liu, Zhan Qin, 13 Aug 2025, Shadow in the Cache: Unveiling and Mitigating Privacy Risks of KV-cache in LLM Inference, https://arxiv.org/abs/2508.09442
- Javier Mu\~noz-Haro and Ruben Tolosana and Ruben Vera-Rodriguez and Aythami Morales and Julian Fierrez, 14 Aug 2025, Privacy-Aware Detection of Fake Identity Documents: Methodology, Benchmark, and Improved Detection Methods (FakeIDet2), https://arxiv.org/abs/2508.11716
- Xiaojin Zhang, Mingcong Xu, Yiming Li, Wei Chen, Qiang Yang, 16 Aug 2025, Deciphering the Interplay between Attack and Protection Complexity in Privacy-Preserving Federated Learning, https://arxiv.org/abs/2508.11907
- Jinyu Lu, Xinrong Sun, Yunting Tao, Tong Ji, Fanyu Kong, Guoqiang Yang, 18 Aug 2025, Efficient and Verifiable Privacy-Preserving Convolutional Computation for CNN Inference with Untrusted Clouds, https://arxiv.org/abs/2508.12832
- Daniel M. Jimenez-Gutierrez, Yelizaveta Falkouskaya, Jose L. Hernandez-Ramos, Aris Anagnostopoulos, Ioannis Chatzigiannakis, Andrea Vitaletti, 19 Aug 2025, On the Security and Privacy of Federated Learning: A Survey with Attacks, Defenses, Frameworks, Applications, and Future Directions, https://arxiv.org/abs/2508.13730
- Salman Habib, Remi Chou, Taejoon Kim, 21 Aug 2025, Stabilization of Perturbed Loss Function: Differential Privacy without Gradient Noise, https://arxiv.org/abs/2508.15523
- Michael Sun, Tai Vu, Andrew Wang, 12 Aug 2025, Privacy Preserving Inference of Personalized Content for Out of Matrix Users, https://arxiv.org/abs/2508.14905
- Ruyi Ding, Tianhong Xu, Xinyi Shen, Aidong Adam Ding, Yunsi Fei, 20 Aug 2025, MoEcho: Exploiting Side-Channel Attacks to Compromise User Privacy in Mixture-of-Experts LLMs, https://arxiv.org/abs/2508.15036
- Aishik Mandal, Tanmoy Chakraborty, Iryna Gurevych, 22 Aug 2025, Towards Privacy-aware Mental Health AI Models: Advances, Challenges, and Opportunities, https://arxiv.org/abs/2502.00451
- Muhammet Anil Yagiz, Zeynep Sude Cengiz, Polat Goktas, 24 Aug 2025, MetaFed: Advancing Privacy, Performance, and Sustainability in Federated Metaverse Systems, https://arxiv.org/abs/2508.17341
- GodsGift Uzor, Hasan Al-Qudah, Ynes Ineza, Abdul Serwadda, 22 Aug 2025, Guarding Your Conversations: Privacy Gatekeepers for Secure Interactions with Cloud-Based AI Models, https://arxiv.org/abs/2508.16765
- Jiale Liu, Jiahao Zhang, Suhang Wang, 24 Aug 2025, Exposing Privacy Risks in Graph Retrieval-Augmented Generation, https://arxiv.org/abs/2508.17222
- Carlos Soto, 23 Aug 2025, Rao Differential Privacy, https://arxiv.org/abs/2508.17135
- Xiaoyu Luo, Qiongxiu Li, 22 Aug 2025, DeMem: Privacy-Enhanced Robust Adversarial Learning via De-Memorization, https://arxiv.org/abs/2412.05767
More Research on AI Safety
Research papers that cover various other AI safety issues:
- J Schuett, N Dreksler, M Anderljung, 2023, Towards best practices in AGI safety and governance: A survey of expert opinion, arXiv preprint, https://arxiv.org/abs/2305.07153
- Jan Leike, Miljan Martic, Victoria Krakovna, Pedro A. Ortega, Tom Everitt, Andrew Lefrancq, Laurent Orseau, Shane Legg, Nov 2017, AI Safety Gridworlds, https://arxiv.org/abs/1711.09883
- J. Schuett. Risk management in the Artificial Intelligence Act. European Journal of Risk Regulation, pages 1–19, 2023. https://arxiv.org/abs/2212.03109
- Dario Amodei, Chris Olah, Jacob Steinhardt, Paul Christiano, John Schulman, Dan Mané, July 2016, Concrete Problems in AI Safety, https://arxiv.org/abs/1606.06565
- Mark O Riedl and Brent Harrison. 2018. Enter the matrix: A virtual world approach to safely interruptable autonomous systems. arXiv preprint arXiv:1703.10284, 2017 (revised Nov 2018). https://arxiv.org/abs/1703.10284v2
- M. Brundage, K. Mayer, T. Eloundou, S. Agarwal, S. Adler, G. Krueger, J. Leike, and P. Mishkin. OpenAI, 2022, Lessons learned on language model safety and misuse. https://openai.com/research/language-model-safety-and-misuse
- OpenAI, Feb 2023, Planning for AGI and beyond, https://openai.com/blog/planning-for-agi-and-beyond
- Andreas Cebulla, Zygmunt Szpak, Catherine Howell, Genevieve Knight & Sazzad Hussain, 2022, Applying ethics to AI in the workplace: the design of a scorecard for Australian workplace health and safety, Network Research, 13 May 2022, volume 38, pages919–935 (2023) https://link.springer.com/article/10.1007/s00146-022-01460-9
- Mohammad Ghavamzadeh, Marek Petrik, and Yinlam Chow. Safe policy improvement by minimizing robust baseline regret. In Advances in Neural Information Processing Systems, pages 2298–2306, 2016. https://arxiv.org/abs/1607.03842v1
- Laurent Orseau and Stuart Armstrong. Safely interruptible agents. In Uncertainty in Artificial Intelligence, pages 557–566, 2016. PDF: http://www.auai.org/uai2016/proceedings/papers/68.pdf
- Tate Ryan-Mosley, August 14, 2023, AI isn’t great at decoding human emotions. So why are regulators targeting the tech? MIT Technology Review, https://www.technologyreview.com/2023/08/14/1077788/ai-decoding-human-emotions-target-for-regulators/
- Maria Korolov, 15 May 2024, 10 things to watch out for with open source gen AI, CIO, https://www.cio.com/article/2104280/10-things-to-watch-out-for-with-open-source-gen-ai.html
- Andy Arditi, Oscar Obeso, Aaquib111, wesg, Neel Nanda, 27th Apr 2024, Refusal in LLMs is mediated by a single direction, LessWrong, https://www.lesswrong.com/posts/jGuXSZgv6qfdhMCuJ/refusal-in-llms-is-mediated-by-a-single-direction
- Google, Responsible Generative AI Toolkit, Feb 2024, https://ai.google.dev/responsible
- Bingbin Liu, Jordan T. Ash, Surbhi Goel, Akshay Krishnamurthy, Cyril Zhang, June 2023, Exposing Attention Glitches with Flip-Flop Language Modeling, https://arxiv.org/abs/2306.00946
- Jon Christian, Jan 30, 2023, CNET's Article-Writing AI Is Already Publishing Very Dumb Errors, https://futurism.com/cnet-ai-errors
- R Dubin, 2023. Disarming Steganography Attacks Inside Neural Network Models, arXiv preprint arXiv:2309.03071, https://arxiv.org/pdf/2309.03071.pdf
- Michael O'Neill, Mark Connor, 6 Jul 2023, Amplifying Limitations, Harms and Risks of Large Language Models, https://arxiv.org/abs/2307.04821
- Lucas Mearian, 14 Mar 2024, AI hallucination mitigation: two brains are better than one, https://www.computerworld.com/article/1612465/ai-hallucination-mitigation-two-brains-are-better-than-one.html
- Zhihang Yuan, Yuzhang Shang, Yang Zhou, Zhen Dong, Zhe Zhou, Chenhao Xue, Bingzhe Wu, Zhikai Li, Qingyi Gu, Yong Jae Lee, Yan Yan, Beidi Chen, Guangyu Sun, Kurt Keutzer, 15 Mar 2024 (v5), LLM Inference Unveiled: Survey and Roofline Model Insights, https://arxiv.org/abs/2402.16363 Code: https://github.com/hahnyuan/LLM-Viewer (A large survey of a variety of LLM optimizations.)
- Laura Manduchi, Kushagra Pandey, Robert Bamler, Ryan Cotterell, Sina Däubener, Sophie Fellenz, Asja Fischer, Thomas Gärtner, Matthias Kirchler, Marius Kloft, Yingzhen Li, Christoph Lippert, Gerard de Melo, Eric Nalisnick, Björn Ommer, Rajesh Ranganath, Maja Rudolph, Karen Ullrich, Guy Van den Broeck, Julia E Vogt, Yixin Wang, Florian Wenzel, Frank Wood, Stephan Mandt, Vincent Fortuin, 28 Feb 2024, On the Challenges and Opportunities in Generative AI, https://arxiv.org/abs/2403.00025
- Zhexin Zhang, Yida Lu, Jingyuan Ma, Di Zhang, Rui Li, Pei Ke, Hao Sun, Lei Sha, Zhifang Sui, Hongning Wang, Minlie Huang, 26 Feb 2024, ShieldLM: Empowering LLMs as Aligned, Customizable and Explainable Safety Detectors, https://arxiv.org/abs/2402.16444, Code: https://github.com/thu-coai/shieldlm
- Peter Dizikes, December 11, 2023, MIT group releases white papers on governance of AI, MIT News, https://news.mit.edu/2023/mit-group-releases-white-papers-governance-ai-1211
- MAK Raiaan, MSH Mukta, K Fatema, NM Fahad, 2023 A Review on Large Language Models: Architectures, Applications, Taxonomies, Open Issues and Challenges, https://www.techrxiv.org/articles/preprint/A_Review_on_Large_Language_Models_Architectures_Applications_Taxonomies_Open_Issues_and_Challenges/24171183/1/files/42414054.pdf
- Xiangyu Qi, Yi Zeng, Tinghao Xie, Pin-Yu Chen3 Ruoxi Jia, Prateek Mittal, Peter Henderson, Oct 2023, Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To! https://arxiv.org/abs/2310.03693v1 Code: https://llm-tuning-safety.github.io/
- Y Hu, J Setpal, D Zhang, J Zietek, J Lambert, 2023, BoilerBot: A Reliable Task-oriented Chatbot Enhanced with Large Language Models, https://assets.amazon.science/8c/03/80c814a749f58e73a1aeda2ff282/boilerbot-tb2-final-2023.pdf
- S Latifi, 2023, Efficient and Dependable Deep Learning Systems Ph.D. Thesis, Computer Science and Engineering, University of Michigan, https://deepblue.lib.umich.edu/bitstream/handle/2027.42/176548/salar_1.pdf?sequence=1
- N. Soares. 2023, Comments on OpenAI’s “Planning for AGI and beyond”. https://www.lesswrong.com/posts/uxnjXBwr79uxLkifG
- K Ramesh, A Chavan, S Pandit, 2023, A Comparative Study on the Impact of Model Compression Techniques on Fairness in Language Models, Microsoft Research, https://aclanthology.org/2023.acl-long.878.pdf https://www.microsoft.com/en-us/research/uploads/prod/2023/07/3687_Paper.pdf
- David Spuler, March 2024, Chapter 43. Overview of AI Research, Generative AI in C++: Coding Transformers and LLMs, https://www.amazon.com/dp/B0CXJKCWX9
- Rithesh Murthy, Liangwei Yang, Juntao Tan, Tulika Manoj Awalgaonkar, Yilun Zhou, Shelby Heinecke, Sachin Desai, Jason Wu, Ran Xu, Sarah Tan, Jianguo Zhang, Zhiwei Liu, Shirley Kokane, Zuxin Liu, Ming Zhu, Huan Wang, Caiming Xiong, Silvio Savarese, 12 Jun 2024, MobileAIBench: Benchmarking LLMs and LMMs for On-Device Use Cases, https://arxiv.org/abs/2406.10290
- Shicheng Xu, Liang Pang, Mo Yu, Fandong Meng, Huawei Shen, Xueqi Cheng, Jie Zhou, 12 Jun 2024 (v2), Unsupervised Information Refinement Training of Large Language Models for Retrieval-Augmented Generation, https://arxiv.org/abs/2402.18150 (Analysis about how LLMs can mishandle information retrieved from a datastore and how to make LLMs better at handling RAG information using a specialized training regime.)
- OpenAI, Moderation: Learn how to build moderation into your AI applications, 2024, https://platform.openai.com/docs/guides/moderation
- Azure, 06/13/2024, Content filtering, https://learn.microsoft.com/en-us/azure/ai-services/openai/concepts/content-filter?tabs=warning%2Cpython
- Yu Wang, Xiaogeng Liu, Yu Li, Muhao Chen, Chaowei Xiao, 14 Mar 2024, AdaShield: Safeguarding Multimodal Large Language Models from Structure-based Attack via Adaptive Shield Prompting, https://arxiv.org/abs/2403.09513 Code: https://github.com/rain305f/AdaShield
- Jinhwa Kim, Ali Derakhshan, Ian G. Harris, 31 Oct 2023, Robust Safety Classifier for Large Language Models: Adversarial Prompt Shield, https://arxiv.org/abs/2311.00172
- Apple, June 2024, Introducing Apple’s On-Device and Server Foundation Models, https://machinelearning.apple.com/research/introducing-apple-foundation-models (Apple's on-device models feature optimizations including small models, grouped query attention, 2-bit/4-bit quantization including activation quantization, shared embedding/unembedding tensors, small-ish vocabulary size of 49k, an undisclosed efficient KV cache optimization for neural engines, and layer-specific 16-bit LoRA/QLoRA adapters of size "10s of megabytes" for fine-tuned specialized model versions, also sometimes in 2-bit/4-bit, claiming speed rates of 0.6ms/token in prefill, and 30 tokens per second in decoding.)
- NVIDIA, June 2024, Nemotron-4 340B Technical Report, https://d1qx31qr3h6wln.cloudfront.net/publications/Nemotron_4_340B_8T_0.pdf (Architecture is decoder-only with GQA, SentencePiece tokenizer, causal attention masks, RoPE, 96 layers, 96 heads, 8 KV heads, 256,000 vocabulary, 18432 internal dimension, context window 4096, and uses squared RELU.)
- Frank Chung, June 23, 2024, ‘I need to go outside’: Young people ‘extremely addicted’ as Character.AI explodes, https://www.news.com.au/technology/online/internet/i-need-to-go-outside-young-people-extremely-addicted-as-characterai-explodes/news-story/5780991c61455c680f34b25d5847a341
- Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, Ryan Lowe, 4 Mar 2022, Training language models to follow instructions with human feedback, https://arxiv.org/abs/2203.02155 (The original 2022 InstructGPT paper from OpenAI.)
- Valentina Alto, 2024, Chapter 12: Responsible AI, Building LLM-Powered Applications: Create intelligence apps and agents with large language models, Packt Publishing, https://www.amazon.com/Building-LLM-Apps-Intelligent-Language/dp/1835462316/
- Aarushi Kansal, Chapter 4: Guardrails and AI: Building Safe and Controllable Apps, Building Generative AI-Powered Apps: A Hands-on Guide for Developers, Apress, https://www.amazon.com/Building-Generative-AI-Powered-Apps-Hands-ebook/dp/B0CTXXP1S4/
- Seungju Han, Kavel Rao, Allyson Ettinger, Liwei Jiang, Bill Yuchen Lin, Nathan Lambert, Yejin Choi, Nouha Dziri, 26 Jun 2024, WildGuard: Open One-Stop Moderation Tools for Safety Risks, Jailbreaks, and Refusals of LLMs, https://arxiv.org/abs/2406.18495
- Jaymari Chua, Yun Li, Shiyi Yang, Chen Wang, Lina Yao, 6 Jul 2024, AI Safety in Generative AI Large Language Models: A Survey, https://arxiv.org/abs/2407.18369
- Marko Zivkovic, Aug 06, 2024, Discovered Apple Intelligence prompts show Apple's attempt at preventing AI disaster, https://appleinsider.com/articles/24/08/06/discovered-apple-intelligence-prompts-show-apples-attempt-at-preventing-ai-disaster
- Mack DeGeurin, Aug 9, 2024, Researchers worry about AI turning humans into jerks: OpenAI safety researchers think GPT4o could influence 'social norms.', https://www.popsci.com/technology/openai-jerks/
- OpenAI, August 8, 2024 GPT-4o System Card, https://openai.com/index/gpt-4o-system-card/
- Rohin Shah, Seb Farquhar, Anca Dragan, 21st Aug 2024, AGI Safety and Alignment at Google DeepMind: A Summary of Recent Work, https://www.alignmentforum.org/posts/79BPxvSsjzBkiSyTq/agi-safety-and-alignment-at-google-deepmind-a-summary-of
- Shayne Longpre, Stella Biderman, Alon Albalak, Hailey Schoelkopf, Daniel McDuff, Sayash Kapoor, Kevin Klyman, Kyle Lo, Gabriel Ilharco, Nay San, Maribeth Rauh, Aviya Skowron, Bertie Vidgen, Laura Weidinger, Arvind Narayanan, Victor Sanh, David Adelani, Percy Liang, Rishi Bommasani, Peter Henderson, Sasha Luccioni, Yacine Jernite, Luca Soldaini, 26 Jun 2024 (v2), The Responsible Foundation Model Development Cheatsheet: A Review of Tools & Resources, https://arxiv.org/abs/2406.16746
- Thomas Mildner, Orla Cooney, Anna-Maria Meck, Marion Bartl, Gian-Luca Savino, Philip R. Doyle, Diego Garaialde, Leigh Clark, John Sloan, Nina Wenig, Rainer Malaka, Jasmin Niess, 26 Jan 2024, Listening to the Voices: Describing Ethical Caveats of Conversational User Interfaces According to Experts and Frequent Users, Proceedings of the CHI Conference on Human Factors in Computing Systems (CHI '24), May 11--16, 2024, Honolulu, HI, USA, https://arxiv.org/abs/2401.14746 https://doi.org/https://doi.org/10.1145/3613904.3642542
- Kyle Wiggers, September 4, 2024, Ilya Sutskever’s startup, Safe Superintelligence, raises $1B, https://techcrunch.com/2024/09/04/ilya-sutskevers-startup-safe-super-intelligence-raises-1b/
- Balasubramaniam S. , Vanajaroselin Chirchi, Seifedine Kadry, Moorthy Agoramoorthy, Gururama Senthilvel P., Satheesh Kumar K., and Sivakumar T. A., Oct 2024, The Road Ahead: Emerging Trends, Unresolved Issues, and ConcludingRemarksinGenerativeAI—AComprehensiveReview, International Journal of Intelligent Systems, Volume 2024, Article ID 4013195, 38 pages, https://doi.org/10.1155/2024/4013195 https://www.researchgate.net/profile/Balasubramaniam-s-2/publication/384729387_The_Road_Ahead_Emerging_Trends_Unresolved_Issues_and_Concluding_Remarks_in_Generative_AI-A_Comprehensive_Review/links/6705560cf5eb7108c6e5d261/The-Road-Ahead-Emerging-Trends-Unresolved-Issues-and-Concluding-Remarks-in-Generative-AI-A-Comprehensive-Review.pdf
- Xinyi Zeng, Yuying Shang, Yutao Zhu, Jiawei Chen, Yu Tian, 9 Oct 2024, Root Defence Strategies: Ensuring Safety of LLM at the Decoding Level, https://arxiv.org/abs/2410.06809
- Michael Nuñez, October 15, 2024, Anthropic just made it harder for AI to go rogue with its updated safety policy, https://venturebeat.com/ai/anthropic-just-made-it-harder-for-ai-to-go-rogue-with-its-updated-safety-policy/
- ETO, Apr 2024, The state of global AI safety research, https://eto.tech/blog/state-of-global-ai-safety-research/
- Leon Derczynski, Christopher Parisien, Nikki Pope, Michael Boone, Nov 2024, NVIDIA Approaches to AI Trust and Safety: Innovation and Tools, https://www.nvidia.com/en-us/on-demand/session/aisummitdc24-sdc1088/?playlistId=playList-c6a9450c-c790-462d-a058-0bacacd5d370
- Y. Bai et al., "Backdoor Attack and Defense on Deep Learning: A Survey," in IEEE Transactions on Computational Social Systems, doi: 10.1109/TCSS.2024.3482723. https://ieeexplore.ieee.org/abstract/document/10744415
- OpenAI, November 21, 2024, Advancing red teaming with people and AI, https://openai.com/index/advancing-red-teaming-with-people-and-ai/
- Patrick Mineault, Niccolò Zanichelli, Joanne Zichen Peng, Anton Arkhipov, Eli Bingham, Julian Jara-Ettinger, Emily Mackevicius, Adam Marblestone, Marcelo Mattar, Andrew Payne, Sophia Sanborn, Karen Schroeder, Zenna Tavares, Andreas Tolias, 27 Nov 2024, NeuroAI for AI Safety, https://arxiv.org/abs/2411.18526
- Maria Korolov and Michael Hill, 03 Dec 2024, 10 most critical LLM vulnerabilities, https://www.csoonline.com/article/575497/owasp-lists-10-most-critical-large-language-model-vulnerabilities.html
- Mayank Vatsa, Anubhooti Jain, Richa Singh, 7 Dec 2023, Adventures of Trustworthy Vision-Language Models: A Survey, https://arxiv.org/abs/2312.04231
- Yedi Zhang, Yufan Cai, Xinyue Zuo, Xiaokun Luan, Kailong Wang, Zhe Hou, Yifan Zhang, Zhiyuan Wei, Meng Sun, Jun Sun, Jing Sun, Jin Song Dong, 9 Dec 2024, The Fusion of Large Language Models and Formal Methods for Trustworthy AI Agents: A Roadmap, https://arxiv.org/abs/2412.06512
- Aditi Bodhankar, Dec 06, 2024, Content Moderation and Safety Checks with NVIDIA NeMo Guardrails, https://developer.nvidia.com/blog/content-moderation-and-safety-checks-with-nvidia-nemo-guardrails/
- Inkit Padhi, Manish Nagireddy, Giandomenico Cornacchia, Subhajit Chaudhury, Tejaswini Pedapati, Pierre Dognin, Keerthiram Murugesan, Erik Miehling, Martín Santillán Cooper, Kieran Fraser, Giulio Zizzo, Muhammad Zaid Hameed, Mark Purcell, Michael Desmond, Qian Pan, Inge Vejsbjerg, Elizabeth M. Daly, Michael Hind, Werner Geyer, Ambrish Rawat, Kush R. Varshney, Prasanna Sattigeri, 10 Dec 2024, Granite Guardian, https://arxiv.org/abs/2412.07724 https://github.com/ibm-granite/granite-guardian (Open-sourcing of safety models with many capabilities.)
- Rhiannon Williams, December 31, 2024, The biggest AI flops of 2024: From chatbots dishing out illegal advice to dodgy AI-generated search results, take a look back over the year’s top AI failures. https://www.technologyreview.com/2024/12/31/1109612/biggest-worst-ai-artificial-intelligence-flops-fails-2024/
- James Manyika, Demis Hassabis, Feb 04, 2025, Responsible AI: Our 2024 report and ongoing work, https://blog.google/technology/ai/responsible-ai-2024-report-ongoing-work/
- Arjun Kharpal, Feb 6 2025, ‘Dangerous proposition’: Top scientists warn of out-of-control AI, https://www.cnbc.com/2025/02/07/dangerous-proposition-top-scientists-warn-of-out-of-control-ai.html
- Vagner Figueredo de Santana, Sara Berger, Tiago Machado, Maysa Malfiza Garcia de Macedo, Cassia Sampaio Sanctos, Lemara Williams, and Zhaoqing Wu. 2025. Can LLMs Recommend More Responsible Prompts? In Proceedings of the 30th International Conference on Intelligent User Interfaces (IUI '25). Association for Computing Machinery, New York, NY, USA, 298–313. https://doi.org/10.1145/3708359.3712137 https://dl.acm.org/doi/full/10.1145/3708359.3712137 https://dl.acm.org/doi/pdf/10.1145/3708359.3712137
- Michael Nuñez, July 15, 2025, OpenAI, Google DeepMind and Anthropic sound alarm: ‘We may be losing the ability to understand AI’, https://venturebeat.com/ai/openai-google-deepmind-and-anthropic-sound-alarm-we-may-be-losing-the-ability-to-understand-ai/ (Monitoring the text-based interim "thinking-out-loud" reasoning of models in CoT.)
- Tomek Korbak, Mikita Balesni, (and many more authors) July 2025, Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety, https://tomekkorbak.com/cot-monitorability-is-a-fragile-opportunity/cot_monitoring.pdf
- Anthropic, 13 Aug 2025, Building Safeguards for Claude, https://www.anthropic.com/news/building-safeguards-for-claude
- Gabriel J. Perin, Runjin Chen, Xuxi Chen, Nina S. T. Hirata, Zhangyang Wang, Junyuan Hong, 23 Jul 2025, LoX: Low-Rank Extrapolation Robustifies LLM Safety Against Fine-tuning, https://arxiv.org/abs/2506.15606
- Zheng Hui, Yijiang River Dong, Ehsan Shareghi, Nigel Collier, 22 Jul 2025, TRIDENT: Benchmarking LLM Safety in Finance, Medicine, and Law, https://arxiv.org/abs/2507.21134
- Haonan Zhang, Dongxia Wang, Yi Liu, Kexin Chen, Jiashui Wang, Xinlei Ying, Long Liu, Wenhai Wang, 15 Aug 2025, ORFuzz: Fuzzing the "Other Side" of LLM Safety -- Testing Over-Refusal, https://arxiv.org/abs/2508.11222
- Yuan Gao, Mattia Piccinini, Korbinian Moller, Amr Alanwar, Johannes Betz, 18 Jul 2025, From Words to Collisions: LLM-Guided Evaluation and Adversarial Generation of Safety-Critical Driving Scenarios, https://arxiv.org/abs/2502.02145
- Juan Manuel Contreras, 19 Jul 2025, Automated Safety Evaluations Across 20 Large Language Models: The Aymara LLM Risk and Responsibility Matrix, https://arxiv.org/abs/2507.14719
- Haoyu Wang and Chris M. Poskitt and Jun Sun and Jiali Wei, 1 Aug 2025, Pro2Guard: Proactive Runtime Enforcement of LLM Agent Safety via Probabilistic Model Checking, https://arxiv.org/abs/2508.00500
- Avni Kothari, Patrick Vossler, Jean Digitale, Mohammad Forouzannia, Elise Rosenberg, Michele Lee, Jennee Bryant, Melanie Molina, James Marks, Lucas Zier, Jean Feng, 11 Aug 2025, When the Domain Expert Has No Time and the LLM Developer Has No Clinical Expertise: Real-World Lessons from LLM Co-Design in a Safety-Net Hospital, https://arxiv.org/abs/2508.08504
- Bing Han, Feifei Zhao, Dongcheng Zhao, Guobin Shen, Ping Wu, Yu Shi, Yi Zeng, 8 Aug 2025, Fine-Grained Safety Neurons with Training-Free Continual Projection to Reduce LLM Fine Tuning Risks, https://arxiv.org/abs/2508.09190
- Minseon Kim, Jin Myung Kwak, Lama Alssum, Bernard Ghanem, Philip Torr, David Krueger, Fazl Barez, Adel Bibi, 17 Aug 2025, Rethinking Safety in LLM Fine-tuning: An Optimization Perspective, https://arxiv.org/abs/2508.12531
- Mingxing Peng, Yuting Xie, Xusen Guo, Ruoyu Yao, Hai Yang, and Jun Ma, 17 Aug 2025, LD-Scene: LLM-Guided Diffusion for Controllable Generation of Adversarial Safety-Critical Driving Scenarios, https://arxiv.org/abs/2505.11247
- Yeonjun In, Wonjoong Kim, Sangwu Park, Chanyoung Park, 1 Aug 2025, R1-ACT: Efficient Reasoning Model Safety Alignment by Activating Safety Knowledge, https://arxiv.org/abs/2508.00324
- Yifan Wang, Runjin Chen, Bolian Li, David Cho, Yihe Deng, Ruqi Zhang, Tianlong Chen, Zhangyang Wang, Ananth Grama, Junyuan Hong, 22 Jul 2025, More is Less: The Pitfalls of Multi-Model Synthetic Preference Data in DPO Safety Alignment, https://arxiv.org/abs/2504.02193
- Frederik Pahde, Thomas Wiegand, Sebastian Lapuschkin, Wojciech Samek, 29 Jul 2025, Ensuring Medical AI Safety: Interpretability-Driven Detection and Mitigation of Spurious Model Behavior and Associated Data, https://arxiv.org/abs/2501.13818
- Qianli Ma, Dongrui Liu, Qian Chen, Linfeng Zhang, Jing Shao, 14 Aug 2025, LED-Merging: Mitigating Safety-Utility Conflicts in Model Merging with Location-Election-Disjoint, https://arxiv.org/abs/2502.16770
- Raviraj Joshi, Rakesh Paul, Kanishk Singla, Anusha Kamath, Michael Evans, Katherine Luna, Shaona Ghosh, Utkarsh Vaidya, Eileen Long, Sanjay Singh Chauhan, Niranjan Wartikar, 3 Aug 2025, CultureGuard: Towards Culturally-Aware Dataset and Guard Model for Multilingual Safety Applications, https://arxiv.org/abs/2508.01710
- Xingjun Ma, Yifeng Gao, Yixu Wang, Ruofan Wang, Xin Wang, Ye Sun, Yifan Ding, Hengyuan Xu, Yunhao Chen, Yunhan Zhao, Hanxun Huang, Yige Li, Yutao Wu, Jiaming Zhang, Xiang Zheng, Yang Bai, Zuxuan Wu, Xipeng Qiu, Jingfeng Zhang, Yiming Li, Xudong Han, Haonan Li, Jun Sun, Cong Wang, Jindong Gu, Baoyuan Wu, Siheng Chen, Tianwei Zhang, Yang Liu, Mingming Gong, Tongliang Liu, Shirui Pan, Cihang Xie, Tianyu Pang, Yinpeng Dong, Ruoxi Jia, Yang Zhang, Shiqing Ma, Xiangyu Zhang, Neil Gong, Chaowei Xiao, Sarah Erfani, Tim Baldwin, Bo Li, Masashi Sugiyama, Dacheng Tao, James Bailey, Yu-Gang Jiang, 2 Aug 2025, Safety at Scale: A Comprehensive Survey of Large Model and Agent Safety, https://arxiv.org/abs/2502.05206
- Francisco Munguia-Galeano, Zhengxue Zhou, Satheeshkumar Veeramani, Hatem Fakhruldeen, Louis Longley, Rob Clowes and Andrew I. Cooper, 7 Aug 2025, Chemist Eye: A Visual Language Model-Powered System for Safety Monitoring and Robot Decision-Making in Self-Driving Laboratories, https://arxiv.org/abs/2508.05148
AI Books from Aussie AI
![]() |
The Sweetest Lesson: Your Brain Versus AI: new book on AI intelligence theory:
Get your copy from Amazon: The Sweetest Lesson |
![]() |
RAG Optimization: Accurate and Efficient LLM Applications:
new book on RAG architectures:
Get your copy from Amazon: RAG Optimization |
![]() |
Generative AI Applications book:
Get your copy from Amazon: Generative AI Applications |
![]() |
Generative AI programming book:
Get your copy from Amazon: Generative AI in C++ |
![]() |
CUDA C++ Optimization book:
Get your copy from Amazon: CUDA C++ Optimization |
![]() |
CUDA C++ Debugging book:
Get your copy from Amazon: CUDA C++ Debugging |
More AI Research
Read more about: