Aussie AI
LLM GUI Agents
-
Last Updated 22 October, 2025
-
by David Spuler, Ph.D.
GUI agents are LLM-based agents that can read and/or manipulate the GUI. The first GUI agents were read-only, but advanced GUI agents can now not only read the screen, but can also click the mouse button or enter keystrokes. This means that the LLM can now launch and control any apps on a PC or phone.
Early GUI agents were read-only, examining what was on the screen as context for a user's query. It was useful to see what app or window the person was looking at on the screen when issuing a query. There were two methods of examining the screen's display:
- Image-based (i.e., using a screen snapshot)
- Internal hierarchy analysis (i.e., examining the internal representation of windows)
Recently, more advanced GUI agents have been released that can also have full control of the input devices, such as moving the mouse, clicking, or entering keystrokes. These advanced "computer usage" agents are theoretically capable of doing anything that a human user can do, but automated via an LLM.
Related areas of LLM research include:
Survey Papers on GUI Agents
Recent survey papers on computer usage and GUI agents:
- Chaoyun Zhang, Shilin He, Jiaxu Qian, Bowen Li, Liqun Li, Si Qin, Yu Kang, Minghua Ma, Qingwei Lin, Saravan Rajmohan, Dongmei Zhang, Qi Zhang, 27 Nov 2024, Large Language Model-Brained GUI Agents: A Survey, https://arxiv.org/abs/2411.18279
- Shuai Wang, Weiwen Liu, Jingxuan Chen, Weinan Gan, Xingshan Zeng, Shuai Yu, Xinlong Hao, Kun Shao, Yasheng Wang, Ruiming Tang, 7 Nov 2024, GUI Agents with Foundation Models: A Comprehensive Survey, https://arxiv.org/abs/2411.04890
- Dang Nguyen, Jian Chen, Yu Wang, Gang Wu, Namyong Park, Zhengmian Hu, Hanjia Lyu, Junda Wu, Ryan Aponte, Yu Xia, Xintong Li, Jing Shi, Hongjie Chen, Viet Dac Lai, Zhouhang Xie, Sungchul Kim, Ruiyi Zhang, Tong Yu, Mehrab Tanjim, Nesreen K. Ahmed, Puneet Mathur, Seunghyun Yoon, Lina Yao, Branislav Kveton, Thien Huu Nguyen, Trung Bui, Tianyi Zhou, Ryan A. Rossi, Franck Dernoncourt, 18 Dec 2024, GUI Agents: A Survey, https://arxiv.org/abs/2412.13501
- Pascal J. Sager, Benjamin Meyer, Peng Yan, Rebekka von Wartburg-Kottler, Layan Etaiwi, Aref Enayati, Gabriel Nobel, Ahmed Abdulkadir, Benjamin F. Grewe, Thilo Stadelmann, 27 Jan 2025, AI Agents for Computer Use: A Review of Instruction-based Computer Control, GUI Automation, and Operator Assistants, https://arxiv.org/abs/2501.16150
Research on LLM GUI Agents
Research papers on GUI agents:
- Anthropic, 23 Oct 2024, Developing a computer use model, https://www.anthropic.com/news/developing-computer-use
- Anthropic, 23 Oct 2024, Introducing computer use, a new Claude 3.5 Sonnet, and Claude 3.5 Haiku, https://www.anthropic.com/news/3-5-models-and-computer-use
- Anirban Ghoshal, 23 Oct 2024, How Anthropic’s new ‘computer use’ ability could further AI automation, https://www.cio.com/article/3583260/how-anthropics-new-computer-use-ability-could-further-ai-automation.html
- Zhiyong Wu, Zhenyu Wu, Fangzhi Xu, Yian Wang, Qiushi Sun, Chengyou Jia, Kanzhi Cheng, Zichen Ding, Liheng Chen, Paul Pu Liang, Yu Qiao, 30 Oct 2024, OS-ATLAS: A Foundation Action Model for Generalist GUI Agents, https://arxiv.org/abs/2410.23218 https://github.com/OS-Copilot/OS-Atlas
- Rogerio Bonatti, Dan Zhao, Francesco Bonacci, Dillon Dupont, Sara Abdali, Yinheng Li, Yadong Lu, Justin Wagle, Kazuhito Koishida, Arthur Bucker, Lawrence Jang, Zack Hui, 13 Sep 2024 (v2), Windows Agent Arena: Evaluating Multi-Modal OS Agents at Scale, https://arxiv.org/abs/2409.08264
- Xiao Liu, Bo Qin, Dongzhu Liang, Guang Dong, Hanyu Lai, Hanchen Zhang, Hanlin Zhao, Iat Long Iong, Jiadai Sun, Jiaqi Wang, Junjie Gao, Junjun Shan, Kangning Liu, Shudan Zhang, Shuntian Yao, Siyi Cheng, Wentao Yao, Wenyi Zhao, Xinghan Liu, Xinyi Liu, Xinying Chen, Xinyue Yang, Yang Yang, Yifan Xu, Yu Yang, Yujia Wang, Yulin Xu, Zehan Qi, Yuxiao Dong, Jie Tang, 28 Oct 2024, AutoGLM: Autonomous Foundation Agents for GUIs https://arxiv.org/abs/2411.00820
- Shuai Wang, Weiwen Liu, Jingxuan Chen, Weinan Gan, Xingshan Zeng, Shuai Yu, Xinlong Hao, Kun Shao, Yasheng Wang, Ruiming Tang, 7 Nov 2024, GUI Agents with Foundation Models: A Comprehensive Survey, https://arxiv.org/abs/2411.04890
- Shirin Ghaffary and Rachel Metz November 14, 2024, OpenAI Nears Launch of AI Agent Tool to Automate Tasks for Users. The new software, codenamed “Operator,” is set to be released in January. https://www.bloomberg.com/news/articles/2024-11-13/openai-nears-launch-of-ai-agents-to-automate-tasks-for-users
- Siyuan Hu, Mingyu Ouyang, Difei Gao, Mike Zheng Shou, 15 Nov 2024, The Dawn of GUI Agent: A Preliminary Case Study with Claude 3.5 Computer Use, https://arxiv.org/abs/2411.10323 https://github.com/showlab/computer_use_ootb
- Mike Elgan, 22 Nov 2024, AI agents are unlike any technology ever, https://www.computerworld.com/article/3608973/ai-agents-are-unlike-any-technology-ever.html
- Show Lab, Nov 2024, ShowUI: ShowUI is a lightweight (2B) vision-language-action model designed for GUI agents. https://huggingface.co/showlab/ShowUI-2B
- Chaoyun Zhang, Shilin He, Jiaxu Qian, Bowen Li, Liqun Li, Si Qin, Yu Kang, Minghua Ma, Qingwei Lin, Saravan Rajmohan, Dongmei Zhang, Qi Zhang, 27 Nov 2024, Large Language Model-Brained GUI Agents: A Survey, https://arxiv.org/abs/2411.18279
- Zhuosheng Zhang, Aston Zhang, 7 Jun 2024 (v4), You Only Look at Screens: Multimodal Chain-of-Action Agents, https://arxiv.org/abs/2309.11436
- Boyu Gou, Ruohan Wang, Boyuan Zheng, Yanan Xie, Cheng Chang, Yiheng Shu, Huan Sun, Yu Su, 7 Oct 2024, Navigating the Digital World as Humans Do: Universal Visual Grounding for GUI Agents, https://arxiv.org/abs/2410.05243
- Kanzhi Cheng, Qiushi Sun, Yougang Chu, Fangzhi Xu, Yantao Li, Jianbing Zhang, Zhiyong Wu, 23 Feb 2024 (v2), SeeClick: Harnessing GUI Grounding for Advanced Visual GUI Agents, https://arxiv.org/abs/2401.10935
- Michael Nuñez, November 29, 2024, AI that clicks for you: Microsoft’s research points to the future of GUI automation, https://venturebeat.com/ai/ai-that-clicks-for-you-microsoft-research-points-to-the-future-of-gui-automation/
- Yiqin Wang, Haoji Zhang, Jingqi Tian, Yansong Tang, 2 Dec 2024, Ponder & Press: Advancing Visual GUI Agent towards General Computer Control, https://arxiv.org/abs/2412.01268 https://invinciblewyq.github.io/ponder-press-page/
- Kyle Wiggers, December 5, 2024, Copilot Vision, Microsoft’s AI tool that can read your screen, launches in preview, https://techcrunch.com/2024/12/05/copilot-vision-microsofts-ai-tool-that-can-read-your-screen-launches-in-preview/
- Wenyi Hong, Weihan Wang, Qingsong Lv, Jiazheng Xu, Wenmeng Yu, Junhui Ji, Yan Wang, Zihan Wang, Yuxiao Dong, Ming Ding, Jie Tang; 2024, Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024, pp. 14281-14290, CogAgent: A Visual Language Model for GUI Agents https://openaccess.thecvf.com/content/CVPR2024/html/Hong_CogAgent_A_Visual_Language_Model_for_GUI_Agents_CVPR_2024_paper.html
- Keen You, Haotian Zhang, Eldon Schoop, Floris Weers, Amanda Swearngin, Jeffrey Nichols, Yinfei Yang, Zhe Gan, 8 Apr 2024, Ferret-UI: Grounded Mobile UI Understanding with Multimodal LLMs, https://arxiv.org/abs/2404.05719
- Dongping Chen, Yue Huang, Siyuan Wu, Jingyu Tang, Liuyi Chen, Yilin Bai, Zhigang He, Chenlong Wang, Huichi Zhou, Yiqiang Li, Tianshuo Zhou, Yue Yu, Chujie Gao, Qihui Zhang, Yi Gui, Zhen Li, Yao Wan, Pan Zhou, Jianfeng Gao, Lichao Sun, 16 Jun 2024, GUI-WORLD: A Dataset for GUI-oriented Multimodal LLM-based Agents, https://arxiv.org/abs/2406.10819
- Wentong Chen, Junbo Cui, Jinyi Hu, Yujia Qin, Junjie Fang, Yue Zhao, Chongyi Wang, Jun Liu, Guirong Chen, Yupeng Huo, Yuan Yao, Yankai Lin, Zhiyuan Liu, Maosong Sun, 17 Jun 2024, GUICourse: From General Vision Language Models to Versatile GUI Agents, https://arxiv.org/abs/2406.11317 https://github.com/yiye3/GUICourse
- Yiheng Xu, Zekun Wang, Junli Wang, Dunjie Lu, Tianbao Xie, Amrita Saha, Doyen Sahoo, Tao Yu, Caiming Xiong, 5 Dec 2024, Aguvis: Unified Pure Vision Agents for Autonomous GUI Interaction, https://arxiv.org/abs/2412.04454 https://aguvis-project.github.io/
- AskUI, Dec 2024, AskUI Vision Agent: Automate computer tasks in Python, https://github.com/askui/vision-agent
- Huawen Shen, Chang Liu, Gengluo Li, Xinlong Wang, Yu Zhou, Can Ma, Xiangyang Ji, 12 Dec 2024, Falcon-UI: Understanding GUI Before Following User Instructions, https://arxiv.org/abs/2412.09362
- Google, Dec 2024, Project Mariner: A research prototype exploring the future of human-agent interaction, starting with your browser, https://deepmind.google/technologies/project-mariner/
- Lu Wang, Fangkai Yang, Chaoyun Zhang, Junting Lu, Jiaxu Qian, Shilin He, Pu Zhao, Bo Qiao, Ray Huang, Si Qin, Qisheng Su, Jiayi Ye, Yudi Zhang, Jian-Guang Lou, Qingwei Lin, Saravan Rajmohan, Dongmei Zhang, Qi Zhang, 13 Dec 2024, Large Action Models: From Inception to Implementation, https://arxiv.org/abs/2412.10047 https://github.com/microsoft/UFO/tree/main/dataflow https://microsoft.github.io/UFO/dataflow/overview/
- Zhiqi Ge, Juncheng Li, Xinglei Pang, Minghe Gao, Kaihang Pan, Wang Lin, Hao Fei, Wenqiao Zhang, Siliang Tang, Yueting Zhuang, 13 Dec 2024, Iris: Breaking GUI Complexity with Adaptive Focus and Self-Refining, https://arxiv.org/abs/2412.10342
- Jiarun Liu, Jia Hao, Chunhong Zhang, Zheng Hu, 14 Dec 2024, WEPO: Web Element Preference Optimization for LLM-based Web Navigation, https://arxiv.org/abs/2412.10742
- Dang Nguyen, Jian Chen, Yu Wang, Gang Wu, Namyong Park, Zhengmian Hu, Hanjia Lyu, Junda Wu, Ryan Aponte, Yu Xia, Xintong Li, Jing Shi, Hongjie Chen, Viet Dac Lai, Zhouhang Xie, Sungchul Kim, Ruiyi Zhang, Tong Yu, Mehrab Tanjim, Nesreen K. Ahmed, Puneet Mathur, Seunghyun Yoon, Lina Yao, Branislav Kveton, Thien Huu Nguyen, Trung Bui, Tianyi Zhou, Ryan A. Rossi, Franck Dernoncourt, 18 Dec 2024, GUI Agents: A Survey, https://arxiv.org/abs/2412.13501
- Diego Rivas, D Shin, 19 Dec 2024 User Interface for Efficient Control of Autonomous Agent Tasks , https://www.tdcommons.org/cgi/viewcontent.cgi?article=8835&context=dpubs_series
- Hao Wen, Shizuo Tian, Borislav Pavlov, Wenjie Du, Yixuan Li, Ge Chang, Shanhui Zhao, Jiacheng Liu, Yunxin Liu, Ya-Qin Zhang, Yuanchun Li, 24 Dec 2024, AutoDroid-V2: Boosting SLM-based GUI Agents via Code Generation, https://arxiv.org/abs/2412.18116
- Kangjia Zhao, Jiahui Song, Leigang Sha, Haozhan Shen, Zhi Chen, Tiancheng Zhao, Xiubo Liang, Jianwei Yin, 24 Dec 2024, GUI Testing Arena: A Unified Benchmark for Advancing Autonomous GUI Testing Agent, https://arxiv.org/abs/2412.18426 https://github.com/ZJU-ACES-ISE/ChatUITest
- X Hu, T Xiong, B Yi, Z Wei, R Xiao, Y Chen, J Ye, M Tao, Dec 2024, OS Agents: A Survey on MLLM-Based Agents for General Computing Devices Use, https://www.preprints.org/frontend/manuscript/3842b6163d82801988adf663ee18b6d5/download_pub
- Xueyu Hu,Tao Xiong,Biao Yi,Zishu Wei,Ruixuan Xiao,Yurun Chen,Jiasheng Ye,Meiling Tao,Xiangxin Zhou,Ziyu Zhao,Yuhuai Li,Shengze Xu,Shawn Wang,Xinchen Xu,Shuofei Qiao,Kun Kuang,Tieyong Zeng,Liang Wang,Jiwei Li,Yuchen Eleanor Jiang,Wangchunshu Zhou,Guoyin Wang,Keting Yin,Zhou Zhao,Hongxia Yang,Fan Wu,Shengyu Zhang ,Fei Wu, Dec 2024, OS Agents: A Survey on MLLM-Based Agents for General Computing Devices Use, https://www.preprints.org/manuscript/202412.2294/v1
- Gautier Dagan, Frank Keller, Alex Lascarides, 30 Dec 2024, Plancraft: an evaluation dataset for planning with LLM agents, https://arxiv.org/abs/2412.21033
- Yuxiang Chai, Hanhao Li, Jiayu Zhang, Liang Liu, Guozhi Wang, Shuai Ren, Siyuan Huang, Hongsheng Li, 2 Jan 2025, A3: Android Agent Arena for Mobile GUI Agents, https://arxiv.org/abs/2501.01149 https://yuxiangchai.github.io/Android-Agent-Arena/
- Dezhi Ran, Mengzhou Wu, Hao Yu, Yuetong Li, Jun Ren, Yuan Cao, Xia Zeng, Haochuan Lu, Zexin Xu, Mengqian Xu, Ting Su, Liangchao Yao, Ting Xiong, Wei Yang, Yuetang Deng, Assaf Marron, David Harel, Tao Xie, 6 Jan 2025, Beyond Pass or Fail: A Multi-dimensional Benchmark for Mobile UI Navigation, https://arxiv.org/abs/2501.02863
- Yuhang Liu, Pengxiang Li, Zishu Wei, Congkai Xie, Xueyu Hu, Xinchen Xu, Shengyu Zhang, Xiaotian Han, Hongxia Yang, Fei Wu, 8 Jan 2025, InfiGUIAgent: A Multimodal Generalist GUI Agent with Native Reasoning and Reflection, https://arxiv.org/abs/2501.04575
- Taryn Plumb, January 22, 2025, ByteDance’s UI-TARS can take over your computer, outperforms GPT-4o and Claude, https://venturebeat.com/ai/bytedances-ui-tars-can-take-over-your-computer-outperforms-gpt-4o-and-claude/
- Yujia Qin, Yining Ye, Junjie Fang, Haoming Wang, Shihao Liang, Shizuo Tian, Junda Zhang, Jiahao Li, Yunxin Li, Shijue Huang, Wanjun Zhong, Kuanye Li, Jiale Yang, Yu Miao, Woyu Lin, Longxiang Liu, Xu Jiang, Qianli Ma, Jingyu Li, Xiaojun Xiao, Kai Cai, Chuang Li, Yaowei Zheng, Chaolin Jin, Chen Li, Xiao Zhou, Minchao Wang, Haoli Chen, Zhaojian Li, Haihua Yang, Haifeng Liu, Feng Lin, Tao Peng, Xin Liu, Guang Shi, 21 Jan 2025, UI-TARS: Pioneering Automated GUI Interaction with Native Agents, https://arxiv.org/abs/2501.12326
- Maxwell Zeff, January 23, 2025, OpenAI launches Operator, an AI agent that performs tasks autonomously, https://techcrunch.com/2025/01/23/openai-launches-operator-an-ai-agent-that-performs-tasks-autonomously/
- Pascal J. Sager, Benjamin Meyer, Peng Yan, Rebekka von Wartburg-Kottler, Layan Etaiwi, Aref Enayati, Gabriel Nobel, Ahmed Abdulkadir, Benjamin F. Grewe, Thilo Stadelmann, 27 Jan 2025, AI Agents for Computer Use: A Review of Instruction-based Computer Control, GUI Automation, and Operator Assistants, https://arxiv.org/abs/2501.16150
- Kyle Wiggers, January 27, 2025, Alibaba’s Qwen team releases AI models that can control PCs and phones, https://techcrunch.com/2025/01/27/alibabas-qwen-team-releases-ai-models-that-can-control-pcs-and-phones/
- Tian Huang, Chun Yu, Weinan Shi, Zijian Peng, David Yang, Weiqi Sun, and Yuanchun Shi. 2025. Prompt2Task: Automating UI Tasks on Smartphones from Textual Prompts. ACM Trans. Comput.-Hum. Interact. Just Accepted (February 2025). https://doi.org/10.1145/3716132 https://dl.acm.org/doi/abs/10.1145/3716132
- Qinzhuo Wu, Wei Liu, Jian Luan, Bin Wang, 5 Feb 2025, ReachAgent: Enhancing Mobile Agent via Page Reaching and Operation, https://arxiv.org/abs/2502.02955
- Kunal Singh, Shreyas Singh, Mukund Khanna, 12 Feb 2025, TRISHUL: Towards Region Identification and Screen Hierarchy Understanding for Large VLM based GUI Agents, https://arxiv.org/abs/2502.08226
- Matt Marshall, February 22, 2025, The rise of browser-use agents: Why Convergence’s Proxy is beating OpenAI’s Operator, https://venturebeat.com/ai/the-rise-of-browser-use-agents-why-convergences-proxy-is-beating-openais-operator/
- Frank Landymore, Jan 25, 2025, OpenAI's Agent Has a Problem: Before It Does Anything Important, You Have to Double-Check It Hasn't Screwed Up: Not as hands-off as you might hope, https://futurism.com/openai-asks-permission-important
- Jiani Zheng, Lu Wang, Fangkai Yang, Chaoyun Zhang, Lingrui Mei, Wenjie Yin, Qingwei Lin, Dongmei Zhang, Saravan Rajmohan, Qi Zhang, 26 Feb 2025, VEM: Environment-Free Exploration for Training GUI Agent with Value Environment Model, https://arxiv.org/abs/2502.18906
- Wenjia Jiang, Yangyang Zhuang, Chenxi Song, Xu Yang, Chi Zhang, 4 Mar 2025, AppAgentX: Evolving GUI Agents as Proficient Smartphone Users, https://arxiv.org/abs/2503.02268
- Zongru Wu, Pengzhou Cheng, Zheng Wu, Tianjie Ju, Zhuosheng Zhang, Gongshen Liu, 4 Mar 2025 (v2), Smoothing Grounding and Reasoning for MLLM-Powered GUI Agents with Query-Oriented Pivot Tasks, https://arxiv.org/abs/2503.00401 https://github.com/ZrW00/GUIPivot
- Yuqi Zhou, Shuai Wang, Sunhao Dai, Qinglin Jia, Zhaocheng Du, Zhenhua Dong, Jun Xu, 5 Mar 2025, CHOP: Mobile Operating Assistant with Constrained High-frequency Optimized Subtask Planning, https://arxiv.org/abs/2503.03743
- Asif Razzaq, March 8, 2025, Meet Manus: A New AI Agent from China with Deep Research + Operator + Computer Use + Lovable + Memory, https://www.marktechpost.com/2025/03/08/meet-manus-a-new-ai-agent-from-china-with-deep-research-operator-computer-use-lovable-memory/
- Kyle Wiggers, March 12, 2025, Browser Use, one of the tools powering Manus, is also going viral,https://techcrunch.com/2025/03/12/browser-use-one-of-the-tools-powering-manus-is-also-going-viral/
- Muzhi Zhu, Yuzhuo Tian, Hao Chen, Chunluan Zhou, Qingpei Guo, Yang Liu, Ming Yang, Chunhua Shen, 11 Mar 2025, SegAgent: Exploring Pixel Understanding Capabilities in MLLMs by Imitating Human Annotator Trajectories, https://arxiv.org/abs/2503.08625
- Di Zhao, Longhui Ma, Siwei Wang, Miao Wang, Zhao Lv, 12 Mar 2025, COLA: A Scalable Multi-Agent Framework For Windows UI Task Automation, https://arxiv.org/abs/2503.09263
- Chaoyun Zhang, Shilin He, Liqun Li, Si Qin, Yu Kang, Qingwei Lin, Dongmei Zhang, 14 Mar 2025, API Agents vs. GUI Agents: Divergence and Convergence, https://arxiv.org/abs/2503.11069
- Yibin Xu, Liang Yang, Hao Chen, Hua Wang, Zhi Chen, Yaohua Tang, 14 Mar 2025, DeskVision: Large Scale Desktop Region Captioning for Advanced GUI Agents, https://arxiv.org/abs/2503.11170
- Zhengxi Lu, Yuxiang Chai, Yaxuan Guo, Xi Yin, Liang Liu, Hao Wang, Guanjing Xiong, Hongsheng Li, 30 Mar 2025 (v2), UI-R1: Enhancing Action Prediction of GUI Agents by Reinforcement Learning, https://arxiv.org/abs/2503.21620
- Gaole Dai, Shiqi Jiang, Ting Cao, Yuanchun Li, Yuqing Yang, Rui Tan, Mo Li, Lili Qiu, 21 Mar 2025 (v2), Advancing Mobile GUI Agents: A Verifier-Driven Approach to Practical Deployment, https://arxiv.org/abs/2503.15937
- Bin Lei, Weitai Kang, Zijian Zhang, Winson Chen, Xi Xie, Shan Zuo, Mimi Xie, Ali Payani, Mingyi Hong, Yan Yan, Caiwen Ding, 16 May 2025, InfantAgent-Next: A Multimodal Generalist Agent for Automated Computer Interaction, https://arxiv.org/abs/2505.10887
- Shuning Zhang, Jingruo Chen, Zhiqi Gao, Jiajing Gao, Xin Yi, Hewu Li, 16 May 2025 (v2), Characterizing Unintended Consequences in Human-GUI Agent Collaboration for Web Browsing, https://arxiv.org/abs/2505.09875
- Apoorv Agrawal, May 23, 2025, Why Cars Drive Themselves Before Computers Do: Robocars are ready; robot secretaries aren’t… yet, https://apoorv03.com/p/autonomy
- Yuheng Lu, Qian Yu, Hongru Wang, Zeming Liu, Wei Su, Yanping Liu, Yuhang Guo, Maocheng Liang, Yunhong Wang, Haifeng Wang, 27 May 2025 (v2), TransBench: Breaking Barriers for Transferable Graphical User Interface Agents in Dynamic Digital Environments, https://arxiv.org/abs/2505.17629
- Shuquan Lian, Yuhang Wu, Jia Ma, Yifan Ding, Zihan Song, Bingqi Chen, Xiawu Zheng, Hui Li, 9 Aug 2025, UI-AGILE: Advancing GUI Agents with Effective Reinforcement Learning and Precise Inference-Time Grounding, https://arxiv.org/abs/2507.22025
- Zihan Zheng, Tianle Cui, Chuwen Xie, Jiahui Zhang, Jiahui Pan, Lewei He, Qianglong Chen, 2 Aug 2025, NatureGAIA: Pushing the Frontiers of GUI Agents with a Challenging Benchmark and High-Quality Trajectory Dataset, https://arxiv.org/abs/2508.01330
- Zheng Wu and Pengzhou Cheng and Zongru Wu and Lingzhong Dong and Zhuosheng Zhang, 4 Aug 2025, GEM: Gaussian Embedding Modeling for Out-of-Distribution Detection in GUI Agents, https://arxiv.org/abs/2505.12842
- Chao Hao, Shuai Wang and Kaiwen Zhou, 6 Aug 2025, Uncertainty-Aware GUI Agent: Adaptive Perception through Component Recommendation and Human-in-the-Loop Refinement, https://arxiv.org/abs/2508.04025
- Liang Tang, Shuxian Li, Yuhao Cheng, Yukang Huo, Zhepeng Wang, Yiqiang Yan, Kaer Huang, Yanzhe Jing and Tiaonan Duan, 6 Aug 2025, SEA: Self-Evolution Agent with Step-wise Reward for Computer Use, https://arxiv.org/abs/2508.04037
- Zeyi Sun, Ziyu Liu, Yuhang Zang, Yuhang Cao, Xiaoyi Dong, Tong Wu, Dahua Lin, Jiaqi Wang, 6 Aug 2025, SEAgent: Self-Evolving Computer Use Agent with Autonomous Learning from Experience, https://arxiv.org/abs/2508.04700
- Liujian Tang, Shaokang Dong, Yijia Huang, Minqi Xiang, Hongtao Ruan, Bin Wang, Shuo Li, Zhihui Cao, Hailiang Pang, Heng Kong, He Yang, Mingxu Chai, Zhilin Gao, Xingyu Liu, Yingnan Fu, Jiaming Liu, Tao Gui, Xuanjing Huang, Yu-Gang Jiang, Qi Zhang, Kang Wang, Yunke Zhang, Yuran Wang, 19 Jul 2025, MagicGUI: A Foundational Mobile GUI Agent with Scalable Data Pipeline and Reinforcement Fine-tuning, https://arxiv.org/abs/2508.03700
- Wenkang Han, Zhixiong Zeng, Jing Huang, Shu Jiang, Liming Zheng, Haibo Qiu, Chang Yao, Jingyuan Chen, Lin Ma, 6 Aug 2025, UITron-Speech: Towards Automated GUI Agents Based on Speech Instructions, https://arxiv.org/abs/2506.11127
- Hanyu Lai, Xiao Liu, Yanxiao Zhao, Han Xu, Hanchen Zhang, Bohao Jing, Yanyu Ren, Shuntian Yao, Yuxiao Dong, and Jie Tang, 19 Aug 2025, ComputerRL: Scaling End-to-End Online Reinforcement Learning for Computer Use Agents, https://arxiv.org/abs/2508.14040
- Xinyuan Wang, Bowen Wang, Dunjie Lu, Junlin Yang, Tianbao Xie, Junli Wang, Jiaqi Deng, Xiaole Guo, Yiheng Xu, Chen Henry Wu, Zhennan Shen, Zhuokai Li, Ryan Li, Xiaochuan Li, Junda Chen, Boyuan Zheng, Peihang Li, Fangyu Lei, Ruisheng Cao, Yeqiao Fu, Dongchan Shin, Martin Shin, Jiarui Hu, Yuyan Wang, Jixuan Chen, Yuxiao Ye, Danyang Zhang, Dikang Du, Hao Hu, Huarong Chen, Zaida Zhou, Haotian Yao, Ziwei Chen, Qizheng Gu, Yipu Wang, Heng Wang, Diyi Yang, Victor Zhong, Flood Sung, Y.Charles, Zhilin Yang, Tao Yu, 14 Aug 2025, OpenCUA: Open Foundations for Computer-Use Agents, https://arxiv.org/abs/2508.09123
- Thong Q. Nguyen, Shubhang Desai, Raja Hasnain Anwar, Firoz Shaik, Vishwas Suryanarayanan, Vishal Chowdhary, 2 Aug 2025, VerificAgent: Domain-Specific Memory Verification for Scalable Oversight of Aligned Computer-Use Agents, https://arxiv.org/abs/2506.02539
- Songqin Nong, Jingxuan Xu, Sheng Zhou, Jianfeng Chen, Xiaoxuan Tang, Tao Jiang, Wenhao Xu, 15 Aug 2025, CRAFT-GUI: Curriculum-Reinforced Agent For GUI Tasks, https://arxiv.org/abs/2508.11360
- Jiabo Ye, Xi Zhang, Haiyang Xu, Haowei Liu, Junyang Wang, Zhaoqing Zhu, Ziwei Zheng, Feiyu Gao, Junjie Cao, Zhengxi Lu, Jitong Liao, Qi Zheng, Fei Huang, Jingren Zhou, and Ming Yan, 21 Aug 2025, Mobile-Agent-v3: Foundamental Agents for GUI Automation, https://arxiv.org/abs/2508.15144
- Haoming Wang, Haoyang Zou, Huatong Song, Jiazhan Feng, Junjie Fang, Junting Lu, Longxiang Liu, Qinyu Luo, Shihao Liang, Shijue Huang, Wanjun Zhong, Yining Ye, Yujia Qin, Yuwen Xiong, Yuxin Song, Zhiyong Wu, Aoyan Li, Bo Li, Chen Dun, Chong Liu, Daoguang Zan, Fuxing Leng, Hanbin Wang, Hao Yu, Haobin Chen, Hongyi Guo, Jing Su, Jingjia Huang, Kai Shen, Kaiyu Shi, Lin Yan, Peiyao Zhao, Pengfei Liu, Qinghao Ye, Renjie Zheng, Shulin Xin, Wayne Xin Zhao, Wen Heng, Wenhao Huang, Wenqian Wang, Xiaobo Qin, Yi Lin, Youbin Wu, Zehui Chen, Zihao Wang, Baoquan Zhong, Xinchun Zhang, Xujing Li, Yuanfan Li, Zhongkai Zhao, Chengquan Jiang, Faming Wu, Haotian Zhou, Jinlin Pang, Li Han, Qi Liu, Qianli Ma, Siyao Liu, Songhua Cai, Wenqi Fu, Xin Liu, Yaohui Wang, Zhi Zhang, Bo Zhou, Guoliang Li, Jiajun Shi, Jiale Yang, et al. (45 additional authors not shown), 5 Sep 2025, UI-TARS-2 Technical Report: Advancing GUI Agent with Multi-Turn Reinforcement Learning, https://arxiv.org/abs/2509.02544
- Yuyang Zhao, Wentao Shi, Fuli Feng, and Xiangnan He, 26 Aug 2025, AppAgent-Pro: A Proactive GUI Agent System for Multidomain Information Integration and User Assistance, https://arxiv.org/abs/2508.18689
- Zeyi Sun, Yuhang Cao, Jianze Liang, Qiushi Sun, Ziyu Liu, Zhixiong Zhang, Yuhang Zang, Xiaoyi Dong, Kai Chen, Dahua Lin, Jiaqi Wang, 27 Aug 2025, CODA: Coordinating the Cerebrum and Cerebellum for a Dual-Brain Computer Use Agent with Decoupled Reinforcement Learning, https://arxiv.org/abs/2508.20096
- Jaewoo Ahn, Junseo Kim, Heeseung Yun, Jaehyeon Son, Dongmin Park, Jaewoong Cho, Gunhee Kim, 1 Sep 2025, FlashAdventure: A Benchmark for GUI Agents Solving Full Story Arcs in Diverse Adventure Games, https://arxiv.org/abs/2509.01052
- Pengxiang Zhao, Guangyi Liu, Yaozhen Liang, Weiqing He, Zhengxi Lu, Yuehao Huang, Yaxuan Guo, Kexin Zhang, Hao Wang, Liang Liu, Yong Liu, 8 Sep 2025, MAS-Bench: A Unified Benchmark for Shortcut-Augmented Hybrid Mobile GUI Agents, https://arxiv.org/abs/2509.06477
- Jungjae Lee, Dongjae Lee, Chihun Choi, Youngmin Im, Jaeyoung Wi, Kihong Heo, Sangeun Oh, Sunjae Lee, Insik Shin, 11 Sep 2025, VeriSafe Agent: Safeguarding Mobile GUI Agent via Logic-based Action Verification, https://arxiv.org/abs/2503.18492
- Musen Lin, Minghao Liu, Taoran Lu, Lichen Yuan, Yiwei Liu, Haonan Xu, Yu Miao, Yuhao Chao, Zhaojian Li, 19 Sep 2025, GUI-ReWalk: Massive Data Generation for GUI Agent via Stochastic Exploration and Intent-Aware Reasoning, https://arxiv.org/abs/2509.15738
- Xianhang Ye, Yiqing Li, Wei Dai, Miancan Liu, Ziyuan Chen, Zhangye Han, Hongbo Min, Jinkui Ren, Xiantao Zhang, Wen Yang, Zhi Jin, 19 Sep 2025, GUI-ARP: Enhancing Grounding with Adaptive Region Perception for GUI Agents, https://arxiv.org/abs/2509.15532
- Shaojie Zhang, Ruoceng Zhang, Pei Fu, Shaokang Wang, Jiahui Yang, Xin Du, Shiqi Cui, Bin Qin, Ying Huang, Zhenbo Luo, Jian Luan, 19 Sep 2025, BTL-UI: Blink-Think-Link Reasoning Model for GUI Agent, https://arxiv.org/abs/2509.15566
- Jingyu Tang, Chaoran Chen, Jiawen Li, Zhiping Zhang, Bingcan Guo, Ibrahim Khalilov, Simret Araya Gebreegziabher, Bingsheng Yao, Dakuo Wang, Yanfang Ye, Tianshi Li, Ziang Xiao, Yaxing Yao, Toby Jia-Jun Li, 12 Sep 2025, Dark Patterns Meet GUI Agents: LLM Agent Susceptibility to Manipulative Interfaces and the Role of Human Oversight, https://arxiv.org/abs/2509.10723
- Zongru Wu, Rui Mao, Zhiyuan Tian, Pengzhou Cheng, Tianjie Ju, Zheng Wu, Lingzhong Dong, Haiyue Sheng, Zhuosheng Zhang, Gongshen Liu, 17 Sep 2025, See, Think, Act: Teaching Multimodal Agents to Effectively Interact with GUI by Identifying Toggles, https://arxiv.org/abs/2509.13615
AI Books from Aussie AI
|
The Sweetest Lesson: Your Brain Versus AI: new book on AI intelligence theory:
Get your copy from Amazon: The Sweetest Lesson |
|
RAG Optimization: Accurate and Efficient LLM Applications:
new book on RAG architectures:
Get your copy from Amazon: RAG Optimization |
|
Generative AI Applications book:
Get your copy from Amazon: Generative AI Applications |
|
Generative AI programming book:
Get your copy from Amazon: Generative AI in C++ |
|
CUDA C++ Optimization book:
Get your copy from Amazon: CUDA C++ Optimization |
|
CUDA C++ Debugging book:
Get your copy from Amazon: CUDA C++ Debugging |
More AI Research
Read more about: