Aussie AI
LLM GUI Agents
-
Last Updated 1 April, 2025
-
by David Spuler, Ph.D.
GUI agents are LLM-based agents that can read and/or manipulate the GUI. The first GUI agents were read-only, but advanced GUI agents can now not only read the screen, but can also click the mouse button or enter keystrokes. This means that the LLM can now launch and control any apps on a PC or phone.
Early GUI agents were read-only, examining what was on the screen as context for a user's query. It was useful to see what app or window the person was looking at on the screen when issuing a query. There were two methods of examining the screen's display:
- Image-based (i.e., using a screen snapshot)
- Internal hierarchy analysis (i.e., examining the internal representation of windows)
Recently, more advanced GUI agents have been released that can also have full control of the input devices, such as moving the mouse, clicking, or entering keystrokes. These advanced "computer usage" agents are theoretically capable of doing anything that a human user can do, but automated via an LLM.
Related areas of LLM research include:
Survey Papers on GUI Agents
Recent survey papers on computer usage and GUI agents:
- Chaoyun Zhang, Shilin He, Jiaxu Qian, Bowen Li, Liqun Li, Si Qin, Yu Kang, Minghua Ma, Qingwei Lin, Saravan Rajmohan, Dongmei Zhang, Qi Zhang, 27 Nov 2024, Large Language Model-Brained GUI Agents: A Survey, https://arxiv.org/abs/2411.18279
- Shuai Wang, Weiwen Liu, Jingxuan Chen, Weinan Gan, Xingshan Zeng, Shuai Yu, Xinlong Hao, Kun Shao, Yasheng Wang, Ruiming Tang, 7 Nov 2024, GUI Agents with Foundation Models: A Comprehensive Survey, https://arxiv.org/abs/2411.04890
- Dang Nguyen, Jian Chen, Yu Wang, Gang Wu, Namyong Park, Zhengmian Hu, Hanjia Lyu, Junda Wu, Ryan Aponte, Yu Xia, Xintong Li, Jing Shi, Hongjie Chen, Viet Dac Lai, Zhouhang Xie, Sungchul Kim, Ruiyi Zhang, Tong Yu, Mehrab Tanjim, Nesreen K. Ahmed, Puneet Mathur, Seunghyun Yoon, Lina Yao, Branislav Kveton, Thien Huu Nguyen, Trung Bui, Tianyi Zhou, Ryan A. Rossi, Franck Dernoncourt, 18 Dec 2024, GUI Agents: A Survey, https://arxiv.org/abs/2412.13501
- Pascal J. Sager, Benjamin Meyer, Peng Yan, Rebekka von Wartburg-Kottler, Layan Etaiwi, Aref Enayati, Gabriel Nobel, Ahmed Abdulkadir, Benjamin F. Grewe, Thilo Stadelmann, 27 Jan 2025, AI Agents for Computer Use: A Review of Instruction-based Computer Control, GUI Automation, and Operator Assistants, https://arxiv.org/abs/2501.16150
Research on LLM GUI Agents
Research papers on GUI agents:
- Anthropic, 23 Oct 2024, Developing a computer use model, https://www.anthropic.com/news/developing-computer-use
- Anthropic, 23 Oct 2024, Introducing computer use, a new Claude 3.5 Sonnet, and Claude 3.5 Haiku, https://www.anthropic.com/news/3-5-models-and-computer-use
- Anirban Ghoshal, 23 Oct 2024, How Anthropic’s new ‘computer use’ ability could further AI automation, https://www.cio.com/article/3583260/how-anthropics-new-computer-use-ability-could-further-ai-automation.html
- Zhiyong Wu, Zhenyu Wu, Fangzhi Xu, Yian Wang, Qiushi Sun, Chengyou Jia, Kanzhi Cheng, Zichen Ding, Liheng Chen, Paul Pu Liang, Yu Qiao, 30 Oct 2024, OS-ATLAS: A Foundation Action Model for Generalist GUI Agents, https://arxiv.org/abs/2410.23218 https://github.com/OS-Copilot/OS-Atlas
- Rogerio Bonatti, Dan Zhao, Francesco Bonacci, Dillon Dupont, Sara Abdali, Yinheng Li, Yadong Lu, Justin Wagle, Kazuhito Koishida, Arthur Bucker, Lawrence Jang, Zack Hui, 13 Sep 2024 (v2), Windows Agent Arena: Evaluating Multi-Modal OS Agents at Scale, https://arxiv.org/abs/2409.08264
- Xiao Liu, Bo Qin, Dongzhu Liang, Guang Dong, Hanyu Lai, Hanchen Zhang, Hanlin Zhao, Iat Long Iong, Jiadai Sun, Jiaqi Wang, Junjie Gao, Junjun Shan, Kangning Liu, Shudan Zhang, Shuntian Yao, Siyi Cheng, Wentao Yao, Wenyi Zhao, Xinghan Liu, Xinyi Liu, Xinying Chen, Xinyue Yang, Yang Yang, Yifan Xu, Yu Yang, Yujia Wang, Yulin Xu, Zehan Qi, Yuxiao Dong, Jie Tang, 28 Oct 2024, AutoGLM: Autonomous Foundation Agents for GUIs https://arxiv.org/abs/2411.00820
- Shuai Wang, Weiwen Liu, Jingxuan Chen, Weinan Gan, Xingshan Zeng, Shuai Yu, Xinlong Hao, Kun Shao, Yasheng Wang, Ruiming Tang, 7 Nov 2024, GUI Agents with Foundation Models: A Comprehensive Survey, https://arxiv.org/abs/2411.04890
- Shirin Ghaffary and Rachel Metz November 14, 2024, OpenAI Nears Launch of AI Agent Tool to Automate Tasks for Users. The new software, codenamed “Operator,” is set to be released in January. https://www.bloomberg.com/news/articles/2024-11-13/openai-nears-launch-of-ai-agents-to-automate-tasks-for-users
- Siyuan Hu, Mingyu Ouyang, Difei Gao, Mike Zheng Shou, 15 Nov 2024, The Dawn of GUI Agent: A Preliminary Case Study with Claude 3.5 Computer Use, https://arxiv.org/abs/2411.10323 https://github.com/showlab/computer_use_ootb
- Mike Elgan, 22 Nov 2024, AI agents are unlike any technology ever, https://www.computerworld.com/article/3608973/ai-agents-are-unlike-any-technology-ever.html
- Show Lab, Nov 2024, ShowUI: ShowUI is a lightweight (2B) vision-language-action model designed for GUI agents. https://huggingface.co/showlab/ShowUI-2B
- Chaoyun Zhang, Shilin He, Jiaxu Qian, Bowen Li, Liqun Li, Si Qin, Yu Kang, Minghua Ma, Qingwei Lin, Saravan Rajmohan, Dongmei Zhang, Qi Zhang, 27 Nov 2024, Large Language Model-Brained GUI Agents: A Survey, https://arxiv.org/abs/2411.18279
- Zhuosheng Zhang, Aston Zhang, 7 Jun 2024 (v4), You Only Look at Screens: Multimodal Chain-of-Action Agents, https://arxiv.org/abs/2309.11436
- Boyu Gou, Ruohan Wang, Boyuan Zheng, Yanan Xie, Cheng Chang, Yiheng Shu, Huan Sun, Yu Su, 7 Oct 2024, Navigating the Digital World as Humans Do: Universal Visual Grounding for GUI Agents, https://arxiv.org/abs/2410.05243
- Kanzhi Cheng, Qiushi Sun, Yougang Chu, Fangzhi Xu, Yantao Li, Jianbing Zhang, Zhiyong Wu, 23 Feb 2024 (v2), SeeClick: Harnessing GUI Grounding for Advanced Visual GUI Agents, https://arxiv.org/abs/2401.10935
- Michael Nuñez, November 29, 2024, AI that clicks for you: Microsoft’s research points to the future of GUI automation, https://venturebeat.com/ai/ai-that-clicks-for-you-microsoft-research-points-to-the-future-of-gui-automation/
- Yiqin Wang, Haoji Zhang, Jingqi Tian, Yansong Tang, 2 Dec 2024, Ponder & Press: Advancing Visual GUI Agent towards General Computer Control, https://arxiv.org/abs/2412.01268 https://invinciblewyq.github.io/ponder-press-page/
- Kyle Wiggers, December 5, 2024, Copilot Vision, Microsoft’s AI tool that can read your screen, launches in preview, https://techcrunch.com/2024/12/05/copilot-vision-microsofts-ai-tool-that-can-read-your-screen-launches-in-preview/
- Wenyi Hong, Weihan Wang, Qingsong Lv, Jiazheng Xu, Wenmeng Yu, Junhui Ji, Yan Wang, Zihan Wang, Yuxiao Dong, Ming Ding, Jie Tang; 2024, Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024, pp. 14281-14290, CogAgent: A Visual Language Model for GUI Agents https://openaccess.thecvf.com/content/CVPR2024/html/Hong_CogAgent_A_Visual_Language_Model_for_GUI_Agents_CVPR_2024_paper.html
- Keen You, Haotian Zhang, Eldon Schoop, Floris Weers, Amanda Swearngin, Jeffrey Nichols, Yinfei Yang, Zhe Gan, 8 Apr 2024, Ferret-UI: Grounded Mobile UI Understanding with Multimodal LLMs, https://arxiv.org/abs/2404.05719
- Dongping Chen, Yue Huang, Siyuan Wu, Jingyu Tang, Liuyi Chen, Yilin Bai, Zhigang He, Chenlong Wang, Huichi Zhou, Yiqiang Li, Tianshuo Zhou, Yue Yu, Chujie Gao, Qihui Zhang, Yi Gui, Zhen Li, Yao Wan, Pan Zhou, Jianfeng Gao, Lichao Sun, 16 Jun 2024, GUI-WORLD: A Dataset for GUI-oriented Multimodal LLM-based Agents, https://arxiv.org/abs/2406.10819
- Wentong Chen, Junbo Cui, Jinyi Hu, Yujia Qin, Junjie Fang, Yue Zhao, Chongyi Wang, Jun Liu, Guirong Chen, Yupeng Huo, Yuan Yao, Yankai Lin, Zhiyuan Liu, Maosong Sun, 17 Jun 2024, GUICourse: From General Vision Language Models to Versatile GUI Agents, https://arxiv.org/abs/2406.11317 https://github.com/yiye3/GUICourse
- Yiheng Xu, Zekun Wang, Junli Wang, Dunjie Lu, Tianbao Xie, Amrita Saha, Doyen Sahoo, Tao Yu, Caiming Xiong, 5 Dec 2024, Aguvis: Unified Pure Vision Agents for Autonomous GUI Interaction, https://arxiv.org/abs/2412.04454 https://aguvis-project.github.io/
- AskUI, Dec 2024, AskUI Vision Agent: Automate computer tasks in Python, https://github.com/askui/vision-agent
- Huawen Shen, Chang Liu, Gengluo Li, Xinlong Wang, Yu Zhou, Can Ma, Xiangyang Ji, 12 Dec 2024, Falcon-UI: Understanding GUI Before Following User Instructions, https://arxiv.org/abs/2412.09362
- Google, Dec 2024, Project Mariner: A research prototype exploring the future of human-agent interaction, starting with your browser, https://deepmind.google/technologies/project-mariner/
- Lu Wang, Fangkai Yang, Chaoyun Zhang, Junting Lu, Jiaxu Qian, Shilin He, Pu Zhao, Bo Qiao, Ray Huang, Si Qin, Qisheng Su, Jiayi Ye, Yudi Zhang, Jian-Guang Lou, Qingwei Lin, Saravan Rajmohan, Dongmei Zhang, Qi Zhang, 13 Dec 2024, Large Action Models: From Inception to Implementation, https://arxiv.org/abs/2412.10047 https://github.com/microsoft/UFO/tree/main/dataflow https://microsoft.github.io/UFO/dataflow/overview/
- Zhiqi Ge, Juncheng Li, Xinglei Pang, Minghe Gao, Kaihang Pan, Wang Lin, Hao Fei, Wenqiao Zhang, Siliang Tang, Yueting Zhuang, 13 Dec 2024, Iris: Breaking GUI Complexity with Adaptive Focus and Self-Refining, https://arxiv.org/abs/2412.10342
- Jiarun Liu, Jia Hao, Chunhong Zhang, Zheng Hu, 14 Dec 2024, WEPO: Web Element Preference Optimization for LLM-based Web Navigation, https://arxiv.org/abs/2412.10742
- Dang Nguyen, Jian Chen, Yu Wang, Gang Wu, Namyong Park, Zhengmian Hu, Hanjia Lyu, Junda Wu, Ryan Aponte, Yu Xia, Xintong Li, Jing Shi, Hongjie Chen, Viet Dac Lai, Zhouhang Xie, Sungchul Kim, Ruiyi Zhang, Tong Yu, Mehrab Tanjim, Nesreen K. Ahmed, Puneet Mathur, Seunghyun Yoon, Lina Yao, Branislav Kveton, Thien Huu Nguyen, Trung Bui, Tianyi Zhou, Ryan A. Rossi, Franck Dernoncourt, 18 Dec 2024, GUI Agents: A Survey, https://arxiv.org/abs/2412.13501
- Diego Rivas, D Shin, 19 Dec 2024 User Interface for Efficient Control of Autonomous Agent Tasks , https://www.tdcommons.org/cgi/viewcontent.cgi?article=8835&context=dpubs_series
- Hao Wen, Shizuo Tian, Borislav Pavlov, Wenjie Du, Yixuan Li, Ge Chang, Shanhui Zhao, Jiacheng Liu, Yunxin Liu, Ya-Qin Zhang, Yuanchun Li, 24 Dec 2024, AutoDroid-V2: Boosting SLM-based GUI Agents via Code Generation, https://arxiv.org/abs/2412.18116
- Kangjia Zhao, Jiahui Song, Leigang Sha, Haozhan Shen, Zhi Chen, Tiancheng Zhao, Xiubo Liang, Jianwei Yin, 24 Dec 2024, GUI Testing Arena: A Unified Benchmark for Advancing Autonomous GUI Testing Agent, https://arxiv.org/abs/2412.18426 https://github.com/ZJU-ACES-ISE/ChatUITest
- X Hu, T Xiong, B Yi, Z Wei, R Xiao, Y Chen, J Ye, M Tao, Dec 2024, OS Agents: A Survey on MLLM-Based Agents for General Computing Devices Use, https://www.preprints.org/frontend/manuscript/3842b6163d82801988adf663ee18b6d5/download_pub
- Xueyu Hu,Tao Xiong,Biao Yi,Zishu Wei,Ruixuan Xiao,Yurun Chen,Jiasheng Ye,Meiling Tao,Xiangxin Zhou,Ziyu Zhao,Yuhuai Li,Shengze Xu,Shawn Wang,Xinchen Xu,Shuofei Qiao,Kun Kuang,Tieyong Zeng,Liang Wang,Jiwei Li,Yuchen Eleanor Jiang,Wangchunshu Zhou,Guoyin Wang,Keting Yin,Zhou Zhao,Hongxia Yang,Fan Wu,Shengyu Zhang ,Fei Wu, Dec 2024, OS Agents: A Survey on MLLM-Based Agents for General Computing Devices Use, https://www.preprints.org/manuscript/202412.2294/v1
- Gautier Dagan, Frank Keller, Alex Lascarides, 30 Dec 2024, Plancraft: an evaluation dataset for planning with LLM agents, https://arxiv.org/abs/2412.21033
- Yuxiang Chai, Hanhao Li, Jiayu Zhang, Liang Liu, Guozhi Wang, Shuai Ren, Siyuan Huang, Hongsheng Li, 2 Jan 2025, A3: Android Agent Arena for Mobile GUI Agents, https://arxiv.org/abs/2501.01149 https://yuxiangchai.github.io/Android-Agent-Arena/
- Dezhi Ran, Mengzhou Wu, Hao Yu, Yuetong Li, Jun Ren, Yuan Cao, Xia Zeng, Haochuan Lu, Zexin Xu, Mengqian Xu, Ting Su, Liangchao Yao, Ting Xiong, Wei Yang, Yuetang Deng, Assaf Marron, David Harel, Tao Xie, 6 Jan 2025, Beyond Pass or Fail: A Multi-dimensional Benchmark for Mobile UI Navigation, https://arxiv.org/abs/2501.02863
- Yuhang Liu, Pengxiang Li, Zishu Wei, Congkai Xie, Xueyu Hu, Xinchen Xu, Shengyu Zhang, Xiaotian Han, Hongxia Yang, Fei Wu, 8 Jan 2025, InfiGUIAgent: A Multimodal Generalist GUI Agent with Native Reasoning and Reflection, https://arxiv.org/abs/2501.04575
- Taryn Plumb, January 22, 2025, ByteDance’s UI-TARS can take over your computer, outperforms GPT-4o and Claude, https://venturebeat.com/ai/bytedances-ui-tars-can-take-over-your-computer-outperforms-gpt-4o-and-claude/
- Yujia Qin, Yining Ye, Junjie Fang, Haoming Wang, Shihao Liang, Shizuo Tian, Junda Zhang, Jiahao Li, Yunxin Li, Shijue Huang, Wanjun Zhong, Kuanye Li, Jiale Yang, Yu Miao, Woyu Lin, Longxiang Liu, Xu Jiang, Qianli Ma, Jingyu Li, Xiaojun Xiao, Kai Cai, Chuang Li, Yaowei Zheng, Chaolin Jin, Chen Li, Xiao Zhou, Minchao Wang, Haoli Chen, Zhaojian Li, Haihua Yang, Haifeng Liu, Feng Lin, Tao Peng, Xin Liu, Guang Shi, 21 Jan 2025, UI-TARS: Pioneering Automated GUI Interaction with Native Agents, https://arxiv.org/abs/2501.12326
- Maxwell Zeff, January 23, 2025, OpenAI launches Operator, an AI agent that performs tasks autonomously, https://techcrunch.com/2025/01/23/openai-launches-operator-an-ai-agent-that-performs-tasks-autonomously/
- Pascal J. Sager, Benjamin Meyer, Peng Yan, Rebekka von Wartburg-Kottler, Layan Etaiwi, Aref Enayati, Gabriel Nobel, Ahmed Abdulkadir, Benjamin F. Grewe, Thilo Stadelmann, 27 Jan 2025, AI Agents for Computer Use: A Review of Instruction-based Computer Control, GUI Automation, and Operator Assistants, https://arxiv.org/abs/2501.16150
- Kyle Wiggers, January 27, 2025, Alibaba’s Qwen team releases AI models that can control PCs and phones, https://techcrunch.com/2025/01/27/alibabas-qwen-team-releases-ai-models-that-can-control-pcs-and-phones/
- Tian Huang, Chun Yu, Weinan Shi, Zijian Peng, David Yang, Weiqi Sun, and Yuanchun Shi. 2025. Prompt2Task: Automating UI Tasks on Smartphones from Textual Prompts. ACM Trans. Comput.-Hum. Interact. Just Accepted (February 2025). https://doi.org/10.1145/3716132 https://dl.acm.org/doi/abs/10.1145/3716132
- Qinzhuo Wu, Wei Liu, Jian Luan, Bin Wang, 5 Feb 2025, ReachAgent: Enhancing Mobile Agent via Page Reaching and Operation, https://arxiv.org/abs/2502.02955
- Kunal Singh, Shreyas Singh, Mukund Khanna, 12 Feb 2025, TRISHUL: Towards Region Identification and Screen Hierarchy Understanding for Large VLM based GUI Agents, https://arxiv.org/abs/2502.08226
- Matt Marshall, February 22, 2025, The rise of browser-use agents: Why Convergence’s Proxy is beating OpenAI’s Operator, https://venturebeat.com/ai/the-rise-of-browser-use-agents-why-convergences-proxy-is-beating-openais-operator/
- Frank Landymore, Jan 25, 2025, OpenAI's Agent Has a Problem: Before It Does Anything Important, You Have to Double-Check It Hasn't Screwed Up: Not as hands-off as you might hope, https://futurism.com/openai-asks-permission-important
- Jiani Zheng, Lu Wang, Fangkai Yang, Chaoyun Zhang, Lingrui Mei, Wenjie Yin, Qingwei Lin, Dongmei Zhang, Saravan Rajmohan, Qi Zhang, 26 Feb 2025, VEM: Environment-Free Exploration for Training GUI Agent with Value Environment Model, https://arxiv.org/abs/2502.18906
- Wenjia Jiang, Yangyang Zhuang, Chenxi Song, Xu Yang, Chi Zhang, 4 Mar 2025, AppAgentX: Evolving GUI Agents as Proficient Smartphone Users, https://arxiv.org/abs/2503.02268
- Zongru Wu, Pengzhou Cheng, Zheng Wu, Tianjie Ju, Zhuosheng Zhang, Gongshen Liu, 4 Mar 2025 (v2), Smoothing Grounding and Reasoning for MLLM-Powered GUI Agents with Query-Oriented Pivot Tasks, https://arxiv.org/abs/2503.00401 https://github.com/ZrW00/GUIPivot
- Yuqi Zhou, Shuai Wang, Sunhao Dai, Qinglin Jia, Zhaocheng Du, Zhenhua Dong, Jun Xu, 5 Mar 2025, CHOP: Mobile Operating Assistant with Constrained High-frequency Optimized Subtask Planning, https://arxiv.org/abs/2503.03743
- Asif Razzaq, March 8, 2025, Meet Manus: A New AI Agent from China with Deep Research + Operator + Computer Use + Lovable + Memory, https://www.marktechpost.com/2025/03/08/meet-manus-a-new-ai-agent-from-china-with-deep-research-operator-computer-use-lovable-memory/
- Kyle Wiggers, March 12, 2025, Browser Use, one of the tools powering Manus, is also going viral,https://techcrunch.com/2025/03/12/browser-use-one-of-the-tools-powering-manus-is-also-going-viral/
- Muzhi Zhu, Yuzhuo Tian, Hao Chen, Chunluan Zhou, Qingpei Guo, Yang Liu, Ming Yang, Chunhua Shen, 11 Mar 2025, SegAgent: Exploring Pixel Understanding Capabilities in MLLMs by Imitating Human Annotator Trajectories, https://arxiv.org/abs/2503.08625
- Di Zhao, Longhui Ma, Siwei Wang, Miao Wang, Zhao Lv, 12 Mar 2025, COLA: A Scalable Multi-Agent Framework For Windows UI Task Automation, https://arxiv.org/abs/2503.09263
- Chaoyun Zhang, Shilin He, Liqun Li, Si Qin, Yu Kang, Qingwei Lin, Dongmei Zhang, 14 Mar 2025, API Agents vs. GUI Agents: Divergence and Convergence, https://arxiv.org/abs/2503.11069
- Yibin Xu, Liang Yang, Hao Chen, Hua Wang, Zhi Chen, Yaohua Tang, 14 Mar 2025, DeskVision: Large Scale Desktop Region Captioning for Advanced GUI Agents, https://arxiv.org/abs/2503.11170
- Zhengxi Lu, Yuxiang Chai, Yaxuan Guo, Xi Yin, Liang Liu, Hao Wang, Guanjing Xiong, Hongsheng Li, 30 Mar 2025 (v2), UI-R1: Enhancing Action Prediction of GUI Agents by Reinforcement Learning, https://arxiv.org/abs/2503.21620
- Gaole Dai, Shiqi Jiang, Ting Cao, Yuanchun Li, Yuqing Yang, Rui Tan, Mo Li, Lili Qiu, 21 Mar 2025 (v2), Advancing Mobile GUI Agents: A Verifier-Driven Approach to Practical Deployment, https://arxiv.org/abs/2503.15937
More AI Research
Read more about: