This survey comprehensively maps out how LLM-powered GUI agents can automate human-computer interactions through graphical interfaces, providing a unified framework for their development and evaluation .
https://arxiv.org/abs/2412.13501
🔍 Methods in this Paper:
→ The paper introduces a unified framework categorizing GUI agents based on perception, reasoning, planning, and acting capabilities .
→ It analyzes different perception interfaces including accessibility-based, HTML/DOM-based, screen-visual-based, and hybrid approaches .
→ The framework incorporates both prompt-based and training-based methods for developing GUI agents .
→ It evaluates agents using comprehensive benchmarks across static datasets and interactive environments .
-----
⚡ Key Insights:
→ GUI agents require a combination of visual understanding and action planning to effectively interact with interfaces
→ Hybrid perception approaches combining multiple interface types show the most promise for robust performance
→ Privacy and latency remain critical challenges for practical deployment
-----
📊 Results:
→ Current GUI agents achieve 51.1% accuracy on unseen websites
→ GPT-4V based trajectory evaluation shows 85.3% agreement with human judgments
→ Hybrid interfaces demonstrate superior performance compared to single-mode approaches
Share this post