Open-source alternative to GPT-4V for building reliable GUI automation agents
OS-ATLAS, a foundational GUI action model enables open-source GUI agents to match commercial VLM performance through cross-platform data synthesis.
Release the largest open-source cross-platform GUI grounding corpus to date, which contains over 13 million GUI elements.
📚 https://arxiv.org/abs/2410.23218
🤖 Original Problem:
Existing GUI agents heavily depend on commercial Vision-Language Models (VLMs) like GPT-4V. Open-source VLMs perform poorly in GUI grounding and Out-Of-Distribution scenarios, making them less preferred for real-world applications.
-----
🛠️ Solution in this Paper:
→ Created OS-ATLAS, a foundational GUI action model with three operating modes: Grounding, Action, and Agent
→ Built first multi-platform GUI data synthesis toolkit covering Windows, Linux, MacOS, Android, web
→ Created largest open-source cross-platform GUI corpus (13M+ elements from 2.3M screenshots)
→ Implemented unified action space during training to resolve naming conflicts across platforms
→ Standardized Basic Actions (click, type, scroll) and Custom Actions for extensibility
-----
💡 Key Insights:
→ Pre-training on comprehensive cross-platform GUI data significantly improves grounding accuracy
→ Unified action space prevents performance degradation from naming conflicts
→ Instruction grounding data, while valuable, isn't critical - referring expression data is sufficient
→ Web-only training doesn't generalize well to other platforms
-----
📊 Results:
→ Achieves 82.47% average grounding accuracy without planner
→ Reaches 85.14% accuracy with GPT-4 as planner
→ Outperforms previous SOTA across mobile, desktop and web platforms
→ Shows 14.63% success rate on OSWorld benchmark (compared to 9.21% baseline)
Share this post