0:00
/
0:00
Transcript

"OS-ATLAS: A Foundation Action Model for Generalist GUI Agents"

The podcast on this paper is generated with Google's Illuminate.

Open-source alternative to GPT-4V for building reliable GUI automation agents

OS-ATLAS, a foundational GUI action model enables open-source GUI agents to match commercial VLM performance through cross-platform data synthesis.

Release the largest open-source cross-platform GUI grounding corpus to date, which contains over 13 million GUI elements.

📚 https://arxiv.org/abs/2410.23218

🤖 Original Problem:

Existing GUI agents heavily depend on commercial Vision-Language Models (VLMs) like GPT-4V. Open-source VLMs perform poorly in GUI grounding and Out-Of-Distribution scenarios, making them less preferred for real-world applications.

-----

🛠️ Solution in this Paper:

→ Created OS-ATLAS, a foundational GUI action model with three operating modes: Grounding, Action, and Agent

→ Built first multi-platform GUI data synthesis toolkit covering Windows, Linux, MacOS, Android, web

→ Created largest open-source cross-platform GUI corpus (13M+ elements from 2.3M screenshots)

→ Implemented unified action space during training to resolve naming conflicts across platforms

→ Standardized Basic Actions (click, type, scroll) and Custom Actions for extensibility

-----

💡 Key Insights:

→ Pre-training on comprehensive cross-platform GUI data significantly improves grounding accuracy

→ Unified action space prevents performance degradation from naming conflicts

→ Instruction grounding data, while valuable, isn't critical - referring expression data is sufficient

→ Web-only training doesn't generalize well to other platforms

-----

📊 Results:

→ Achieves 82.47% average grounding accuracy without planner

→ Reaches 85.14% accuracy with GPT-4 as planner

→ Outperforms previous SOTA across mobile, desktop and web platforms

→ Shows 14.63% success rate on OSWorld benchmark (compared to 9.21% baseline)

Discussion about this video