UI-TARS unlocks true GUI automation through screen perception and action.
UI-TARS, a native GUI agent model, enhances automated GUI interaction by directly perceiving screenshots and performing actions. It addresses the limitations of existing agent frameworks reliant on complex workflows and commercial LLMs.
https://arxiv.org/abs/2501.12326
🤔: Original Problem
→ Current GUI agents struggle with platform inconsistencies and scalability due to dependence on textual representations like HTML.
→ Agent frameworks, while flexible, rely on handcrafted, expert-defined rules and prompts that hinder scalability and adaptability to evolving interfaces.
→ Native agent models, though conceptually advantageous, are limited by a scarcity of comprehensive training data.
Solution in this Paper 💡:
→ UI-TARS leverages a large-scale, multi-task dataset of GUI screenshots with rich metadata for enhanced perception.
→ It introduces unified action space, standardizing actions across platforms and improving grounding through large-scale action traces.
→ Incorporates System-2 Reasoning to enhance deliberate decision-making, integrating GUI knowledge and diverse reasoning patterns.
→ Addresses the data bottleneck through iterative training with reflective online traces, enabling continuous learning and adaptation with minimal human intervention.
Key Insights from this Paper 🤔:
→ UI-TARS uses a unified action space with atomic and compositional actions, allowing it to operate across different platforms and execute complex multi-step tasks.
→ A large-scale GUI screenshot dataset is used for training perception, covering diverse tasks like element description, captioning, and QA.
→ The model integrates System-2 reasoning capabilities by incorporating a dataset of 6M GUI tutorials and augmenting action traces with explicit reasoning patterns.
→ Iterative training with reflective online traces further enhances the agent's ability to learn from mistakes and adapt to unforeseen situations by learning from corrected and post-reflection traces.
Results 🚀:
→ UI-TARS achieves SOTA results across 10+ GUI agent benchmarks, surpassing GPT-40 and Claude Computer Use.
→ In OSWorld, UI-TARS-72B reaches 24.6 with 50 steps and 22.7 with 15 steps, exceeding Claude's 22.0 and 14.9.
→ UI-TARS achieves 46.6 on AndroidWorld, outperforming GPT-40’s 34.5.
Share this post