A3 is a new evaluation platform for testing mobile GUI agents with real-world apps, offering automated evaluation and expanded action space compatibility.
https://arxiv.org/abs/2501.01149
🎯 Original Problem:
Existing mobile GUI agent evaluation platforms focus on static frame tests or limited app scenarios, failing to assess real-world performance effectively.
💡 Solution in this Paper:
→ A3 integrates with 21 popular third-party apps and 201 real-world tasks spanning operations and information queries
→ The platform implements a flexible action space compatible with any dataset annotation style
→ Tasks are categorized into operation, single-frame query, and multi-frame query types
→ A3 introduces both task-specific evaluation functions and an LLM-based evaluation system
→ The automated evaluation process uses GPT-4 and Gemini 1.5 Pro for cross-validation
🔑 Key Insights:
→ Static frame evaluations don't reflect real-world challenges
→ Action history significantly impacts agent performance
→ Self-correction capability is crucial for real-world scenarios
📊 Results:
→ LLM evaluation achieves ~80% accuracy
→ Cross-validation reduces misjudgment to ~3%
→ AppAgent outperforms both finetuned models and GPT-4
→ Success rates: 30.8% (easy tasks), 7% (medium tasks), 2% (hard tasks)
Share this post