AndroidLab, proposed in this paper, unifies training and testing of Android agents across different model types
Systematic benchmark for evaluating both open and closed-source Android agents
https://arxiv.org/abs/2410.24024
🤖 Original Problem:
Training and evaluating Android autonomous agents lacks systematic research across both open-source and closed-source models, with existing benchmarks being limited in scope and reproducibility.
-----
🛠️ Solution in this Paper:
→ AndroidLab framework provides a unified environment supporting both LLMs and Large Multimodal Models (LMMs) with identical action spaces
→ Two operation modes: XML mode for text-only models and Set-of-Mark (SoM) mode for multimodal models
→ A benchmark with 138 tasks across 9 apps, using predefined Android virtual devices
→ Developed Android Instruction dataset with 10.5k traces and 94.3k steps for model training
-----
💡 Key Insights:
→ Fine-tuning significantly improves model performance compared to complex reasoning frameworks
→ Standardized evaluation metrics like Success Rate and Sub-Goal Success Rate provide better assessment
→ Open-source models can approach closed-source performance through proper fine-tuning
→ Task completion verification using UI tree structure matching enables precise assessment
-----
📊 Results:
→ Fine-tuning lifted average success rates from 4.59% to 21.50% for LLMs
→ Success rates improved from 1.93% to 13.28% for LMMs
→ GPT-4 achieved over 30% success rate in both XML and SoM modes
→ Best open-source models reached around 5% success rate before fine-tuning
Share this post