0:00
/
0:00
Transcript

"AndroidLab: Training and Systematic Benchmarking of Android Autonomous Agents"

The podcast on this paper is generated with Google's Illuminate.

AndroidLab, proposed in this paper, unifies training and testing of Android agents across different model types

Systematic benchmark for evaluating both open and closed-source Android agents

https://arxiv.org/abs/2410.24024

🤖 Original Problem:

Training and evaluating Android autonomous agents lacks systematic research across both open-source and closed-source models, with existing benchmarks being limited in scope and reproducibility.

-----

🛠️ Solution in this Paper:

→ AndroidLab framework provides a unified environment supporting both LLMs and Large Multimodal Models (LMMs) with identical action spaces

→ Two operation modes: XML mode for text-only models and Set-of-Mark (SoM) mode for multimodal models

→ A benchmark with 138 tasks across 9 apps, using predefined Android virtual devices

→ Developed Android Instruction dataset with 10.5k traces and 94.3k steps for model training

-----

💡 Key Insights:

→ Fine-tuning significantly improves model performance compared to complex reasoning frameworks

→ Standardized evaluation metrics like Success Rate and Sub-Goal Success Rate provide better assessment

→ Open-source models can approach closed-source performance through proper fine-tuning

→ Task completion verification using UI tree structure matching enables precise assessment

-----

📊 Results:

→ Fine-tuning lifted average success rates from 4.59% to 21.50% for LLMs

→ Success rates improved from 1.93% to 13.28% for LMMs

→ GPT-4 achieved over 30% success rate in both XML and SoM modes

→ Best open-source models reached around 5% success rate before fine-tuning

Discussion about this video