Two-stage training creates GUI agents that understand and reason about interface interactions naturally.
InfiGUIAgent introduces a two-stage training pipeline for GUI agents that enhances both basic skills and advanced reasoning capabilities through native integration of hierarchical and reflection-based reasoning .
-----
https://arxiv.org/abs/2501.04575
🤖 Original Problem:
Existing GUI agents struggle with multi-step reasoning and rely heavily on textual annotations, limiting their effectiveness in real-world applications .
-----
🛠️ Solution in this Paper:
→ The solution employs a two-stage supervised fine-tuning approach to build robust GUI agents .
→ Stage 1 focuses on fundamental capabilities like GUI understanding and instruction grounding using diverse datasets .
→ Stage 2 integrates hierarchical reasoning and expectation-reflection reasoning using synthesized data .
→ The agent uses strategic and tactical layers for planning complex tasks .
→ A reference-augmented annotation format enhances visual-language understanding .
-----
💡 Key Insights:
→ GUI agents need both strategic planning and tactical execution abilities
→ Native reasoning capabilities can be integrated through carefully structured training data
→ Reflection-based learning improves self-correction and adaptation
-----
📊 Results:
→ Achieved highest accuracy of 76.3% on ScreenSpot benchmark
→ Outperformed larger models like UGround-7B (73.3%) and ShowUI (75.1%)
→ Demonstrated 0.09 overall success rate on AndroidWorld, surpassing similar-sized models
Share this post