A new way to make AI learn GUI interactions: plan first, act later.
AutoGLM, proposed in this paper, teaches AI to navigate websites and apps by separating thinking from doing
https://arxiv.org/abs/2411.00820
🎯 Original Problem:
Foundation models excel at acquiring knowledge but struggle with real-world decision-making, especially in GUI environments. The key challenge is the scarcity of decision-making data in pre-training sets, limiting their ability to learn from dynamic interactions.
-----
🔧 Solution in this Paper:
→ AutoGLM introduces an "intermediate interface" that separates planning from grounding behaviors, allowing independent optimization
→ Implements self-evolving online curriculum reinforcement learning that enables agents to learn from failures progressively
→ Uses comprehensive training techniques: pre-training with weak-supervised signals, behavior cloning from expert trajectories, and reward modeling
→ Deploys through Qingyan Browser Plugin and Android applications for real-world testing
-----
💡 Key Insights:
→ Separating planning and grounding behaviors significantly improves agent performance
→ Self-evolving RL with curriculum learning effectively addresses data scarcity
→ Real-world deployment helps in understanding both benefits and risks of autonomous agents
-----
📊 Results:
→ Web browsing: 55.2% success rate on VAB-WebArena-Lite (59.1% with second attempt)
→ OpenTable tasks: 96.2% success rate, outperforming GPT-4 (62.6%)
→ Android control: 36.2% success rate on AndroidLab, exceeding both GPT-4 and Claude-3.5
Share this post