WebRL, proposed in this paper, transforms weak open LLMs into powerful web agents using self-evolving reinforcement learning
https://arxiv.org/abs/2411.02337
Original Problem 🤔:
Open source LLMs perform poorly as web agents compared to proprietary models like GPT-4. Base Llama-3.1-8B has only 4.8% success rate on WebArena-Lite benchmark, making open LLMs impractical for web automation tasks.
-----
Solution in this Paper 🛠️:
→ Introduces WebRL - a self-evolving online curriculum reinforcement learning framework for training web agents
→ Uses a curriculum that automatically generates new training tasks from failed attempts, matched to agent's current skill level
→ Implements an outcome-supervised reward model to evaluate task success
→ Employs KL-divergence constraints between reference and actor policies to prevent catastrophic forgetting
→ Uses experience replay buffer with confidence filtering to retain successful trajectories
-----
Key Insights 💡:
→ Curriculum learning with dynamic task generation is crucial for continuous improvement
→ KL-divergence constraints effectively prevent policy drift during online learning
→ Filtering replay data based on action perplexity helps balance between familiar and challenging tasks
→ Focus on successful trajectories avoids incorrect intermediate state estimation
-----
Results 📊:
→ Improves Llama-3.1-8B from 4.8% to 42.4% success rate
→ Boosts GLM-4-9B from 6.1% to 43% success rate
→ Achieves 49.1% with Llama-3.1-70B
→ Outperforms GPT-4-Turbo (17.6%) and AutoWebGLM (18.2%)
Share this post