Agent Q: Advanced Reasoning and Learning for Autonomous AI Agents
AI agents learn web navigation through guided exploration and self-critique, outperforming humans.
AI agents learn web navigation through guided exploration and self-critique, outperforming humans.
Original Problem ๐:
LLMs struggle with agentic, multi-step reasoning in interactive environments like web navigation. Traditional supervised pre-training on static datasets falls short in enabling autonomous agent capabilities for complex decision-making in dynamic settings.
Solution in this Paper ๐ง :
โข Introduces Agent Q framework combining:
Guided Monte Carlo Tree Search (MCTS) for exploration
Self-critique mechanism for intermediate rewards
Iterative fine-tuning using off-policy Direct Preference Optimization (DPO)
โข Allows learning from both successful and unsuccessful trajectories
โข Improves generalization in complex, multi-step reasoning tasks
Key Insights from this Paper ๐ก:
โข Combining search and learning significantly boosts agent performance
โข Process-level supervision improves over purely outcome-based training
โข Test-time search capabilities provide substantial performance gains
โข Fine-grained credit assignment crucial for long-horizon tasks
Results ๐:
โข WebShop: Outperforms baselines and average human performance (50.5% vs 50.0%)
โข OpenTable (real-world booking):
Improves LLaMA-3 70B zero-shot performance from 18.6% to 81.7% success rate
With online search, further improves to 95.4% success rate
โข Surpasses GPT-4's performance after one day of autonomous data collection
๐ง What are the key components of the Agent Q framework?
The key components of Agent Q are:
Guided Monte Carlo Tree Search (MCTS) for exploration
Self-critique mechanism for intermediate rewards
Iterative fine-tuning using an off-policy variant of Direct Preference Optimization (DPO)



