LLMs now navigate websites systematically instead of surface-level searching, and we need to be able to measure this ability.
WebWalker introduces a benchmark and framework for evaluating LLMs' ability to navigate websites deeply, addressing limitations in current web information retrieval systems.
-----
https://arxiv.org/abs/2501.07572
🔍 Original Problem:
Traditional search engines often retrieve shallow content, limiting LLMs' ability to handle complex information spread across multiple webpage layers.
-----
🛠️ Solution in this Paper:
→ WebWalkerQA benchmark contains 680 queries across 1373 webpages, testing LLMs' ability to traverse websites systematically
→ Implements a multi-agent system with Explorer Agent for navigation and Critic Agent for memory management
→ Uses HTML button data and two-stage funnel annotation combining LLM-based and human verification
→ Covers both single-source and multi-source queries in education, conference, organization, and game domains
-----
💡 Key Insights:
→ Even powerful LLMs struggle with deep web navigation, validating WebWalkerQA's challenging nature
→ Combining RAG with WebWalker improves performance through horizontal and vertical coordination
→ Vertical exploration within pages shows promise for scaling inference time
-----
📊 Results:
→ Best performing WebWalker using GPT-4 achieved 37.5% accuracy
→ Performance decreases with increasing depth and number of required sources
→ Commercial RAG systems achieved maximum 40.73% accuracy on WebWalkerQA
Share this post