"WebWalker: Benchmarking LLMs in Web Traversal"

Playback speed

Share post at current time

0:00

Transcript

"WebWalker: Benchmarking LLMs in Web Traversal"

Generated below podcast on this paper with Google's Illuminate.

Rohan Paul

Jan 22, 2025

LLMs now navigate websites systematically instead of surface-level searching, and we need to be able to measure this ability.

WebWalker introduces a benchmark and framework for evaluating LLMs' ability to navigate websites deeply, addressing limitations in current web information retrieval systems.

-----

https://arxiv.org/abs/2501.07572

🔍 Original Problem:

Traditional search engines often retrieve shallow content, limiting LLMs' ability to handle complex information spread across multiple webpage layers.

-----

🛠️ Solution in this Paper:

→ WebWalkerQA benchmark contains 680 queries across 1373 webpages, testing LLMs' ability to traverse websites systematically

→ Implements a multi-agent system with Explorer Agent for navigation and Critic Agent for memory management

→ Uses HTML button data and two-stage funnel annotation combining LLM-based and human verification

→ Covers both single-source and multi-source queries in education, conference, organization, and game domains

-----

💡 Key Insights:

→ Even powerful LLMs struggle with deep web navigation, validating WebWalkerQA's challenging nature

→ Combining RAG with WebWalker improves performance through horizontal and vertical coordination

→ Vertical exploration within pages shows promise for scaling inference time

-----

📊 Results:

→ Best performing WebWalker using GPT-4 achieved 37.5% accuracy

→ Performance decreases with increasing depth and number of required sources

→ Commercial RAG systems achieved maximum 40.73% accuracy on WebWalkerQA

Rohan's Bytes

"WebWalker: Benchmarking LLMs in Web Traversal"

Discussion about this video