0:00
/
0:00
Transcript

"ToolHop: A Query-Driven Benchmark for Evaluating Large Language Models in Multi-Hop Tool Use"

Generated below podcast on this paper with Google's Illuminate.

ToolHop shows how LLMs struggle to use multiple tools together, even when given clear instructions.

ToolHop introduces a dataset with 995 queries and 3,912 tools to evaluate how well LLMs can use multiple tools in sequence to solve complex problems.

-----

https://arxiv.org/abs/2501.02506

🤔 Original Problem:

→ Current benchmarks lack reliable ways to test LLMs' ability to use multiple tools together in a logical sequence

→ Existing datasets focus on single-tool scenarios or lack verifiable answers

-----

🛠️ Solution in this Paper:

→ ToolHop uses a query-driven approach that starts with complex questions and builds tools around them

→ Each complex query is broken down into simpler sub-queries that depend on each other

→ Tools are created specifically for each sub-query, ensuring meaningful connections between tools

→ The system includes detailed error messages and feedback mechanisms

→ All tools are locally executable and can be directly tested

-----

💡 Key Insights:

→ Query-driven tool creation leads to more natural and interdependent tool relationships

→ Detailed feedback helps models correct their mistakes during tool use

→ Even advanced models struggle with complex multi-tool scenarios

-----

📊 Results:

→ Best performing model (GPT-4) achieved only 49.04% accuracy

→ Tool use improved model performance by 12.29% on average

→ GPT family showed 23.59% improvement through tool usage

Discussion about this video