ToolHop shows how LLMs struggle to use multiple tools together, even when given clear instructions.
ToolHop introduces a dataset with 995 queries and 3,912 tools to evaluate how well LLMs can use multiple tools in sequence to solve complex problems.
-----
https://arxiv.org/abs/2501.02506
🤔 Original Problem:
→ Current benchmarks lack reliable ways to test LLMs' ability to use multiple tools together in a logical sequence
→ Existing datasets focus on single-tool scenarios or lack verifiable answers
-----
🛠️ Solution in this Paper:
→ ToolHop uses a query-driven approach that starts with complex questions and builds tools around them
→ Each complex query is broken down into simpler sub-queries that depend on each other
→ Tools are created specifically for each sub-query, ensuring meaningful connections between tools
→ The system includes detailed error messages and feedback mechanisms
→ All tools are locally executable and can be directly tested
-----
💡 Key Insights:
→ Query-driven tool creation leads to more natural and interdependent tool relationships
→ Detailed feedback helps models correct their mistakes during tool use
→ Even advanced models struggle with complex multi-tool scenarios
-----
📊 Results:
→ Best performing model (GPT-4) achieved only 49.04% accuracy
→ Tool use improved model performance by 12.29% on average
→ GPT family showed 23.59% improvement through tool usage
Share this post