Current benchmarks fail to assess complex function calling in real-world LLM scenarios.
This paper introduces ComplexFuncBench, a benchmark for multi-step, constrained function calls, and ComplexEval, an automatic evaluation framework.
-----
https://arxiv.org/abs/2501.10132
Original Problem 🤔:
→ LLMs lack real-time and factual knowledge.
→ Function calling enhances LLMs with external APIs.
→ Evaluating complex function calls is challenging.
→ Existing benchmarks do not cover real-world complexity.
→ They lack multi-step calls and parameter reasoning evaluation.
-----
Solution in this Paper 💡:
→ This paper introduces ComplexFuncBench.
→ It is a benchmark for complex function calling.
→ ComplexFuncBench includes multi-step and constrained calls.
→ It requires long parameter filling and value reasoning.
→ It features 128k long context scenarios.
→ The paper also proposes ComplexEval.
→ ComplexEval is an automatic evaluation framework.
→ It uses multi-dimensional matching for evaluation.
→ This includes rule-based, response-based and LLM-based matching.
-----
Key Insights from this Paper 🧐:
→ Current LLMs show deficiencies in complex function calling.
→ Parameter value errors are a significant error source.
→ Models struggle with constrained parameter reasoning.
→ Long context parameter extraction is challenging for models.
→ Different models show specific weaknesses in different scenarios.
-----
Results 📊:
→ GPT-4o achieves 60.5% overall success rate on ComplexFuncBench.
→ Qwen2.5-72B achieves 40.1% overall success rate.
→ Claude-3.5-Sonnet achieves 1.84 completeness score and 1.85 correctness score in response evaluation.
Share this post