0:00
/
0:00
Transcript

"ComplexFuncBench: Exploring Multi-Step and Constrained Function Calling under Long-Context Scenario"

Below podcast is generated with Google's Illuminate.

Current benchmarks fail to assess complex function calling in real-world LLM scenarios.

This paper introduces ComplexFuncBench, a benchmark for multi-step, constrained function calls, and ComplexEval, an automatic evaluation framework.

-----

https://arxiv.org/abs/2501.10132

Original Problem 🤔:

→ LLMs lack real-time and factual knowledge.

→ Function calling enhances LLMs with external APIs.

→ Evaluating complex function calls is challenging.

→ Existing benchmarks do not cover real-world complexity.

→ They lack multi-step calls and parameter reasoning evaluation.

-----

Solution in this Paper 💡:

→ This paper introduces ComplexFuncBench.

→ It is a benchmark for complex function calling.

→ ComplexFuncBench includes multi-step and constrained calls.

→ It requires long parameter filling and value reasoning.

→ It features 128k long context scenarios.

→ The paper also proposes ComplexEval.

→ ComplexEval is an automatic evaluation framework.

→ It uses multi-dimensional matching for evaluation.

→ This includes rule-based, response-based and LLM-based matching.

-----

Key Insights from this Paper 🧐:

→ Current LLMs show deficiencies in complex function calling.

→ Parameter value errors are a significant error source.

→ Models struggle with constrained parameter reasoning.

→ Long context parameter extraction is challenging for models.

→ Different models show specific weaknesses in different scenarios.

-----

Results 📊:

→ GPT-4o achieves 60.5% overall success rate on ComplexFuncBench.

→ Qwen2.5-72B achieves 40.1% overall success rate.

→ Claude-3.5-Sonnet achieves 1.84 completeness score and 1.85 correctness score in response evaluation.