CallNavi introduces a novel dataset and benchmarking framework for evaluating how LLMs handle complex API function calling tasks, with multi-step routing and parameter generation in real-world scenarios.
https://arxiv.org/abs/2501.05255
Original Problem 🤔:
→ Existing chatbot systems struggle with complex API interactions that require selecting correct APIs from large pools and managing multi-step, nested API calls.
→ Current benchmarks test only basic API calling with limited API options, not reflecting real-world complexity.
CallNavi as introduced in this Paper 💡:
→ CallNavi introduces a dataset with 729 questions across 10 domains to evaluate API function calling.
→ It assesses models on handling over 100 API candidates per task.
→ The framework combines general-purpose LLMs for API selection with fine-tuned models for parameter generation.
→ It introduces a novel stability score to measure output consistency across multiple runs.
Key Insights 🔍:
→ Commercial models excel at API routing but struggle with parameter generation
→ Smaller open-source models (<10B parameters) show instability in complex tasks
→ JSON format significantly outperforms YAML for API function calling tasks
Results 📊:
→ GPT-4o achieves 91.9% accuracy in API routing and 71.1% in parameter generation
→ The enhanced routing method improves hard task performance by 30%
→ Open LLMs like LLAMA3.1 achieve 86.1% routing accuracy but only 24.5% parameter accuracy
Share this post