0:00
/
0:00
Transcript

"CallNavi: A Study and Challenge on Function Calling Routing and Invocation in Large Language Models"

Generated below podcast on this paper with Google's Illuminate.

CallNavi introduces a novel dataset and benchmarking framework for evaluating how LLMs handle complex API function calling tasks, with multi-step routing and parameter generation in real-world scenarios.

https://arxiv.org/abs/2501.05255

Original Problem 🤔:

→ Existing chatbot systems struggle with complex API interactions that require selecting correct APIs from large pools and managing multi-step, nested API calls.

→ Current benchmarks test only basic API calling with limited API options, not reflecting real-world complexity.

CallNavi as introduced in this Paper 💡:

→ CallNavi introduces a dataset with 729 questions across 10 domains to evaluate API function calling.

→ It assesses models on handling over 100 API candidates per task.

→ The framework combines general-purpose LLMs for API selection with fine-tuned models for parameter generation.

→ It introduces a novel stability score to measure output consistency across multiple runs.

Key Insights 🔍:

→ Commercial models excel at API routing but struggle with parameter generation

→ Smaller open-source models (<10B parameters) show instability in complex tasks

→ JSON format significantly outperforms YAML for API function calling tasks

Results 📊:

→ GPT-4o achieves 91.9% accuracy in API routing and 71.1% in parameter generation

→ The enhanced routing method improves hard task performance by 30%

→ Open LLMs like LLAMA3.1 achieve 86.1% routing accuracy but only 24.5% parameter accuracy

Discussion about this video