Benchmarking Agentic Workflow Generation
New benchmark reveals LLMs struggle with complex graph-based workflows, scoring 15% lower than linear tasks.
New benchmark reveals LLMs struggle with complex graph-based workflows, scoring 15% lower than linear tasks.
Simple A-to-B? LLMs nail it. Complex flowcharts? Not so much
Graph-structured workflow benchmark exposes gaps in LLM planning capabilities, guiding future agent development.
Original Problem 🔍:
Existing workflow evaluation frameworks lack comprehensiveness, focusing on limited scenarios and linear structures. They fail to accurately assess LLM agents' ability to decompose complex tasks into executable workflows.
Solution in this Paper 🛠️:
• Introduces WORFBENCH: A unified workflow generation benchmark
• Covers multi-faceted scenarios: problem-solving, function calling, embodied planning, open-grounded planning
• Models workflows as Directed Acyclic Graphs for complex structures
• Employs strict quality control using Topological Sorting and human evaluation
• Presents WORFEVAL: Evaluation protocol using subsequence and subgraph matching algorithms
Key Insights from this Paper 💡:
• Significant gap between linear and graph planning capabilities in LLMs
• Graph-structured workflows enhance downstream task performance and efficiency
• Integration of world knowledge crucial for improving LLM agents' planning abilities
Results 📊:
• GPT-4 achieves 67.32% f1_chain and 52.47% f1_graph scores
• 15% performance gap between linear and graph planning capabilities
• Workflow-augmented models show 2.5-3.5% improvement in function call accuracy
• Parallel execution reduces task completion time by 18-35%
• Trained open-source models exhibit limited generalization to held-out tasks
🧠 WORFBENCH differ from existing workflow evaluation frameworks by
Covering multiple complex scenarios (problem-solving, function calling, embodied planning, open-grounded planning)
Modeling workflows as Directed Acyclic Graphs to represent complicated serial or parallel structures
Introducing an intermediary "node chain" structure between tasks and workflow graphs
Employing rigorous validation using Topological Sorting and human evaluation
Generated workflows can enhance downstream tasks by:
Serving as structured prior knowledge, improving performance in embodied scenarios
Functioning as Chain-of-Thought augmentation, assisting in more focused planning
Enabling parallel execution of subtasks, reducing overall task completion time
Shortening planning steps by providing purposeful guidance