Benchmarking Agentic Workflow Generation

New benchmark reveals LLMs struggle with complex graph-based workflows, scoring 15% lower than linear tasks.

Rohan Paul

Nov 08, 2024

New benchmark reveals LLMs struggle with complex graph-based workflows, scoring 15% lower than linear tasks.

Simple A-to-B? LLMs nail it. Complex flowcharts? Not so much

Graph-structured workflow benchmark exposes gaps in LLM planning capabilities, guiding future agent development.

Original Problem 🔍:

Existing workflow evaluation frameworks lack comprehensiveness, focusing on limited scenarios and linear structures. They fail to accurately assess LLM agents' ability to decompose complex tasks into executable workflows.

Solution in this Paper 🛠️:

• Introduces WORFBENCH: A unified workflow generation benchmark

• Covers multi-faceted scenarios: problem-solving, function calling, embodied planning, open-grounded planning

• Models workflows as Directed Acyclic Graphs for complex structures

• Employs strict quality control using Topological Sorting and human evaluation

• Presents WORFEVAL: Evaluation protocol using subsequence and subgraph matching algorithms

Key Insights from this Paper 💡:

• Significant gap between linear and graph planning capabilities in LLMs

• Graph-structured workflows enhance downstream task performance and efficiency

• Integration of world knowledge crucial for improving LLM agents' planning abilities

Results 📊:

• GPT-4 achieves 67.32% f1_chain and 52.47% f1_graph scores

• 15% performance gap between linear and graph planning capabilities

• Workflow-augmented models show 2.5-3.5% improvement in function call accuracy

• Parallel execution reduces task completion time by 18-35%

• Trained open-source models exhibit limited generalization to held-out tasks

🧠 WORFBENCH differ from existing workflow evaluation frameworks by

Covering multiple complex scenarios (problem-solving, function calling, embodied planning, open-grounded planning)
Modeling workflows as Directed Acyclic Graphs to represent complicated serial or parallel structures
Introducing an intermediary "node chain" structure between tasks and workflow graphs
Employing rigorous validation using Topological Sorting and human evaluation

Generated workflows can enhance downstream tasks by:

Serving as structured prior knowledge, improving performance in embodied scenarios
Functioning as Chain-of-Thought augmentation, assisting in more focused planning
Enabling parallel execution of subtasks, reducing overall task completion time
Shortening planning steps by providing purposeful guidance
Rohan's Bytes is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.

Rohan's Bytes

Benchmarking Agentic Workflow Generation

New benchmark reveals LLMs struggle with complex graph-based workflows, scoring 15% lower than linear tasks.

Discussion about this post