New benchmark reveals LLMs struggle with complex graph-based workflows, scoring 15% lower than linear tasks.
Benchmarking Agentic Workflow Generation
New benchmark reveals LLMs struggle with complex graph-based workflows, scoring 15% lower than linear tasks.