0:00
/
0:00
Transcript

"WorkflowLLM: Enhancing Workflow Orchestration Capability of Large Language Models"

The podcast on this paper is generated with Google's Illuminate.

WorkflowLLM enables LLMs to handle 70+ action workflows, a 10x improvement over current capabilities

An LLM that can orchestrate real-world automation workflows at production scale

https://arxiv.org/abs/2411.05451

Original Problem 🤔:

Current LLMs can only handle small workflows with around 6 actions and simple logical structures. This falls short of real-world needs where applications like Apple Shortcuts involve 70+ actions and complex branching/looping patterns.

-----

Solution in this Paper 🛠️:

→ Created WorkflowBench - a dataset with 106,763 workflow samples covering 1,503 APIs from 83 applications

→ Collected real workflows from Apple Shortcuts and RoutineHub, converted to Python code, added hierarchical thoughts using ChatGPT

→ Used ChatGPT to generate diverse task queries and expand dataset coverage

→ Trained an annotator model on collected data to generate workflows for new queries

→ Fine-tuned Llama-3.1-8B on this dataset to create WorkflowLlama

-----

Key Insights from this Paper 💡:

→ Data quality and scale are crucial for workflow orchestration capability

→ Three-phase data construction ensures diversity and complexity

→ Hierarchical thought generation improves model understanding

→ Quality confirmation steps maintain dataset integrity

-----

Results 📊:

→ Outperformed all baselines including GPT-4

→ Handled complex workflows with 70+ actions vs 6 actions for GPT-4

→ Demonstrated strong generalization to unseen APIs and instructions

→ Achieved 77.5% F1 score on out-of-distribution T-Eval benchmark

Discussion about this video