DocETL: Agentic Query Rewriting and Evaluation for Complex Document Processing
DocETL turns messy document processing into a smart puzzle that LLMs can actually solve correctly.
DocETL turns messy document processing into a smart puzzle that LLMs can actually solve correctly.
Original Problem 🔍:
DocETL addresses limitations in existing LLM-powered data processing frameworks that focus on cost reduction rather than accuracy improvement for complex document processing tasks.
Solution in this Paper 🛠️:
• Introduces DocETL, a declarative system for optimizing complex document processing pipelines
• Employs novel rewrite directives for LLM-based operators
• Utilizes an agent-based framework for plan rewriting and evaluation
• Implements opportunistic optimization strategy for recursive plan refinement
• Offers a two-stage evaluation process for selecting optimal plans
Key Insights from this Paper 💡:
• LLM outputs for complex tasks often require decomposition for accuracy
• Agent-driven optimization can significantly improve pipeline performance
• Task-specific validation is crucial for effective plan evaluation
• Balancing human intuition with automated optimization is essential
Results 📊:
• Police misconduct identification: 1.34x more accurate than baseline
• Polarizing feature analysis: 4.55x more distinct quotes, 4.60x more distinct games referenced
• Declassified document analysis: 1.82x more comprehensive extracted information
• Outperforms LOTUS on Biodex task: 16% improvement in RP@5, 21% in RP@10
• Achieves 18% higher F1 score than Palimpzest on medical schema matching task
🔍 DocETL's architecture includes several key components and innovations:
A declarative YAML-based interface for users to author pipelines with operators specific to the LLM setting.
Novel rewrite directives that guide LLM agents in rewriting query plans. These directives are more abstract than traditional database rewrite rules and can be interpreted by LLMs in the context of specific tasks and data characteristics.
An agent-based framework for plan rewriting and evaluation. Generation agents apply rewrite directives to create diverse candidate plans, while validation agents assess the effectiveness of optimized sub-pipelines.
An opportunistic optimization strategy that recursively optimizes new operations created as part of a rewrite, allowing for more refined plans to be explored efficiently.
A two-stage evaluation process for selecting the best plan, combining average ratings and pairwise comparisons of top-rated plans.