Great paper addressing the performance gap between open-source and proprietary models
It proposes a DECOMPOSE , CRITIQUE, AND REFINE (DECRIM) self-correction pipeline, which enhances LLMs’ ability to follow constraints. D E CRIM works by decomposing the original instruction into a list of constraints and using a Critic model to decide when and where the LLM’s response needs refinement.
📚 https://arxiv.org/pdf/2410.06458
Original Problem 🔍:
LLMs struggle to follow instructions with multiple constraints, failing to meet at least one constraint in over 21% of real-world user requests. Existing benchmarks rely on synthetic data, not capturing real-world complexity.
-----
Solution in this Paper 💡:
• Introduces REALINSTRUCT benchmark using real user requests to AI assistants
• Proposes DECRIM (DECOMPOSE, CRITIQUE, AND REFINE) self-correction pipeline:
- Decomposes instructions into granular constraints
- Uses Critic model to evaluate constraint satisfaction
- Refines output iteratively based on Critic feedback
• Investigates LLM-as-a-Judge for cost-effective constraint satisfaction evaluation
-----
Key Insights from this Paper 💡:
• Real user requests often contain multiple, complex constraints
• LLMs, including GPT-4, struggle with multi-constrained instructions
• Open-source LLMs can match/surpass GPT-4 with strong feedback in DECRIM
• Critic quality is crucial for DECRIM's success
-----
Results 📊:
• DECRIM improves Mistral's performance by 7.3% on REALINSTRUCT and 8.0% on IFEval with weak feedback
• With strong feedback, DECRIM enables open-source LLMs to outperform GPT-4 on both benchmarks
• GPT-4-Turbo with Chain-of-Thought prompting is a reliable, cost-effective alternative to human evaluation for constraint satisfaction
Share this post