0:00
/
0:00
Transcript

LLM Self-Correction with DECRIM: DECOMPOSE, CRITIQUE, AND REFINE for Enhanced Following of Instructions with Multiple Constraints

The podcast on this paper is generated with Google's Illuminate.

Great paper addressing the performance gap between open-source and proprietary models

It proposes a DECOMPOSE , CRITIQUE, AND REFINE (DECRIM) self-correction pipeline, which enhances LLMs’ ability to follow constraints. D E CRIM works by decomposing the original instruction into a list of constraints and using a Critic model to decide when and where the LLM’s response needs refinement.

📚 https://arxiv.org/pdf/2410.06458

Original Problem 🔍:

LLMs struggle to follow instructions with multiple constraints, failing to meet at least one constraint in over 21% of real-world user requests. Existing benchmarks rely on synthetic data, not capturing real-world complexity.

-----

Solution in this Paper 💡:

• Introduces REALINSTRUCT benchmark using real user requests to AI assistants

• Proposes DECRIM (DECOMPOSE, CRITIQUE, AND REFINE) self-correction pipeline:

- Decomposes instructions into granular constraints

- Uses Critic model to evaluate constraint satisfaction

- Refines output iteratively based on Critic feedback

• Investigates LLM-as-a-Judge for cost-effective constraint satisfaction evaluation

-----

Key Insights from this Paper 💡:

• Real user requests often contain multiple, complex constraints

• LLMs, including GPT-4, struggle with multi-constrained instructions

• Open-source LLMs can match/surpass GPT-4 with strong feedback in DECRIM

• Critic quality is crucial for DECRIM's success

-----

Results 📊:

• DECRIM improves Mistral's performance by 7.3% on REALINSTRUCT and 8.0% on IFEval with weak feedback

• With strong feedback, DECRIM enables open-source LLMs to outperform GPT-4 on both benchmarks

• GPT-4-Turbo with Chain-of-Thought prompting is a reliable, cost-effective alternative to human evaluation for constraint satisfaction