LLM Self-Correction with DECRIM: DECOMPOSE, CRITIQUE, AND REFINE for Enhanced Following of Instructions with Multiple Constraints

Playback speed

Share post at current time

0:00

Transcript

LLM Self-Correction with DECRIM: DECOMPOSE, CRITIQUE, AND REFINE for Enhanced Following of Instructions with Multiple Constraints

The podcast on this paper is generated with Google's Illuminate.

Rohan Paul

Dec 28, 2024

Great paper addressing the performance gap between open-source and proprietary models

It proposes a DECOMPOSE , CRITIQUE, AND REFINE (DECRIM) self-correction pipeline, which enhances LLMs’ ability to follow constraints. D E CRIM works by decomposing the original instruction into a list of constraints and using a Critic model to decide when and where the LLM’s response needs refinement.

📚 https://arxiv.org/pdf/2410.06458

Original Problem 🔍:

LLMs struggle to follow instructions with multiple constraints, failing to meet at least one constraint in over 21% of real-world user requests. Existing benchmarks rely on synthetic data, not capturing real-world complexity.

-----

Solution in this Paper 💡:

• Introduces REALINSTRUCT benchmark using real user requests to AI assistants

• Proposes DECRIM (DECOMPOSE, CRITIQUE, AND REFINE) self-correction pipeline:

- Decomposes instructions into granular constraints

- Uses Critic model to evaluate constraint satisfaction

- Refines output iteratively based on Critic feedback

• Investigates LLM-as-a-Judge for cost-effective constraint satisfaction evaluation

-----

Key Insights from this Paper 💡:

• Real user requests often contain multiple, complex constraints

• LLMs, including GPT-4, struggle with multi-constrained instructions

• Open-source LLMs can match/surpass GPT-4 with strong feedback in DECRIM

• Critic quality is crucial for DECRIM's success

-----

Results 📊:

• DECRIM improves Mistral's performance by 7.3% on REALINSTRUCT and 8.0% on IFEval with weak feedback

• With strong feedback, DECRIM enables open-source LLMs to outperform GPT-4 on both benchmarks

• GPT-4-Turbo with Chain-of-Thought prompting is a reliable, cost-effective alternative to human evaluation for constraint satisfaction

Rohan's Bytes

LLM Self-Correction with DECRIM: DECOMPOSE, CRITIQUE, AND REFINE for Enhanced Following of Instructions with Multiple Constraints

Discussion about this video