Great paper from @Meta
Synthetic data and iterative self-improvement is all you need.
No humans needed in the evaluation loop.
This paper introduces a self-improving evaluator that learns to assess LLM outputs without human feedback, using synthetic data and iterative self-training to match top human-supervised models.
-----
📚 https://arxiv.org/pdf/2408.02666
Original Problem 🤔:
Building strong LLM evaluators typically requires extensive human preference data, which is costly and becomes outdated as models improve. Current approaches rely heavily on human annotations, limiting scalability and adaptability.
-----
Solution in this Paper 🔧:
→ The method starts with unlabeled instructions and uses a seed LLM to generate contrasting response pairs, where one is intentionally inferior.
→ It then uses an LLM-as-Judge approach to generate reasoning traces and final judgments for these synthetic pairs.
→ The system filters correct judgments and uses them to train an improved evaluator model.
→ This process repeats iteratively, with each iteration using the improved model to generate better synthetic training data.
-----
Key Insights from this Paper 💡:
→ Human preference data isn't necessary for training strong LLM evaluators
→ Synthetic data generation with iterative self-improvement can match human-supervised approaches
→ Different data sources (safety, math, coding) improve performance in their respective domains
-----
Results 📊:
→ Improved RewardBench accuracy from 75.4 to 88.3 (88.7 with majority voting)
→ Outperformed GPT-4 (84.3) and matched top reward models trained with human data
→ Achieved 79.5% agreement with human judgments on MT-Bench using majority voting
Share this post