LLMs can now judge their own outputs better than we thought - but only after reaching a certain size.
This paper mathematically formalizes how LLMs can verify and improve their own outputs through a framework called generation-verification gap.
-----
https://arxiv.org/abs/2412.02674
🤔 Original Problem:
→ While LLMs can generate synthetic training data, using this data without verification can harm performance
→ Current verification methods either require expensive human annotation or stronger models, which isn't always feasible
-----
🛠️ Solution in this Paper:
→ The paper introduces a mathematical framework for self-improvement where models verify their own outputs
→ It defines the "generation-verification gap" metric to measure how well models can verify their generations
→ The framework uses three components: generation of multiple responses, self-verification of these responses, and model updates based on verified data
-----
💡 Key Insights:
→ Only larger models (above 7B parameters) show meaningful self-improvement capabilities
→ Chain-of-Thought verification performs more reliably than simple multiple-choice verification
→ The relative generation-verification gap scales linearly with model's pre-training compute
→ Self-improvement saturates after 2-3 iterations
-----
📊 Results:
→ Large models (72B parameters) showed 200% accuracy improvement on Sudoku tasks
→ The generation-verification gap increases monotonically with pre-training compute
→ Cross-verification shows gap increases with verifier capability and decreases with generator capability
Share this post