0:00
/
0:00
Transcript

"Mind the Gap: Examining the Self-Improvement Capabilities of Large Language Models"

The podcast on this paper is generated with Google's Illuminate.

LLMs can now judge their own outputs better than we thought - but only after reaching a certain size.

This paper mathematically formalizes how LLMs can verify and improve their own outputs through a framework called generation-verification gap.

-----

https://arxiv.org/abs/2412.02674

🤔 Original Problem:

→ While LLMs can generate synthetic training data, using this data without verification can harm performance

→ Current verification methods either require expensive human annotation or stronger models, which isn't always feasible

-----

🛠️ Solution in this Paper:

→ The paper introduces a mathematical framework for self-improvement where models verify their own outputs

→ It defines the "generation-verification gap" metric to measure how well models can verify their generations

→ The framework uses three components: generation of multiple responses, self-verification of these responses, and model updates based on verified data

-----

💡 Key Insights:

→ Only larger models (above 7B parameters) show meaningful self-improvement capabilities

→ Chain-of-Thought verification performs more reliably than simple multiple-choice verification

→ The relative generation-verification gap scales linearly with model's pre-training compute

→ Self-improvement saturates after 2-3 iterations

-----

📊 Results:

→ Large models (72B parameters) showed 200% accuracy improvement on Sudoku tasks

→ The generation-verification gap increases monotonically with pre-training compute

→ Cross-verification shows gap increases with verifier capability and decreases with generator capability

Discussion about this video