LLMs can self-improve by learning to trust their own judgment about response quality
This paper introduces "sharpening," a mechanism to improve LLM performance without external data by leveraging the model's ability to verify responses better than generate them. The research formalizes self-improvement through statistical frameworks and proves theoretical guarantees for both SFT and RLHF-based approaches.
-----
https://arxiv.org/abs/2412.01951
🤔 Original Problem:
→ LLMs often verify responses better than they generate them, but this gap isn't effectively utilized for improvement
→ Self-improvement without external feedback seems to violate data processing inequality, raising questions about its feasibility
-----
🔧 Solution in this Paper:
→ Introduces "sharpening" - tilting model distribution toward responses with higher self-reward
→ Uses model's own verification ability through self-reward functions like log-probabilities
→ Implements two approaches: SFT-Sharpening filters high-reward responses for fine-tuning, while RLHF-Sharpening optimizes self-reward directly
→ Develops statistical framework measuring efficiency through sample-and-evaluate queries
-----
💡 Key Insights:
→ Models contain "hidden knowledge" that's easier to verify than generate
→ Coverage (probability mass on high-quality responses) fundamentally limits self-improvement
→ RLHF can bypass coverage requirements through exploration
-----
📊 Results:
→ SFT-Sharpening achieves minimax optimality with sufficient coverage
→ RLHF-Sharpening improves over SFT by leveraging online exploration
→ Empirical validation shows significant gains over greedy decoding across multiple tasks
Share this post