Self-Consistency Preference Optimization (SCPO) trains LLMs to solve complex problems by learning from their own consistent answers.
Self-consistency replaces human annotations in training LLMs for reasoning tasks
https://arxiv.org/abs/2411.04109
🎯 Original Problem:
Training LLMs on complex reasoning tasks requires extensive human-annotated data, which is expensive and time-consuming to collect. Existing self-training methods struggle to evaluate their own correctness on reasoning problems.
-----
🔧 Solution in this Paper:
→ Self-Consistency Preference Optimization (SCPO) trains models without human annotations by leveraging answer consistency across multiple samples
→ For each problem, SCPO generates multiple solutions and creates preference pairs based on answer frequency
→ The method uses a weighted loss function that considers confidence in each preference pair based on vote margins
→ SCPO can work in unsupervised, semi-supervised, or supervised settings by combining with gold labels when available
-----
💡 Key Insights:
→ Models make random mistakes, so incorrect solutions rarely lead to the same wrong answer repeatedly
→ Higher consistency across multiple samples correlates strongly with answer correctness
→ Weighted loss based on vote margins improves training quality
→ SCPO can generate and filter new training problems without requiring correct answers
-----
📊 Results:
→ On GSM8K math problems: 22.74% absolute improvement in zero-shot accuracy without gold labels
→ Matches supervised training performance within 1% difference
→ On ZebraLogic puzzles: Llama-3 8B with SCPO outperforms Llama-3 70B and Claude-3 Haiku
→ Semi-supervised SCPO improves over supervised baseline by 2.35% on GSM8K
Share this post