0:00
/
0:00
Transcript

"Self-Consistency Preference Optimization"

The podcast on this paper is generated with Google's Illuminate.

Self-Consistency Preference Optimization (SCPO) trains LLMs to solve complex problems by learning from their own consistent answers.

Self-consistency replaces human annotations in training LLMs for reasoning tasks

https://arxiv.org/abs/2411.04109

🎯 Original Problem:

Training LLMs on complex reasoning tasks requires extensive human-annotated data, which is expensive and time-consuming to collect. Existing self-training methods struggle to evaluate their own correctness on reasoning problems.

-----

🔧 Solution in this Paper:

→ Self-Consistency Preference Optimization (SCPO) trains models without human annotations by leveraging answer consistency across multiple samples

→ For each problem, SCPO generates multiple solutions and creates preference pairs based on answer frequency

→ The method uses a weighted loss function that considers confidence in each preference pair based on vote margins

→ SCPO can work in unsupervised, semi-supervised, or supervised settings by combining with gold labels when available

-----

💡 Key Insights:

→ Models make random mistakes, so incorrect solutions rarely lead to the same wrong answer repeatedly

→ Higher consistency across multiple samples correlates strongly with answer correctness

→ Weighted loss based on vote margins improves training quality

→ SCPO can generate and filter new training problems without requiring correct answers

-----

📊 Results:

→ On GSM8K math problems: 22.74% absolute improvement in zero-shot accuracy without gold labels

→ Matches supervised training performance within 1% difference

→ On ZebraLogic puzzles: Llama-3 8B with SCPO outperforms Llama-3 70B and Claude-3 Haiku

→ Semi-supervised SCPO improves over supervised baseline by 2.35% on GSM8K

Discussion about this video

User's avatar