Teaching LLMs patience improves math accuracy by 6.7% with just 5 minutes of training
Simple preference optimization makes LLMs slow down and think better
The paper introduces a simple yet powerful approach to enhance LLM reasoning capabilities through patient, detailed problem-solving. Instead of expensive training data or complex architectures, it uses Direct Preference Optimization to teach models to favor thorough reasoning over quick answers.
-----
https://arxiv.org/abs/2411.13082
Original Problem 🤔:
Current LLMs prioritize quick, concise answers due to user preference alignment, leading to oversimplified reasoning and reduced accuracy in complex problem-solving tasks.
-----
Solution in this Paper 🛠️:
→ The method generates detailed reasoning examples using GPT-4 as positive samples and concise solutions as negative samples.
→ It uses Direct Preference Optimization (DPO) to train models to prefer patient, step-by-step reasoning.
→ The training process takes less than 5 minutes on 8 A100 GPUs, using Qwen2-7B-Instruct as the base model.
→ The approach requires no new knowledge or skills, just encouraging models to be more thorough in their existing reasoning capabilities.
-----
Key Insights 💡:
→ More patient reasoning leads to better problem-solving accuracy
→ Simple preference optimization can improve performance without expensive training data
→ Trading inference time for accuracy is worthwhile in complex tasks
-----
Results 📊:
→ 6.7% accuracy improvement on GSM8k benchmark
→ 0.2% increase on MATH dataset
→ Inference time increased from 7.2 to 10.9 seconds per problem
------
Are you into AI and LLMs❓ Join me on X/Twitter with 49K+ others, to remain on the bleeding-edge of AI every day.
𝕏/🐦 https://x.com/rohanpaul_ai
Share this post