SLMs can now learn complex reasoning by analyzing their own thought process steps.
Small Language Models learn to reason better by critiquing their own step-by-step thinking process, without needing external help or supervision.
-----
https://arxiv.org/abs/2412.08393
🤔 Original Problem:
Small Language Models (SLMs) struggle with reasoning tasks compared to LLMs. Current methods to improve them need expensive supervision from humans or advanced LLMs, leading to overfitting and poor generalization.
-----
🔧 Solution in this Paper:
→ The paper introduces Self-Iterative Process Feedback (SIPF) method where SLMs learn from their own reasoning attempts
→ A Process Reward Model evaluates the correctness of each reasoning step, not just the final answer
→ The system samples multiple reasoning paths and uses simulation to determine which steps are correct
→ It combines this with Odds Ratio Preference Optimization to fine-tune SLMs using both good and bad examples
→ The process iteratively improves as the model learns from its successes and failures
-----
💡 Key Insights:
→ Step-by-step feedback is more valuable than just checking final answers
→ Self-iteration helps models continuously improve without external supervision
→ Combining process feedback with preference optimization leads to better learning
-----
📊 Results:
→ Improved Gemma-2B performance by 12.43 points on GSM8K math problems
→ Increased code generation success by 3.95 points on MBPP benchmark
→ Showed better generalization on out-of-domain tasks like MMLU_Math
Share this post