0:00
/
0:00
Transcript

"Learning to Reason via Self-Iterative Process Feedback for Small Language Models"

The podcast on this paper is generated with Google's Illuminate.

SLMs can now learn complex reasoning by analyzing their own thought process steps.

Small Language Models learn to reason better by critiquing their own step-by-step thinking process, without needing external help or supervision.

-----

https://arxiv.org/abs/2412.08393

🤔 Original Problem:

Small Language Models (SLMs) struggle with reasoning tasks compared to LLMs. Current methods to improve them need expensive supervision from humans or advanced LLMs, leading to overfitting and poor generalization.

-----

🔧 Solution in this Paper:

→ The paper introduces Self-Iterative Process Feedback (SIPF) method where SLMs learn from their own reasoning attempts

→ A Process Reward Model evaluates the correctness of each reasoning step, not just the final answer

→ The system samples multiple reasoning paths and uses simulation to determine which steps are correct

→ It combines this with Odds Ratio Preference Optimization to fine-tune SLMs using both good and bad examples

→ The process iteratively improves as the model learns from its successes and failures

-----

💡 Key Insights:

→ Step-by-step feedback is more valuable than just checking final answers

→ Self-iteration helps models continuously improve without external supervision

→ Combining process feedback with preference optimization leads to better learning

-----

📊 Results:

→ Improved Gemma-2B performance by 12.43 points on GSM8K math problems

→ Increased code generation success by 3.95 points on MBPP benchmark

→ Showed better generalization on out-of-domain tasks like MMLU_Math

Discussion about this video