Step-KTO: Optimizing Mathematical Reasoning through Stepwise Binary Feedback
Step-KTO improves mathematical reasoning in LLMs by integrating binary feedback at both stepwise and final-answer levels, ensuring more reliable and interpretable problem-solving.
📌 Step-KTO directly addresses logical inconsistency in LLM reasoning by enforcing correctness at each step. Unlike chain-of-thought prompting, which improves answer accuracy without verifying reasoning, Step-KTO ensures every intermediate step is validated. The PRM detects and corrects flawed logic, preventing models from arriving at correct answers through incorrect reasoning.
📌 The Kahneman-Tversky-inspired risk-sensitive optimization is key. By balancing error avoidance and solution progress, the model refines reasoning instead of just maximizing reward. This prevents unstable training dynamics where models chase final-answer correctness at the cost of logical consistency.
📌 Results show a major leap in reasoning robustness. A 9.8% jump in Pass@1 on MATH-500 proves Step-KTO improves structured problem-solving, not just final outputs. AMC23 and AIME24 gains confirm effectiveness in complex, competition-level math, proving its scalability beyond toy datasets.
---
https://arxiv.org/abs/2501.10799
Original Problem 🤔
→ LLMs can solve math problems but often produce logically inconsistent reasoning steps, even when final answers are correct.
→ Existing approaches like chain-of-thought prompting improve answer accuracy but do not ensure reasoning validity.
→ Final-answer-based training fails to verify correctness of intermediate steps, reducing trust in model outputs.
---
Solution in this Paper 🔧:
→ Step-KTO introduces a novel training framework that combines process-level and outcome-level binary feedback.
→ The process reward model (PRM) evaluates intermediate reasoning steps, while the outcome reward model (ORM) assesses final-answer correctness.
→ A Kahneman-Tversky-inspired value function integrates both binary signals to progressively refine reasoning quality.
→ Iterative training helps the model correct errors in intermediate steps and maintain final-answer accuracy.
---
Key Insights from this Paper 💡:
→ Combining stepwise and final-answer feedback improves both reasoning consistency and accuracy.
→ Iterative refinement allows models to learn from past mistakes and improve performance over multiple training rounds.
→ Risk-sensitive optimization (inspired by Kahneman-Tversky theory) helps the model balance avoiding errors and making progress.
---
Results 📊:
→ MATH-500: Pass@1 improves from 53.4% to 63.2% compared to baseline models.
→ AMC23: Accuracy increases from 35.0% to 47.5%, showing improvements in competition-level math problems.
→ AIME24: Step-KTO reaches 16.7% Pass@1, outperforming alternative training methods.
Share this post