Step-KTO fixes a major flaw in LLM reasoning—correcting intermediate steps, not just the final answer.
Improves LLM mathematical reasoning by integrating process-level and outcome-level binary feedback, ensuring correct intermediate steps and final answers.
---
Paper - https://arxiv.org/abs/2501.10799
Original Problem 🧩:
→ LLMs perform well in mathematical reasoning but often produce correct final answers with flawed intermediate steps.
→ Chain-of-thought prompting and self-consistency methods improve accuracy but do not guarantee coherent stepwise reasoning.
→ Without reliable stepwise logic, LLMs are untrustworthy for critical applications requiring transparent problem-solving.
---
Solution in this Paper 🛠️:
→ Step-KTO introduces a dual-feedback training framework, using process reward models (PRMs) for intermediate steps and outcome reward models (ORMs) for final answers.
→ The PRM assigns binary correctness labels to each reasoning step, guiding the model towards valid intermediate steps.
→ A Kahneman-Tversky-inspired value function prioritizes error reduction and risk-averse optimization.
→ The iterative training procedure refines model accuracy by progressively improving both process-level and final correctness.
→ Experiments on MATH-500, AMC23, and AIME24 datasets show higher accuracy and stepwise consistency compared to state-of-the-art baselines.
---
Key Insights from this Paper 💡:
→ Stepwise feedback significantly improves logical consistency in LLM-generated solutions.
→ Combining process- and outcome-level feedback ensures both step-by-step coherence and final accuracy.
→ Iterative training with Step-KTO continuously refines mathematical reasoning capabilities over multiple rounds.
---
Results 📊:
→ Pass@1 on MATH-500 improved from 53.4% to 63.2% (8B model) and 74.6% to 79.6% (70B model).
→ AMC23 Pass@1 increased from 35.0% to 47.5% for 8B and 40.0% to 70.0% for 70B.
→ Stepwise error rate reduced from 27.3% to 19.9% over training iterations.
Share this post