Fix one token, fix the whole math solution - that's what this paper discovered.
This paper shows how single tokens can make or break LLM's reasoning ability.
📌 Paper: "Critical Tokens Matter: Token-Level Contrastive Estimation Enhances LLM's Reasoning Capability"
This paper introduces a method to improve LLMs' reasoning by identifying and fixing critical tokens that cause incorrect solutions. The approach uses contrastive estimation to detect problematic tokens and incorporates token-level rewards during model alignment, significantly boosting mathematical reasoning performance.
-----
https://arxiv.org/abs/2411.19943
🤔 Original Problem:
LLMs struggle with reasoning tasks despite using alignment techniques like Direct Preference Optimization (DPO). Current methods focus on example-level optimization but miss the impact of individual tokens on reasoning outcomes.
-----
🔧 Solution in this Paper:
→ The paper introduces cDPO, which automatically identifies critical tokens in incorrect reasoning paths using contrastive estimation.
→ It trains separate models on correct and incorrect reasoning examples to learn pattern differences.
→ The method compares token generation likelihoods between these models to spot problematic tokens.
→ cDPO extends DPO to token-level optimization, using the identified critical tokens as weighted rewards during training.
-----
💡 Key Insights:
→ Small changes in operators and logical elements can drastically affect reasoning outcomes
→ Forcing models to avoid critical tokens significantly improves solution accuracy
→ Token-level optimization outperforms traditional example-level approaches
-----
📊 Results:
→ Achieved 90.8% accuracy on GSM8K with Llama-3-70B
→ Improved MATH500 performance by 3.3% over baseline methods
→ Statistical significance with p<0.005 across all benchmarks
Share this post