Self-improvement framework helps AI models reason about images without human guidance, AI learns to critique its own thinking process
The paper introduces a systematic framework for improving multimodal reasoning in LLMs through self-evolving training, enhancing models' ability to learn from their own outputs without human annotations.
-----
https://arxiv.org/abs/2412.17451
🤔 Original Problem:
LLMs struggle with multimodal reasoning due to limited annotated data with detailed thought processes. While self-evolving training helps models learn from their outputs, its application in multimodal contexts lacks comprehensive understanding.
-----
🔧 Solution in this Paper:
→ The paper introduces M-STAR framework with three key components: training method, reward model, and prompt variation
→ It proposes continuous self-evolving training that inherits optimizer states between iterations
→ A Process Reward Model (PRM) evaluates the quality of reasoning steps, not just final answers
→ Dynamic temperature adjustment balances exploration and exploitation during training
-----
💡 Key Insights:
→ Continuous optimization outperforms traditional iterative approaches in multimodal reasoning
→ PRM works better as a reranker than a verifier for selecting high-quality responses
→ Unlabeled data helps only with perfect reward signals, otherwise hurts performance
→ Model exploration ability decreases as training progresses, requiring dynamic adjustments
-----
📊 Results:
→ M-STAR improved performance on 5 multimodal reasoning benchmarks
→ Achieved 59.5% accuracy on MathVista with MiniCPM-V-2.5 (8B)
→ Showed consistent improvements across different model sizes: 8B, 4B, and 2B architectures
Share this post