Mixed Preference Optimization (MPO), proposed in this paper, teaches visual AI to think step-by-step
This paper introduces Mixed Preference Optimization (MPO) to enhance multimodal LLMs' reasoning abilities. The key innovation lies in combining preference optimization with automated data generation to improve Chain-of-Thought performance and reduce distribution shifts during inference.
-----
https://arxiv.org/abs/2411.10442
🤔 Original Problem:
Open-source Multimodal LLMs struggle with Chain-of-Thought (CoT) reasoning, often performing worse with CoT compared to direct answers due to distribution shifts between training and inference.
-----
🛠️ Solution in this Paper:
→ Introduces an automated preference data construction pipeline that creates MMPR dataset containing 3 million high-quality multimodal reasoning preference samples
→ Develops Mixed Preference Optimization (MPO) that combines three loss functions: preference loss for learning relative preferences, quality loss for absolute quality assessment, and generation loss for preferred response generation
→ Uses Dropout Next Token Prediction for samples without clear ground truth, truncating good responses and asking model to complete without image input
→ For data with clear ground truth, samples multiple solutions and uses correct ones as positive samples
-----
💡 Key Insights:
→ Distribution shifts during inference significantly impact CoT performance
→ Combining multiple loss functions improves reasoning capabilities
→ Automated preference data generation can be more efficient than manual annotation
-----
📊 Results:
→ InternVL2-8B-MPO achieves 67.0% accuracy on MathVista, outperforming base model by 8.7 points
→ Matches performance of InternVL2-76B which is 10x larger
→ Sets new state-of-the-art among open-source models on MathVision with 25.7% accuracy
Share this post