"Enhancing the Reasoning Ability of Multimodal Large Language Models via Mixed Preference Optimization"

Playback speed

Share post at current time

0:00

Transcript

"Enhancing the Reasoning Ability of Multimodal Large Language Models via Mixed Preference Optimization"

The podcast on this paper is generated with Google's Illuminate.

Rohan Paul

Dec 26, 2024

Mixed Preference Optimization (MPO), proposed in this paper, teaches visual AI to think step-by-step

This paper introduces Mixed Preference Optimization (MPO) to enhance multimodal LLMs' reasoning abilities. The key innovation lies in combining preference optimization with automated data generation to improve Chain-of-Thought performance and reduce distribution shifts during inference.

-----

https://arxiv.org/abs/2411.10442

🤔 Original Problem:

Open-source Multimodal LLMs struggle with Chain-of-Thought (CoT) reasoning, often performing worse with CoT compared to direct answers due to distribution shifts between training and inference.

-----

🛠️ Solution in this Paper:

→ Introduces an automated preference data construction pipeline that creates MMPR dataset containing 3 million high-quality multimodal reasoning preference samples

→ Develops Mixed Preference Optimization (MPO) that combines three loss functions: preference loss for learning relative preferences, quality loss for absolute quality assessment, and generation loss for preferred response generation

→ Uses Dropout Next Token Prediction for samples without clear ground truth, truncating good responses and asking model to complete without image input

→ For data with clear ground truth, samples multiple solutions and uses correct ones as positive samples

-----

💡 Key Insights:

→ Distribution shifts during inference significantly impact CoT performance

→ Combining multiple loss functions improves reasoning capabilities

→ Automated preference data generation can be more efficient than manual annotation

-----

📊 Results:

→ InternVL2-8B-MPO achieves 67.0% accuracy on MathVista, outperforming base model by 8.7 points

→ Matches performance of InternVL2-76B which is 10x larger

→ Sets new state-of-the-art among open-source models on MathVision with 25.7% accuracy

Rohan's Bytes

"Enhancing the Reasoning Ability of Multimodal Large Language Models via Mixed Preference Optimization"

Discussion about this video