URSA enhances multimodal LLMs' mathematical reasoning by introducing high-quality Chain-of-Thought training data and a novel verification system for better problem-solving accuracy.
-----
https://arxiv.org/abs/2501.04686
🤔 Original Problem:
→ Current multimodal LLMs struggle with complex mathematical reasoning due to lack of high-quality Chain-of-Thought (CoT) training data
→ Existing models can't effectively verify their own reasoning steps, leading to inconsistent performance
-----
🔧 Solution in this Paper:
→ Introduces MMathCoT-1M, a synthesized dataset combining CoT distillation, trajectory rewriting, and format unification
→ Develops URSA-7B model using Qwen2.5-Math-7B-Instruct as backbone with hybrid vision encoder (SAM-B + SigLIP-L)
→ Creates DualMath-1.1M dataset for training URSA-RM-7B verifier using dual-view process supervision
→ Implements Monte Carlo Tree Search for error detection in reasoning steps
-----
💡 Key Insights:
→ High-quality CoT training data significantly improves mathematical reasoning capabilities
→ Dual-view process supervision helps catch both logical and visual interpretation errors
→ Test-time scaling with verification can boost model performance substantially
-----
📊 Results:
→ Achieves 90.9% pass rate on MathVista geometry problems with just 4 sampling iterations
→ Surpasses GPT-4V on several metrics while approaching GPT-4o performance
→ Shows 74.2% relative improvement in pass@64 compared to single inference on MathVerse
Share this post