0:00
/
0:00
Transcript

"URSA: Understanding and Verifying Chain-of-thought Reasoning in Multimodal Mathematics"

Generated below podcast on this paper with Google's Illuminate.

URSA enhances multimodal LLMs' mathematical reasoning by introducing high-quality Chain-of-Thought training data and a novel verification system for better problem-solving accuracy.

-----

https://arxiv.org/abs/2501.04686

🤔 Original Problem:

→ Current multimodal LLMs struggle with complex mathematical reasoning due to lack of high-quality Chain-of-Thought (CoT) training data

→ Existing models can't effectively verify their own reasoning steps, leading to inconsistent performance

-----

🔧 Solution in this Paper:

→ Introduces MMathCoT-1M, a synthesized dataset combining CoT distillation, trajectory rewriting, and format unification

→ Develops URSA-7B model using Qwen2.5-Math-7B-Instruct as backbone with hybrid vision encoder (SAM-B + SigLIP-L)

→ Creates DualMath-1.1M dataset for training URSA-RM-7B verifier using dual-view process supervision

→ Implements Monte Carlo Tree Search for error detection in reasoning steps

-----

💡 Key Insights:

→ High-quality CoT training data significantly improves mathematical reasoning capabilities

→ Dual-view process supervision helps catch both logical and visual interpretation errors

→ Test-time scaling with verification can boost model performance substantially

-----

📊 Results:

→ Achieves 90.9% pass rate on MathVista geometry problems with just 4 sampling iterations

→ Surpasses GPT-4V on several metrics while approaching GPT-4o performance

→ Shows 74.2% relative improvement in pass@64 compared to single inference on MathVerse

Discussion about this video