First open-source multimodal math dataset boosts MLLM performance on complex reasoning tasks like MathVerse and We-Math.
📚 https://arxiv.org/pdf/2409.12568
This project created
• Created InfiMM-WebMath-40B dataset:
- 24 million web pages, 85 million image URLs, 40 billion text tokens
- Filtered from CommonCrawl using multi-stage pipeline
• Developed InfiMM-Math models:
- Visual encoder: SigLip siglip-so400m-patch14-384
- Vision-to-language connector: 3-layer Perceiver Resampler with 64 latents
- LLMs: DeepSeek-Coder (1.3B and 7B versions)
• Three-stage training:
1. Modality alignment: 8 million image-text pairs from DFN-2B dataset
2. Continue pre-training: InfiMM-WebMath-40B dataset
3. Instruction fine-tuning: Multiple datasets including PGPS9K, Geo170k, TABMWP
-----
Results 📊:
• InfiMM-Math models outperform existing open-source models on benchmarks:
- MathVerse: 34.5 average score (7B model), surpassing GPT-4V in some categories
- We-Math: 20.6 average score (7B model), outperforming larger models like LLaVA-NeXT-110B
• 1.3B model with 40B tokens matches performance of DeepSeekMath-1.3B using 120B tokens
• Significant improvements in multimodal math reasoning capabilities
-----
Key Insights from this Paper 💡:
• Comprehensive data curation pipeline for high-quality math content extraction
• Effective combination of visual and language models for mathematical reasoning
• Importance of multi-stage training for enhancing MLLM performance
Share this post