0:00
/
0:00
Transcript

InfiMM-WebMath-40B: Advancing Multimodal Pre-Training for Enhanced Mathematical Reasoning

Generated this podcast with Google's Illuminate.

First open-source multimodal math dataset boosts MLLM performance on complex reasoning tasks like MathVerse and We-Math.

📚 https://arxiv.org/pdf/2409.12568

This project created

• Created InfiMM-WebMath-40B dataset:

- 24 million web pages, 85 million image URLs, 40 billion text tokens

- Filtered from CommonCrawl using multi-stage pipeline

• Developed InfiMM-Math models:

- Visual encoder: SigLip siglip-so400m-patch14-384

- Vision-to-language connector: 3-layer Perceiver Resampler with 64 latents

- LLMs: DeepSeek-Coder (1.3B and 7B versions)

• Three-stage training:

1. Modality alignment: 8 million image-text pairs from DFN-2B dataset

2. Continue pre-training: InfiMM-WebMath-40B dataset

3. Instruction fine-tuning: Multiple datasets including PGPS9K, Geo170k, TABMWP

-----

Results 📊:

• InfiMM-Math models outperform existing open-source models on benchmarks:

- MathVerse: 34.5 average score (7B model), surpassing GPT-4V in some categories

- We-Math: 20.6 average score (7B model), outperforming larger models like LLaVA-NeXT-110B

• 1.3B model with 40B tokens matches performance of DeepSeekMath-1.3B using 120B tokens

• Significant improvements in multimodal math reasoning capabilities

-----

Key Insights from this Paper 💡:

• Comprehensive data curation pipeline for high-quality math content extraction

• Effective combination of visual and language models for mathematical reasoning

• Importance of multi-stage training for enhancing MLLM performance

Discussion about this video