"Scale-up" is NOT dead. High-quality data is the true key to effective scaling, particularly textbook-level, high-quality knowledge corpora.
The project in this paper collected massive online instructional videos and extracted keyframes and their corresponding audio transcriptions (text). This process ultimately constructs a high-quality multimodal, interleaved pretraining corpus.
6.5M images with 0.75B text tokens from 22,000 hours of instruction.
This is to train Vision-Language Models (VLMs) and they figured, VLMs learn better from classroom videos than random web images
-----
https://arxiv.org/abs/2501.00958
Original Problem 🤔:
Existing VLM training datasets from web crawls have poor image-text alignment, weak logical connections between images, and low knowledge density. This limits VLMs' ability to learn complex visual reasoning and foundational knowledge.
-----
Solution in this Paper 🛠️:
→ Created a knowledge taxonomy using LLMs covering 6 subjects, 55 courses, and 3,915 knowledge points
→ Systematically collected 159K instructional videos based on this taxonomy
→ Developed a multi-level pipeline to extract and filter content: video-level ASR filtering, clip-level segmentation, and keyframe extraction
→ Refined ASR transcriptions using LLMs and extracted text/formulas via OCR
→ Organized content chronologically into an interleaved format, maintaining temporal coherence
-----
Key Insights 💡:
→ Video-based datasets provide better sequential coherence than web-crawled data
→ Educational content offers higher knowledge density and better image-text alignment
→ Multi-level filtering crucial for maintaining data quality
-----
Results 📊:
→ Average gains of +3.2%, +8.3%, +4.0%, and +4.6% in 0-shot to 4-shot settings across 7 benchmarks
→ +20% improvement on ScienceQA compared to MMC4
→ Significant boost in knowledge-intensive tasks like MathVista (+5.3%)
Share this post