"2.5 Years in Class: A Multimodal Textbook for Vision-Language Pretraining"

Playback speed

Share post at current time

0:00

Transcript

"2.5 Years in Class: A Multimodal Textbook for Vision-Language Pretraining"

Generated below podcast on this paper with Google's Illuminate.

Rohan Paul

Jan 12, 2025

"Scale-up" is NOT dead. High-quality data is the true key to effective scaling, particularly textbook-level, high-quality knowledge corpora.

The project in this paper collected massive online instructional videos and extracted keyframes and their corresponding audio transcriptions (text). This process ultimately constructs a high-quality multimodal, interleaved pretraining corpus.

6.5M images with 0.75B text tokens from 22,000 hours of instruction.

This is to train Vision-Language Models (VLMs) and they figured, VLMs learn better from classroom videos than random web images

-----

https://arxiv.org/abs/2501.00958

Original Problem 🤔:

Existing VLM training datasets from web crawls have poor image-text alignment, weak logical connections between images, and low knowledge density. This limits VLMs' ability to learn complex visual reasoning and foundational knowledge.

-----

Solution in this Paper 🛠️:

→ Created a knowledge taxonomy using LLMs covering 6 subjects, 55 courses, and 3,915 knowledge points

→ Systematically collected 159K instructional videos based on this taxonomy

→ Developed a multi-level pipeline to extract and filter content: video-level ASR filtering, clip-level segmentation, and keyframe extraction

→ Refined ASR transcriptions using LLMs and extracted text/formulas via OCR

→ Organized content chronologically into an interleaved format, maintaining temporal coherence

-----

Key Insights 💡:

→ Video-based datasets provide better sequential coherence than web-crawled data

→ Educational content offers higher knowledge density and better image-text alignment

→ Multi-level filtering crucial for maintaining data quality

-----

Results 📊:

→ Average gains of +3.2%, +8.3%, +4.0%, and +4.6% in 0-shot to 4-shot settings across 7 benchmarks

→ +20% improvement on ScienceQA compared to MMC4

→ Significant boost in knowledge-intensive tasks like MathVista (+5.3%)

Rohan's Bytes

"2.5 Years in Class: A Multimodal Textbook for Vision-Language Pretraining"

Discussion about this video