0:00
/
0:00
Transcript

"$100K or 100 Days: Trade-offs when Pre-Training with Academic Resources"

Generated this podcast on this Paper with Google's Illuminate, which is Google's platform to create podcast from arXiv papers

Academic researchers can now pre-train billion-parameter models using just 4 GPUs through optimized configurations

Strategic GPU optimization enables academics to pre-train LLMs with 3x less compute

📚 https://arxiv.org/abs/2410.23261

🤖 Original Problem:

Academic researchers lack compute resources for pre-training LLMs, with 85% having zero cloud budget and limited GPU access. Most can only use 1-8 GPUs for days/weeks, making model pre-training seem impossible.

-----

🔧 Solution in this Paper:

→ Created a benchmark to measure pre-training time on academic GPUs

→ Optimized training using "free-lunch" methods:

- Model compilation for GPU optimization

- Custom kernels (FlashAttention, SSM-specific)

- TF32 mode for matrix operations

→ Implemented memory-saving techniques:

- Activation checkpointing

- Model sharding across GPUs

- Offloading to system RAM

-----

💡 Key Insights:

→ Most academics (70-80%) only use GPUs for fine-tuning and inference

→ Only 17% attempt pre-training models under 1B parameters

→ Optimal configurations can reduce training time by 3x

→ Memory-saving methods provide up to 71% speedup with multiple GPUs

-----

📊 Results:

→ Pythia-1B can be trained in 18 days on 4 A100 GPUs vs original 64 GPUs for 3 days

→ 4 H100 GPUs ($130k) is more cost-effective than 8 A100s ($160k)

→ Training costs: $800 on A100s vs $600 on H100s

→ 8 H100 GPUs ($250k) provides fastest training at 4 days

Discussion about this video