"$100K or 100 Days: Trade-offs when Pre-Training with Academic Resources"

Playback speed

Share post at current time

0:00

Transcript

"$100K or 100 Days: Trade-offs when Pre-Training with Academic Resources"

Generated this podcast on this Paper with Google's Illuminate, which is Google's platform to create podcast from arXiv papers

Rohan Paul

Dec 18, 2024

Academic researchers can now pre-train billion-parameter models using just 4 GPUs through optimized configurations

Strategic GPU optimization enables academics to pre-train LLMs with 3x less compute

📚 https://arxiv.org/abs/2410.23261

🤖 Original Problem:

Academic researchers lack compute resources for pre-training LLMs, with 85% having zero cloud budget and limited GPU access. Most can only use 1-8 GPUs for days/weeks, making model pre-training seem impossible.

-----

🔧 Solution in this Paper:

→ Created a benchmark to measure pre-training time on academic GPUs

→ Optimized training using "free-lunch" methods:

- Model compilation for GPU optimization

- Custom kernels (FlashAttention, SSM-specific)

- TF32 mode for matrix operations

→ Implemented memory-saving techniques:

- Activation checkpointing

- Model sharding across GPUs

- Offloading to system RAM

-----

💡 Key Insights:

→ Most academics (70-80%) only use GPUs for fine-tuning and inference

→ Only 17% attempt pre-training models under 1B parameters

→ Optimal configurations can reduce training time by 3x

→ Memory-saving methods provide up to 71% speedup with multiple GPUs

-----

📊 Results:

→ Pythia-1B can be trained in 18 days on 4 A100 GPUs vs original 64 GPUs for 3 days

→ 4 H100 GPUs ($130k) is more cost-effective than 8 A100s ($160k)

→ Training costs: $800 on A100s vs $600 on H100s

→ 8 H100 GPUs ($250k) provides fastest training at 4 days

Rohan's Bytes

"$100K or 100 Days: Trade-offs when Pre-Training with Academic Resources"

Discussion about this video