"Computational Bottlenecks of Training Small-scale Large Language Models"

Playback speed

Share post at current time

0:00

Transcript

"Computational Bottlenecks of Training Small-scale Large Language Models"

The podcast on this paper is generated with Google's Illuminate.

Rohan Paul

Dec 23, 2024

This paper maps hardware-cost sweet spots for training efficient small-scale language models.

Data shows A100-40GB beats H100 for training cost-effective small language models

📚 https://arxiv.org/abs/2410.19456

🎯 Original Problem:

Training small-scale LLMs (under 2B parameters) faces unclear computational bottlenecks. No systematic study exists on optimal hardware configurations and training dynamics for these models.

-----

🔧 Solution in this Paper:

• Analyzed training behavior up to 2B parameter models across:

- GPU types (A100-40GB, A100-80GB, H100-80GB)

- Batch sizes (4 to 64 per device)

- Communication protocols (DDP vs FSDP)

- Attention mechanisms (Vanilla vs FlashAttention)

- GPU counts (1 to 64)

• Used Token/Dollar and Token/Second as key metrics for cost efficiency

-----

💡 Key Insights:

• FlashAttention gives higher efficiency gains in smaller models

• A100-40GB is cost-optimal for smaller models

• H100 GPUs are not cost-efficient for training small LLMs

• DDP works better for smaller models due to less communication overhead

• FSDP outperforms DDP for 2B parameter models with large batch sizes

• Cost efficiency saturates before maximum GPU memory utilization

-----

📊 Results:

• FlashAttention enables training 1B-2B models with 512 batch size

• A100-80GB performs best for 1B-2B models with 32+ GPUs

• FSDP with gradient/optimizer state sharding beats full sharding

• DDP shows 20-30% better Token/Dollar for sub-1B models

Rohan's Bytes

"Computational Bottlenecks of Training Small-scale Large Language Models"

Discussion about this video