This paper maps hardware-cost sweet spots for training efficient small-scale language models.
Data shows A100-40GB beats H100 for training cost-effective small language models
📚 https://arxiv.org/abs/2410.19456
🎯 Original Problem:
Training small-scale LLMs (under 2B parameters) faces unclear computational bottlenecks. No systematic study exists on optimal hardware configurations and training dynamics for these models.
-----
🔧 Solution in this Paper:
• Analyzed training behavior up to 2B parameter models across:
- GPU types (A100-40GB, A100-80GB, H100-80GB)
- Batch sizes (4 to 64 per device)
- Communication protocols (DDP vs FSDP)
- Attention mechanisms (Vanilla vs FlashAttention)
- GPU counts (1 to 64)
• Used Token/Dollar and Token/Second as key metrics for cost efficiency
-----
💡 Key Insights:
• FlashAttention gives higher efficiency gains in smaller models
• A100-40GB is cost-optimal for smaller models
• H100 GPUs are not cost-efficient for training small LLMs
• DDP works better for smaller models due to less communication overhead
• FSDP outperforms DDP for 2B parameter models with large batch sizes
• Cost efficiency saturates before maximum GPU memory utilization
-----
📊 Results:
• FlashAttention enables training 1B-2B models with 512 batch size
• A100-80GB performs best for 1B-2B models with 32+ GPUs
• FSDP with gradient/optimizer state sharding beats full sharding
• DDP shows 20-30% better Token/Dollar for sub-1B models
Share this post