Democratize LLM training by connecting distributed GPUs worldwide
Train LLMs across global internet-connected GPUs without moving data from source.
Photon: First system to enable collaborative LLM training over regular internet connections
https://arxiv.org/abs/2411.02908
🎯 Original Problem:
Training LLMs traditionally requires massive data centers with high-bandwidth connections, making it expensive and limiting collaboration. Current distributed training methods can't effectively work across low-bandwidth internet connections.
-----
🔧 Solution in this Paper:
→ Photon enables federated training of LLMs across geographically distributed GPUs connected via regular internet
→ Uses cross-silo Federated Learning to minimize communication overhead by 64x-512x compared to standard methods
→ Implements adaptive local parallelism to optimize training based on each client's connectivity
→ Exploits small batch sizes with high learning rates for better model generalization
→ Provides three-component architecture: Aggregator (central server), LLM Client (local training), and Data Sources
-----
💡 Key Insights:
→ Federated Learning can match or exceed centralized training performance for LLMs
→ Small batch sizes with high learning rates work better in federated settings
→ Communication frequency can be drastically reduced without sacrificing model quality
→ Data can stay at source while enabling global collaboration
-----
📊 Results:
→ Successfully trained 7B parameter models with 16.9% lower perplexity than centralized training
→ Reduced communication by 64x-512x while achieving 35% better wall-time performance
→ Converged twice as fast as previous methods like DiLoCo
→ Achieved 20% higher throughput than centralized distributed training
Share this post