0:00
/
0:00
Transcript

Photon: Federated LLM Pre-Training

The podcast on this paper is generated with Google's Illuminate.

Democratize LLM training by connecting distributed GPUs worldwide

Train LLMs across global internet-connected GPUs without moving data from source.

Photon: First system to enable collaborative LLM training over regular internet connections

https://arxiv.org/abs/2411.02908

🎯 Original Problem:

Training LLMs traditionally requires massive data centers with high-bandwidth connections, making it expensive and limiting collaboration. Current distributed training methods can't effectively work across low-bandwidth internet connections.

-----

🔧 Solution in this Paper:

→ Photon enables federated training of LLMs across geographically distributed GPUs connected via regular internet

→ Uses cross-silo Federated Learning to minimize communication overhead by 64x-512x compared to standard methods

→ Implements adaptive local parallelism to optimize training based on each client's connectivity

→ Exploits small batch sizes with high learning rates for better model generalization

→ Provides three-component architecture: Aggregator (central server), LLM Client (local training), and Data Sources

-----

💡 Key Insights:

→ Federated Learning can match or exceed centralized training performance for LLMs

→ Small batch sizes with high learning rates work better in federated settings

→ Communication frequency can be drastically reduced without sacrificing model quality

→ Data can stay at source while enabling global collaboration

-----

📊 Results:

→ Successfully trained 7B parameter models with 16.9% lower perplexity than centralized training

→ Reduced communication by 64x-512x while achieving 35% better wall-time performance

→ Converged twice as fast as previous methods like DiLoCo

→ Achieved 20% higher throughput than centralized distributed training

Discussion about this video

User's avatar