0:00
/
0:00
Transcript

"Time-MoE: Billion-Scale Time Series Foundation Models with Mixture of Experts"

Generated this podcast with Google's Illuminate.

Existing pre-trained time series foundation models lack scale and efficiency, hindering the development of larger, more capable forecasting models for real-world applications.

TIME-MOE scales time series forecasting to billion-parameter models, improving accuracy while reducing computational costs, by letting experts sleep until needed.

• 23% average MSE reduction in zero-shot forecasting across 6 benchmarks

-----

📚 https://arxiv.org/pdf/2409.16040

Solution in this Paper 🛠️:

• TIME-MOE: A decoder-only transformer with mixture-of-experts layers

• Point-wise tokenization of input time series

• Multi-resolution forecasting heads for flexible prediction horizons

• Pre-training on Time-300B dataset (over 300 billion time points across 9 domains)

• Models scaled up to 2.4 billion parameters (1.1 billion activated)

• Sparse architecture activates only a subset of networks for each prediction

-----

Key Insights from this Paper 💡:

• Sparse mixture-of-experts architecture enhances computational efficiency while maintaining high model capacity

• Scaling laws apply to time series forecasting, with larger models and more training data improving performance

• Multi-resolution forecasting enables flexible prediction horizons

• Sparsely activated design allows effective scaling without significant increase in inference costs

-----

Results 📊:

• Outperforms state-of-the-art models in zero-shot and fine-tuned scenarios

• 25% average MSE reduction in in-distribution forecasting

• Maintains superior performance with reduced computational costs compared to dense models

Discussion about this video