Existing pre-trained time series foundation models lack scale and efficiency, hindering the development of larger, more capable forecasting models for real-world applications.
TIME-MOE scales time series forecasting to billion-parameter models, improving accuracy while reducing computational costs, by letting experts sleep until needed.
• 23% average MSE reduction in zero-shot forecasting across 6 benchmarks
-----
📚 https://arxiv.org/pdf/2409.16040
Solution in this Paper 🛠️:
• TIME-MOE: A decoder-only transformer with mixture-of-experts layers
• Point-wise tokenization of input time series
• Multi-resolution forecasting heads for flexible prediction horizons
• Pre-training on Time-300B dataset (over 300 billion time points across 9 domains)
• Models scaled up to 2.4 billion parameters (1.1 billion activated)
• Sparse architecture activates only a subset of networks for each prediction
-----
Key Insights from this Paper 💡:
• Sparse mixture-of-experts architecture enhances computational efficiency while maintaining high model capacity
• Scaling laws apply to time series forecasting, with larger models and more training data improving performance
• Multi-resolution forecasting enables flexible prediction horizons
• Sparsely activated design allows effective scaling without significant increase in inference costs
-----
Results 📊:
• Outperforms state-of-the-art models in zero-shot and fine-tuned scenarios
• 25% average MSE reduction in in-distribution forecasting
• Maintains superior performance with reduced computational costs compared to dense models
Share this post