Predict model performance on new datasets without extensive training runs
Loss relationships reveal hidden patterns between different data distributions, with simple power laws connecting performance across diverse training datasets.
This paper introduces a method to predict model performance across different datasets by establishing mathematical relationships between losses. It shows how training loss on one dataset can predict performance on another, enabling better scaling law predictions with minimal training runs.
-----
https://arxiv.org/abs/2411.12925
🤔 Original Problem:
Scaling laws work well for predicting model performance on a single dataset, but we don't know how these predictions change when switching between different datasets or predicting downstream task performance.
-----
🔧 Solution in this Paper:
→ The paper introduces "loss-to-loss prediction" - a way to translate scaling laws between datasets.
→ It establishes three key relationships: train-to-train (comparing training losses across datasets), train-to-test (predicting downstream performance), and test-to-test (comparing downstream performance across models).
→ The relationships follow shifted power laws, allowing accurate predictions with minimal data points.
-----
💡 Key Insights:
→ Different datasets lead to different returns to scale, but compute-optimal model size remains consistent
→ Downstream task performance shows smooth improvements without emergent properties
→ Including broader training data that covers test domains is crucial for optimal performance
-----
📊 Results:
→ Predictions extrapolate well even at 20x the largest FLOP budget used to fit curves
→ Train-to-train prediction achieves R² of 0.990 compared to baseline 0.961
→ Method requires only 8 models on new dataset vs traditional 88 models for accurate predictions
Share this post