"Loss-to-Loss Prediction: Scaling Laws for All Datasets"

Playback speed

Share post at current time

0:00

Transcript

"Loss-to-Loss Prediction: Scaling Laws for All Datasets"

The podcast on this paper is generated with Google's Illuminate.

Rohan Paul

Dec 29, 2024

Predict model performance on new datasets without extensive training runs

Loss relationships reveal hidden patterns between different data distributions, with simple power laws connecting performance across diverse training datasets.

This paper introduces a method to predict model performance across different datasets by establishing mathematical relationships between losses. It shows how training loss on one dataset can predict performance on another, enabling better scaling law predictions with minimal training runs.

-----

https://arxiv.org/abs/2411.12925

🤔 Original Problem:

Scaling laws work well for predicting model performance on a single dataset, but we don't know how these predictions change when switching between different datasets or predicting downstream task performance.

-----

🔧 Solution in this Paper:

→ The paper introduces "loss-to-loss prediction" - a way to translate scaling laws between datasets.

→ It establishes three key relationships: train-to-train (comparing training losses across datasets), train-to-test (predicting downstream performance), and test-to-test (comparing downstream performance across models).

→ The relationships follow shifted power laws, allowing accurate predictions with minimal data points.

-----

💡 Key Insights:

→ Different datasets lead to different returns to scale, but compute-optimal model size remains consistent

→ Downstream task performance shows smooth improvements without emergent properties

→ Including broader training data that covers test domains is crucial for optimal performance

-----

📊 Results:

→ Predictions extrapolate well even at 20x the largest FLOP budget used to fit curves

→ Train-to-train prediction achieves R² of 0.990 compared to baseline 0.961

→ Method requires only 8 models on new dataset vs traditional 88 models for accurate predictions

Rohan's Bytes

"Loss-to-Loss Prediction: Scaling Laws for All Datasets"

Discussion about this video