Physics laws reveal how Next-token Prediction (NTP) models actually learn and why they need so much energy.
Information conservation explains why bigger models need more training data
https://arxiv.org/abs/2411.00660
🎯 Original Problem:
Current auto-regressive models using Next-token Prediction (NTP) require massive datasets and computational power, but we lack understanding of why this leads to intelligence emergence. We need to uncover the fundamental physics behind NTP to optimize model training.
-----
🔧 Solution in this Paper:
→ Introduced First Law of Information Capacity (IC-1): ηN = D(H-L), showing intelligence emerges through information transfer from dataset to model parameters
→ Proposed Second Law of Information Capacity (IC-2): E0 = ηN(kBT ln 2), establishing minimum energy requirements for training
→ Demonstrated model training is essentially compressing dataset information, with information capacity η indicating compression efficiency
-----
💡 Key Insights:
→ Model training follows information conservation law - no information is lost, only transferred
→ Dataset entropy can be estimated using initial model loss
→ Information capacity (η) typically falls between 0.115 and 0.268 for current models
→ Found direct proportional relationship between model size (N) and training tokens (D)
-----
📊 Results:
→ Validated theoretical framework against OpenAI's Scaling Laws
→ Proved compatibility with Knowledge Capacity Scaling Laws
→ Demonstrated universal applicability across all auto-regressive architectures
Share this post