Pre-training agents exhibit predictable scaling patterns, enabling optimal resource allocation.
Scaling laws in embodied AI follow power laws like LLMs, but with architecture-dependent coefficients.
Different architectures lead to different optimal scaling strategies for embodied AI
https://arxiv.org/abs/2411.04434
Original Problem 🤔:
Understanding how scaling laws in pre-training agents and world models compare to those found in LLMs. Previous research showed bigger models perform better, but lacked precise understanding of optimal model sizing and compute allocation.
-----
Solution in this Paper 🛠️:
→ Analyzed scaling behavior in world modeling and behavior cloning tasks using transformer architectures
→ Used two architectures: tokenized (VQGAN-based) and CNN-based for processing image observations
→ Trained models on 8.6 years of human gameplay data from a complex multiplayer game
→ Focused on pre-training loss rather than downstream performance for cleaner analysis
→ Employed power law relationships to predict optimal model and dataset sizes for given compute budgets
-----
Key Insights 💡:
→ World modeling with 256 tokens/image shows similar scaling coefficients (≈0.5) to LLMs
→ Increasing tokens to 540/image shifts optimal trade-off toward model size (0.62) vs dataset size (0.37)
→ Behavior cloning with tokens heavily favors dataset size (0.68) over model size (0.32)
→ CNN-based continuous embeddings shift coefficients back toward model size (0.66)
-----
Results 📊:
→ Validated scaling laws with 894M parameter model showing good agreement with predictions
→ For world modeling, optimal coefficients match LLMs at 0.49 for model size, 0.51 for dataset size
→ Behavior cloning shows distinct patterns: BC-Token favors dataset size (0.68), while BC-CNN favors model size (0.66)
Share this post