"Scaling Laws for Pre-training Agents and World Models"

Playback speed

Share post at current time

0:00

Transcript

"Scaling Laws for Pre-training Agents and World Models"

The podcast on this paper is generated with Google's Illuminate.

Rohan Paul

Jan 01, 2025

Pre-training agents exhibit predictable scaling patterns, enabling optimal resource allocation.

Scaling laws in embodied AI follow power laws like LLMs, but with architecture-dependent coefficients.

Different architectures lead to different optimal scaling strategies for embodied AI

https://arxiv.org/abs/2411.04434

Original Problem 🤔:

Understanding how scaling laws in pre-training agents and world models compare to those found in LLMs. Previous research showed bigger models perform better, but lacked precise understanding of optimal model sizing and compute allocation.

-----

Solution in this Paper 🛠️:

→ Analyzed scaling behavior in world modeling and behavior cloning tasks using transformer architectures

→ Used two architectures: tokenized (VQGAN-based) and CNN-based for processing image observations

→ Trained models on 8.6 years of human gameplay data from a complex multiplayer game

→ Focused on pre-training loss rather than downstream performance for cleaner analysis

→ Employed power law relationships to predict optimal model and dataset sizes for given compute budgets

-----

Key Insights 💡:

→ World modeling with 256 tokens/image shows similar scaling coefficients (≈0.5) to LLMs

→ Increasing tokens to 540/image shifts optimal trade-off toward model size (0.62) vs dataset size (0.37)

→ Behavior cloning with tokens heavily favors dataset size (0.68) over model size (0.32)

→ CNN-based continuous embeddings shift coefficients back toward model size (0.66)

-----

Results 📊:

→ Validated scaling laws with 894M parameter model showing good agreement with predictions

→ For world modeling, optimal coefficients match LLMs at 0.49 for model size, 0.51 for dataset size

→ Behavior cloning shows distinct patterns: BC-Token favors dataset size (0.68), while BC-CNN favors model size (0.66)

Rohan's Bytes

"Scaling Laws for Pre-training Agents and World Models"

Discussion about this video