"Does RLHF Scale? Exploring the Impacts From Data, Model, and Method"

Playback speed

Share post at current time

0:00

Transcript

"Does RLHF Scale? Exploring the Impacts From Data, Model, and Method"

Generated below podcast on this paper with Google's Illuminate.

Rohan Paul

Jan 06, 2025

Scaling RLHF isn't like pretraining - more compute doesn't guarantee better results.

This paper systematically investigates how RLHF scales across different components like model size, data composition, and inference budget in LLMs.

https://arxiv.org/abs/2412.06000

🤔 Original Problem:

→ RLHF is crucial for aligning LLMs with human intentions, but its scaling properties remain largely unknown, unlike the well-studied scaling behaviors of pretraining and supervised fine-tuning.

-----

🔍 Solution in this Paper:

→ The researchers analyzed over 20 models with varying reward and policy model sizes (9B to 200B parameters) across different dataset sizes.

→ They examined how sampling more responses per prompt during policy training impacts performance.

→ They investigated the relationship between reward model size and policy model effectiveness.

→ They studied data diversity's impact on reward model performance.

-----

💡 Key Insights:

→ Sampling more responses per prompt improves performance initially but plateaus quickly

→ Larger reward models offer modest gains in policy training

→ Larger policy models benefit less from RLHF with fixed reward models

→ Performance improves rapidly early in training but shows diminishing returns with additional data

-----

📊 Results:

→ Performance gain drops from 4.4% to 1.9% as policy model size grows from 9B to 200B

→ Increasing prompt diversity proves more effective than generating multiple responses per prompt

→ Process supervision models show better performance on targeted tasks but struggle with generalization

Rohan's Bytes

"Does RLHF Scale? Exploring the Impacts From Data, Model, and Method"

Discussion about this video