Scaling RLHF isn't like pretraining - more compute doesn't guarantee better results.
This paper systematically investigates how RLHF scales across different components like model size, data composition, and inference budget in LLMs.
https://arxiv.org/abs/2412.06000
🤔 Original Problem:
→ RLHF is crucial for aligning LLMs with human intentions, but its scaling properties remain largely unknown, unlike the well-studied scaling behaviors of pretraining and supervised fine-tuning.
-----
🔍 Solution in this Paper:
→ The researchers analyzed over 20 models with varying reward and policy model sizes (9B to 200B parameters) across different dataset sizes.
→ They examined how sampling more responses per prompt during policy training impacts performance.
→ They investigated the relationship between reward model size and policy model effectiveness.
→ They studied data diversity's impact on reward model performance.
-----
💡 Key Insights:
→ Sampling more responses per prompt improves performance initially but plateaus quickly
→ Larger reward models offer modest gains in policy training
→ Larger policy models benefit less from RLHF with fixed reward models
→ Performance improves rapidly early in training but shows diminishing returns with additional data
-----
📊 Results:
→ Performance gain drops from 4.4% to 1.9% as policy model size grows from 9B to 200B
→ Increasing prompt diversity proves more effective than generating multiple responses per prompt
→ Process supervision models show better performance on targeted tasks but struggle with generalization
Share this post