Efficient bandit-based approach reduces human annotation costs in LLM training
SEA (Sample-Efficient Alignment) uses Thompson sampling to align LLMs with minimal human feedback
https://arxiv.org/abs/2411.01493
🎯 Original Problem:
Aligning LLMs with human preferences requires massive amounts of human feedback data, making it expensive and time-consuming. Current methods need extensive human annotations for effective alignment.
-----
🔧 Solution in this Paper:
→ They frame LLM alignment as a contextual dueling bandits problem, where the model learns from pairwise comparisons of responses
This formulation naturally requires two key properties for sample-efficient alignment:
→ Online interaction - allowing the agent to act with latest learned policy and immediately improve from experience
→ Active exploration - strategically selecting actions that lead to maximal policy improvement
→ Introduce SEA (Sample-Efficient Alignment), implementing Thompson sampling with epistemic reward modeling
→ SEA maintains uncertainty-aware reward models and uses policy-guided search for efficient exploration
→ The system works in both online user feedback and crowdsourcing scenarios
-----
💡 Key Insights:
→ Online interaction allows immediate policy improvement from latest experiences
→ Active exploration strategically selects actions for maximal learning
→ Thompson sampling naturally balances exploration vs exploitation
→ Mixed preference learning combines different alignment approaches
-----
📊 Results:
→ Tested across three model scales: 1B, 2.8B, and 6.9B parameters
→ Evaluated with three preference learning algorithms: DPO, IPO, and SLiC
→ Achieved higher win rates compared to baseline approaches
→ Significantly improved sample efficiency over recent active exploration methods
Share this post