Dynamic rewarding allows LLMs to identify and fix their own alignment weaknesses.
DRPO (Dynamic Rewarding with Prompt Optimization) enables LLMs to self-align without expensive tuning by using dynamic rewards and prompt optimization, making alignment more efficient and adaptable.
-----
https://arxiv.org/abs/2411.08733
🤖 Original Problem:
Traditional LLM alignment requires costly training and human annotations, while existing self-alignment methods still need expensive tuning or human oversight.
-----
🔧 Solution in this Paper:
→ DRPO introduces a search-based framework where LLMs optimize their own alignment instructions through iterative self-improvement
→ The core innovation is a dynamic rewarding mechanism that adapts evaluation criteria based on specific queries
→ DRPO uses beam search to optimize both system prompts and in-context examples separately
→ For each query, relevant rewards are dynamically selected from a predefined set while maintaining flexibility to propose new ones
-----
💡 Key Insights:
→ LLMs can achieve effective alignment through lightweight prompting without expensive tuning
→ Dynamic rewards outperform static reward functions by adapting to query context
→ Quality of in-context examples matters more than quantity
-----
📊 Results:
→ DRPO enhanced base models outperformed their SFT/RLHF-tuned versions across 8 LLMs
→ Using just 2 optimized examples, DRPO achieved better results than methods using 3+ examples
→ Achieved 4.06 average score on Mistral-7B compared to 3.66 for instruction-tuned version