Inverse reinforcement learning (IRL) fine-tuning enhances LLM performance and generation diversity beyond traditional MLE approaches.
https://arxiv.org/abs/2409.01369
Original Problem 🔍:
LLM fine-tuning relies heavily on maximum likelihood estimation (MLE) for next token prediction, which may not fully utilize the sequential structure of language generation.
-----
Solution in this Paper 🛠️:
• Reformulates inverse soft Q-learning as a temporal difference regularized extension of MLE
• Evaluates offline and online IRL algorithms, including IQLearn and GAIL
• Compares Inverse reinforcement learning (IRL) approaches to MLE across multiple benchmarks and model sizes
-----
Key Insights from this Paper 💡:
• Inverse reinforcement learning (IRL) methods can optimize for entire sequence impact rather than individual tokens
• Offline IRL achieves most benefits without expensive online sampling
• IRL-extracted rewards show higher correlation with task performance metrics
• IRL approaches consistently increase diversity of model generations
-----
Results 📊:
• Inverse reinforcement learning (IRL) methods demonstrated better or on-par task performance compared to MLE
• Increased diversity of model generations as measured by Self-BLEU scores
• IQLearn achieved higher performance in low temperature sampling regimes
• IRL-extracted reward functions showed higher correlation with task metrics (e.g., 0.64 vs -0.05 for ROUGE-1 on TLDR)
Share this post