Nice paper from @GoogleDeepMind
LLMs can learn optimal exploration strategies through algorithm distillation and inference-time support.
📚 https://arxiv.org/abs/2410.06238
Original Problem 🔍:
LLMs' ability to make optimal decisions under uncertainty remains understudied, particularly in bandit environments requiring exploration-exploitation balance.
This is crucial as many real-world applications, ranging from personalized recommendations to healthcare interventions, demand that LLMs not only predict but also actively learn to make optimal decisions through exploration.
-----
Solution in this Paper 🛠️:
• Introduced BanditBench: A comprehensive suite of multi-armed bandit (MAB) and contextual bandit (CB) environments in natural language
• Proposed two approaches to enhance LLMs for in-context exploration:
- Inference-time algorithm-guided support
- Algorithm distillation via optimal demonstration data
• Conducted extensive ablation studies on task difficulty, textual representation, and algorithm-guided support
-----
Key Insights from this Paper 💡:
• LLMs struggle with in-context exploration when relying solely on raw interaction history
• Inference-time support significantly improves performance
• Algorithm distillation via fine-tuning leads to strong generalization across domains
• Smaller models enhanced by these approaches can achieve superior exploration performance compared to larger models
• An optimality gap remains between LLMs and classical optimal algorithms
-----
Results 📊:
• Algorithm-guided support improved win-rate from 27.7% to 32.2% for Gemini-1.5 Flash in MAB
• Oracle behavior fine-tuning achieved 65.6% win-rate for Gemini-1.5 Flash in MAB
• In contextual bandit (CB), algorithm-guided support boosted win-rate from 0% to 46.4% for Gemini-1.5 Flash
Share this post