In-Context Reinforcement Learning (ICRL) unlocks new learning paradigms for LLMs, enabling adaptation through reward signals alone, without parameter updates.
This paper's algorithm increases test-time compute, as well as a compute-bound approximation.
📚 https://arxiv.org/abs/2410.05362
Original Problem 🤔:
LLMs exhibit in-context supervised learning, but can they perform In-Context Reinforcement Learning (ICRL) without parameter updates?
-----
Solution in this Paper 🧠:
• Proposed Explorative ICRL algorithm to address exploration deficiency
• Introduced stochasticity in prompt construction by randomly sampling past episodes
• Filtered out negative reward examples to simplify prompt reasoning
• Developed Approximate ICRL to reduce computational costs while maintaining performance
-----
Key Insights from this Paper 💡:
• Naive ICRL fails due to lack of exploration and difficulty learning from negative rewards
• LLMs can effectively learn from rewards alone through ICRL
• Stochasticity in context generation and focusing on positive examples are crucial for ICRL success
• Approximate ICRL offers a compute-efficient alternative to Explorative ICRL
-----
Results 📊:
• Explorative ICRL significantly outperformed zero-shot and naive ICRL across all tasks
• Banking-77 task: Llama improved from 17.2% zero-shot to 66.0% accuracy with Explorative ICRL
• Approximate ICRL reduced processed tokens by two orders of magnitude compared to Explorative
• Llama showed more robustness to approximation than Phi, requiring less computational budget
Share this post