0:00
/
0:00
Transcript

LLMs Are In-Context Reinforcement Learners

Generated this podcast with Google's Illuminate.

In-Context Reinforcement Learning (ICRL) unlocks new learning paradigms for LLMs, enabling adaptation through reward signals alone, without parameter updates.

This paper's algorithm increases test-time compute, as well as a compute-bound approximation.

📚 https://arxiv.org/abs/2410.05362

Original Problem 🤔:

LLMs exhibit in-context supervised learning, but can they perform In-Context Reinforcement Learning (ICRL) without parameter updates?

-----

Solution in this Paper 🧠:

• Proposed Explorative ICRL algorithm to address exploration deficiency

• Introduced stochasticity in prompt construction by randomly sampling past episodes

• Filtered out negative reward examples to simplify prompt reasoning

• Developed Approximate ICRL to reduce computational costs while maintaining performance

-----

Key Insights from this Paper 💡:

• Naive ICRL fails due to lack of exploration and difficulty learning from negative rewards

• LLMs can effectively learn from rewards alone through ICRL

• Stochasticity in context generation and focusing on positive examples are crucial for ICRL success

• Approximate ICRL offers a compute-efficient alternative to Explorative ICRL

-----

Results 📊:

• Explorative ICRL significantly outperformed zero-shot and naive ICRL across all tasks

• Banking-77 task: Llama improved from 17.2% zero-shot to 66.0% accuracy with Explorative ICRL

• Approximate ICRL reduced processed tokens by two orders of magnitude compared to Explorative

• Llama showed more robustness to approximation than Phi, requiring less computational budget

Discussion about this video

User's avatar