0:00
/
0:00
Transcript

Unsupervised-to-Online Reinforcement Learning

Generated this podcast with Google's Illuminate.

Unsupervised offline pre-training enables better online Reinforcement Learning RL adaptation than supervised offline pre-training.

📚 https://arxiv.org/abs/2408.14785

Results 📊:

• Unsupervised-to-online RL (U2O RL) outperforms standard offline-to-online RL and off-policy online RL with offline data

• Matches or exceeds SOTA offline-to-online RL methods

• Significantly outperforms previous best method (Cal-QL) on challenging AntMaze tasks:

- antmaze-ultra-diverse: 54% vs 5%

- antmaze-ultra-play: 58% vs 13%

🧠 Offline vs Online RL Basics - Offline RL learns from a fixed dataset without environment interaction, while online RL learns through real-time interaction and continuous data collection from the environment.

Problem 🔍:

Offline-to-online reinforcement learning (RL) has limitations: domain-specific pre-training, inability to reuse models, and instability during online adaptation.

Key Insights from this Paper 💡:

• Unsupervised offline RL pre-training outperforms supervised offline RL for online fine-tuning

• Multi-task unsupervised pre-training learns better representations than task-specific supervised pre-training

• A single unsupervised pre-trained model can be fine-tuned for multiple downstream tasks

Solution in this Paper 🧠:

• Unsupervised-to-online RL (U2O RL) framework:

- Unsupervised offline policy pre-training using HILP or other skill-based methods

- Bridging phase to identify best skill for downstream task

- Online fine-tuning of pre-trained policy

• Reward scale matching technique to bridge intrinsic and extrinsic rewards

• Uses successor feature-based or goal-conditioned RL methods for skill identification

Discussion about this video

User's avatar