Unsupervised offline pre-training enables better online Reinforcement Learning RL adaptation than supervised offline pre-training.
📚 https://arxiv.org/abs/2408.14785
Results 📊:
• Unsupervised-to-online RL (U2O RL) outperforms standard offline-to-online RL and off-policy online RL with offline data
• Matches or exceeds SOTA offline-to-online RL methods
• Significantly outperforms previous best method (Cal-QL) on challenging AntMaze tasks:
- antmaze-ultra-diverse: 54% vs 5%
- antmaze-ultra-play: 58% vs 13%
🧠 Offline vs Online RL Basics - Offline RL learns from a fixed dataset without environment interaction, while online RL learns through real-time interaction and continuous data collection from the environment.
Problem 🔍:
Offline-to-online reinforcement learning (RL) has limitations: domain-specific pre-training, inability to reuse models, and instability during online adaptation.
Key Insights from this Paper 💡:
• Unsupervised offline RL pre-training outperforms supervised offline RL for online fine-tuning
• Multi-task unsupervised pre-training learns better representations than task-specific supervised pre-training
• A single unsupervised pre-trained model can be fine-tuned for multiple downstream tasks
Solution in this Paper 🧠:
• Unsupervised-to-online RL (U2O RL) framework:
- Unsupervised offline policy pre-training using HILP or other skill-based methods
- Bridging phase to identify best skill for downstream task
- Online fine-tuning of pre-trained policy
• Reward scale matching technique to bridge intrinsic and extrinsic rewards
• Uses successor feature-based or goal-conditioned RL methods for skill identification
Share this post