Two-phase learning from unlabeled data creates efficient exploration strategies.
Converting messy prior data into useful skills and examples for better learning.
📚 https://arxiv.org/abs/2410.18076
🎯 Original Problem:
Reinforcement Learning (RL) faces a unique challenge in using unlabeled prior data for exploration, unlike supervised learning where pretraining directly helps in mimicking task-specific data. The core challenge is to effectively use unlabeled trajectory data to learn efficient exploration strategies.
-----
🛠️ Solution in this Paper:
• SUPE (Skills from Unlabeled Prior data for Exploration) leverages unlabeled trajectory data twice:
- First in offline phase: Uses VAE to extract low-level skills
- Then in online phase: Transforms prior data into high-level examples using optimistic rewards
• Key mechanisms:
- Variational autoencoder extracts reusable low-level skills from unlabeled data
- Optimistic reward model relabels prior data for online learning
- High-level policy composes pretrained skills for efficient exploration
-----
💡 Key Insights:
• Double utilization of prior data compounds benefits of both skill learning and online optimization
• Pseudo-relabeling with optimistic rewards makes unlabeled data useful for online learning
• Hierarchical decomposition helps separate task-agnostic skills from task-specific behaviors
-----
📊 Results:
• Outperforms all baselines across three domains: AntMaze, Kitchen, Visual AntMaze
• 4x faster goal discovery in challenging sparse-reward tasks
• Achieves 75% higher success rates in complex navigation tasks
• Maintains strong performance even with corrupted/limited prior data
Share this post