0:00
/
0:00
Transcript

"Leveraging Skills from Unlabeled Prior Data for Efficient Online Exploration"

The podcast on this paper is generated with Google's Illuminate.

Two-phase learning from unlabeled data creates efficient exploration strategies.

Converting messy prior data into useful skills and examples for better learning.

📚 https://arxiv.org/abs/2410.18076

🎯 Original Problem:

Reinforcement Learning (RL) faces a unique challenge in using unlabeled prior data for exploration, unlike supervised learning where pretraining directly helps in mimicking task-specific data. The core challenge is to effectively use unlabeled trajectory data to learn efficient exploration strategies.

-----

🛠️ Solution in this Paper:

• SUPE (Skills from Unlabeled Prior data for Exploration) leverages unlabeled trajectory data twice:

- First in offline phase: Uses VAE to extract low-level skills

- Then in online phase: Transforms prior data into high-level examples using optimistic rewards

• Key mechanisms:

- Variational autoencoder extracts reusable low-level skills from unlabeled data

- Optimistic reward model relabels prior data for online learning

- High-level policy composes pretrained skills for efficient exploration

-----

💡 Key Insights:

• Double utilization of prior data compounds benefits of both skill learning and online optimization

• Pseudo-relabeling with optimistic rewards makes unlabeled data useful for online learning

• Hierarchical decomposition helps separate task-agnostic skills from task-specific behaviors

-----

📊 Results:

• Outperforms all baselines across three domains: AntMaze, Kitchen, Visual AntMaze

• 4x faster goal discovery in challenging sparse-reward tasks

• Achieves 75% higher success rates in complex navigation tasks

• Maintains strong performance even with corrupted/limited prior data

Discussion about this video