This paper adapts Multimodal LLMs into a unified Generalist Embodied Agent (GEA) that can handle diverse tasks across manipulation, games, UI control, and planning domains.
-----
https://arxiv.org/abs/2412.08442
🤖 Original Problem:
→ Current AI systems excel at specific tasks but struggle to handle diverse embodied tasks like robot control, game playing, and UI manipulation simultaneously
→ Existing solutions lack a unified approach to handle multiple domains and embodiments
-----
🔧 Solution in this Paper:
→ GEA uses a multi-embodiment action tokenizer that converts diverse action spaces into unified token representations using Residual VQ-VAE with 2 codebooks
→ The system employs a two-stage training process: supervised finetuning on 2.2M trajectories followed by online reinforcement learning
→ Training combines data from multiple domains to enable cross-domain performance benefits
-----
🎯 Key Insights:
→ Training on combined data from diverse domains provides better generalization
→ Online reinforcement learning significantly improves robustness compared to supervised learning alone
→ Visual encoder initialization has stronger impact than language model initialization
-----
📊 Results:
→ Achieves 90% success on CALVIN manipulation tasks with unseen instructions
→ Reaches 83% success in Habitat mobile pick tasks in new environments
→ Attains 44% of expert score in Procgen video games, 20% higher than prior methods
Share this post