0:00
/
0:00
Transcript

MLLM as Retriever: Interactively Learning Multimodal Retrieval for Embodied Agents

The podcast on this paper is generated with Google's Illuminate.

Current retrieval methods primarily focus on surface-level similarities of textual or visual cues in trajectories, neglecting their effectiveness for the specific task at hand.

MLLM AS RETRIEVER (MART) solves this by leveraging interactive feedback to retrieve task-effective multimodal trajectories.

https://arxiv.org/abs/2410.03450

Results 📊:

• MART consistently outperforms baselines by over 10% in Success Rate across environments

• AI2-THOR: 40% Success Rate (vs 18-26% for baselines)

• LEGENT: 87% Success Rate (vs 69-75% for other methods)

• Improved Average Steps needed to complete tasks in both environments

Solution in this Paper 🛠️:

• MART (MLLM As ReTriever): Fine-tunes MLLM retriever using interactive feedback

• Trajectory Abstraction: Condenses trajectories while preserving key information

• Preference learning: Organizes interactive feedback into pairs for fine-tuning

• Bradley-Terry model: Used to train the MLLM retriever

-----

Key Insights from this Paper 💡:

• Interactive learning improves trajectory retrieval for embodied agents

• Trajectory Abstraction reduces context window length and removes distracting information

• MLLM retriever can prioritize effective trajectories for unseen tasks

• Combining MLLM capabilities with task-specific effectiveness assessment enhances performance