A unified framework to benchmark LLMs' ability to understand and execute physical tasks
https://arxiv.org/abs/2410.07166
🎯 Original Problem:
Evaluating LLMs for embodied decision-making lacks standardization across tasks, modules, and metrics. Current evaluations use different domains, purposes, and input-output specifications, making it hard to pinpoint LLMs' limitations.
-----
🔧 Solution in this Paper:
→ Introduces Embodied Agent Interface - a standardized framework unifying embodied tasks, LLM modules, and evaluation metrics
→ Implements object-centric state and action representation using Linear Temporal Logic for defining task success
→ Formalizes four key LLM modules: goal interpretation, subgoal decomposition, action sequencing, and transition modeling
→ Provides fine-grained evaluation metrics to identify different error types like hallucination, affordance, and planning errors
-----
💡 Key Insights:
→ Most LLMs struggle with translating natural language into grounded states
→ Trajectory feasibility errors are common (45.2%), with missing steps (19.5%) and additional steps (14.2%)
→ Performance decreases with longer sequences and more complex environments
→ LLMs show hallucination of nonexistent objects/actions and reporting bias
-----
📊 Results:
→ o1-preview outperforms other models (74.9% vs 64.2% on BEHAVIOR)
→ Claude-3.5 Sonnet excels in goal interpretation on BEHAVIOR
→ Mistral Large performs well in action sequencing on VirtualHome
Share this post