0:00
/
0:00
Transcript

"Embodied Agent Interface: Benchmarking LLMs for Embodied Decision Making"

The podcast on this paper is generated with Google's Illuminate.

A unified framework to benchmark LLMs' ability to understand and execute physical tasks

https://arxiv.org/abs/2410.07166

🎯 Original Problem:

Evaluating LLMs for embodied decision-making lacks standardization across tasks, modules, and metrics. Current evaluations use different domains, purposes, and input-output specifications, making it hard to pinpoint LLMs' limitations.

-----

🔧 Solution in this Paper:

→ Introduces Embodied Agent Interface - a standardized framework unifying embodied tasks, LLM modules, and evaluation metrics

→ Implements object-centric state and action representation using Linear Temporal Logic for defining task success

→ Formalizes four key LLM modules: goal interpretation, subgoal decomposition, action sequencing, and transition modeling

→ Provides fine-grained evaluation metrics to identify different error types like hallucination, affordance, and planning errors

-----

💡 Key Insights:

→ Most LLMs struggle with translating natural language into grounded states

→ Trajectory feasibility errors are common (45.2%), with missing steps (19.5%) and additional steps (14.2%)

→ Performance decreases with longer sequences and more complex environments

→ LLMs show hallucination of nonexistent objects/actions and reporting bias

-----

📊 Results:

→ o1-preview outperforms other models (74.9% vs 64.2% on BEHAVIOR)

→ Claude-3.5 Sonnet excels in goal interpretation on BEHAVIOR

→ Mistral Large performs well in action sequencing on VirtualHome

Discussion about this video