0:00
/
0:00
Transcript

"LMAct: A Benchmark for In-Context Imitation Learning with Long Multimodal Demonstrations"

The podcast on this paper is generated with Google's Illuminate.

LMAct shows current LLMs still can't consistently learn to act from examples, even with hundreds of demonstrations

LMAct introduces a benchmark to test if LLMs can learn decision-making from expert demonstrations. It evaluates models' ability to generalize from multimodal examples in contexts up to 1M tokens across games like chess, Atari, and grid worlds.

-----

https://arxiv.org/abs/2412.01441

🤔 Original Problem

Today's LLMs struggle with simple decision-making tasks even when they have good factual knowledge about solving them. They often fail to translate knowledge into effective actions.

-----

🔧 Solution in this Paper

→ Created a benchmark called LMAct to test LLMs' decision-making abilities using up to 512 expert demonstrations

→ Evaluated 6 frontier models (Claude 3.5, Gemini 1.5, GPT-4o, o1) on interactive tasks like chess, tic-tac-toe, Atari games

→ Used both text and image state representations to test multimodal capabilities

→ Pushed context lengths up to 1M tokens to test long-context reasoning

-----

💡 Key Insights

→ Performance often independent of number of demonstrations shown

→ Some models show strong in-context learning on specific tasks like grid world navigation

→ Text representations generally work better than image-based ones

-----

📊 Results

→ Only o1-preview reached near-expert performance on crosswords and tic-tac-toe

→ Most models struggled with Atari games, performing below random baseline

→ Grid world navigation saw best results across models

→ Increasing demonstrations beyond 1-2 episodes rarely improved performance

Discussion about this video