LMAct shows current LLMs still can't consistently learn to act from examples, even with hundreds of demonstrations
LMAct introduces a benchmark to test if LLMs can learn decision-making from expert demonstrations. It evaluates models' ability to generalize from multimodal examples in contexts up to 1M tokens across games like chess, Atari, and grid worlds.
-----
https://arxiv.org/abs/2412.01441
🤔 Original Problem
Today's LLMs struggle with simple decision-making tasks even when they have good factual knowledge about solving them. They often fail to translate knowledge into effective actions.
-----
🔧 Solution in this Paper
→ Created a benchmark called LMAct to test LLMs' decision-making abilities using up to 512 expert demonstrations
→ Evaluated 6 frontier models (Claude 3.5, Gemini 1.5, GPT-4o, o1) on interactive tasks like chess, tic-tac-toe, Atari games
→ Used both text and image state representations to test multimodal capabilities
→ Pushed context lengths up to 1M tokens to test long-context reasoning
-----
💡 Key Insights
→
→ Performance often independent of number of demonstrations shown
→ Some models show strong in-context learning on specific tasks like grid world navigation
→ Text representations generally work better than image-based ones
-----
📊 Results
→ Only o1-preview reached near-expert performance on crosswords and tic-tac-toe
→ Most models struggled with Atari games, performing below random baseline
→ Grid world navigation saw best results across models
→ Increasing demonstrations beyond 1-2 episodes rarely improved performance
Share this post