Properly testing RL agent memory requires separating short-term from long-term capabilities.
This paper introduces a systematic way to classify and evaluate different types of memory in Reinforcement Learning agents, addressing the lack of standardized testing methods.
-----
https://arxiv.org/abs/2412.06531
🤖 Original Problem:
→ Current RL research lacks clear definitions for different types of agent memory, leading to incorrect evaluations and comparisons
→ The term "memory" has multiple interpretations across different studies, making it difficult to properly assess agent capabilities
-----
🔍 Solution in this Paper:
→ Introduces formal definitions for long-term memory (LTM) and short-term memory (STM) in RL agents
→ Proposes Memory Decision-Making (Memory DM) framework to evaluate agent's ability to use past information
→ Develops standardized methodology for testing memory capabilities using correlation horizons and context lengths
→ Creates classification system distinguishing between declarative memory (single environment/episode) and procedural memory (multiple environments/episodes)
-----
⚡ Key Insights:
→ Memory type validation depends on relationship between agent context (K) and environment parameters
→ Proper memory testing requires controlling both context length and correlation horizon
→ Current evaluations often mix LTM and STM capabilities due to improper experimental setups
-----
📊 Results:
→ Demonstrated that naive testing approaches can lead to 50% performance drop in memory tasks
→ Showed that transformer-based agents achieve near 100% success rate in STM tasks but fail in true LTM scenarios
Share this post