AgentSense: Benchmarking Social Intelligence of Language Agents through Interactive Scenarios
This paper evaluates LLMs' social abilities through complex multi-agent interactions and private information handling
This paper evaluates LLMs' social abilities through complex multi-agent interactions and private information handling
Finds, LLMs struggle with complex social scenarios and high-level growth goals
🤖 Original Problem:
Evaluating social intelligence of LLMs in complex interactions remains challenging. Current benchmarks lack scenario diversity, oversimplify real interactions, and focus only on explicit goal achievement without considering private information handling.
🔧 Solution in this Paper:
• Built AgentSense: A benchmark with 1,225 diverse social scenarios extracted from scripts using bottom-up approach
• Uses Dramaturgical Theory to create realistic social interactions
• Evaluates both goal completion and implicit reasoning abilities
• Implements multi-turn conversations between agents with private information
• Measures performance through interviews and multiple-choice questions
• Introduces Profile Sensitivity Index (PSI) to assess stability across different character profiles
💡 Key Insights:
• Being a "sender" (actively sharing information) is more challenging than being a "receiver"
• Models perform better at relationship management and cooperation vs competition
• Even GPT-4 struggles with balancing goal achievement and private information protection
• Social intelligence varies significantly based on character profiles
📊 Results:
• GPT-4 leads overall performance but still needs improvement in private information reasoning
• Qwen2.5-14b shows strong performance in both goal completion and information reasoning
• Llama-2 series models perform poorly, with some improvement in Llama-3 series
• PSI results show higher social intelligence models are less sensitive to profile changes
• Models achieve 88.36% goal completion rate and 76.86% information reasoning accuracy
🎠The key components and methodology of AgentSense
Scenario Construction: Extracts templates from scripts and synthesizes diverse characters to create scenarios
Social Interaction Simulation: Agents interact through multi-turn conversations trying to achieve social goals while protecting private information
Evaluation: Uses interviews and multiple-choice questions to assess goal completion and information reasoning abilities.
🎯 The types of social goals and scenarios covered
Personal domain (54%): Home, private gatherings, intimate settings
Small society (37%): Schools, workplaces, communities
Large society (9%): Public spaces, online platforms
Social goals are categorized using ERG theory into:
Existence needs: Information exchange
Relatedness needs: Building/maintaining relationships
Growth needs: Cooperation, competition, conflict resolution.