Foundation models demonstrate human-like affective cognition across diverse emotional reasoning tasks.
LLMs show sophisticated grasp of emotional dynamics in social situations.
📚 https://arxiv.org/pdf/2409.11733
Original Problem 👀:
Evaluations of affective cognition (understanding emotions) in foundation models compared to humans is a challenge. Existing evaluations lack systematic benchmarking of different types of affective inferences.
-----
Solution in this Paper 🔬:
• Introduces evaluation framework based on psychological theory of emotions
• Generates 1,280 diverse scenarios exploring relationships between appraisals, emotions, expressions, and outcomes
• Uses causal template to systematically vary stimuli and test different inferences
• Compares model performance (GPT-4, Claude-3, Gemini-1.5-Pro) to human judgments across carefully selected conditions
-----
Key Insights from this Paper 💡:
• Foundation models match or exceed human-level performance on many affective reasoning tasks
• Models benefit from chain-of-thought prompting, improving affective judgments
• Some appraisal dimensions (e.g. goal inference) more salient than others for both humans and models
• Models can integrate information from outcomes, appraisals, emotions, and facial expressions
-----
Results 📊:
• Model-participant agreement matches/exceeds interparticipant agreement on many tasks
• "Superhuman" performance on some tasks, e.g. Claude-3 with CoT: 78.82% vs human 69.38% agreement on emotion inference
• Chain-of-thought improves performance, e.g. GPT-4 goal inference from 71.14% to 88.61%
• Models struggle more with safety appraisal inference (61.07% agreement) vs goal inference (88.61%)
Share this post