A framework that sees images and reasons about social relationships like humans do.
SocialGPT combines vision models and LLMs to understand relationships in images without training.
📚 https://arxiv.org/abs/2410.21411
🎯 Original Problem:
Social relation recognition from images faces limitations in generalizability and interpretability when using traditional end-to-end neural networks trained on labeled datasets.
-----
🔧 Solution in this Paper:
→ Introduces SocialGPT - a modular framework combining Vision Foundation Models (VFMs) and LLMs
→ Uses SAM for object segmentation and BLIP-2 for generating dense captions about social scenes
→ Implements symbol-based object referencing for clear communication between components
→ Employs structured SocialPrompt with system rules, expectations, context, and guidance
→ Introduces Greedy Segment Prompt Optimization (GSPO) to improve prompt effectiveness
-----
💡 Key Insights:
→ Breaking down visual social reasoning into perception and reasoning phases improves interpretability
→ Symbol-based referencing bridges the gap between visual and textual information effectively
→ Structured prompts with clear segments enhance LLM reasoning capabilities
→ GSPO addresses the long prompt optimization challenge through segment-level optimization
-----
📊 Results:
→ Achieves 66.7% zero-shot accuracy on PIPA dataset without training
→ Outperforms GPT-4V by 7.03% in accuracy
→ GGreedy Segment Prompt Optimization (GSPO) improves performance by 2.53% on PIPA and 1.07% on PISC
→ Shows strong generalization to novel image styles like sketches and cartoons
Share this post