0:00
/
0:00
Transcript

"SocialGPT: Prompting LLMs for Social Relation Reasoning via Greedy Segment Optimization"

The podcast on this paper is generated with Google's Illuminate.

A framework that sees images and reasons about social relationships like humans do.

SocialGPT combines vision models and LLMs to understand relationships in images without training.

📚 https://arxiv.org/abs/2410.21411

🎯 Original Problem:

Social relation recognition from images faces limitations in generalizability and interpretability when using traditional end-to-end neural networks trained on labeled datasets.

-----

🔧 Solution in this Paper:

→ Introduces SocialGPT - a modular framework combining Vision Foundation Models (VFMs) and LLMs

→ Uses SAM for object segmentation and BLIP-2 for generating dense captions about social scenes

→ Implements symbol-based object referencing for clear communication between components

→ Employs structured SocialPrompt with system rules, expectations, context, and guidance

→ Introduces Greedy Segment Prompt Optimization (GSPO) to improve prompt effectiveness

-----

💡 Key Insights:

→ Breaking down visual social reasoning into perception and reasoning phases improves interpretability

→ Symbol-based referencing bridges the gap between visual and textual information effectively

→ Structured prompts with clear segments enhance LLM reasoning capabilities

→ GSPO addresses the long prompt optimization challenge through segment-level optimization

-----

📊 Results:

→ Achieves 66.7% zero-shot accuracy on PIPA dataset without training

→ Outperforms GPT-4V by 7.03% in accuracy

→ GGreedy Segment Prompt Optimization (GSPO) improves performance by 2.53% on PIPA and 1.07% on PISC

→ Shows strong generalization to novel image styles like sketches and cartoons

Discussion about this video