This new dataset turns Friends episodes into training ground for more social AI systems
The Friends-MMC dataset enables AI to understand multi-person conversations by connecting visual context with dialogue, making chatbots more natural in group settings.
-----
https://arxiv.org/abs/2412.17295
🤔 Original Problem:
Current AI dialogue systems struggle with group conversations because they can't connect speakers with their visual presence and context. Most datasets only handle two-person chats or treat speakers as outsiders commenting on images.
-----
🛠️ Solution in this Paper:
→ Created Friends-MMC dataset with 24,000+ conversations from Friends TV series, including video clips, face detection, and speaker labels.
→ Developed a three-part system: visual model detects speaking faces, text model analyzes speaker relationships, and an optimizer combines both signals.
→ Used sliding windows to capture natural conversation flows, with either 5 or 8 turns per dialogue session.
-----
💡 Key Insights:
→ Visual context is crucial but not sufficient - combining it with text analysis improves speaker identification by 2-5%
→ Speaker information significantly improves response prediction accuracy
→ Traditional multi-modal models struggle with this task, showing need for specialized architectures
-----
📊 Results:
→ Achieved 83.21% accuracy in speaker identification with video context
→ Improved response prediction accuracy from 30.69% to 36.89% by adding speaker information
→ Outperformed GPT-4 (66.36%) and other pre-trained models significantly
Share this post