0:00
/
0:00
Transcript

"Friends-MMC: A Dataset for Multi-modal Multi-party Conversation Understanding"

Generated below podcast on this paper with Google's Illuminate.

This new dataset turns Friends episodes into training ground for more social AI systems

The Friends-MMC dataset enables AI to understand multi-person conversations by connecting visual context with dialogue, making chatbots more natural in group settings.

-----

https://arxiv.org/abs/2412.17295

🤔 Original Problem:

Current AI dialogue systems struggle with group conversations because they can't connect speakers with their visual presence and context. Most datasets only handle two-person chats or treat speakers as outsiders commenting on images.

-----

🛠️ Solution in this Paper:

→ Created Friends-MMC dataset with 24,000+ conversations from Friends TV series, including video clips, face detection, and speaker labels.

→ Developed a three-part system: visual model detects speaking faces, text model analyzes speaker relationships, and an optimizer combines both signals.

→ Used sliding windows to capture natural conversation flows, with either 5 or 8 turns per dialogue session.

-----

💡 Key Insights:

→ Visual context is crucial but not sufficient - combining it with text analysis improves speaker identification by 2-5%

→ Speaker information significantly improves response prediction accuracy

→ Traditional multi-modal models struggle with this task, showing need for specialized architectures

-----

📊 Results:

→ Achieved 83.21% accuracy in speaker identification with video context

→ Improved response prediction accuracy from 30.69% to 36.89% by adding speaker information

→ Outperformed GPT-4 (66.36%) and other pre-trained models significantly

Discussion about this video