"Friends-MMC: A Dataset for Multi-modal Multi-party Conversation Understanding"

Playback speed

Share post at current time

0:00

Transcript

"Friends-MMC: A Dataset for Multi-modal Multi-party Conversation Understanding"

Generated below podcast on this paper with Google's Illuminate.

Rohan Paul

Jan 12, 2025

This new dataset turns Friends episodes into training ground for more social AI systems

The Friends-MMC dataset enables AI to understand multi-person conversations by connecting visual context with dialogue, making chatbots more natural in group settings.

-----

https://arxiv.org/abs/2412.17295

🤔 Original Problem:

Current AI dialogue systems struggle with group conversations because they can't connect speakers with their visual presence and context. Most datasets only handle two-person chats or treat speakers as outsiders commenting on images.

-----

🛠️ Solution in this Paper:

→ Created Friends-MMC dataset with 24,000+ conversations from Friends TV series, including video clips, face detection, and speaker labels.

→ Developed a three-part system: visual model detects speaking faces, text model analyzes speaker relationships, and an optimizer combines both signals.

→ Used sliding windows to capture natural conversation flows, with either 5 or 8 turns per dialogue session.

-----

💡 Key Insights:

→ Visual context is crucial but not sufficient - combining it with text analysis improves speaker identification by 2-5%

→ Speaker information significantly improves response prediction accuracy

→ Traditional multi-modal models struggle with this task, showing need for specialized architectures

-----

📊 Results:

→ Achieved 83.21% accuracy in speaker identification with video context

→ Improved response prediction accuracy from 30.69% to 36.89% by adding speaker information

→ Outperformed GPT-4 (66.36%) and other pre-trained models significantly

Rohan's Bytes

"Friends-MMC: A Dataset for Multi-modal Multi-party Conversation Understanding"

Discussion about this video