Grounding Partially-Defined Events in Multimodal Data

Playback speed

Share post at current time

0:00

Transcript

Grounding Partially-Defined Events in Multimodal Data

Generated this podcast with Google's Illuminate.

Rohan Paul

Dec 27, 2024

AI graduates from watching full movies to understanding movie trailers. 💡

When AI sees half a story in video, it can now guess the whole picture - at least this is what this Paper is achieving.

A system that helps AI understand incomplete video stories using text, time, and spatial data clues.

📌 As human, we are able to learn about complex current events just from short snippets of video. AI should be able to do the same.

While natural language enables straightforward ways to represent under-specified, partially observable events, visual data does not facilitate analogous methods and, consequently, introduces unique challenges in event understanding.

https://arxiv.org/abs/2410.05267

Original Problem 🔍:

Vision-capable AI agents struggle to model complex events from unstructured video data, especially when events are only partially depicted.

-----

Solution in this Paper 🧩:

• Introduces a multimodal formulation for partially-defined events

• Proposes a three-stage span retrieval task:

1. Text span retrieval

2. Temporal span retrieval

3. Spatial span retrieval

• Develops MultiVENT-G benchmark: 14.5 hours of annotated videos, 1,168 text documents, 22.8K labeled event-centric entities

• Evaluates LLM-driven approaches on MultiVENT-G

-----

Key Insights from this Paper 💡:

• Partially-defined events require reasoning over noisy, multimodal data

• Event modeling systems must handle varying levels of event complexity

• Multimodal data presents unique challenges in event understanding

• LLM-driven approaches show promise in tackling complex event extraction

-----

Results 📊:

• LLMs perform well on text evidence retrieval: GPT-4o achieves 67.2% F1 score

• VideoLLMs adapt to temporal retrieval: TimeChat-Charades reaches 33.36% F1 score

• VLM captioners lag in spatial retrieval: LLaVA 7B+GLIP achieves 27.2% IoU at 0.3 threshold

• OCR systems retrieve ~50% of full relevant strings across languages

Rohan's Bytes

Grounding Partially-Defined Events in Multimodal Data

Discussion about this video