0:00
/
0:00
Transcript

"M-Longdoc: A Benchmark For Multimodal Super-Long Document Understanding And A Retrieval-Aware Tuning Framework"

The podcast on this paper is generated with Google's Illuminate.

A new benchmark that tests if AI can actually read long documents like humans do

M-LongDoc helps models understand 200+ page documents by learning to focus on what matters

https://arxiv.org/abs/2411.06176

🤖 Original Problem:

Current multimodal models struggle to understand super-long documents with hundreds of pages. Existing benchmarks focus on short documents and simple extractive questions, not reflecting real-world challenges.

-----

🔧 Solution in this Paper:

→ Introduced M-LongDoc benchmark with 851 samples featuring documents averaging 210.8 pages across academic, financial, and product domains

→ Developed an automated evaluation framework using multiple judge models to score answer correctness on 1-5 scale

→ Created a retrieval-aware tuning approach that includes both relevant content and distractors during training

→ Built a training corpus of 10,070 samples for question-answering over multimodal documents

-----

💡 Key Insights:

→ Models show significant bias - perform worse on figure/table questions compared to text questions

→ Simply increasing retrieved context doesn't improve performance and can worsen results

→ Automated evaluation framework achieved 88.9% correlation with human judgments

→ Most current models struggle with visual content in long documents

-----

📊 Results:

→ Achieved 4.6% relative improvement in answer correctness compared to baseline models

→ Processed documents averaging 120,988 tokens and 210.8 pages

→ Evaluation framework showed 88.9% correlation with human judgment

Discussion about this video