A new benchmark that tests if AI can actually read long documents like humans do
M-LongDoc helps models understand 200+ page documents by learning to focus on what matters
https://arxiv.org/abs/2411.06176
🤖 Original Problem:
Current multimodal models struggle to understand super-long documents with hundreds of pages. Existing benchmarks focus on short documents and simple extractive questions, not reflecting real-world challenges.
-----
🔧 Solution in this Paper:
→ Introduced M-LongDoc benchmark with 851 samples featuring documents averaging 210.8 pages across academic, financial, and product domains
→ Developed an automated evaluation framework using multiple judge models to score answer correctness on 1-5 scale
→ Created a retrieval-aware tuning approach that includes both relevant content and distractors during training
→ Built a training corpus of 10,070 samples for question-answering over multimodal documents
-----
💡 Key Insights:
→ Models show significant bias - perform worse on figure/table questions compared to text questions
→ Simply increasing retrieved context doesn't improve performance and can worsen results
→ Automated evaluation framework achieved 88.9% correlation with human judgments
→ Most current models struggle with visual content in long documents
-----
📊 Results:
→ Achieved 4.6% relative improvement in answer correctness compared to baseline models
→ Processed documents averaging 120,988 tokens and 210.8 pages
→ Evaluation framework showed 88.9% correlation with human judgment
Share this post