0:00
/
0:00
Transcript

"MMDocIR: Benchmarking Multi-Modal Retrieval for Long Documents"

Generated below podcast on this paper with Google's Illuminate.

MMDocIR introduces a benchmark for evaluating multi-modal document retrieval systems through page-level and layout-level tasks, with expert-annotated labels for comprehensive assessment.

-----

https://arxiv.org/abs/2501.08828

🔍 Original Problem:

→ Current benchmarks lack robust evaluation capabilities for multi-modal document retrieval systems that need to handle text, figures, tables, and layouts in long documents.

→ Existing datasets miss complete document pages and have limited domain coverage.

-----

💡 Solution in this Paper:

→ MMDocIR benchmark introduces two distinct retrieval tasks: page-level and layout-level retrieval.

→ The dataset comprises 313 lengthy documents across 10 domains with 1,685 expert-annotated questions.

→ A training set with 173,843 bootstrapped questions enhances model development.

→ The system uses DPR-Phi3 and Col-Phi3 models with Phi3-Vision for document/query encoding.

→ High-resolution images are processed via 336x336 pixel sub-image cropping.

-----

🎯 Key Insights:

→ Visual retrievers consistently outperform text-based approaches

→ Token-level retrievers show better performance than document-level ones

→ VLM-text methods achieve significantly better results than OCR-text approaches

→ Layout-level retrieval proves more challenging than page-level retrieval

-----

📊 Results:

→ Visual retrievers achieve 76.8% accuracy in top-k retrieval tasks

→ MMDocIR training set improves retrieval performance by 15%

→ Text retrievers using VLM show 40% better performance than OCR-text methods

Discussion about this video

User's avatar