MMDocIR introduces a benchmark for evaluating multi-modal document retrieval systems through page-level and layout-level tasks, with expert-annotated labels for comprehensive assessment.
-----
https://arxiv.org/abs/2501.08828
🔍 Original Problem:
→ Current benchmarks lack robust evaluation capabilities for multi-modal document retrieval systems that need to handle text, figures, tables, and layouts in long documents.
→ Existing datasets miss complete document pages and have limited domain coverage.
-----
💡 Solution in this Paper:
→ MMDocIR benchmark introduces two distinct retrieval tasks: page-level and layout-level retrieval.
→ The dataset comprises 313 lengthy documents across 10 domains with 1,685 expert-annotated questions.
→ A training set with 173,843 bootstrapped questions enhances model development.
→ The system uses DPR-Phi3 and Col-Phi3 models with Phi3-Vision for document/query encoding.
→ High-resolution images are processed via 336x336 pixel sub-image cropping.
-----
🎯 Key Insights:
→ Visual retrievers consistently outperform text-based approaches
→ Token-level retrievers show better performance than document-level ones
→ VLM-text methods achieve significantly better results than OCR-text approaches
→ Layout-level retrieval proves more challenging than page-level retrieval
-----
📊 Results:
→ Visual retrievers achieve 76.8% accuracy in top-k retrieval tasks
→ MMDocIR training set improves retrieval performance by 15%
→ Text retrievers using VLM show 40% better performance than OCR-text methods