The paper propose a Visual-aware document search that actually understands charts and figures, not just text
M3DocRAG: A multi-modal system that answers questions from thousands of documents while preserving visual context
https://arxiv.org/abs/2411.04952
Original Problem 🤔:
Document Visual Question Answering (DocVQA) systems face two critical limitations: they can't process information across multiple documents/pages, and they lose vital visual data like charts when using text extraction tools.
-----
Solution in this Paper 🛠️:
→ M3DocRAG introduces a three-stage framework: document embedding converts pages to visual embeddings using ColPali, page retrieval finds relevant pages using MaxSim scoring, and question answering uses multi-modal LLMs like Qwen2-VL to generate answers
→ The system creates visual embeddings for each page and query in a shared space, enabling efficient retrieval across thousands of documents
→ For faster open-domain search, it implements approximate indexing using inverted file index (IVF), reducing query time from 20s to under 2s across 40K pages
-----
Key Insights from this Paper 💡:
→ Visual information preservation is crucial for accurate document understanding, especially for charts and tables
→ Multi-modal retrieval outperforms text-only approaches, particularly for image-heavy content
→ Efficient indexing can dramatically speed up large-scale document search without significant accuracy loss
-----
Results 📊:
→ M3DocRAG with ColPali and Qwen2-VL 7B achieves superior performance across M3DocVQA, MMLongBench-Doc, and MP-DocVQA benchmarks
→ On M3DocVQA, achieves 36.5 F1 score with 4-page context, significantly outperforming text-only baselines
→ Shows 22.6 F1 score on MMLongBench-Doc, setting new state-of-the-art performance
Share this post