VisRAG: Vision-based Retrieval-augmented Generation on Multi-modality Documents
Visual processing in RAG removes text parsing, preserving crucial document layout and visuals
Visual processing in RAG removes text parsing, preserving crucial document layout and visuals
Original Problem ๐:
Traditional RAG systems rely on text-based processing, limiting their ability to handle multi-modal documents effectively.
Solution in this Paper ๐ง :
โข VisRAG: Vision-based Retrieval-augmented Generation
โข Uses VLMs for both retrieval and generation
โข VisRAG-Ret: Dual-encoder paradigm with VLM for query and document encoding
โข VisRAG-Gen: Three approaches for handling multiple retrieved documents
Page Concatenation
Weighted Selection
Multi-image VLMs
Key Insights from this Paper ๐ก:
โข VisRAG eliminates document parsing, preserving visual and layout information
โข Demonstrates better training data efficiency and generalization capability
โข Shows robustness across text-centric and vision-centric documents
โข Outperforms text-based RAG in both retrieval and generation tasks
Results ๐:
โข 25-39% relative improvement in end-to-end performance over text-based RAG
โข Superior retrieval performance compared to SOTA text- and vision-centric retrievers
โข Better training data efficiency: surpasses bge-large (OCR) with only 20k q-d pairs
โข Consistent outperformance across various document types and lengths
๐ VisRAG consistently outperforms text-based RAG systems in both retrieval and generation tasks.
It achieves a 25-39% relative improvement in end-to-end performance over traditional text-based RAG pipelines. VisRAG-Ret shows superior performance in retrieving multi-modal documents compared to state-of-the-art text- and vision-centric retrievers.
๐ผ๏ธ VisRAG-Gen uses three main approaches to handle multiple retrieved documents?
Page Concatenation: Concatenates all retrieved pages into a single image for VLMs that only accept one image.
Weighted Selection: Generates answers for each retrieved page and selects the final answer based on confidence scores.
Multi-image VLMs: Directly processes multiple images for VLMs capable of handling multiple inputs.



