VisRAG: Vision-based Retrieval-augmented Generation on Multi-modality Documents

Visual processing in RAG removes text parsing, preserving crucial document layout and visuals

Nov 10, 2024

Visual processing in RAG removes text parsing, preserving crucial document layout and visuals

Original Problem 🔍:

Traditional RAG systems rely on text-based processing, limiting their ability to handle multi-modal documents effectively.

Solution in this Paper 🧠:

• VisRAG: Vision-based Retrieval-augmented Generation

• Uses VLMs for both retrieval and generation

• VisRAG-Ret: Dual-encoder paradigm with VLM for query and document encoding

• VisRAG-Gen: Three approaches for handling multiple retrieved documents

Page Concatenation
Weighted Selection
Multi-image VLMs

Key Insights from this Paper 💡:

• VisRAG eliminates document parsing, preserving visual and layout information

• Demonstrates better training data efficiency and generalization capability

• Shows robustness across text-centric and vision-centric documents

• Outperforms text-based RAG in both retrieval and generation tasks

Results 📊:

• 25-39% relative improvement in end-to-end performance over text-based RAG

• Superior retrieval performance compared to SOTA text- and vision-centric retrievers

• Better training data efficiency: surpasses bge-large (OCR) with only 20k q-d pairs

• Consistent outperformance across various document types and lengths

📊 VisRAG consistently outperforms text-based RAG systems in both retrieval and generation tasks.

It achieves a 25-39% relative improvement in end-to-end performance over traditional text-based RAG pipelines. VisRAG-Ret shows superior performance in retrieving multi-modal documents compared to state-of-the-art text- and vision-centric retrievers.

🖼️ VisRAG-Gen uses three main approaches to handle multiple retrieved documents?

Page Concatenation: Concatenates all retrieved pages into a single image for VLMs that only accept one image.
Weighted Selection: Generates answers for each retrieved page and selects the final answer based on confidence scores.
Multi-image VLMs: Directly processes multiple images for VLMs capable of handling multiple inputs.
Rohan's Bytes is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.

Rohan's Bytes

VisRAG: Vision-based Retrieval-augmented Generation on Multi-modality Documents

Visual processing in RAG removes text parsing, preserving crucial document layout and visuals

Discussion about this post