RAG systems are only as good as their OCR - and current OCR isn't good enough
OCR errors in PDFs significantly impact RAG systems, hurting both retrieval and answer generation. As RAG systems rely heavily on OCR to extract text from PDFs for their knowledge bases. OCR errors and inconsistent formatting introduce noise that degrades RAG performance. No benchmark exists to systematically evaluate this impact.
This paper introduces OHRBench, a benchmark to evaluate OCR's cascading effects on RAG performance through systematic noise analysis.
https://arxiv.org/abs/2412.02592
-----
🛠️ The OHRBench
→ OHRBench introduces 350 PDF documents from 6 real-world domains with complex layouts and multimodal elements
→ It identifies two key OCR noise types: Semantic Noise (recognition errors) and Formatting Noise (inconsistent structure)
→ Creates perturbed datasets with controlled noise levels to analyze impact on RAG components
→ Evaluates multiple OCR solutions including pipeline-based, end-to-end, and vision-language models
-----
💡 Key Insights:
→ Pipeline-based OCR performs best but still shows 7.5% performance gap vs ground truth
→ Semantic Noise severely impacts both retrieval and generation across all models
→ Formatting Noise mainly affects multimodal content like formulas and tables
→ Vision-language models show promise when combining image and text inputs
-----
📊 Results:
→ Best OCR solutions show 1.9% drop in Exact Match and 2.93% drop in F1 score
→ Dense retrievers outperform sparse retrievers as noise increases
→ Formula-related queries see up to 19.4% performance drop with formatting noise
Share this post