OmniDocBench introduces a comprehensive benchmark for evaluating document parsing systems across diverse document types with detailed annotations and evaluation metrics.
https://arxiv.org/abs/2412.07626
🛠️ OmniDocBench:
→ Provides a meticulously curated dataset comprising nine diverse document types including academic papers, textbooks, slides, and more.
→ The benchmark implements 19 layout category labels and 14 attribute labels for multi-level assessment.
→ It enables flexible evaluation across entire datasets, individual modules, and specific data types.
→ The evaluation framework incorporates both pipeline-based and end-to-end assessment methods.
-----
💡 Key Insights:
→ Pipeline tools outperform general Vision Language Models (VLMs) in document parsing tasks
→ VLMs show better generalization on specialized content like slides and handwritten notes
→ Document parsing performance varies significantly across different languages and layouts
→ Table recognition accuracy drops substantially with rotated content across all models
-----
📊 Results:
→ MinerU achieved best performance for English pages with 0.058 edit distance
→ DocLayout-YOLO demonstrated 48.71% mAP across diverse document types
→ RapidTable reached 82.5% accuracy in table recognition tasks
→ GPT-4o showed 86.8% CDM score in formula recognition
Share this post