0:00
/
0:00
Transcript

"OmniDocBench: Benchmarking Diverse PDF Document Parsing with Comprehensive Annotations"

The podcast on this paper is generated with Google's Illuminate.

OmniDocBench introduces a comprehensive benchmark for evaluating document parsing systems across diverse document types with detailed annotations and evaluation metrics.

https://arxiv.org/abs/2412.07626

🛠️ OmniDocBench:

→ Provides a meticulously curated dataset comprising nine diverse document types including academic papers, textbooks, slides, and more.

→ The benchmark implements 19 layout category labels and 14 attribute labels for multi-level assessment.

→ It enables flexible evaluation across entire datasets, individual modules, and specific data types.

→ The evaluation framework incorporates both pipeline-based and end-to-end assessment methods.

-----

💡 Key Insights:

→ Pipeline tools outperform general Vision Language Models (VLMs) in document parsing tasks

→ VLMs show better generalization on specialized content like slides and handwritten notes

→ Document parsing performance varies significantly across different languages and layouts

→ Table recognition accuracy drops substantially with rotated content across all models

-----

📊 Results:

→ MinerU achieved best performance for English pages with 0.058 edit distance

→ DocLayout-YOLO demonstrated 48.71% mAP across diverse document types

→ RapidTable reached 82.5% accuracy in table recognition tasks

→ GPT-4o showed 86.8% CDM score in formula recognition

Discussion about this video