State-of-the-Art Model Architectures for Document Layout Analysis

Apr 16, 2025

Browse all previoiusly published AI Tutorials here.

I. Introduction to Document Layout Analysis (DLA)

Document Layout Analysis (DLA) is a critical process within the fields of computer vision and document intelligence, concerned with understanding the spatial arrangement and structure of content within digital document images.¹ It serves as a foundational step for transforming unstructured or semi-structured documents, such as scanned pages or born-digital files like PDFs, into machine-readable formats that capture not only the text but also the organizational schema intended by the document's creator.

A. Defining DLA: Physical vs. Logical Layout Analysis

At its core, DLA aims to determine the structure of a document by identifying and categorizing its constituent components or regions of interest.⁴ This process fundamentally involves two interrelated aspects: physical layout analysis and logical layout analysis.⁷

Physical Layout Analysis focuses on the geometric arrangement and identification of tangible elements on the page.⁸ Its primary goal is to decompose the document image into a hierarchy of homogeneous regions based on their visual characteristics.⁷ These regions correspond to the physical blocks of content, such as paragraphs of text, headings, figures (images or illustrations), tables, lists, mathematical formulas, footnotes, headers, footers, page numbers, and selection marks (like checkboxes).¹ This stage essentially answers the questions: What types of content blocks are present, and where are they located on the page?

Logical Layout Analysis, conversely, deals with the functional or semantic roles of the identified physical components and their interrelationships.⁷ It assigns meaningful labels (e.g., identifying a block of text as a 'title', 'caption', 'author name', or 'abstract') and seeks to understand the intended reading order and hierarchical structure connecting these elements.⁷ For example, logical analysis determines that a caption is associated with a specific figure or that a sequence of paragraphs forms a coherent section under a particular heading. This stage often builds upon the results of the physical analysis, interpreting the purpose and flow of information within the document.⁷

While traditionally viewed as distinct, sequential steps, the boundary between physical and logical layout analysis is increasingly blurred in modern DLA systems. Early approaches often treated physical segmentation as a prerequisite for subsequent logical interpretation.⁷ However, contemporary state-of-the-art models, particularly those leveraging multimodal information (text, vision, layout) or graph-based representations, often perform these tasks jointly or use insights from one to inform the other.⁵ This integration reflects a deeper understanding that the physical arrangement and the logical function of document elements are intrinsically linked and that a holistic analysis leads to more accurate and meaningful document understanding.⁸ The ultimate goal is to bridge the gap between the raw visual presentation of a document and its underlying semantic structure.⁹

B. Core Goals and Applications

The primary objective of DLA is to decompose document images into their constituent structural elements and understand the roles and relationships of these elements.⁷ By extracting both text and structural information, DLA facilitates a deeper semantic understanding of the document's content, converting unstructured visual information into a structured, machine-interpretable format.⁸

This structured representation serves as a critical enabler for a wide array of downstream applications and document intelligence tasks.⁶ The practical and economic value derived from these applications is a major driving force behind advancements in DLA research. Key applications include:

Optical Character Recognition (OCR) Enhancement: DLA helps OCR systems by identifying text regions, separating them from non-textual elements (like images or tables), and determining the correct reading order, leading to more accurate text extraction.⁶ It can also guide OCR engines to apply different processing strategies to different regions (e.g., table cells vs. paragraphs).
Information Extraction (IE) and Knowledge Extraction: By identifying specific layout components like titles, authors, abstracts, tables, or key-value pairs in forms, DLA provides the necessary structural context for targeted information extraction systems.⁵ Accurate DLA ensures that IE models operate on meaningful segments of text, improving the precision of downstream tasks like Named Entity Recognition (NER) and relation extraction.³²
Document Retrieval and Search: Understanding the layout allows for more sophisticated document retrieval systems that can search not only based on text content but also on structural elements (e.g., finding documents containing tables related to a specific topic) or logical roles (e.g., searching within figure captions).⁴
Content Categorization and Classification: Layout features (e.g., presence of specific elements like formulas or code blocks, number of columns) can be strong indicators for classifying documents by type (e.g., scientific paper, invoice, resume) or topic.¹
Automated Document Processing Workflows: DLA is fundamental to automating workflows in various industries by enabling machines to "read" and process documents like humans. Examples include processing invoices in finance (extracting vendor, amount, date), managing patient records and insurance claims in healthcare, analyzing contracts and court transcripts in the legal sector, and handling forms in government and business.¹
Document Conversion: Converting documents from formats like PDF into structured, editable formats (e.g., HTML, XML, Markdown) relies heavily on accurate layout analysis to preserve the original structure and reading order.⁴ Tools like Docling explicitly leverage DLA models for this purpose.³⁶
Accessibility: DLA can improve the experience for users of assistive technologies like screen readers by providing a logical reading order and distinguishing between main content and peripheral elements (headers, footers, page numbers), making complex documents more navigable.¹⁰
Handwritten Text Processing: In the context of historical or handwritten documents, DLA is crucial for segmenting text lines and identifying different zones (e.g., main text, marginalia) before transcription or recognition.⁵

C. Fundamental Challenges in Modern DLA

Despite significant progress, developing robust and accurate DLA systems faces numerous inherent challenges, largely stemming from the sheer diversity and complexity of real-world documents. Overcoming these challenges is essential for unlocking the full potential of automated document understanding.

Layout Diversity and Complexity: Documents exhibit an enormous variety of layouts. Simple single-column text documents (often termed "Manhattan layouts" ⁶) contrast sharply with complex multi-column layouts found in newspapers or magazines ³⁴, intricate table-based structures in financial reports or scientific papers ³⁴, and highly variable, often unstructured layouts in forms, webpages, or historical manuscripts.⁶ Non-Manhattan layouts, which deviate from strict grid structures and often incorporate artistic or aesthetic design choices, pose particular difficulties.²⁸ The type of document (e.g., scientific article, invoice, legal contract, patent, manual) heavily influences its typical layout conventions.⁸ Accurately defining the boundaries of different elements, especially when they are irregularly shaped or closely spaced, remains a hard problem.³²
Element Variability: Layout components themselves vary greatly in size, aspect ratio, and appearance.⁸ Text can appear in numerous fonts, sizes, colors, and styles (bold, italic).³⁴ Images and tables can span multiple columns or have complex internal structures (e.g., merged cells in tables ³⁴). Models must effectively handle these multi-scale variations.⁴⁰
Image Quality and Degradation: Documents, especially those digitized via scanning, are often imperfect. Common issues include image noise (salt-and-pepper, Gaussian noise from sensors ²), skew (rotation during scanning ²), low resolution, compression artifacts, perspective distortion (from camera capture), uneven illumination, bleed-through (text from the reverse side showing through), faded ink, stains, smudges, and handwritten annotations or marks.⁶ Historical documents are particularly prone to severe degradation due to aging.⁶ State-of-the-art models often exhibit significant performance degradation when faced with such perturbations, highlighting a critical gap between performance on clean benchmark datasets and real-world robustness.⁴²
Data Scarcity and Annotation Cost: Modern deep learning models for DLA are data-hungry, requiring large quantities of accurately labeled training data.¹⁹ Creating such datasets through manual annotation is extremely labor-intensive, time-consuming, and expensive.²⁸ This is particularly true for diverse datasets covering many layout types or specialized domains like historical documents or non-Manhattan layouts.²⁸ Furthermore, ensuring high quality and consistency across annotations can be challenging.⁴⁸ Data scarcity is a major bottleneck, especially in domains with privacy concerns (e.g., medical records, financial statements) where sharing or annotating data is restricted.¹
Generalization Across Domains: A significant challenge is the poor generalization ability of many DLA models.¹ Models trained extensively on one type of document layout (e.g., scientific articles from PubLayNet) often fail to perform well when applied to documents from different domains or with substantially different layouts (e.g., financial reports or manuals in DocLayNet).²¹ This domain shift necessitates the development of more robust models, more diverse training datasets, or effective domain adaptation techniques.¹ This lack of generalization is arguably one of the most significant hurdles preventing the widespread, reliable deployment of DLA in diverse real-world applications, driving major research efforts in dataset creation (e.g., DocLayNet ³⁹, M6Doc ²⁴), robustness benchmarking (e.g., RoDLA benchmark ⁴³), and adaptation methods (e.g., SFDLA ²¹).
Logical Structure Complexity: While identifying physical blocks is challenging, inferring the correct logical structure (reading order, hierarchical relationships) is often even harder, especially for complex, non-linear layouts involving multiple columns, sidebars, floating figures, or intricate tables.¹⁰ Establishing these relationships correctly is vital for semantic understanding and tasks like document summarization or question answering.
Computational Trade-offs: There is often a trade-off between model accuracy and computational efficiency (speed and memory usage).⁴⁰ Complex multimodal models that leverage text and vision often achieve higher accuracy but incur significant latency, making them unsuitable for real-time or large-scale processing.⁴⁰ Faster unimodal (e.g., vision-only) models may lack the accuracy needed for demanding applications.⁴⁰ Balancing these factors is a critical consideration for practical deployment.

II. Evolution of DLA Methodologies

The approaches to Document Layout Analysis have evolved considerably over the past few decades, mirroring broader trends in computer vision and machine learning while also responding to the specific challenges posed by document structure. This evolution has progressed from heuristic, rule-based systems designed for simple layouts to sophisticated deep learning models capable of handling much greater complexity and diversity.

A. From Rule-Based Systems to Deep Learning

Early research into DLA, dating back to the 1990s, primarily focused on documents with relatively simple, well-structured layouts, often characterized by single or multiple rectangular text columns (Manhattan layouts).⁶ DLA was frequently treated as a preprocessing step within larger document understanding systems rather than a distinct research problem.⁶ The methods developed during this era were largely rule-based or heuristic.⁴⁷

Common techniques included:

Top-Down Approaches: These methods attempt to recursively partition the document page into major regions based on global features like whitespace. The Recursive X-Y Cut (RXYC) algorithm is a classic example, using horizontal and vertical projection profiles (histograms of pixel counts) to identify large gaps corresponding to column or block boundaries.² While relatively fast, these methods rely heavily on assumptions about the layout structure (e.g., rectangular blocks, clear separators) and struggle with skewed documents or non-Manhattan layouts.²
Bottom-Up Approaches: These methods start by identifying primitive elements like pixels or connected components (groups of adjacent black pixels) and iteratively merge them into larger structures like characters, words, text lines, and finally, text blocks.² The Docstrum algorithm, for instance, analyzed the distribution of angles and distances between nearest-neighbor connected components to identify text lines and estimate skew.² Other bottom-up techniques involved Run Length Smearing (RLSA), which merged nearby black pixels by applying horizontal and vertical smearing operations.⁵⁵
Hybrid Approaches: Some methods combined top-down and bottom-up strategies to leverage the strengths of both.⁸
Other Classical Techniques: Texture-based analysis, using features derived from Gabor filters or spatial autocorrelation, was employed to distinguish between text and non-text regions based on their textural properties.⁶ Analysis of the background structure was also explored.⁵⁵

These early methods, while foundational, faced significant limitations. They were often brittle, performing poorly on documents with complex or irregular layouts, significant noise or degradation, or variations in font styles and sizes.² Many required manually defined rules or parameters tuned for specific document types.

As the field progressed into the 2000s, statistical machine learning techniques began to be incorporated.⁶ Approaches based on feature engineering coupled with classifiers like Multi-Layer Perceptrons (MLPs) or Support Vector Machines (SVMs) emerged.⁶ DLA started to be framed as a pixel-level semantic segmentation task, aiming to classify each pixel as belonging to a specific layout category.⁴⁷ However, these methods were often still limited by the expressiveness of the handcrafted features used.

The advent of deep learning, particularly Convolutional Neural Networks (CNNs), around 2015 marked a paradigm shift in DLA, as in many other computer vision domains.²⁰ Deep learning models offered the ability to learn hierarchical features directly from raw pixel data, eliminating the need for manual feature engineering and demonstrating superior performance on complex visual tasks.⁶ This led to DLA being increasingly treated as an object detection or instance/semantic segmentation problem, leveraging architectures developed for natural image analysis.¹¹ The availability of large-scale annotated datasets, such as PubLayNet released in 2019 ¹¹, proved crucial in enabling the training and success of these data-hungry deep learning models, significantly accelerating progress in the field.¹¹ This co-evolution highlights how algorithmic advancements (deep learning) and data availability (large benchmarks) mutually reinforce progress in DLA.

B. Key Architectural Shifts: CNNs, Transformers, GNNs, Multimodal Models

The deep learning era in DLA has witnessed several major architectural shifts, driven by the quest for models that can better capture the complex interplay of visual appearance, textual content, and spatial structure inherent in documents.

CNN-Based Methods: CNNs were the workhorses of early deep learning DLA. Architectures like Fully Convolutional Networks (FCNs) and U-Net, originally developed for biomedical image segmentation, were adapted for pixel-level layout segmentation.⁶ Object detection frameworks like Faster R-CNN and Mask R-CNN, which use CNN backbones (e.g., ResNet, ResNeXt) for feature extraction followed by region proposal and classification/mask prediction heads, became standard approaches for detecting bounding boxes or masks for layout elements.¹⁸ CNNs excel at learning local visual patterns and hierarchical features directly from image pixels.⁶ While often surpassed by newer architectures in state-of-the-art systems, CNNs remain widely used as powerful feature extraction backbones within more complex models.
Transformer-Based Methods: More recently, Transformer architectures, initially developed for natural language processing, have become increasingly dominant in DLA.¹⁶ Their core self-attention mechanism allows them to model long-range dependencies and global context within an image or sequence, which is highly beneficial for understanding the overall layout structure.⁸

Vision Transformer (ViT) Backbones: Models like the vanilla ViT, and more advanced variants such as the Swin Transformer ⁵⁹ and InternImage ⁴⁵, are used as powerful backbones, often replacing CNNs. The Document Image Transformer (DiT) specifically adapted the ViT architecture for documents and demonstrated significant gains through self-supervised pre-training on large document corpora.⁵
Detection Transformers (DETR Family): DETR (DEtection TRansformer) and its successors like Deformable DETR, DINO, and Mask DINO reformulate object detection as an end-to-end set prediction problem, eliminating the need for hand-designed components like anchors and Non-Max Suppression (NMS).¹⁵ These models use a CNN or ViT backbone followed by a Transformer encoder-decoder architecture. DINO, in particular, has shown very strong performance on DLA benchmarks and is frequently used in SOTA systems.⁴⁵
Unified Architectures: Some recent work aims to create unified Transformer-based models that handle multiple DLA sub-tasks (e.g., object detection, role classification, reading order) within a single end-to-end framework, such as DLAFormer.¹⁵

Graph Neural Networks (GNNs): Recognizing that documents possess inherent relational structure, GNN-based approaches model a document page as a graph.¹⁸ Typically, text blocks or lines extracted via OCR or from PDF metadata serve as nodes, and edges represent spatial or logical relationships between them.¹⁷ Graph Convolutional Networks (GCNs) or Graph Attention Networks (GATs) are then applied to learn node representations that incorporate neighborhood information, effectively capturing the document's structure.¹⁷ DLA tasks like classifying text blocks (nodes) or identifying elements belonging to the same segment (edge classification or graph segmentation) can then be performed.¹² Models like GLAM ¹², Doc-GCN ²², and Paragraph2Graph ¹⁹ exemplify this approach. GNNs can often directly leverage the rich metadata available in born-digital PDFs (e.g., precise text coordinates) and tend to be computationally efficient.¹²
Multimodal Approaches: Given that documents inherently contain multiple modalities – visual appearance, textual content, and spatial layout – many state-of-the-art models explicitly aim to fuse information from these different sources.⁸ These models typically combine visual features extracted by CNNs or ViTs with textual features derived from OCR output processed by language models (like BERT) or text embeddings, along with layout information encoded via positional embeddings.⁵ The LayoutLM series (v1, v2, v3) pioneered multimodal pre-training for documents, jointly learning representations of text and layout, sometimes incorporating explicit visual features.⁵ Other notable multimodal models include VGT, which uses separate Transformer streams for vision and a 2D text grid ²¹, and UDoc.⁸ Grid-based methods, like VGT or earlier approaches, represent the document's text and layout on a 2D grid, which is then processed, often in conjunction with visual features.²¹

This architectural progression from CNNs to Transformers to GNNs and multimodal fusion reflects an ongoing effort to develop models that can more effectively capture and integrate the diverse signals present in documents. CNNs provide strong local visual feature extraction, Transformers excel at modeling global context and long-range dependencies (both visually and semantically), while GNNs offer a way to explicitly incorporate relational structure. Multimodal models attempt to synthesize all these information sources for a comprehensive understanding.

C. Emerging Paradigms: Self-Supervision, Adaptation, Robustness

Beyond core architectural choices, several emerging paradigms are shaping the cutting edge of DLA research, driven by the need to overcome limitations related to data availability, generalization, and real-world applicability.

Self-Supervised Pre-training: Inspired by successes in NLP and computer vision, self-supervised learning (SSL) has become crucial in DLA.⁴⁷ By designing pretext tasks that allow models to learn from vast amounts of unlabeled document images (e.g., from repositories like IIT-CDIP or arXiv.com), SSL reduces the dependency on costly manual annotations.⁵ Models like DiT use Masked Image Modeling (MIM), where the model learns to predict masked image patches.⁶³ LayoutLM models use objectives like Masked Visual-Language Modeling (MVLM), predicting masked words based on visual and layout context.¹⁸ VGT introduces grid-specific pre-training tasks like Masked Grid Language Modeling (MGLM) and Segment Language Modeling (SLM).⁵⁸ This pre-training equips models with strong, general-purpose representations of document characteristics before fine-tuning on specific downstream DLA tasks.³³
Domain Adaptation and Generalization: Addressing the critical challenge of poor generalization across diverse document domains ³⁹ is a major focus.

Source-Free Domain Adaptation (SFDA): This paradigm specifically tackles scenarios where a model pre-trained on a labeled source dataset needs to be adapted to an unlabeled target dataset without access to the original source data.¹ This is vital for applications involving sensitive data (e.g., medical, financial) or proprietary models.²¹ The SFDLA benchmark ²¹ and frameworks like DLAdapter, which uses a dual-teacher approach involving self-supervised learning mechanisms to align representations across domains, have been proposed to address this.²¹ While promising, SFDA faces challenges like pseudo-label instability and sensitivity to large domain gaps.²¹
Synthetic Data Generation: Creating large volumes of synthetic or augmented document images is another strategy to improve robustness and generalization. Techniques range from using engines like LaTeX ²⁸ or generative models like LayoutGAN/LayoutVAE ²⁸ to more sophisticated methods like aesthetic-guided augmentation for non-Manhattan layouts ²⁸ or the Mesh-candidate BestFit algorithm (framing synthesis as 2D bin packing) used to create DocSynth-300K.⁴⁰ Pre-training or augmenting with diverse synthetic data can significantly improve performance across different document types.⁴⁰
Unsupervised Learning: Some approaches attempt to train DLA models entirely without labels. The UnSupDLA framework, for example, uses visual features from a self-supervised model (DINO) combined with graph-based segmentation (Normalized Cuts) to generate initial pseudo-masks for training an object detector.⁴⁸

Robustness: Recognizing that real-world documents are rarely pristine, there is a growing emphasis on building models that are robust to common perturbations like noise, blur, compression artifacts, and geometric distortions.²¹ The RoDLA benchmark provides a standardized way to evaluate model performance under various controlled perturbations.⁴² Architectures like the RoDLA model incorporate specific design choices (e.g., channel attention, average pooling) aimed at extracting perturbation-insensitive features.⁴²

The increasing focus on self-supervision, adaptation, and robustness signals a maturation of the DLA field. While achieving high accuracy on clean, in-domain benchmarks was the primary goal of earlier deep learning efforts, the current research landscape reflects a stronger drive towards practical applicability. Addressing the challenges of limited labeled data, domain shifts, and imperfect input quality is now central to pushing the boundaries of DLA and enabling its reliable deployment in diverse, real-world scenarios.

III. State-of-the-Art Model Architectures

The current state-of-the-art in Document Layout Analysis is largely defined by models leveraging deep learning, particularly those based on Transformer architectures. These models vary in their specific design, modality focus (vision-only vs. multimodal), and underlying principles (object detection vs. segmentation vs. graph-based methods).

A. Transformer-Based Object Detectors (DETR, DINO, Mask DINO)

A significant trend in modern DLA is to frame the task as an object detection problem, where the goal is to predict bounding boxes and class labels for layout elements (text blocks, figures, tables, etc.). Transformer-based detectors, originating from the DETR (DEtection TRansformer) model, have become particularly prominent.

Core Idea: These models typically employ a CNN or Vision Transformer backbone to extract image features, followed by a Transformer encoder-decoder architecture. The key innovation of DETR was to treat object detection as a direct set prediction problem, eliminating the need for hand-crafted components like anchor boxes and Non-Max Suppression (NMS) used in earlier detectors like Faster R-CNN.¹⁵ The decoder uses learned object queries to probe the encoder's output and predict a set of bounding boxes and class labels, which are then matched to ground truth objects using bipartite matching during training.⁸
DETR and Deformable DETR: While foundational, the original DETR suffered from slow convergence and limitations in detecting small objects.⁶⁰ Deformable DETR introduced deformable attention mechanisms to address these issues, allowing attention heads to focus on a small set of key sampling points around a reference point, improving efficiency and performance.¹⁵ These serve as inspiration or components in some DLA systems.¹⁵
DINO (DETR with Improved deNoising anchOr boxes): DINO represents a significant advancement over earlier DETR variants and has emerged as a powerful baseline and component in state-of-the-art DLA systems.⁴⁵ It introduces several key improvements: contrastive denoising training (helping the model learn box prediction refinement), mixed query selection (improving anchor initialization), and a look-forward-twice scheme for more accurate box prediction.⁶⁷ DINO demonstrates strong performance on DLA benchmarks like PubLayNet (achieving around 95.5% mAP ⁸) and DocLayNet (around 79.5% mAP ⁵⁹), particularly excelling compared to earlier DETR versions in detecting smaller objects.⁶⁹ Its effectiveness is further highlighted by its use in top-performing ensemble methods, such as the WeLayout system (DINO + YOLO) which won the ICDAR 2023 competition on robust layout segmentation.⁶⁶ The RoDLA architecture also draws inspiration from DINO.⁴⁵
Mask DINO: This model extends the DINO architecture to perform instance, panoptic, and semantic segmentation in a unified framework.⁶² It integrates a pixel decoder module (similar to that used in Mask2Former) with the DINO Transformer decoder, allowing it to predict segmentation masks in addition to bounding boxes.⁶² While primarily evaluated on natural image segmentation benchmarks like COCO and ADE20K where it achieves state-of-the-art results ⁶², its unified approach holds potential for pixel-precise document layout analysis tasks that require segmentation masks rather than just bounding boxes.
Strengths and Weaknesses: Transformer-based detectors offer an end-to-end solution, effectively capturing global context through self-attention mechanisms, leading to strong performance. However, they can be computationally intensive to train and deploy. While DINO improved small object detection, extremely small or densely packed elements might still pose challenges. Their performance can also be sensitive to hyperparameter tuning and specific training strategies.

B. Document Image Transformers (DiT) and Vision Transformer Backbones

Another major line of research leverages the power of Vision Transformers (ViTs) directly for document image understanding, often focusing on learning robust visual representations through self-supervised pre-training.

Core Idea: These approaches treat the document page primarily as an image, adapting ViT architectures originally designed for natural images.⁵ The document image is typically divided into a sequence of non-overlapping patches, which are then processed by Transformer encoder blocks.⁶³
DiT (Document Image Transformer): DiT is a seminal work in this area.⁶³ It adapts the standard ViT architecture and employs self-supervised pre-training using Masked Image Modeling (MIM) on large datasets of unlabeled document images (e.g., IIT-CDIP Test Collection 1.0 containing millions of pages ⁶³).⁶³ In MIM, a portion of the input image patches are masked, and the model is trained to predict the visual tokens (derived from a discrete VAE or dVAE tokenizer) of these masked patches based on the surrounding context.⁶³ DiT comes in base and large sizes, analogous to ViT variants.⁶³
Performance: DiT, when used as a backbone for downstream tasks like DLA (typically combined with an object detection head like Faster R-CNN or Cascade R-CNN), achieves state-of-the-art results.⁵ On the PubLayNet benchmark, DiT-Large with a Cascade R-CNN head reached a mAP of 94.9% ⁶⁰, significantly outperforming models pre-trained on general image datasets like ImageNet, demonstrating the value of domain-specific pre-training for documents.⁶³ It also serves as a strong baseline or component in more complex multimodal systems evaluated on DocLayNet.⁵³
Other ViT Backbones: Besides DiT, other powerful ViT variants developed for general computer vision are also frequently employed as backbones in DLA models. These include the Swin Transformer ⁵⁹, InternImage ⁴⁵, DeiT ⁶⁰, and BEiT.⁶⁰ These backbones are often combined with sophisticated detection heads (e.g., Cascade R-CNN) or Transformer decoders (e.g., in DINO or RoDLA frameworks) to achieve top performance.
Strengths and Weaknesses: These models learn powerful visual representations directly from document pixels, benefiting immensely from large-scale self-supervised pre-training on domain-specific data. However, as primarily vision-based models, they may not inherently leverage the rich textual semantics present in documents unless explicitly integrated into a multimodal framework. Large Transformer models are also computationally demanding.

C. Multimodal Pre-trained Models (LayoutLMv3, VGT, UDoc)

Recognizing that documents are inherently multimodal artifacts containing visual, textual, and spatial information, a significant body of work focuses on developing models that jointly pre-train on these modalities.

Core Idea: These models aim to learn unified representations that capture the complex interactions between text content, its visual appearance (font, style), and its position on the page.⁵ They are typically pre-trained on massive document corpora using self-supervised objectives and then fine-tuned for specific downstream tasks like DLA.
LayoutLM Series (v1, v2, v3): This influential series of models progressively integrated modalities.

LayoutLMv1 ⁵: Based on BERT, it added 2D position embeddings (encoding x,y coordinates of token bounding boxes) and optionally image embeddings for entire tokens (derived from object detectors like Faster R-CNN). It used a Masked Visual-Language Model (MVLM) objective, predicting masked words based on their textual, positional, and (optionally) visual context.¹⁸
LayoutLMv2 ¹⁹: Incorporated visual features more deeply by adding a visual embedding layer and allowing cross-attention between text and visual features within the Transformer layers, inspired by ViT.
LayoutLMv3 ¹: Further unified the architecture, using a single ViT-style backbone to process both text and image patches. It employed unified text and image masking objectives during pre-training.¹⁷ LayoutLMv3 achieves strong performance, reaching 95.1% mAP on PubLayNet with a Cascade R-CNN head ¹⁷ and serving as a robust baseline on DocLayNet and other document AI tasks.¹³

Vision Grid Transformer (VGT): This model adopts a two-stream architecture.²¹ One stream uses a standard Vision Transformer (ViT) to process the document image. The other stream introduces a novel Grid Transformer (GiT) that processes a 2D grid representation of the document's text and layout information.⁵⁸ The GiT is pre-trained using specialized objectives: Masked Grid Language Modeling (MGLM) for token-level semantics and Segment Language Modeling (SLM) for segment-level understanding.⁵⁸ By fusing features from both streams, VGT effectively leverages multimodal information. It achieved state-of-the-art results on PubLayNet (96.2% mAP) upon its release, surpassing previous models.⁵⁸
UDoc (Unified Pretraining Framework): This framework aims to provide a universal model for various document understanding tasks, including DLA.⁸ While specific architectural details are less emphasized in the provided materials, it demonstrates strong performance on PubLayNet, achieving around 93.9% mAP.⁸
Strengths and Weaknesses: Multimodal models effectively capture the synergy between different information sources in documents, often leading to superior performance and generalization within the document domain. Large-scale pre-training is key to their success. However, they are typically architecturally complex and computationally very expensive, both for pre-training and often for inference. Their performance can also be sensitive to the quality of the underlying OCR or text extraction used to obtain the textual modality. Generalization to highly dissimilar, out-of-domain documents might still require adaptation.

D. Graph Neural Network Approaches (GLAM, Doc-GCN, Paragraph2Graph)

An alternative paradigm leverages Graph Neural Networks (GNNs) to explicitly model the relational structure of documents.

Core Idea: These methods represent a document page as a graph, where nodes typically correspond to primitive layout elements (e.g., text lines or text blocks identified by OCR or PDF parsing) and edges represent relationships between them (e.g., spatial proximity, reading order, containment).¹² GNN architectures like Graph Convolutional Networks (GCNs) or Graph Attention Networks (GATs) are then used to learn representations for nodes and/or edges by aggregating information from their local graph neighborhood.¹⁷ DLA is then framed as a node classification (assigning roles to text blocks) or edge classification/graph segmentation (grouping nodes into larger layout components) task.¹²
GLAM (Graph-based Layout Analysis Model): GLAM is designed as a lightweight GNN (around 4 million parameters) that directly utilizes metadata extracted from PDF files (text content and coordinates) to construct the graph.¹² It employs a GCN architecture (specifically using Topology Adaptive Graph Convolutional layers, TAGConv ¹⁷) for joint node and edge classification. Layout segments are then identified by finding connected components in the graph after removing predicted negative edges.¹⁷ GLAM achieves competitive performance on the challenging DocLayNet dataset (68.6% mAP), notably outperforming a much larger SOTA vision-based model on several text-based classes (like Title, List-item, Section-header).¹² An ensemble of GLAM with a YOLOv5x6 detector achieved a new state-of-the-art mAP of 80.8% on DocLayNet.¹² However, its performance on PubLayNet is lower (72.2% mAP ¹⁷), partly attributed to its inability to process image-based elements (like figures) directly from PDF metadata and potential misalignment between PDF text boxes and ground truth annotations in that dataset.¹⁷ A key advantage of GLAM is its high efficiency in terms of model size and inference speed.¹²
Doc-GCN: This model uses a heterogeneous graph approach, constructing multiple graphs for each document page, with each graph capturing different aspects of node features (syntactic, semantic, text density, visual appearance).²² GCNs are applied to each graph, features are integrated using pooling, and final classification is done via MLPs.²² Doc-GCN reported state-of-the-art results on PubLayNet, FUNSD, and DocBank at the time of its publication (COLING 2022).²²
Paragraph2Graph: This framework proposes a language-independent GNN model designed for adaptability, particularly for scenarios requiring strict separation between layout components.¹⁹ It emphasizes modularity in node definition, edge definition, graph sampling, GNN architecture, and task-specific layers.¹⁹ It claims competitive results and good generalization across languages, suggesting layout structure diversity is more critical than language itself for its approach.¹⁹
Strengths and Weaknesses: GNNs excel at explicitly modeling the relational structure within documents and can efficiently leverage readily available PDF metadata.¹² They are often significantly more lightweight and computationally efficient than large Transformer models.¹² They can enforce clean separation between components ¹⁹ and potentially offer better language independence.¹⁹ However, their performance can be highly dependent on the quality of the initial node features derived from OCR or PDF parsing. They may struggle with purely visual elements (like figures or complex table borders) if not combined with strong visual feature extractors.¹⁷ While competitive, especially on text-centric tasks or when efficiency is paramount, their overall mAP on diverse benchmarks like DocLayNet generally lags behind the top-performing vision-centric or multimodal Transformer models when used standalone, although ensembles incorporating GNNs can be very effective.¹²
Connect with me on X (Twitter)

E. Specialized and Hybrid Architectures (RoDLA, DLAFormer, Ensembles)

Beyond the major architectural categories, several specialized or hybrid approaches have emerged, often targeting specific challenges like robustness or integrating multiple DLA tasks, and frequently achieving top benchmark performance through combining the strengths of different models.

RoDLA (Robust Document Layout Analyzer): This model is specifically engineered to be robust against the various perturbations commonly found in real-world document images.²¹ Its architecture builds upon the strong DINO object detection framework but incorporates modifications to enhance robustness.⁴² Key additions include using the powerful InternImage backbone and integrating channel attention mechanisms alongside self-attention in the Transformer encoder, coupled with average pooling layers.⁴² These additions are designed to help the model focus on perturbation-insensitive features. RoDLA achieves state-of-the-art scores on robustness benchmarks (PubLayNet-P, DocLayNet-P, M6Doc-P) using the mRD metric and shows significant mAP improvements over baseline models on these perturbed datasets.⁴² It also demonstrates strong performance on clean datasets, achieving 96.0% mAP on PubLayNet.⁴⁵
DLAFormer: This model represents an effort towards a unified, end-to-end solution for multiple DLA sub-tasks within a single Transformer framework.¹⁵ It tackles graphical object detection (tables, figures), text region detection, logical role classification, and reading order prediction simultaneously by casting them as relation prediction problems.¹⁵ Built upon a DETR-inspired encoder-decoder architecture, it uses novel type-wise queries to handle the diverse types of page objects and relationships.¹⁵ DLAFormer reported superior performance compared to previous multi-branch or multi-stage approaches on the DocLayNet and Comp-HRDoc benchmarks.¹⁵
Hybrid Approaches: Several top-performing systems combine elements from different architectural paradigms. One such approach, presented at ICDAR 2024, utilizes an advanced Transformer-based object detector (like DINO) with a ResNet-50 backbone, but enhances it with a novel query encoding strategy and a hybrid one-to-one/one-to-many matching scheme during training.³ This combination achieved state-of-the-art mAP scores on PubLayNet (97.3%), DocLayNet (81.6%), and the PubTables dataset (98.6%) at the time of publication.³ Another effective hybrid strategy involves using a strong top-down detector (like DINO) for graphical objects (tables, figures) while employing a separate bottom-up model (potentially sharing the same backbone) to detect and classify text regions based on text lines, which can better handle fine-grained text structures and reading order.⁶⁵ This approach also reported state-of-the-art results on DocLayNet and PubLayNet.⁶⁵
Ensemble Methods: Combining the predictions of multiple diverse models is a well-established technique for boosting performance, particularly in competitive settings.¹² The winning WeLayout system in the ICDAR 2023 Robust Layout Segmentation competition used a sophisticated ensemble of DINO and YOLO models, employing techniques like Weighted Box Fusion and Bayesian optimization to find optimal ensemble parameters, achieving 70.0% mAP on the challenging competition dataset.⁶⁶ Similarly, the GLAM+YOLOv5x6 ensemble demonstrated a significant boost over standalone models on DocLayNet, reaching 80.8% mAP.¹²

The prevalence and success of hybrid and ensemble methods underscore the idea that different architectures possess complementary strengths. For instance, Transformer-based detectors like DINO might excel at complex global layouts and graphical objects, while faster detectors like YOLO might be efficient for simpler objects or serve as a diverse component in an ensemble. GNNs bring strengths in modeling explicit text structure and efficiency. Combining these diverse capabilities often leads to higher overall accuracy and robustness than any single model can achieve alone, especially on complex and varied datasets like DocLayNet or competition benchmarks. This suggests the field is still exploring the optimal combination of techniques rather than having settled on a single definitive architecture.

IV. Performance Evaluation and Benchmarking

Evaluating the performance of DLA models requires standardized datasets and metrics to allow for meaningful comparisons between different architectures and approaches. The field relies on several key benchmarks, each with its own characteristics and challenges.

A. Key Benchmark Datasets

Several publicly available datasets serve as standard benchmarks for training and evaluating DLA models:

PubLayNet: This is one of the largest and most widely used datasets for DLA.¹¹ It contains over 360,000 document images derived from scientific articles publicly available on PubMed Central.¹³ Annotations for typical layout elements (text, title, list, table, figure) were generated automatically by matching PDF content with corresponding XML representations.¹¹ Its large scale was instrumental in training early deep learning models for DLA.¹¹ However, its primary limitation is the lack of layout diversity, as it consists almost exclusively of scientific articles, leading to performance saturation and poor generalization to other document types.³⁹
DocLayNet: Developed specifically to address the diversity limitations of PubLayNet and DocBank.³⁹ DocLayNet contains 80,863 document pages manually annotated with bounding boxes for 11 distinct layout classes: Caption, Footnote, Formula, List-item, Page-footer, Page-header, Picture, Section-header, Table, Text, and Title.⁸ The dataset sources documents from diverse domains, including financial reports, manuals, patents, scientific articles, and legal documents, representing a much wider variability in layouts.²¹ Consequently, DocLayNet is considered more challenging and a better benchmark for evaluating the generalization capabilities of DLA models for real-world applications.³⁹ Models trained on DocLayNet tend to be more robust and suitable for general-purpose DLA.³⁹
ICDAR Competition Datasets: The International Conference on Document Analysis and Recognition (ICDAR) frequently hosts competitions related to DLA, releasing specialized datasets.⁸⁸ These often target specific challenges, such as documents with complex layouts (e.g., RDCL2015 ⁸⁸), historical documents, table detection, scientific literature parsing (ICDAR 2021 ⁵⁰), or robust layout segmentation in corporate documents (ICDAR 2023, using a dataset derived from/related to DocLayNet ⁶⁶). While sometimes smaller in scale than PubLayNet or DocLayNet, these datasets provide valuable benchmarks for specific research problems.
Other Datasets: Several other datasets contribute to DLA research:

DocBank: Contains 500k pages from arXiv, with token-level annotations generated via weak supervision from LaTeX source files. More focused on semantic structure labels.³⁹
M^6Doc: A large-scale dataset designed for modern DLA, featuring multi-format, multi-type, multi-layout, multi-language, and multi-annotation categories.⁴
Historical Document Datasets: Including HJDataset (Japanese ⁵), AnnoPage (Czech/German, non-textual elements ¹), and datasets used in specific studies on historical commentaries ⁹³ or Swedish documents.⁶¹
Form Understanding: FUNSD (Form Understanding in Noisy Scanned Documents) is often used for evaluating layout analysis in the context of form processing.²²
Specialized Datasets: D4LA (Diverse Document Document Layout Analysis dataset ⁴⁰), HRDoc/Comp-HRDoc (Hierarchical structure ¹⁵), FPD (Fine-grained Page semantic Dataset for non-Manhattan layouts ²⁸), RanLayNet (for domain generalization evaluation ³¹).

Evaluation Metric: The standard evaluation metric for DLA tasks framed as object detection is the mean Average Precision (mAP) as defined in the COCO object detection challenge.¹² This involves calculating the Average Precision (AP) for each object class (e.g., table, figure, title) across a range of Intersection-over-Union (IoU) thresholds, typically from 0.50 to 0.95 with a step of 0.05 (denoted as mAP@[0.50:0.95] or simply mAP). The final mAP score is the mean of the AP values across all classes.⁵⁰ Performance is often reported both overall and per class to provide more detailed insights.

B. Comparative Performance Analysis of SOTA Models

Comparing the performance of different architectures on standard benchmarks reveals the current state-of-the-art and highlights the relative strengths and weaknesses of various approaches.

Note: Performance figures are indicative and depend on specific backbones, training settings, and dataset versions. "-" indicates score not readily available in provided snippets for that specific benchmark. DocLayNet SOTA evolves rapidly; figures reflect values reported in cited sources.

Key Observations from Performance Comparison:

Transformer Dominance: The top-performing models overwhelmingly rely on Transformer architectures, whether as backbones (DiT, RoDLA's InternImage), end-to-end detectors (DINO, VGT, Hybrids), or multimodal frameworks (LayoutLMv3, VGT). This underscores the effectiveness of attention mechanisms for capturing the global context and complex dependencies in document layouts.
PubLayNet vs. DocLayNet Gap: There is a consistent and significant performance drop when moving from PubLayNet to DocLayNet (often 10-20 mAP points or more for the same model architecture). This gap starkly illustrates the generalization challenge. The relative homogeneity of PubLayNet allows models to achieve very high scores, potentially by overfitting to specific layout conventions of scientific articles. The diversity of DocLayNet presents a much harder test of a model's ability to handle varied layouts, making it a more realistic benchmark for general-purpose DLA. Progress on DocLayNet is arguably a better indicator of true advancement in the field.
Hybridization and Ensembling: The absolute highest scores on both benchmarks are often achieved by hybrid models that combine different architectural ideas (e.g., Transformer detector + specialized matching ⁸, or top-down graphical detection + bottom-up text detection ⁶⁵) or by ensembling diverse strong models (e.g., GLAM+YOLO ¹², DINO+YOLO ⁶⁸). This suggests that current single architectures may have complementary strengths and weaknesses, and combining them effectively pushes the performance envelope.
GNN Performance Profile: GNN-based models like GLAM demonstrate remarkable efficiency but lag significantly behind top Transformer-based models in overall mAP, particularly on PubLayNet where visual elements (figures) are prevalent and GNNs relying solely on PDF text struggle.¹⁷ However, they can be highly competitive or even superior on specific text-based categories within more diverse datasets like DocLayNet ¹², and their contribution to SOTA ensembles highlights their value.¹²

C. Robustness Benchmarking Insights

While standard benchmarks evaluate performance on clean, well-formatted document images, real-world applications often encounter documents suffering from various forms of degradation. Recognizing this gap, recent efforts have focused on benchmarking model robustness.

The Need for Robustness Evaluation: As noted previously, many state-of-the-art DLA models, despite achieving high accuracy on datasets like PubLayNet and DocLayNet, experience substantial performance drops when processing perturbed images (e.g., due to scanning noise, blur, compression).⁴² This lack of robustness limits their practical utility.
The RoDLA Benchmark: To address this, the RoDLA benchmark was introduced as the first large-scale effort specifically designed to evaluate the robustness of DLA models.⁴² It comprises perturbed versions of three major datasets (PubLayNet, DocLayNet, M6Doc), totaling approximately 450,000 images.⁴² The benchmark includes a comprehensive taxonomy of 12 common document perturbation types categorized into five groups (spatial transformation, content interference, inconsistency distortion, blur, noise), each applied at three severity levels.⁴²
Robustness Metrics: Alongside the benchmark, new metrics were proposed to better quantify robustness: Mean Perturbation Effect (mPE) assesses the impact of perturbations, while Mean Robustness Degradation (mRD) evaluates model robustness more accurately by factoring out baseline performance differences.⁴²
Key Findings: Experiments using the RoDLA benchmark revealed the significant vulnerability of previous SOTA models to perturbations.⁴² The RoDLA model, specifically designed with robustness-enhancing architectural features (channel attention, pooling), achieved state-of-the-art mRD scores and demonstrated substantial mAP improvements (ranging from +3.8% on PubLayNet-P to +12.1% on M6Doc-P) compared to prior methods on these perturbed datasets.⁴² This work highlights that robustness is not merely a byproduct of high accuracy on clean data but requires specific architectural considerations and targeted evaluation. The choice of input representation was also identified as playing a crucial role in robustness under different corruption types.³⁷

The introduction and adoption of robustness benchmarks represent a critical maturation of the DLA field. Evaluating models solely on pristine benchmark data provides an incomplete picture of their real-world potential. Robustness is emerging as a key differentiator and a necessary attribute for deploying reliable DLA systems in practice. Furthermore, while the standard COCO mAP metric is useful, analyzing performance at specific IoU thresholds or on a per-class basis can offer deeper insights into model behavior. For example, a model might excel at identifying the general location of text blocks (high mAP@0.5) but struggle with precise boundary prediction (lower mAP at higher IoUs), a nuance lost in the overall averaged score.¹⁷ Similarly, per-class analysis can reveal specific failure modes, such as GLAM's difficulty with image-based figures.¹⁷

V. Specialized Approaches and Future Directions

While significant progress has been made in general DLA, specific challenges remain, particularly concerning highly complex layouts, domain generalization, and practical deployment constraints. Research is actively exploring specialized approaches and charting future directions to address these limitations.

A. Addressing Complex Layouts (Non-Manhattan, Historical Documents)

Generic DLA models, often trained on datasets dominated by structured, modern documents like scientific articles (e.g., PubLayNet), frequently struggle when confronted with fundamentally different layout paradigms.

The Challenge: Non-Manhattan layouts, common in magazines, brochures, and some forms, deliberately break grid structures for aesthetic or functional reasons, featuring overlapping elements, irregular shapes, and complex text flow.⁶ Historical documents present a distinct set of challenges, including highly variable and often complex layouts (e.g., multi-column commentaries with marginalia ¹⁶), archaic scripts, inconsistent typography, significant degradation (fading, bleed-through, damage), and the lack of large annotated datasets.⁵ These out-of-distribution characteristics often cause generic models to fail.
Specialized Techniques and Models: Addressing these requires tailored solutions. For non-Manhattan layouts, one approach involves synthetic data generation using techniques like aesthetic-guided image augmentation, which attempts to model the principles of graphic design to create more realistic complex layouts.²⁸ Accompanying this, specialized network components like edge embedding networks (e.g., L-E3Net) have been proposed to better capture the fine-grained features of these layouts.²⁸ For historical documents, researchers often rely on fine-tuning existing powerful architectures like Mask R-CNN ⁶¹ or evaluating modern models like LayoutLMv3 and YOLOv5 ¹⁶ on specific historical collections. However, the generalizability problem persists even within historical documents, where models trained on one period or type may not transfer well to others.³⁸ Hybrid approaches combining visual and textual cues are often deemed necessary for interpreting layouts where semantic content defines regions more than graphical features.¹⁶
Importance of Specialized Datasets: Progress in these areas heavily relies on the creation of dedicated datasets. Examples include the FPD dataset for non-Manhattan layouts ²⁸, the HJDataset for historical Japanese documents ⁵, the AnnoPage dataset for historical Czech/German documents focusing on non-textual elements ¹, and specific datasets curated for scholarly commentaries.⁹³ The existence of these specialized resources underscores the fact that current SOTA models trained on general benchmarks are often insufficient for these challenging sub-domains.

B. Domain Adaptation and Generalization Strategies

The tendency for DLA models to perform poorly when applied to domains or layout styles different from their training data remains a central obstacle to widespread adoption.¹ Several strategies are being pursued to improve model generalization and adaptability.

Source-Free Domain Adaptation (SFDA): This is becoming increasingly important due to practical constraints like data privacy and limited access to source training data.²¹ SFDA techniques aim to adapt a pre-trained source model to an unlabeled target domain using only the target data.¹ Methods like DLAdapter employ self-supervised learning and dual-teacher frameworks to align feature distributions across domains without needing source examples.²¹ The SFDLA benchmark facilitates research in this area.²¹ Successful SFDA could significantly broaden the applicability of powerful DLA models trained on public data to private or sensitive enterprise datasets.
Unsupervised Pre-training and Learning: Leveraging massive amounts of unlabeled documents through self-supervised pre-training (as discussed for DiT, LayoutLM, VGT) is a primary strategy for learning generalizable representations.⁴⁸ Completely unsupervised approaches, like UnSupDLA which generates pseudo-masks from self-supervised visual features ⁴⁸, attempt to bypass labeled data altogether, though often at the cost of accuracy compared to supervised or self-supervised pre-trained models.
Synthetic Data Augmentation: Generating diverse synthetic documents provides a controllable way to expose models to a wider range of layouts during training, potentially improving generalization.²⁸ The quality and diversity of the synthetic data are crucial for effectiveness.
Training on Diverse Datasets: The most direct approach to improving generalization is to train models on large-scale datasets that inherently capture diverse layouts and domains. The creation of DocLayNet ³⁹ and M6Doc ²⁴ reflects this understanding. Models trained on these more varied datasets demonstrably exhibit better robustness and transferability compared to those trained solely on more homogeneous datasets like PubLayNet.³⁹ Datasets like RanLayNet are explicitly designed to evaluate domain generalization performance.³¹

C. Current Limitations, Open Challenges, and Failure Modes

Despite the advancements, current state-of-the-art DLA models still exhibit several limitations and face open challenges:

Robustness: Models remain sensitive to real-world image perturbations, and performance can degrade significantly under noise, blur, or other distortions.⁴² Achieving consistent performance across varying image qualities is an ongoing challenge.
Generalization: True "universal" layout analysis remains elusive. Models often require domain-specific fine-tuning or adaptation to perform well on unseen document types or drastically different layouts.¹
Complex and Overlapping Layouts: Handling highly irregular, non-Manhattan, or densely packed layouts, as well as accurately segmenting overlapping elements, continues to be difficult for many architectures.¹⁶
Logical Structure Understanding: While progress is being made, accurately inferring complex reading orders and hierarchical relationships, especially across page boundaries, requires further research.¹⁴ Many models still prioritize physical block detection over deep logical structure parsing.¹⁴
Fine-Grained and Small Elements: Detecting very small layout elements consistently (e.g., page numbers, isolated symbols, list bullets) can be challenging, although newer detectors show improvements.⁸
Data Dependency and Cost: The reliance on large labeled datasets remains a major bottleneck, hindering development for low-resource languages or specialized domains and raising concerns about annotation cost and privacy.²¹ The quality and potential biases of automatically generated or weakly supervised annotations also need consideration.¹⁷
Computational Cost and Efficiency: Many SOTA models, especially large multimodal Transformers, are computationally expensive, posing challenges for training and real-time deployment.⁴⁰ The trade-off between accuracy and efficiency persists, driving research into lightweight models (like GNNs ¹² or optimized CNNs/YOLO variants ⁴⁰) and techniques like knowledge distillation.⁸⁰
Architecture-Specific Weaknesses: Different architectures have inherent limitations. GNNs may struggle with visual complexity or OCR errors if not augmented with visual features.¹⁷ Purely vision-based models ignore valuable textual semantics.²² Multimodal models can introduce significant latency.⁴⁰

This landscape of limitations reveals a persistent tension. Multimodal models often yield higher accuracy by leveraging richer information ⁴⁰, but at the cost of complexity and speed. Efficient unimodal or GNN-based approaches ¹² may sacrifice performance, particularly on tasks requiring nuanced semantic or visual understanding. Finding architectures that bridge this gap remains an active area of investigation.

D. Future Research Trajectories

Based on the current state and limitations, several key directions are likely to shape the future of DLA research:

Enhanced Robustness: Continued development of architectures and training methodologies (e.g., adversarial training, improved data augmentation, specific architectural modules like in RoDLA) that are inherently resilient to a wider range of real-world perturbations.⁴²
Improved Generalization and Adaptation: Further exploration of self-supervised and unsupervised learning, few-shot learning ²⁷, domain randomization, and more effective domain adaptation techniques, particularly source-free methods ²¹, aiming for models that require minimal or no target-domain labels.
Unified Physical and Logical Analysis: Designing truly end-to-end models that seamlessly integrate the prediction of physical layout elements, their logical roles, reading order, and hierarchical relationships, potentially leveraging graph structures to represent these connections explicitly.¹⁴
Efficiency and Lightweight Models: Research into more computationally efficient architectures, model compression techniques (like quantization or pruning), and knowledge distillation ⁸⁰ to enable deployment on edge devices or in high-throughput scenarios, without excessive accuracy loss.¹²
Leveraging Foundation Models (LLMs/LMMs): Investigating how large language models and large multimodal models can be effectively adapted or instruction-tuned for DLA tasks.⁷⁶ This includes developing methods to better incorporate fine-grained layout information into these models and potentially combining layout analysis with deeper semantic reasoning capabilities.⁷⁶
Multi-Page Document Analysis: Extending DLA models to consistently process entire documents, understanding layout structures and relationships that span across multiple pages.¹⁴
Advanced Evaluation: Developing more comprehensive evaluation metrics that go beyond bounding box mAP to assess the quality of logical structure understanding (e.g., reading order accuracy, hierarchy correctness) or robustness more effectively.
Continued Focus on Specialized Domains: Ongoing efforts to improve performance on particularly challenging domains like historical documents, complex forms, and non-Manhattan layouts, likely requiring domain-specific datasets and techniques.

VI. Conclusion

Document Layout Analysis has undergone a remarkable transformation, evolving from heuristic methods for simple layouts to sophisticated deep learning systems capable of parsing complex documents. The field is currently characterized by rapid innovation, driven by the critical role DLA plays in unlocking automated document understanding across numerous applications.

A. Dominant Architectural Trends in DLA

Several key architectural trends define the current state-of-the-art:

Transformer Architectures: Transformers, in various forms, are the dominant architectural paradigm. This includes Vision Transformer (ViT) backbones pre-trained on documents (like DiT), end-to-end object detection frameworks derived from DETR (especially DINO and its variants), and multimodal frameworks (like LayoutLMv3 and VGT) that integrate text, vision, and layout using Transformer encoders/decoders. The self-attention mechanism's ability to model global context appears highly effective for capturing layout structures.
Multimodal Integration: Combining visual features with textual content and layout/positional information is a proven strategy for achieving top performance. Models pre-trained jointly on these modalities learn rich, synergistic representations beneficial for DLA. While adding complexity, this approach often leads to better accuracy and semantic understanding.
End-to-End Learning: There is a clear trend towards end-to-end systems, particularly with DETR-based detectors and unified frameworks like DLAFormer. These models simplify the pipeline by reducing reliance on intermediate steps or hand-crafted components like anchor generation or NMS.
Focus on Practical Challenges: Beyond maximizing accuracy on clean benchmarks, research increasingly prioritizes practical considerations. Robustness to image perturbations, generalization across diverse document domains, computational efficiency, and adaptability (especially source-free adaptation for privacy-sensitive data) are now critical areas of investigation and differentiation.
GNNs as an Alternative: Graph Neural Networks represent a distinct and viable alternative, particularly strong for leveraging explicit structure in text-heavy documents or born-digital PDFs, and offering significant advantages in efficiency. While currently lagging behind top Transformers in overall mAP on broad benchmarks, their strengths make them valuable, especially in hybrid systems or resource-constrained settings.

B. Summary of Leading Models and Performance Benchmarks

The state-of-the-art is currently led by models based on Transformer architectures, often employing multimodal pre-training or specialized designs for detection and robustness. As summarized in Table 1 (Section IV.B), models like VGT, hybrid systems based on DINO, RoDLA, LayoutLMv3, and DiT achieve mAP scores exceeding 95-97% on the widely used PubLayNet benchmark. However, performance on the more diverse and challenging DocLayNet benchmark is considerably lower, with top models (often hybrid or ensemble approaches) reaching around 80-82% mAP. This performance gap underscores the ongoing challenge of domain generalization. The introduction of robustness benchmarks like RoDLA-P further refines the evaluation landscape, highlighting models like RoDLA that maintain performance under perturbation. PubLayNet drove initial deep learning progress with its scale, while DocLayNet now serves as a crucial testbed for generalization, and robustness benchmarks assess real-world readiness.

C. Concluding Remarks on the Field's Trajectory

The trajectory of DLA clearly points towards more robust, generalizable, and efficient models. The convergence on Transformer-based architectures suggests their suitability for modeling the complex dependencies in documents. However, the significant performance difference between benchmarks like PubLayNet and DocLayNet, coupled with the growing focus on robustness and domain adaptation, indicates that the field is actively grappling with the transition from controlled environments to the complexities of real-world data. No single architecture currently reigns supreme across all criteria (accuracy, robustness, efficiency, generalization); the effectiveness of hybrid and ensemble methods highlights that combining complementary strengths is often necessary to achieve peak performance.

Future advancements will likely involve continued refinement of Transformer and multimodal architectures, breakthroughs in self-supervised learning and domain adaptation to reduce data dependency and improve generalization, the development of unified models that jointly analyze physical and logical structure across multiple pages, and the exploration of how large foundation models can be effectively leveraged. Ultimately, the goal remains the development of DLA systems that can reliably and efficiently parse the structure of any document, regardless of its origin, format, or quality, thereby enabling a vast range of downstream intelligent document processing applications.

Works cited

Document Layout Analysis - CatalyzeX, accessed April 10, 2025, https://www.catalyzex.com/s/Document%20Layout%20Analysis
Document layout analysis - Wikipedia, accessed April 10, 2025, https://en.wikipedia.org/wiki/Document_layout_analysis
[2404.17888] A Hybrid Approach for Document Layout Analysis in Document images - arXiv, accessed April 10, 2025, https://arxiv.org/abs/2404.17888
Document Layout Analysis | Papers With Code, accessed April 10, 2025, https://paperswithcode.com/task/document-layout-analysis/codeless?page=2
Document Layout Analysis | Papers With Code, accessed April 10, 2025, https://paperswithcode.com/task/document-layout-analysis
sci-hub.se, accessed April 10, 2025, https://sci-hub.se/downloads/2019-12-17/2b/binmakhashen2019.pdf
Document Structure and Layout Analysis - AMiner, accessed April 10, 2025, https://static.aminer.org/pdf/PDF/000/348/003/document_page_segmentation_and_layout_analysis_using_soft_ordering.pdf
A Hybrid Approach for Document Layout Analysis in Document images - arXiv, accessed April 10, 2025, https://arxiv.org/html/2404.17888v2
Document layout analysis - Document Intelligence - Azure AI services | Microsoft Learn, accessed April 10, 2025, https://learn.microsoft.com/en-us/azure/ai-services/document-intelligence/prebuilt/layout?view=doc-intel-4.0.0
Form Recognizer's document layout analysis model adds new structure insights, accessed April 10, 2025, https://techcommunity.microsoft.com/blog/azure-ai-services-blog/document-layout-analysis-model-by-form-recognizer-adds-new-structure-insights/3642004
Document Layout Analysis with an Enhanced Object Detector, accessed April 10, 2025, https://www.dfki.de/fileadmin/user_upload/import/12500_minouei.pdf
[2308.02051] A Graphical Approach to Document Layout Analysis - arXiv, accessed April 10, 2025, https://arxiv.org/abs/2308.02051
M2Doc: A Multi-Modal Fusion Approach for Document Layout Analysis, accessed April 10, 2025, https://ojs.aaai.org/index.php/AAAI/article/view/28552/29073
UniHDSA: A Unified Relation Prediction Approach for Hierarchical Document Structure Analysis - arXiv, accessed April 10, 2025, https://arxiv.org/html/2503.15893v1
DLAFormer: An End-to-End Transformer For Document Layout Analysis - arXiv, accessed April 10, 2025, https://arxiv.org/html/2405.11757v1
Page Layout Analysis of Text-heavy Historical Documents: a Comparison of Textual and Visual Approaches - Infoscience, accessed April 10, 2025, https://infoscience.epfl.ch/record/299775/files/2022_layout_analysis.pdf
[2308.02051] A Graphical Approach to Document Layout Analysis - ar5iv - arXiv, accessed April 10, 2025, https://ar5iv.labs.arxiv.org/html/2308.02051
LayoutLM: Pre-training of Text and Layout for Document Image Understanding - arXiv, accessed April 10, 2025, https://arxiv.org/pdf/1912.13318
(PDF) PARAGRAPH2GRAPH: A GNN-based framework for layout paragraph analysis, accessed April 10, 2025, https://www.researchgate.net/publication/370227975_PARAGRAPH2GRAPH_A_GNN-based_framework_for_layout_paragraph_analysis
Document Layout Analysis: A Comprehensive Survey | Request PDF - ResearchGate, accessed April 10, 2025, https://www.researchgate.net/publication/336593534_Document_Layout_Analysis_A_Comprehensive_Survey
SFDLA: Source-Free Document Layout Analysis - arXiv, accessed April 10, 2025, https://arxiv.org/html/2503.18742v1
[2208.10970] Doc-GCN: Heterogeneous Graph Convolutional Networks for Document Layout Analysis - ar5iv, accessed April 10, 2025, https://ar5iv.labs.arxiv.org/html/2208.10970
Doc-GCN: Heterogeneous Graph Convolutional Networks for Document Layout Analysis, accessed April 10, 2025, https://aclanthology.org/2022.coling-1.256/
[2305.08719] M$^{6}$Doc: A Large-Scale Multi-Format, Multi-Type, Multi-Layout, Multi-Language, Multi-Annotation Category Dataset for Modern Document Layout Analysis - arXiv, accessed April 10, 2025, https://arxiv.org/abs/2305.08719
Document Layout Analysis | Papers With Code, accessed April 10, 2025, https://paperswithcode.com/task/document-layout-analysis/latest
[2503.17213] PP-DocLayout: A Unified Document Layout Detection Model to Accelerate Large-Scale Data Construction - arXiv, accessed April 10, 2025, https://arxiv.org/abs/2503.17213
Document Layout Analysis - Papers With Code, accessed April 10, 2025, https://paperswithcode.com/task/document-layout-analysis/codeless?page=3&q=
arXiv:2111.13809v1 [cs.CV] 27 Nov 2021, accessed April 10, 2025, https://arxiv.org/pdf/2111.13809
[2111.13809] Document Layout Analysis with Aesthetic-Guided Image Augmentation - arXiv, accessed April 10, 2025, https://arxiv.org/abs/2111.13809
Document Layout Analysis | Papers With Code, accessed April 10, 2025, https://paperswithcode.com/task/document-layout-analysis?page=4&q=
RanLayNet: A Dataset for Document Layout Detection used for Domain Adaptation and Generalization | AI Research Paper Details - AIModels.fyi, accessed April 10, 2025, https://www.aimodels.fyi/papers/arxiv/ranlaynet-dataset-document-layout-detection-used-domain
Document Layout Analysis for text extraction - python - Stack Overflow, accessed April 10, 2025, https://stackoverflow.com/questions/66473977/document-layout-analysis-for-text-extraction
Vision Grid Transformer for Document Layout Analysis | Request PDF - ResearchGate, accessed April 10, 2025, https://www.researchgate.net/publication/377419088_Vision_Grid_Transformer_for_Document_Layout_Analysis
Document Layout Analysis, a complete guide - Kili Technology, accessed April 10, 2025, https://kili-technology.com/data-labeling/machine-learning/document-layout-analysis-a-complete-guide
A Graphical Approach to Document Layout Analysis - ResearchGate, accessed April 10, 2025, https://www.researchgate.net/publication/372950690_A_Graphical_Approach_to_Document_Layout_Analysis
arXiv:2501.17887v1 [cs.CL] 27 Jan 2025, accessed April 10, 2025, https://arxiv.org/pdf/2501.17887?
RoDLA: Benchmarking the Robustness of Document Layout Analysis Models | Request PDF, accessed April 10, 2025, https://www.researchgate.net/publication/384177861_RoDLA_Benchmarking_the_Robustness_of_Document_Layout_Analysis_Models
[2301.10781] Generalizability in Document Layout Analysis for Scientific Article Figure & Caption Extraction - arXiv, accessed April 10, 2025, https://arxiv.org/abs/2301.10781
[2206.01062] DocLayNet: A Large Human-Annotated Dataset for Document-Layout Analysis, accessed April 10, 2025, https://arxiv.org/abs/2206.01062
DocLayout-YOLO: Enhancing Document Layout Analysis through Diverse Synthetic Data and Global-to-Local Adaptive Perception | OpenReview, accessed April 10, 2025, https://openreview.net/forum?id=k0X4m9GAQV
DocLayout-YOLO: Enhancing Document Layout Analysis through Diverse Synthetic Data and Global-to-Local Adaptive Perception - Hugging Face, accessed April 10, 2025, https://huggingface.co/papers/2410.12628
RoDLA: Benchmarking the Robustness of Document Layout Analysis Models - arXiv, accessed April 10, 2025, https://arxiv.org/html/2403.14442v1/
RoDLA: Benchmarking the Robustness of Document Layout Analysis Models - CVPR 2024 Open Access Repository - The Computer Vision Foundation, accessed April 10, 2025, https://openaccess.thecvf.com/content/CVPR2024/html/Chen_RoDLA_Benchmarking_the_Robustness_of_Document_Layout_Analysis_Models_CVPR_2024_paper.html
RoDLA: Benchmarking the Robustness of Document Layout Analysis Models - arXiv, accessed April 10, 2025, https://arxiv.org/abs/2403.14442
RoDLA: Benchmarking the Robustness of Document Layout Analysis Models - arXiv, accessed April 10, 2025, https://arxiv.org/html/2403.14442v1
RoDLA: Benchmarking the Robustness of Document Layout Analysis Models - CVF Open Access, accessed April 10, 2025, https://openaccess.thecvf.com/content/CVPR2024/papers/Chen_RoDLA_Benchmarking_the_Robustness_of_Document_Layout_Analysis_Models_CVPR_2024_paper.pdf
Document Parsing Unveiled: Techniques, Challenges, and Prospects for Structured Information Extraction - arXiv, accessed April 10, 2025, https://arxiv.org/html/2410.21169v2
UnSupDLA: Towards Unsupervised Document Layout Analysis - arXiv, accessed April 10, 2025, https://arxiv.org/html/2406.06236v1
UnSupDLA: Towards Unsupervised Document Layout Analysis, accessed April 10, 2025, https://www.dfki.de/fileadmin/user_upload/import/15032_UNSUPDLA.pdf
ICDAR 2021 Scientific Literature Parsing Competition - Oracle Labs, accessed April 10, 2025, https://labs.oracle.com/pls/apex/f?p=LABS:0:2901459229958:APPLICATION_PROCESS=GETDOC_INLINE:::DOC_ID:2140
[2503.18742] SFDLA: Source-Free Document Layout Analysis - arXiv, accessed April 10, 2025, https://arxiv.org/abs/2503.18742
DocLayNet: A Large Human-Annotated Dataset for Document-Layout Analysis, accessed April 10, 2025, https://www.researchgate.net/publication/361051040_DocLayNet_A_Large_Human-Annotated_Dataset_for_Document-Layout_Analysis
DocLayout-YOLO: Enhancing Document Layout Analysis through Diverse Synthetic Data and Global-to-Local Adaptive Perception - arXiv, accessed April 10, 2025, https://arxiv.org/html/2410.12628v1
[2410.12628] DocLayout-YOLO: Enhancing Document Layout Analysis through Diverse Synthetic Data and Global-to-Local Adaptive Perception - arXiv, accessed April 10, 2025, https://arxiv.org/abs/2410.12628
Geometric Layout Analysis Techniques for Document Image Understanding: a Review, accessed April 10, 2025, https://www.researchgate.net/publication/2552084_Geometric_Layout_Analysis_Techniques_for_Document_Image_Understanding_a_Review
[2011.13534] A Survey of Deep Learning Approaches for OCR and ..., accessed April 10, 2025, https://ar5iv.labs.arxiv.org/html/2011.13534
Boosting Document Layout Analysis with Graphic Multi-modal Data Fusion and Spatial Geometric Transformation | OpenReview, accessed April 10, 2025, https://openreview.net/forum?id=kmbU3EdLtS
Vision Grid Transformer for Document Layout Analysis - CVF Open Access, accessed April 10, 2025, https://openaccess.thecvf.com/content/ICCV2023/papers/Da_Vision_Grid_Transformer_for_Document_Layout_Analysis_ICCV_2023_paper.pdf
Graph-based Document Structure Analysis - arXiv, accessed April 10, 2025, https://arxiv.org/html/2502.02501v1
PubLayNet val Benchmark (Document Layout Analysis) | Papers With Code, accessed April 10, 2025, https://paperswithcode.com/sota/document-layout-analysis-on-publaynet-val
Document Layout Analysis for Historical Documents - DiVA portal, accessed April 10, 2025, https://www.diva-portal.org/smash/get/diva2:1612708/FULLTEXT01.pdf
IDEA-Research/MaskDINO: [CVPR 2023] Official implementation of the paper "Mask DINO: Towards A Unified Transformer-based Framework for Object Detection and Segmentation" - GitHub, accessed April 10, 2025, https://github.com/IDEA-Research/MaskDINO
DiT: Self-supervised Pre-training for Document Image Transformers - Microsoft Research, accessed April 10, 2025, https://www.microsoft.com/en-us/research/articles/dit-self-supervised-pre-training-for-document-image-transformers/
[2203.02378] DiT: Self-supervised Pre-training for Document Image Transformer - arXiv, accessed April 10, 2025, https://arxiv.org/abs/2203.02378
A Hybrid Approach to Document Layout Analysis for Heterogeneous Document Images, accessed April 10, 2025, https://www.researchgate.net/publication/373232830_A_Hybrid_Approach_to_Document_Layout_Analysis_for_Heterogeneous_Document_Images
ICDAR 2023 Competition on Robust Layout Segmentation in Corporate Documents - ar5iv, accessed April 10, 2025, https://ar5iv.labs.arxiv.org/html/2305.14962
Organized by: ICDAR23 DocLayNet - Leaderboard - EvalAI, accessed April 10, 2025, https://eval.ai/web/challenges/challenge-page/1923/leaderboard/4545/Total
[2305.06553] WeLayout: WeChat Layout Analysis System for the ICDAR 2023 Competition on Robust Layout Segmentation in Corporate Documents - arXiv, accessed April 10, 2025, https://arxiv.org/abs/2305.06553
arXiv:2305.06553v1 [cs.CV] 11 May 2023, accessed April 10, 2025, https://arxiv.org/pdf/2305.06553
Document Parsing Unveiled: Techniques, Challenges, and Prospects for Structured Information Extraction - arXiv, accessed April 10, 2025, https://arxiv.org/html/2410.21169v2?ref=chitika.com
arXiv:2105.05727v4 [cs.CL] 21 Mar 2022 - Kun KUANG, accessed April 10, 2025, https://kunkuang.github.io/papers/ACL21-BertGCN.pdf
Document Layout Analysis | Papers With Code, accessed April 10, 2025, https://paperswithcode.com/task/document-layout-analysis/latest?page=3&q=
[2208.10970] Doc-GCN: Heterogeneous Graph Convolutional Networks for Document Layout Analysis - arXiv, accessed April 10, 2025, https://arxiv.org/abs/2208.10970
ivanstepanovftw/glam: Graph-based Layout Analysis Model - GitHub, accessed April 10, 2025, https://github.com/ivanstepanovftw/glam
[2308.14978] Vision Grid Transformer for Document Layout Analysis - ar5iv - arXiv, accessed April 10, 2025, https://ar5iv.labs.arxiv.org/html/2308.14978
LayoutLLM: Layout Instruction Tuning with Large Language Models for Document Understanding - arXiv, accessed April 10, 2025, https://arxiv.org/html/2404.05225v1
[2308.15517] Document AI: A Comparative Study of Transformer-Based, Graph-Based Models, and Convolutional Neural Networks For Document Layout Analysis - arXiv, accessed April 10, 2025, https://arxiv.org/abs/2308.15517
Doc2Graph: a Task Agnostic Document Understanding Framework based on Graph Neural Networks - UAB, accessed April 10, 2025, http://refbase.cvc.uab.es/files/GBC2022.pdf
Daily Papers - Hugging Face, accessed April 10, 2025, https://huggingface.co/papers?q=table%20detection
EDocNet: Efficient Datasheet Layout Analysis Based on Focus and Global Knowledge Distillation - arXiv, accessed April 10, 2025, https://arxiv.org/pdf/2502.16541
[D] Document layout - recreating the structure : r/MachineLearning - Reddit, accessed April 10, 2025, https://www.reddit.com/r/MachineLearning/comments/174odzs/d_document_layout_recreating_the_structure/
(PDF) Unifying Vision, Text, and Layout for Universal Document Processing - ResearchGate, accessed April 10, 2025, https://www.researchgate.net/publication/366063398_Unifying_Vision_Text_and_Layout_for_Universal_Document_Processing
Detect-Order-Construct: A Tree Construction based Approach for Hierarchical Document Structure Analysis - arXiv, accessed April 10, 2025, https://arxiv.org/html/2401.11874v2
A Hybrid Approach for Document Layout Analysis in Document images - ResearchGate, accessed April 10, 2025, https://www.researchgate.net/publication/380269819_A_Hybrid_Approach_for_Document_Layout_Analysis_in_Document_images
ICDAR 2023 Competition on Robust Layout Segmentation in Corporate Documents - arXiv, accessed April 10, 2025, https://arxiv.org/abs/2305.14962
A. Appendix Overview B. Visualization Analysis C. Downstream Evaluation Datasets D. UDOP-Dual Performance E. Supervised Pretrain - CVF Open Access, accessed April 10, 2025, https://openaccess.thecvf.com/content/CVPR2023/supplemental/Tang_Unifying_Vision_Text_CVPR_2023_supplemental.pdf
ds4sd/icdar2023-doclaynet · Datasets at Hugging Face, accessed April 10, 2025, https://huggingface.co/datasets/ds4sd/icdar2023-doclaynet
Document Layout Analysis - state-of-the-art? - Data Science Stack Exchange, accessed April 10, 2025, https://datascience.stackexchange.com/questions/19377/document-layout-analysis-state-of-the-art
Machine Learning Datasets - Papers With Code, accessed April 10, 2025, https://paperswithcode.com/datasets?q=segmentation&v=lst&o=match&task=document-layout-analysis
Daily Papers - Hugging Face, accessed April 10, 2025, https://huggingface.co/papers?q=multi-page%20document%20classification%20datasets
joheras/Lecturas - GitHub, accessed April 10, 2025, https://github.com/joheras/Lecturas
arXiv:2004.08686v1 [cs.CV] 18 Apr 2020 - Harvard University, accessed April 10, 2025, https://scholar.harvard.edu/files/dell/files/2004.08686.pdf
[2212.13924] Page Layout Analysis of Text-heavy Historical Documents: a Comparison of Textual and Visual Approaches - arXiv, accessed April 10, 2025, https://arxiv.org/abs/2212.13924
Connect with me on X (Twitter)

Rohan's Bytes

Discussion about this post