Multilingual & Multimodal LLMs for Document Digitization: A Literature Review

Jun 16, 2025

Browse all previously published AI Tutorials here.

Multilingual And Multimodal LLMs for Document Digitization A Literature Review
Overview
Architectural Adaptations for Multilingual and Multimodal Input
Key Challenges in Multilingual Multimodal Processing
Benchmarks and State-of-the-Art Performance
Practical Engineering Considerations
Key Takeaways

Overview

Large language models (LLMs) are being extended to handle multiple languages and data modalities (text, images, tables, speech, etc.) to better support document digitization and analysis. Traditionally, many foundation models have focused on English text, but recent research emphasizes inclusivity across diverse languages and input types ( Pangea: A Fully Open Multilingual Multimodal LLM for 39 Languages). This review synthesizes the latest (2024–2025) research on system design changes required for multilingual, multimodal LLMs, the challenges encountered (from tokenization to OCR and cross-modal alignment), benchmark performance of state-of-the-art models, and practical engineering considerations for deploying these systems.

Architectural Adaptations for Multilingual and Multimodal Input

Unified Tokenization and Embeddings: Multilingual support often begins with a shared subword tokenizer (e.g. SentencePiece or BPE) covering many scripts. A single vocabulary enables one model to ingest different languages, but if training data is heavily English-centric, the tokenizer may fragment other languages into inefficient byte-level tokens (How does a Language-Specific Tokenizer affect LLMs?). Modern LLMs like LLaMA-2 were ~90% trained on English, causing non-Latin languages to be broken into many small tokens, which limits effective context length and representation quality . One remedy is vocabulary extension: adding language-specific tokens and merge rules to better encode under-represented languages . For example, recent studies show that extending a tokenizer for Korean yields more stable, sensible outputs and lower perplexity on that language . While extending the vocab requires retraining embeddings, it is far cheaper than training a new model from scratch and markedly improves multilingual performance .

Multimodal Input Encoders: To handle non-text modalities, LLM architectures incorporate additional encoders or embedding pipelines. A common design is to prepend visual or audio features as special tokens to the text transformer. For document images, one approach is to use a vision transformer to generate visual patch embeddings (plus 2D position coordinates) that are fed into the LLM alongside text tokens ( DocLayLLM: An Efficient and Effective Multi-modal Extension of Large Language Models for Text-rich Document Understanding). Liao et al. (2024) follow this strategy in DocLayLLM, seamlessly integrating image patches and layout positions into a language model, which lets the LLM leverage its natural text comprehension while enhancing perception of spatial OCR information . This avoids treating text and layout as separate streams – the unified transformer can attend across both, after minimal adaptation. Another strategy is lightweight modality adapters: for instance, Apple’s FLoRA method attaches small low-rank adapter layers to a pre-trained text LLM to ingest new modalities (Multimodal Large Language Models with Fusion Low Rank Adaptation for Device Directed Speech Detection - Apple Machine Learning Research). Palaskar et al. (2024) showed that with FLoRA, a text-only model can be augmented to handle audio inputs (for speech detection) with only a fraction of parameters updated, yet matching the performance of full multimodal fine-tuning . This modular approach simplifies adding modalities (audio, video) on top of existing LLMs without costly re-training.

Layout and Structure Modeling: Documents often contain structured layouts (forms, tables, multi-column text) that pure text models miss. Recent systems explicitly incorporate layout features or tasks into LLM training. LayoutLLM (CVPR 2024) introduced layout-aware instruction tuning, with pre-training tasks at document-level, region-level, and segment-level to teach the model how to utilize spatial structure (e.g. reading order, section boundaries) ( LayoutLLM: Layout Instruction Tuning with Large Language Models for Document Understanding). They also use a Layout Chain-of-Thought mechanism, guiding the model to focus on the region relevant to a query before answering . Another approach is found in DocLLM (JPMorgan AI, 2024), which avoids heavy image encoders entirely and instead feeds the model textual content plus each text segment’s bounding-box coordinates ( DocLLM: A layout-aware generative language model for multimodal document understanding). By decomposing the transformer’s attention into text vs. spatial sub-matrices, DocLLM achieves cross-modal alignment between content and position without processing raw pixels . This lightweight layout encoding captures document structure while saving computation, illustrating that effective multimodal design doesn’t always require end-to-end image modeling.

Key Challenges in Multilingual Multimodal Processing

Tokenization & Script Diversity: As noted, a single vocabulary can struggle with diverse scripts. Excessive splitting of words (e.g. into bytes or characters) in low-resource languages leads to longer input sequences and lost semantic context (How does a Language-Specific Tokenizer affect LLMs?). Morphologically rich languages or those without whitespace (Chinese, Thai) are particularly affected. Ensuring the tokenizer respects word boundaries or common morphemes in each language is hard when sharing across 30+ languages. Researchers address this by increasing vocabulary size or using language-specific subtoken additions , but this raises embedding alignment issues (making sure new tokens integrate meaningfully with original ones). Maintaining a balanced training mix is also critical – if one language dominates, others suffer (the curse of multilinguality). Recent multilingual LLMs like Pangea-7B found that performance in each language depends on having the right proportion of English vs. non-English data and on language popularity; under-sampling high-resource languages can help elevate low-resource ones ( Pangea: A Fully Open Multilingual Multimodal LLM for 39 Languages).
Cross-Lingual Embedding Alignment: Beyond tokenization, the model’s internal representations need to align semantically across languages. If “bank account” in English and its Arabic equivalent end up in very different embedding subspaces, the model cannot easily transfer knowledge or do cross-lingual tasks. Multilingual pre-training on parallel corpora can induce some alignment, but additional techniques are being explored. Li et al. (2024) propose a post-pretraining alignment using translation pairs with a contrastive loss to explicitly pull together embeddings of sentences that mean the same thing in different languages (Align after Pre-train: Improving Multilingual Generative Models with Cross-lingual Alignment | OpenReview). Even using <0.1% of the original training data for such alignment, they significantly improved cross-lingual downstream performance . This indicates that a relatively small intervention can mitigate the isolation of representations for different languages.
Multimodal Fusion & Alignment: Integrating modalities poses its own alignment challenges. Visual and textual information must be mapped into a common latent space or at least made mutually understandable to the model. A classic solution is contrastive image-text pretraining (exemplified by CLIP), but generative LLMs require deeper fusion than just matching captions to images. Many multimodal LLMs adopt a two-stage architecture: a modality encoder (e.g. CNN or ViT for images, audio encoder for speech) produces embeddings, and an integration layer (often a projector or cross-attention module) feeds those into the language model ( MM-LLMs: Recent Advances in MultiModal Large Language Models). Tuning these components jointly is tricky – early layers must learn modality-specific features (pixels vs. phonemes) while later layers align them with text semantics. Some research (e.g. Huang et al., 2023 with AudioGPT) treats speech as another token stream via an intermediate recognition step ( Self-Powered LLM Modality Expansion for Large Speech-Text Models), essentially converting audio to text tokens using an ASR model (like Whisper) and then using the LLM as normal. This pipeline simplifies integration but relies on the quality of the speech-to-text component, which may falter for dialects or code-switching. Fully end-to-end speech+text LLMs (like Meta’s SeamlessM4T) jointly learn multiple modalities but need huge training resources. Recent work on adapter fusion (as noted with FLoRA) suggests we can attach audio or vision understanding to a frozen LLM incrementally . Ensuring that the LLM “pays attention” to these new modality tokens appropriately (and not overwhelmed by the abundant text weights) remains an open challenge.
Optical Character Recognition (OCR) for Low-Resource Languages: Document digitization often starts with OCR to convert images of text into characters. For many languages, especially those with complex scripts or limited training data, OCR quality is a major bottleneck. A 2023 survey highlights open problems in scaling OCR to low-resource languages, from lack of annotated data to diverse font/printing variations (A Concise Survey of OCR for Low-Resource Languages). The creators of Pangea-7B specifically flagged multilingual OCR as particularly challenging in their multimodal LLM system (Pangea: A Fully Open Multilingual Multimodal LLM for 39 Languages). They augmented training with 500K synthetic OCR examples across 10 languages (by screenshotting websites in those languages) to improve the model’s ability to read text in images . This did boost OCR accuracy, but non-Latin scripts (Chinese, Japanese, etc.) still lagged significantly behind Latin ones . The results suggest that much more data (and possibly new OCR-specific model components) are needed for equitable performance. Integrating specialized OCR engines into the LLM pipeline is a practical workaround: e.g. use Google’s OCR for Telugu text then feed the recognized text to the LLM. However, this two-step process can break the end-to-end flow and may not capture layout or font nuances that an integrated multimodal model could. Finding the right balance between end-to-end learning and modular OCR remains an active area – some models like DocLayLLM demonstrate that an LLM augmented with visual tokens can even outperform traditional OCR-based pipelines ( DocLayLLM: An Efficient and Effective Multi-modal Extension of Large Language Models for Text-rich Document Understanding), hinting at the potential of tightly-coupled vision-language reasoning.
Handling Diverse Modalities (Tables, Forms, Math, etc.): Beyond plain text and images, real documents include tables, charts, formulas, and other formats. Each requires special treatment. Tables, for instance, carry a 2D grid structure and sometimes calculations; simply reading left-to-right might scramble their meaning. LLMs can struggle with tables if given as linearized text. One solution is to detect tables and convert them to a structured form (JSON or Markdown) before feeding to the model, preserving cell boundaries. Some benchmarks like EXAMS-V explicitly include tables, diagrams, and equations in their multimodal questions ( EXAMS-V: A Multi-Discipline Multilingual Multimodal Exam Benchmark for Evaluating Vision Language Models), requiring models to jointly interpret text and visual elements. Current top models (e.g. GPT-4V) still find this difficult . For math equations or scientific charts, a combination of OCR (to get printed math as LaTeX) and domain-specific parsing might be necessary alongside the LLM. In short, each modality demands embedding alignment with text: e.g. a table’s row/column headers need to align with how a question refers to them. Custom encoders (like graph neural nets for tables or latex parsers for math) may be integrated into future LLM systems to handle these seamlessly. So far, most multimodal LLMs handle images and text; handling nested modalities (an image that contains a table with text) is an evolving challenge.

Benchmarks and State-of-the-Art Performance

To evaluate these multilingual, multimodal capabilities, new benchmarks have emerged. PangeaBench (2024) is a suite of 14 datasets covering 47 languages, testing models on image-based tasks in diverse cultural contexts (Pangea: A Fully Open Multilingual Multimodal LLM for 39 Languages) . On this benchmark, Pangea-7B (a 7B-parameter vision-language model trained on 39 languages) achieves state-of-the-art results – on par with the best open models in English, and substantially better in multilingual settings . Notably, Pangea-7B outperforms other open-source models in tasks requiring cross-lingual understanding and cultural nuance, highlighting the impact of its inclusive training data . This demonstrates that targeted multilingual multimodal training can close the gap with English-centric models, at least on academic benchmarks.

Another comprehensive benchmark, M5 (Multilingual Multicultural Multimodal Benchmark), examines model performance across 8 datasets, 5 task types, and 41 languages (M5 – A Diverse Benchmark to Assess the Performance of Large Multimodal Models Across Multilingual and Multicultural Vision-Language Tasks - ACL Anthology). Schneider and Sitaram (2024) found substantial performance disparities between high-resource and low-resource languages on vision-language tasks . Surprisingly, they also noted that bigger is not always better – larger models did not consistently outperform smaller ones in multilingual tests . This suggests that simply scaling parameters won’t solve multilingual generalization; data diversity and training strategy matter more. M5 also introduced a challenging Visio-Linguistic Outlier Detection task (finding culturally out-of-context elements in images), where all tested models performed near random chance . Such results pinpoint remaining blind spots of current LLMs, especially for culturally specific reasoning that wasn’t covered in training.

For document-specific tasks, standard benchmarks include form understanding (e.g. XFUND for multilingual forms), document QA (DocVQA, InfoVQA), and table QA (WikiTables, ChartQA). On many of these, specialized models are starting to overtake general LLMs. For example, DocLLM’s layout-aware model, after fine-tuning on four core document tasks, outperformed prior state-of-the-art LLMs on 14 out of 16 datasets evaluated ( DocLLM: A layout-aware generative language model for multimodal document understanding). It also generalized well to most unseen datasets, indicating robust learning of document structures . In visual document question answering, stepwise reasoning approaches are proving valuable. Zhang et al. (2024) augment a smaller multimodal model with intermediate reasoning steps (using a larger LLM to generate synthetic chain-of-thought data), achieving +5% accuracy on the complex InfoVQA benchmark and +7% on ChartQA relative to direct answering (Read and Think: An Efficient Step-wise Multimodal Language Model for Document Understanding and Reasoning). This demonstrates that forcing the model to “think step-by-step” can yield better comprehension of charts and densely formatted pages.

At the high end, proprietary models like GPT-4V (OpenAI) and Google’s Gemini (multimodal successor to PaLM) currently lead many benchmarks, but even they struggle on the hardest tasks. The EXAMS-V benchmark – 20k multimodal high-school exam questions in 11 languages – stumps these advanced models, with GPT-4 Vision and Gemini underperforming on many questions ( EXAMS-V: A Multi-Discipline Multilingual Multimodal Exam Benchmark for Evaluating Vision Language Models). The questions often require combining text, image, and world knowledge in a specific language, illustrating that no model yet fully masters joint multilingual and multimodal reasoning in open domains. We are beginning to see head-to-head evaluations: for instance, Xie et al. (2024) report their PDF-WuKong model (designed for long academic PDFs) outperforms "proprietary products" by 8.6% F1 on a long-document QA task ({cutout}03.2cm55cm1 empty PDF-WuKong : A Large Multimodal Model for Efficient Long PDF Reading with End-to-End Sparse Sampling). This hints that focused research prototypes can sometimes beat general-purpose commercial systems on niche tasks, though the gap may not last long as industry models rapidly incorporate similar ideas.

In summary, benchmarks are evolving to test both breadth (many languages, modalities, and cultures as in M5 and PangeaBench) and depth (complex, multi-hop reasoning as in EXAMS-V). The best open models are closing the performance gap in multilingual settings (Pangea: A Fully Open Multilingual Multimodal LLM for 39 Languages), but significant challenges remain evident whenever the input is far from the training domain – e.g. a low-resource language script, a densely formatted scientific table, or an image requiring cultural context to interpret.

Practical Engineering Considerations

Designing and deploying multilingual, multimodal LLM systems for documents involves trade-offs in computational cost, complexity, and integration. Key considerations include:

Computational Cost & Model Size: Supporting dozens of languages and multiple modalities typically increases model size and training data requirements. Vocabulary extension and extra encoder modules add parameters. Training a model like Pangea-7B (multilingual vision-LM) means handling a 6M example instruction corpus across 39 languages ( Pangea: A Fully Open Multilingual Multimodal LLM for 39 Languages), which is computationally expensive. One way to mitigate costs is to leverage frozen pre-trained components (e.g. a pre-trained ViT for images or Whisper for speech) and only train a bridging layer. This reuse, however, requires careful alignment. Another strategy is Mixture-of-Experts (MoE), where separate expert subnetworks handle different languages or modalities, activating only a subset per input to save computation. While MoE can scale to many languages without blowing up inference cost, it adds system complexity (routing, load-balancing experts) and is an area of ongoing research.
Inference Efficiency: Multimodal LLMs can be slow at runtime. Processing an image or audio input involves running a hefty encoder (like a ResNet or transformer) before the text generation even begins. If documents have multiple pages or many images, inference latency multiplies. Engineers are exploring sparse computation and retrieval to speed this up. The PDF-WuKong system introduces a sparse sampler that learns to pick only the most relevant parts of a long document (both text paragraphs and figures) to feed into the model ({cutout}03.2cm55cm1 empty PDF-WuKong : A Large Multimodal Model for Efficient Long PDF Reading with End-to-End Sparse Sampling) . By filtering irrelevant sections, the model avoids wasting its limited context and compute on the entire 100-page PDF, focusing instead on, say, the one page that likely contains the answer. This kind of smart chunking or content selection dramatically improves efficiency and even accuracy, since the model isn’t distracted by extraneous data. In production document pipelines, a similar approach is to use an external search index: first split a large document into chunks (by page or section), embed them and retrieve top-k chunks relevant to the query, and only feed those into the LLM. This retrieval-augmented strategy is popular to cope with long texts and is naturally language-agnostic (it works as long as your embeddings can handle multilingual text).
Integration into Pipelines: Many real-world document processing systems are modular – e.g. scan -> OCR -> translate -> analyze -> summarize. Replacing all modules with one giant multimodal LLM is tempting but may be impractical. Instead, hybrid solutions are used. For example, one can use specialized OCR or ASR tools for each language (since they might be more accurate than a general LLM at raw transcription), then feed the extracted text into an LLM for understanding. This pipeline allows swapping out the OCR component for improvements without retraining the LLM. However, tight integration can yield better results as shown by DocLayLLM, which directly learns from OCR outputs and visual features together, beating systems that do OCR separately ( DocLayLLM: An Efficient and Effective Multi-modal Extension of Large Language Models for Text-rich Document Understanding). A practical compromise is to have the LLM call tools on the fly – e.g. an LLM could detect it’s dealing with an image in an unfamiliar script and invoke an OCR API (this is akin to the ReAct/Toolformer paradigm). Such dynamic tool use can combine the strengths of specialist models with the reasoning of LLMs. Engineering these pipelines requires careful orchestration and may involve frameworks for LLM agents.
Memory and Scaling: Serving a multilingual model can be memory-intensive due to the large vocabulary and parameters needed to cover many languages. If a use-case only needs a few languages or one modality, a slimmed-down model might be preferred for speed and cost. Techniques like LoRA (Low-Rank Adapters) or prompt tuning enable maintaining a single big model but loading small adaptation weights for specific domains or languages on demand. For instance, an AI service might keep an English-only LLM and only activate a multilingual extension component when non-English text is detected. This conditional routing saves time. Additionally, quantization of models (down to 8-bit or 4-bit weights) is often applied to large multimodal LLMs to fit them on GPUs for inference, though one must ensure that quantization doesn’t disproportionately hurt performance on certain languages (which might happen if those languages rely on subtle embedding distinctions that get lost with low precision).
Evaluation and Monitoring: From an engineering standpoint, supporting multiple languages means expanded testing – one must evaluate the system’s accuracy on each language and modality combination of interest. New benchmarks like M5 (M5 – A Diverse Benchmark to Assess the Performance of Large Multimodal Models Across Multilingual and Multicultural Vision-Language Tasks - ACL Anthology) and EXAMS-V ( EXAMS-V: A Multi-Discipline Multilingual Multimodal Exam Benchmark for Evaluating Vision Language Models) are useful guides, but organizations often develop internal test sets (e.g. company documents in various languages) to ensure the system meets specific needs. Monitoring a live system requires logging not just overall success but tracking if certain languages or formats consistently fail. This can inform data collection for the next training cycle (e.g. if the system struggles with Arabic handwriting, gather more of that data). Fairness and bias also come into play: a multilingual model should be checked for any bias in how it handles different scripts or cultures – a known issue since many LLMs inherited skews from predominantly English internet data. Ongoing maintenance is needed to keep performance balanced.

Key Takeaways

Multilingual LLM Design: Requires inclusive tokenization and training data. Using a shared subword vocabulary across languages is common, but additional steps (vocab expansion, alignment objectives) are needed to avoid favoring high-resource languages (How does a Language-Specific Tokenizer affect LLMs?). Properly balanced data and slight architecture tweaks can yield strong cross-lingual performance, as seen with models like Pangea-7B (Pangea: A Fully Open Multilingual Multimodal LLM for 39 Languages).
Multimodal Integration: Demands new system components (vision/audio encoders or adapters) and alignment mechanisms. Effective approaches include feeding image patches and layout tokens into the transformer ( DocLayLLM: An Efficient and Effective Multi-modal Extension of Large Language Models for Text-rich Document Understanding), or using lightweight adapters to plug modalities into an existing LLM (Multimodal Large Language Models with Fusion Low Rank Adaptation for Device Directed Speech Detection - Apple Machine Learning Research). Ensuring the model can attend to both textual and visual/spatial context is crucial for document tasks.
Core Challenges: Tokenization of diverse scripts, cross-lingual semantic alignment, and reliable OCR for low-resource languages remain tough problems. Even advanced models see accuracy drop on non-Latin scripts and under-represented languages . Complex layouts (tables, forms) and mixed-modality content (diagrams with text) require the model to reason beyond sequential text, often with specialized training (e.g. layout-aware tuning, chain-of-thought) to guide it.
Performance Trends: Specialized multimodal LLMs are closing the gap with or exceeding general models on document understanding benchmarks ( DocLLM: A layout-aware generative language model for multimodal document understanding). However, evaluation suites like M5 and EXAMS-V reveal that no current model excels across all languages and modalities – high resource languages still greatly outperform low-resource ones , and tasks combining vision, language, and cultural knowledge push models to their limits . There is active research to address these gaps, including using larger diverse training sets and explicit alignment techniques.
Engineering Best Practices: In practice, systems often combine LLMs with traditional tools. Chunking long documents (via retrieval or learned sparse sampling) is essential for efficiency ({cutout}03.2cm55cm1 empty PDF-WuKong : A Large Multimodal Model for Efficient Long PDF Reading with End-to-End Sparse Sampling) . Adapters and modular architectures enable adding capabilities (new language or modality) without rebuilding from scratch. Finally, evaluation and iteration are key – a multilingual multimodal system requires continuous tuning to handle new document types, languages, and use cases as they arise, ensuring that the benefits of broad language and modality support are fully realized in real-world deployments.

Rohan's Bytes

Discussion about this post