Handling Graphs and Charts in RAG Pipelines 2024-2025

Jun 16, 2025

Browse all previously published AI Tutorials here.

Table of Contents

Extracting and Processing Graphs-Charts in RAG
Embedding Charts and Graphs into Vector Stores
Challenges in Integrating Graph-Chart Data
Best Practices and Industry Applications

Extracting and Processing Graphs-Charts in RAG

Retrieval-Augmented Generation (RAG) has evolved to handle multimodal content, including graphs and charts embedded in documents. In industry settings (finance, healthcare, scientific publishing), critical information is often conveyed in figures. Recent work in 2024 and 2025 emphasizes techniques to digitize and chunk these visuals for LLMs, integrating them into retrieval pipelines alongside text. This review highlights methods for extracting chart data, embedding visual content into vector stores, challenges of integrating these modalities, and best practices from industry applications.

Document Parsing and Segmentation: The first step is identifying and extracting charts/graphs from documents. Modern document parsing tools can split PDFs into textual sections and images. For example, Azure’s Document Intelligence can isolate text and even OCR any images within a file (Integrating vision into RAG applications | Microsoft Community Hub). Many pipelines now separate images from text content during ingestion (An Easy Introduction to Multimodal Retrieval-Augmented Generation | NVIDIA Technical Blog), so each chart or graph can be processed independently. Research highlights that handling figures is more complex than plain text: “Documents with charts, tables, maths are more complex… Some parsers combine OCR with computer vision and LLMs” (Graph RAG – Orbifold Consulting). This means chart handling often requires a combination of techniques (e.g. detecting text in the image, analyzing visual elements, and using language models to interpret context).

OCR and Text Extraction: Charts usually contain textual elements (titles, axis labels, legends, data labels). OCR-based methods are essential to extract this embedded text. Industry OCR services like Amazon Textract or Azure OCR can pull out these strings, which are then included in the metadata or textual representation of the chart. A recent benchmark, CHART-Info 2024, defines multiple sub-tasks needed for full chart understanding: chart text detection and recognition, text role classification (e.g. distinguish axis labels vs. data labels), axis and legend analysis, and data extraction (HERE). These tasks underscore that beyond raw OCR, understanding a chart involves interpreting the roles of text and the visual structure. In practice, specialized vision-language models are emerging to handle this. For instance, Google’s DePlot model is designed to comprehend charts and plots. NVIDIA demonstrated using DePlot to convert bar chart images into a “linearized” table or text form in a RAG pipeline . By generating a structured textual representation of a chart (essentially reading the chart’s data), the chart can be treated as text for downstream processing. This approach was applied to technical documentation with complex figures, ensuring the key information from charts is extracted and expressed in text . In cases where such specialized models are unavailable, a simpler alternative is to produce an image caption or summary of the chart via a vision-capable model, describing the trends or insights it conveys.

Chunking and Metadata: Once a chart’s content is extracted (via OCR or model), it becomes its own “chunk” in the RAG pipeline. Best practices include attaching relevant metadata – e.g. the figure caption, source, or a tag indicating this chunk is an image. Some pipelines store the full OCR text or data of the chart but use a summary for the actual embedding, because raw extracted data (like a list of numbers) may not be semantically meaningful for retrieval (An Easy Introduction to Multimodal Retrieval-Augmented Generation | NVIDIA Technical Blog). A 2024 NVIDIA workflow recommends summarizing the linearized chart data and using that summary as the chunk to vectorize, which improved retrieval relevance . Additionally, maintaining references between the chart chunk and its parent document/page helps with citations and downstream usage (Graph RAG – Orbifold Consulting).

Embedding Charts and Graphs into Vector Stores

Multimodal Embeddings: A core challenge is how to represent a chart or graph in the vector store so that it can be retrieved given a user’s query. One approach is using multimodal embedding models that map images and text into a shared vector space. For example, Microsoft’s Florence model (available via Azure AI Vision) generates 1024-dimensional embeddings for images such that similar content yields vectors close to relevant text queries (Integrating vision into RAG applications | Microsoft Community Hub) . Using this, an image of a rising line chart could be retrieved for a query about “increasing trends,” even if the query doesn’t explicitly describe the image . In practice, systems like Azure Cognitive Search allow adding an “imageEmbedding” field alongside text embeddings for each document page . During retrieval, a hybrid search can combine text semantic search with image-vector search to find matches in either modality . This multi-vector approach ensures that a query can surface a chart even if the chart’s textual metadata is sparse, by relying on visual similarity in the embedding space.

Textual Embeddings of Chart Content: Another technique is to represent the chart via text (as noted earlier) and use a standard text embedding model (like OpenAI’s Ada-002 or similar) on that description. The KX engineering team, for instance, demonstrated this kind of approach for tables – extracting each table and generating a descriptive context, then converting the table to a uniform text format for embedding (Mastering RAG: Precision techniques for table-heavy documents | KX: Vector Database, Time Series And Real Time Analytics). A similar logic can apply to charts: one can create a textual summary of the chart’s data (e.g. “a line chart showing patient heart rate rising from 70 to 90 bpm over 5 minutes”) and embed that. The advantage is that it leverages well-understood text-vector models and can capture the semantics of the chart in natural language. The disadvantage is loss of some precision (the model might not list every data point). In practice, many industry pipelines combine approaches: store the image’s own embedding and a text-derived embedding. For example, Microsoft’s RAG system kept both the text content embedding and an image embedding for each page , enabling queries to hit on either representation.

Vector Index Organization: It’s common to treat each figure (chart) as a separate entry in the vector database, often linked to a caption or figure number. This allows the RAG retriever to return a chart “chunk” similarly to a text chunk. Some advanced retrievers also store modality flags or use separate indexes per modality (text vs image) and then merge results. LangChain’s 2024 multi-vector retriever and other frameworks can handle multiple embedding fields per document chunk, as seen in open-source cookbooks for text+image RAG (Multi-Vector Retriever for RAG on tables, text, and images) (though such 2023 references laid groundwork, the concept carries into 2024 implementations).

Challenges in Integrating Graph-Chart Data

Semantic Gap and Retrieval Accuracy: Integrating charts introduces a semantic gap – the meaning in a chart must be captured either in an embedding or textual form. If using image embeddings, a challenge noted by practitioners is that visual similarity doesn’t always equate to relevance. For example, an image embedding model might consider a mostly blank chart similar to many queries (due to lack of distinctive features), causing irrelevant retrievals (Integrating vision into RAG applications | Microsoft Community Hub). Pamela Fox at Microsoft observed that embedding every page image naively could surface blank or irrelevant pages as top hits (an image of an empty page might appear “similar” to everything in latent space) . Mitigations include filtering out images with little content, or using a captioning model to generate a descriptive text for the image instead of the raw image embedding . There is also the issue that charts with very domain-specific visuals might confuse a general embedding model. A biomedical plot of gene expression might not be well-understood by a generic vision model. In such cases, custom embeddings or fine-tuned models may be needed.

LLM Context and Reasoning: Once retrieved, using chart data in generation is non-trivial. Standard LLMs accept text, not images, so an LLM with vision capability (like GPT-4V or open-source Multimodal LLMs) must be leveraged, or a two-stage approach must be used. One approach is to include the chart’s text summary in the prompt (so the LLM only sees text). This works for questions answerable by the summary, but fails if the question requires details only visible in the image (e.g. exact trends or values not fully captured by the summary). The more robust approach is a pipeline: if an image chunk is retrieved, feed the actual image (or its data) into a vision model to get the answer, then incorporate that into the final LLM response. NVIDIA’s 2024 demo implemented this: upon retrieving a relevant chart image, they passed it (with the user’s question) into a vision-question answering model, which interpreted the chart (e.g. reading the exact value difference between two bars) and produced an answer snippet (An Easy Introduction to Multimodal Retrieval-Augmented Generation | NVIDIA Technical Blog). That snippet (80% higher performance in their example) was then included as context for the final answer generation . This kind of late-stage fusion ensures accuracy but adds complexity (you need a vision VQA module alongside your LLM). Alternatively, when using a single multimodal LLM like GPT-4, one can directly provide the image (e.g. via base64 in the prompt, as done in Azure’s implementation) and ask the model to answer using both text and image sources . However, reliance on closed models like GPT-4 may raise data privacy concerns for industry and can be expensive.

Accuracy and Limitations: Even with advanced models, understanding charts is not perfect. A study on ChartQA in late 2024 evaluated 19 multimodal LLMs on reading charts and found the average accuracy was only ~39.8%, with the best (GPT-4V) around 69% on “low-level” tasks (like identifying specific correlations or values) (ChartInsights: Evaluating Multimodal Large Language Models for Low-Level Chart Question Answering - ACL Anthology). This indicates current models often misread or overlook fine details in charts. The research introduced improved prompting strategies (like a Chain-of-Charts method to guide the model’s attention) that boosted accuracy to ~84% . For RAG systems, this implies that even if the correct chart is retrieved, the system may need tailored prompts or logic to ensure the LLM extracts the right answer from it. Challenges like varying chart styles, image noise, or unusual layouts can further hinder interpretation . Integrators must anticipate errors – for instance, an LLM might hallucinate a trend if the chart is complex – and possibly include validation (if underlying data is available, cross-check the LLM’s reading).

Computational Overhead: Storing and searching images alongside text increases storage and compute needs. Image embeddings (e.g. 1024-d vectors for each page image) can be heavy at scale. Some industry solutions address this by selectivity – e.g. only embedding pages or figures that contain significant visual information (Integrating vision into RAG applications | Microsoft Community Hub). Likewise, running a vision model at query time for potentially multiple images can be slow. Caching analyses for frequently asked-about charts or using lightweight models can mitigate latency.

Best Practices and Industry Applications

Financial Reports: In finance, RAG systems deal with earnings reports, filings, and presentations that mix narrative text with charts of trends and tables of numbers. Best practices here include treating tables and charts as first-class citizens in the knowledge base. One industry approach is to convert every chart and table into a textual summary during ingestion, so that the LLM can retrieve and quote facts from them reliably. For example, a pipeline for a financial report might extract a revenue trend chart and generate a sentence like “Figure 5: Revenue increased from $10M in Q1 to $15M in Q2”, which is stored as a chunk with a reference to the figure. This ensures queries about “revenue in Q2” retrieve that info. KX’s solution for table-heavy documents combined table markdown with contextual descriptions for robust retrieval (Mastering RAG: Precision techniques for table-heavy documents | KX: Vector Database, Time Series And Real Time Analytics) – a similar enrichment can be applied to charts by including their caption or a brief analysis. Additionally, using multi-modal search can catch questions phrased visually (e.g. “show me any charts of rising costs”), retrieving the actual chart image via vector similarity (Integrating vision into RAG applications | Microsoft Community Hub). Microsoft reports that enabling image-based retrieval was “a great fit for diagram-heavy domains like finance”, allowing users to get answers entirely from charts when needed . For accuracy, financial applications often double-check any numeric values read from charts, since an error can be critical. If possible, storing the raw data behind a chart (from CSV or reports) and linking it to the image can allow the system to use the exact numbers instead of relying on image reading.

Medical Documents: Healthcare documents can contain patient charts (like vital sign trends), medical imagery (X-rays, MRIs), and annotated diagrams. Integrating these in RAG is emerging. A key practice is to use domain-specific models when available – for instance, a general chart parser might not handle an EKG graph well, but a specialized healthcare AI could. NVIDIA suggests either fine-tuning a single model to handle all image types or using an ensemble of models for different image categories (An Easy Introduction to Multimodal Retrieval-Augmented Generation | NVIDIA Technical Blog). In a medical setting, one could route line-chart type images (like lab results over time) to a chart-reading module, while sending anatomical images to an entirely different analyzer. Maintaining the context is also crucial: a medical chart’s meaning is tied to the patient and measurement. So linking the extracted chart info with patient metadata (patient ID, date) in the vector store is a best practice, to retrieve the correct record for a query like “Show me John Doe’s blood pressure trend in March”. Privacy and compliance are particularly important; if using a service like Bedrock’s multimodal RAG (which now supports images and tables (Amazon Bedrock Knowledge Bases now processes multimodal data - AWS)), healthcare providers must ensure the data stays encrypted and within approved systems. In terms of OCR, medical charts often have handwritten annotations or scans, so high-quality OCR (or human-in-the-loop validation) may be needed to avoid critical mistakes.

Scientific and Technical Papers: Scientific literature contains numerous graphs and plots that are essential to understanding results. RAG-powered literature assistants (for example, tools to query academic papers) need to handle questions about these figures. A best practice here is to leverage the figure captions and surrounding text heavily. Typically, a well-written paper describes each figure in the caption or body; ensuring the caption is indexed and chunked with the figure can answer many questions without needing complex image processing. However, for questions requiring reading values off a plot (e.g. “According to Figure 2, what is the peak intensity?”), a vision model is needed. Industry solutions like SciNLP assistants have begun to incorporate figure parsing libraries (like pdffigures2) to isolate each figure, then applying a model such as DePlot or an MLLM (Multimodal LLM) to generate a textual explanation of the figure. This explanation can be indexed for retrieval. The 2024 NVIDIA example of an AI reading an NVIDIA research blog’s charts is analogous to doing so for a scientific paper: the system successfully answered a performance comparison question by interpreting a bar chart from the document . For scientific use, it’s also recommended to classify the figure type (line graph, scatter plot, diagram, etc.) because certain models perform better on certain types (e.g. a chemistry diagrams parser vs. a data plot parser). By 2025, we see early deployments of such systems in enterprise research departments and publishing platforms to enable querying documents beyond just text.

General Recommendations: Across domains, some universal best practices have emerged:

Multimodal Indexing: Index both text and visual information. Use a unified embedding space if possible for cross-modal search (Integrating vision into RAG applications | Microsoft Community Hub) , and/or store separate embeddings with a retriever that can combine them. This hybrid approach yields more complete results.
Contextual Chunking: When chunking documents, keep charts and their explanatory text together. If a chart has a caption or is referred to in the paragraph above, linking those in the vector store (through metadata or even combining them in one chunk) can improve retrieval relevance and provide context for the LLM to understand the image.
Efficient Image Use: Avoid indexing meaningless images (e.g. decorative graphics or blank pages) to reduce noise (Integrating vision into RAG applications | Microsoft Community Hub). Focus on informative charts/graphs. Optionally, generate captions for images and index those rather than raw pixel embeddings if the visual model isn’t reliable.
Leverage VQA at Runtime: For critical applications, incorporate a vision-QA step when an image is retrieved. This ensures the final answer is grounded in what the chart actually shows, not just the description. As shown by industry prototypes, combining an MLLM’s answer from the image with the main LLM’s answer yields accurate and citeable results (An Easy Introduction to Multimodal Retrieval-Augmented Generation | NVIDIA Technical Blog).
Metadata and Source Attribution: Always store the source of the chart (document name, figure number) and include it in the LLM prompt or answer for transparency. AWS’s Bedrock multimodal RAG now even provides source attribution for visual data (Amazon Bedrock Knowledge Bases now processes multimodal data - AWS), which is important for user trust. Microsoft’s approach of stamping the image with its filename and citing that in answers is one way to handle this (Integrating vision into RAG applications | Microsoft Community Hub).

By following these practices, organizations in 2024 and beyond have started to successfully incorporate graphs and charts into their RAG pipelines, making LLMs far more knowledgeable on visual information. This unlocks advanced use-cases like querying financial trends directly from report charts or asking scientific questions that require reading a graph – tasks that pure text models would have missed. While challenges remain (in accuracy and complexity), ongoing research and industry innovation are rapidly closing the gap, making multimodal RAG a practical reality for document intelligence.

References:

NVIDIA (2024), Multimodal RAG pipeline – techniques for chart interpretation and image-text integration (An Easy Introduction to Multimodal Retrieval-Augmented Generation | NVIDIA Technical Blog) .
Microsoft Azure Tech Community (2024), Vision in RAG – using multimodal embeddings and GPT-4V to handle diagrams in finance (Integrating vision into RAG applications | Microsoft Community Hub) .
AWS Bedrock (Dec 2024), Knowledge Bases multimodal support – announcement of end-to-end RAG on text and images (charts, tables) (Amazon Bedrock Knowledge Bases now processes multimodal data - AWS) .
Davila et al. (2024), CHART-Info Dataset – defines OCR and analysis tasks for chart recognition (HERE).
Wu et al. (2024), ChartInsights (EMNLP 2024) – evaluation of LLMs on chart QA, highlighting accuracy limits and improvements (ChartInsights: Evaluating Multimodal Large Language Models for Low-Level Chart Question Answering - ACL Anthology) .
Orbifold Consulting (2024), Graph RAG blog – notes on document parsing challenges with charts and combined OCR/CV approaches (Graph RAG – Orbifold Consulting).

Rohan's Bytes

Discussion about this post