How to improve Embedding Accuracy for OpenAI Embedding Models in RAG

Jun 15, 2025

Browse all previously published AI Tutorials here.

How to improve Embedding Accuracy for OpenAI Embedding Models in RAG
Optimized Chunking Strategies for Better Retrieval
Enhancements in Retrieval-Augmented Generation (RAG)
Empirical Findings and Comparison

OpenAI’s embedding models (e.g. text-embedding-ada-002) are widely used in retrieval-augmented generation (RAG) pipelines. Recent research in 2024–2025 highlights that document chunking strategies and RAG pipeline enhancements can significantly boost retrieval accuracy and reduce hallucinations. Below, we summarize cutting-edge methods – from smarter chunking to improved retrieval/generation techniques – with a focus on recent empirical findings.

Optimized Chunking Strategies for Better Retrieval

Chunk Size Trade-offs: The size of text chunks indexed in a vector database has a major impact on retrieval performance. Large chunks preserve more context (improving semantic coherence for embeddings) at the cost of slower search and potential inclusion of irrelevant content, while small chunks boost recall and speed but may fragment context (Searching for Best Practices in Retrieval-Augmented Generation). Finding an optimal chunk length requires balancing relevance and faithfulness (i.e. ensuring answers are grounded in retrieved text rather than hallucinated) . Recent studies emphasize that no single fixed chunk size is best for all data – the optimal granularity can vary by domain and task (HERE).

Structural or Semantic Chunking: Instead of naive paragraph or fixed-length splits, researchers have proposed chunking by document structure or semantics. Jimeno-Yepes et al. (2024) introduce an element-type-based chunking for financial reports, leveraging document layout (sections, tables, etc.) to create semantically coherent chunks . This approach improved retrieval and downstream Q&A accuracy without tuning chunk size. In their FinanceBench experiments, element-based chunks achieved the highest page-level retrieval accuracy (~84.4%) and best QA accuracy, outperforming uniform token-length chunks . Notably, it required half as many chunks as a baseline (≈62k vs 112k), greatly reducing index size and query latency . This result underscores how semantically informed chunking can boost accuracy and efficiency simultaneously.

Dynamic Chunking per Query: Rather than a one-size-fits-all segmentation, dynamic methods adjust chunk granularity based on the query. Zhong et al. (2024) propose Mix-of-Granularity (MoG), which trains a lightweight router to choose the optimal chunk size or “knowledge granularity” for each query (Mix-of-Granularity: Optimize the Chunking Granularity for Retrieval-Augmented Generation). For instance, a broad question might retrieve larger text blocks, whereas a specific query uses fine-grained snippets. They further extend this to MoG-Graph (MoGG), representing documents as graphs of interconnected chunks to retrieve information that may be spread across distant sections . On medical QA benchmarks, MoG/MoGG significantly outperformed a static-chunk baseline (MedRAG), yielding higher answer accuracy. MoGG achieved larger accuracy gains than MoG and proved more sample-efficient, thanks to its flexible graph-based organization of snippets . This adaptive approach shows that varying chunk size on-the-fly can capture relevant context more effectively than any fixed segmentation.

Contextualized “Late” Chunking: Another advance is to preserve more context during embedding by chunking after encoding. Traditional pipelines split text first, then embed each piece independently – which can lose cross-chunk references (e.g. pronouns or entities that span sentences). Late chunking flips this process: feed a long document (or large portion) into a long-context embedding model whole, then produce embeddings for smaller segments by pooling over the model’s output (The Rise and Evolution of RAG in 2024 A Year in Review | RAGFlow) . This technique, introduced by Safjan et al. (2024), allows each chunk’s vector to “know” about its broader context. For example, references like “its” or “the city” in one sentence will correctly encode the entity “Berlin” mentioned in a previous sentence – something not possible with naive independent chunks (HERE). Late chunking delivered consistently higher similarity between related pieces of text and significantly improved retrieval performance across numerous benchmarks . Importantly, it requires no model fine-tuning; it leverages long embedding windows (OpenAI’s models support ~8K tokens) to produce richer chunk embeddings. The trade-off is increased memory/compute per embedding (processing bigger context at once), but studies show it yields superior results to naive chunking for many tasks .

Multi-Stage and Meta-Chunking: Other 2024 approaches enrich or post-process chunks to boost embedding quality. These include adding summaries or metadata to each chunk (so-called “context-enriched chunking” (Mix-of-Granularity: Optimize the Chunking Granularity for Retrieval-Augmented Generation)), or using language models to intelligently merge or split chunks at natural boundaries (“meta-chunking” that groups logically connected sentences ). Such techniques ensure each vector represents a semantically complete idea, making it easier for embedding search to retrieve the right pieces. Hierarchical strategies (small chunks for initial recall, then retrieve the surrounding larger section) have also proven effective (Searching for Best Practices in Retrieval-Augmented Generation) – this is sometimes called a “small-to-big” or sliding window approach. In sum, the field has converged on providing more context to embeddings – either via smarter chunk definitions or by linking chunks – as key to improving retrieval accuracy.

Enhancements in Retrieval-Augmented Generation (RAG)

Better chunking is one pillar of RAG improvement; recent research also targets retrieval and generation stages to increase relevance and reliability:

Semantic Indexing and Hierarchical Search: Building on chunking improvements, Fan et al. (2025) introduce TrustRAG, which indexes documents in a hierarchical manner (TrustRAG: An Information Assistant with Retrieval Augmented Generation). Each chunk is stored along with contextual neighbors or section labels (a form of semantic-enhanced indexing). This ensures the retriever can surface a chunk plus its necessary context, mitigating the risk of retrieving isolated fragments out of context. By supplementing each chunk with broader document structure, the embeddings and index better preserve semantic completeness .
Retriever Filtering and Reranking: Simply retrieving top-k vectors can pull in some irrelevant text. TrustRAG adds a utility-based filter to discard low-quality or off-topic retrieved chunks before generation . Similarly, other work in 2024 recommends a reranking step using a cross-encoder or LLM to re-evaluate and sort retrieved passages by relevance (Searching for Best Practices in Retrieval-Augmented Generation). These additional steps are resource-intensive (involving large models or extra scoring), but they significantly improve the relevance of context fed into the LLM. Empirical RAG studies show that without such filtering, many retrieved chunks can be irrelevant “noise,” especially if using small chunk sizes . Thus, a hybrid of dense embeddings for recall followed by a precise re-check (or filtering heuristic) yields better final answers.

Query Expansion and Transformation: Improving the query for retrieval is another proven technique. Methods like HyDE (Hypothetical Document Embeddings) generate a pseudo-answer from the query using an LLM, then embed that to match against documents, often retrieving better-aligned context . Others decompose complex queries into sub-queries to retrieve in parts and then combine results. These strategies were validated in a 2024 “best practices” study by Xiong et al., which found that query rewriting and augmentation can boost recall and downstream accuracy without changes to the embedding model . Such approaches effectively compensate for embedding vocabulary gaps by enriching the query representation.
Improved Generation with Retrieved Context: In the answer synthesis stage, recent improvements aim to make use of retrieved facts more faithfully. TrustRAG, for example, performs fine-grained citation analysis during generation, identifying which sentences in the LLM’s output are likely claims or opinions and ensuring each is backed by a source (TrustRAG: An Information Assistant with Retrieval Augmented Generation). This yields responses with inline citations that are more accurate and easier to verify. Other work has focused on training generation models to better integrate source text – e.g. by additional supervised fine-tuning so the LLM learns to copy factual spans from retrieved passages or to abstain when unsure. While OpenAI’s own models are not fine-tuned by end-users, these techniques highlight the trend of aligning generation closely with retrieved evidence to reduce hallucinations.
Resource-Intensive Approaches: Some cutting-edge solutions for maximal accuracy forego the conventional retriever+embedder paradigm entirely. Qian et al. (2024) present a chunking-free in-context retrieval (CFIC) method that uses a large-context language model to directly find answers in a long document (HERE) . In CFIC, the full document is encoded (as hidden states) and an autoregressive search procedure extracts the precise answer spans, eliminating the need to chunk the text at all . This approach showed significant gains in evidence precision on open-domain QA, since it avoids chopping up relevant text . The drawback is heavy resource usage: CFIC requires a 32k-token model context and specialized decoding, so it’s more computation-intensive than standard vector search. Similarly, using GPT-4 itself as a retriever (via iterative reading or tool-use) could improve accuracy but at high cost. Such solutions underscore a theme in 2024–25 research: for critical applications, investing more compute or model capacity into retrieval can pay off with higher answer correctness.

Empirical Findings and Comparison

Recent empirical studies consistently demonstrate that smarter chunking and RAG refinements yield more accurate and contextually relevant results than baseline methods:

A comparative study in early 2024 showed that document-aware chunking (using structure or semantics) outperforms fixed-size chunking, improving QA accuracy by up to 35% in domain-specific tasks (HERE). These methods also reduced the total number of embedding vectors (hence index size) by ~50%, highlighting a win-win for accuracy and efficiency .
Dynamic granularity methods like MoG were found especially beneficial for heterogeneous knowledge sources (e.g. mixed text from manuals, wikis, and databases). By selecting chunk size per query, MoG/MoGG achieved higher average accuracy on medical QA tasks than any single-granularity approach (Mix-of-Granularity: Optimize the Chunking Granularity for Retrieval-Augmented Generation). Notably, MoG’s gains were larger when using smaller LLMs for generation, since those models rely more on retrieved evidence . This suggests adaptive chunking can effectively complement smaller OpenAI models or domain-specific models by feeding them just the right amount of context.
Advanced chunking techniques that inject more context into embeddings have shown strong results on standard retrieval benchmarks. Late chunking, for example, consistently improved recall of relevant passages, as chunks carry contextual cues that naive splits would miss (HERE). In practical terms, an OpenAI embedder with an 8K window using late chunking can embed an entire document to produce contextualized vectors for each section, outperforming embeddings of isolated sections in search tasks . The only scenarios where late chunking may not help are those with extremely short, self-contained documents, where simple chunking is already sufficient .
RAG pipeline tuning has been shown to rival massive model scaling in effectiveness. Xiong et al. (2024) found that carefully optimizing each step (chunking, retrieval, reranking, etc.) could achieve high accuracy without using an extremely large generator model (Searching for Best Practices in Retrieval-Augmented Generation) . For instance, augmenting queries and adding a rerank stage boosted factual accuracy more than switching from GPT-3.5 to GPT-4 in some cases. This underscores that embedding accuracy is not solely about the model, but also how the data is prepped and used.

In summary, the past year’s research has equipped practitioners with a toolkit to improve OpenAI embedding-based search. Better chunking – whether via structural splits, dynamic granularity, or context-enriched embedding – directly improves the quality of retrieved content by creating more semantically accurate vectors. On top of that, RAG enhancements like smarter retrieval (filters, query expansion) and robust generation (citation checks, fine-tuning) ensure that the LLM’s final answer stays factual and relevant. While some solutions (e.g. chunking-free retrieval or large-context embeddings) are resource-intensive, they illustrate the upper bound of what’s achievable in accuracy. More lightweight techniques, like hierarchical indexing and overlap chunking, offer practical gains that can be applied today with OpenAI’s models.

Going into 2025, the consensus is that no single tweak is silver-bullet. Rather, combining multiple strategies – optimal chunking, advanced retrieval, and guided generation – yields the best results in retrieval-augmented systems (TrustRAG: An Information Assistant with Retrieval Augmented Generation). As new empirical studies emerge, they continue to refine these approaches, but the clear trend is towards maximizing the useful context given to language models while minimizing noise. By following these best practices from recent research, developers can significantly improve embedding-based accuracy in their RAG applications, leading to more reliable and relevant AI-generated responses.

Sources: Recent arXiv papers and results from 2024–2025, including Jimeno-Yepes et al. , Zhong et al. (Mix-of-Granularity: Optimize the Chunking Granularity for Retrieval-Augmented Generation), Safjan et al. (HERE) , Qian et al. (HERE) , Fan et al. , and Xiong et al. (Searching for Best Practices in Retrieval-Augmented Generation) , among others.

Rohan's Bytes

Discussion about this post