Browse all previously published AI Tutorials here.
Table of Contents
Performance Optimization in Embedding-Based Retrieval
Scalability Challenges and Solutions
Fine-Tuning and Domain Adaptation
Embedding Model Architectures: BERT-Based vs. Beyond
Multimodal Embeddings: Text, Images, and Structured Data
Implementation Insights: Document Chunking and Retrieval (Code Example)
Performance Optimization in Embedding-Based Retrieval
Embedding-based retrieval (e.g. in Retrieval-Augmented Generation) must balance accuracy with speed. Modern systems employ Approximate Nearest Neighbor (ANN) indexes like HNSW graphs to avoid brute-force searches in high dimensions (Operational Advice for Dense and Sparse Retrievers: HNSW, Flat, or Inverted Indexes?) . HNSW (Hierarchical Navigable Small-World) provides sub-linear retrieval time but adds memory overhead from its multi-layer graph (Down with the Hierarchy: The ‘H’ in HNSW Stands for “Hubs”). Recent analyses show the hierarchical layers may be unnecessary: a flat (single-layer) graph can achieve similar recall and latency as HNSW without the complexity . For smaller corpora or prototyping, a flat index (brute-force) can be preferable – it guarantees exact results and avoids tuning parameters. Lin (2024) notes that brute-force “flat” search is often best at small scales, providing a stable baseline to evaluate embedding quality (HERE). As data scales up, ANN methods become essential: graph-based indexes, inverted file (IVF) partitions, or product quantization all trade a slight accuracy loss for big speedups. Int8 quantization of embeddings is a common optimization to cut memory use (by 4×) with minimal impact on retrieval quality . Another direction is adaptive indexing: building indexes on-the-fly based on query demand. For example, CrackIVF (2025) initially does near brute-force search and gradually partitions the index as queries arrive, ensuring low latency without upfront full indexing (Cracking Vector Search Indexes) . Their experiments suggest using brute-force by default for new/small datasets and switching to ANN after ~10–100 queries have been seen . Overall, performance optimization techniques include: choosing the right index type per scale, compressing embeddings, and caching frequently accessed results – all to reduce latency while maintaining high recall.
Scalability Challenges and Solutions
LLM-based document search often needs to handle very large corpora (millions of documents or more). Storing and searching billions of embeddings requires careful system design. Specialized Vector Database Management Systems (VDBMS) like Weaviate, Pinecone, and Qdrant have emerged as infrastructure to manage embedding indexes at scale. These systems support sharding across machines and on-disk indexes to go beyond memory limits. A key challenge is the trade-off between index size and query speed. Graph-based ANN structures (like HNSW) use more memory; in fact, HNSW’s hierarchy can hurt throughput in distributed settings due to synchronization overhead (Down with the Hierarchy: The ‘H’ in HNSW Stands for “Hubs”). For web-scale data, disk-based ANN (e.g. DiskANN) and compact indexes (IVF with product quantization) allow scaling to billions of vectors by sacrificing some recall. Another challenge is dynamic data and multi-tenancy: in real deployments, indexes must handle frequent updates and serve many clients. A naïve approach is one massive index with per-document access filters, or separate indexes per tenant – but the former slows queries, while the latter wastes memory ( Curator: Efficient Indexing for Multi-Tenant Vector Databases). Recent work on multi-tenant indexing proposes hybrid solutions: Curator (2024) builds a shared index where each tenant’s data forms a compact subtree of a global clustering tree . This yields query speeds on par with dedicated indexes while keeping memory usage as low as a single shared index . In practice, scaling embedding search requires a combination of techniques: distributing indexes across nodes, using asynchronous updates, and periodically re-indexing to incorporate new documents. The concept of “embedding data lakes” has also arisen, where raw unstructured data is stored alongside its embeddings (Cracking Vector Search Indexes) . This enables organizations to perform semantic search over enormous data lakes, but it demands robust indexing strategies (like the adaptive CrackIVF) to build and query those indexes efficiently. In summary, scalability is addressed by distributed indexing, memory/disk trade-offs, and novel indexing schemes that preserve performance for large-scale vector search.
Fine-Tuning and Domain Adaptation
General-purpose embedding models (often based on LLMs) are trained on broad web text and work well on average. However, domain-specific corpora (finance, legal, scientific, etc.) have specialized vocabulary and semantics that general models may miss. A 2024 study by Tang & Yang introduced a finance embedding benchmark (FinMTEB) and found that even state-of-the-art embeddings dropped sharply in performance on domain data, failing to capture domain-specific patterns (Do we need domain-specific embedding models? An empirical investigation). This suggests a need for domain-specific embedding models or at least domain-tuned versions of general models. One approach is fine-tuning a pre-trained encoder (e.g. fine-tune BERT or sentence transformers on in-domain QA pairs or similarity data). Given the cost of full model fine-tuning, Parameter-Efficient Fine-Tuning (PEFT) methods have become popular. Techniques like LoRA (Low-Rank Adaptation) insert small trainable weight updates into a frozen LLM, allowing it to learn domain knowledge with minimal computational overhead (HERE). Such PEFT approaches are offered by cloud platforms to help inject custom data into embeddings. Recent research shows that the quality of the fine-tuning data is critical: categorizing and filtering training examples can yield better results than naively using all data . Another strategy for real-world systems is multi-task retriever training. Béchard and Ayala (2025) propose fine-tuning a single compact bi-encoder on a variety of tasks and domains simultaneously (Multi-task retriever fine-tuning for domain-specific and efficient RAG). The resulting model can serve as a universal retriever for many applications, eliminating the need to maintain separate specialized retrievers for each domain. Their instruction-tuned multi-task retriever was shown to adapt to new domains and even an unseen retrieval task without additional training . In summary, adaptation techniques range from full model fine-tuning to lightweight adapters, and even training one retriever to rule them all. These enable customizing embeddings for specialized applications – a crucial step for high accuracy in domain-specific LLM deployments.
Embedding Model Architectures: BERT-Based vs. Beyond
Embedding models for retrieval have evolved from traditional encoder-only models to more complex setups. BERT-based dual encoders (e.g. Sentence-BERT, DPR) encode queries and documents into vectors and use dot-product similarity. They were among the first dense retrievers, offering efficient retrieval with a single vector per document. Large Language Models have since opened new possibilities: recent work shows that decoder-based LLMs (GPT-style), despite being generative, can produce excellent embeddings (LLMs are Also Effective Embedding Models: An In-depth Overview). This has led to a paradigm shift where LLMs are repurposed as embedding generators – either by prompt-based methods or by fine-tuning the LLM’s embedding output layers . Compared to smaller BERT encoders, LLM-based embeddings can capture richer semantics (thanks to training on vast data), but they come with higher compute cost and latency. There is also the contrast between dense vs. sparse embeddings. Sparse models (e.g. SPLADE++) generate high-dimensional sparse vectors (essentially selecting important n-grams or terms) akin to learned inverted indexes. Dense models produce a low-dimensional continuous vector. Extensive evaluations find that neither dense nor sparse has a clear overall advantage – their effectiveness is often comparable, and performance can depend on the dataset (Operational Advice for Dense and Sparse Retrievers: HNSW, Flat, or Inverted Indexes?). This has motivated hybrid retrieval, where systems combine dense and sparse results to cover each other’s gaps. For instance, BM25 or SPLADE can be used alongside a dense model; merging their candidate lists often yields higher recall. Another emerging architecture is the multi-vector retriever. Instead of a single embedding per document, a model like ColBERT represents a document as a set of multiple token-level embeddings, enabling more fine-grained matching . Such approaches bridge the gap between pure dense retrieval and traditional term matching by allowing a query to find relevant local embeddings within a document. Finally, pooling strategies and long-text embeddings are an active area: models now handle long inputs via chunking or hierarchical encoders, ensuring that lengthy documents (common in digitization projects) can be effectively embedded . In summary, today’s landscape includes:
BERT-based dual encoders – fast, effective for many tasks.
LLM-derived embeddings – leveraging powerful generative models for improved semantic capture .
Sparse/hybrid models – integrating lexical matching signals with embeddings .
Multi-vector and other advanced architectures – aiming for finer matching and better recall in complex queries.
Multimodal Embeddings: Text, Images, and Structured Data
Embedding models are increasingly multimodal, meaning they handle text, images, and other data in a unified way. A common technique for text–image integration is joint contrastive training: models like CLIP learn a shared space for images and captions by pulling matching pairs together and pushing others apart (From Word Vectors to Multimodal Embeddings: Techniques, Applications, and Future Directions For Large Language Models). This enables cross-modal retrieval (e.g. find an image from a text description, or vice versa) using a single embedding model. For instance, a caption and its image will have high similarity in CLIP’s embedding space, allowing effective retrieval. Beyond static models, researchers are using Multimodal LLMs (MLLMs) that accept both image and text inputs. Sheng-Chieh Lin et al. (2025) fine-tuned a vision-language LLM as a bi-encoder retriever on a suite of 10 datasets covering 16 tasks ( MM-Embed: Universal Multimodal Retrieval with Multimodal LLMs). Their model, MM-Embed, can understand queries that mix text and images (e.g. a question with an attached diagram) and retrieve relevant results across modalities. An interesting finding was that a raw multimodal LLM retriever initially underperformed a dedicated CLIP model on cross-modal search, due to modality bias (the LLM was pre-trained more on text) . By introducing modality-aware hard negative mining during training, they mitigated this bias and achieved state-of-the-art on a challenging benchmark (M-BEIR) covering image/text retrieval tasks . In fact, the unified model even surpassed the previous best text-only retriever on a purely textual benchmark, showing that a well-trained multimodal embedder can excel in single-modality tasks too . This suggests that multimodal training, if done carefully, does not necessarily compromise text understanding and can produce very powerful general-purpose embeddings. Apart from text and images, other data types are being incorporated: for example, researchers embed structured data like tables or knowledge graphs by converting them to text or using graph neural networks, then aligning those embeddings with text. The goal is a single embedding space where a user’s query can find relevant information whether it’s in plain text, an image, or a database row. Such capabilities are crucial for document digitization projects where information might appear as scanned text, images (figures, diagrams), or metadata. Multimodal embedding models allow an LLM-based system to fetch supporting facts from diverse sources, improving the robustness and versatility of applications like question answering on documents.
Implementation Insights: Document Chunking and Retrieval (Code Example)
To ground these concepts, here is a simplified example of how document chunking and embedding-based retrieval can be implemented in Python. We use a pre-trained sentence transformer model to embed text chunks, and FAISS (Facebook AI Similarity Search) for efficient vector similarity search. This setup could be part of an LLM-powered document QA system:
from sentence_transformers import SentenceTransformer
import numpy as np
import faiss
## 1. Load an embedding model (BERT-based sentence transformer in this case)
model = SentenceTransformer('all-MiniLM--v2') # a small, fast embedding model
## 2. Example document and query
document_text = "Alice's Adventures in Wonderland is an 1865 novel written by Lewis Carroll..." # etc.
query = "Who is the author of Alice in Wonderland?"
## 3. Chunk the document into smaller segments (to fit model's length limits)
chunk_size = 100 # characters per chunk (for demonstration; in practice use tokens)
chunks = [document_text[i:i+chunk_size] for i in range(0, len(document_text), chunk_size)]
## 4. Compute embeddings for each chunk
chunk_embeddings = model.encode(chunks) # shape: (num_chunks, embedding_dim)
## 5. Build a FAISS index for fast nearest-neighbor search on embeddings
index = faiss.IndexFlat(chunk_embeddings.shape[1]) # distance (could use cosine similarity)
index.add(np.array(chunk_embeddings)) # add all chunk vectors to the index
## 6. Embed the query and retrieve top-K similar chunks
query_vec = model.encode([query]) # shape: (1, embedding_dim)
distances, indices = index.search(np.array(query_vec), k=3) # find 3 nearest chunks
top_chunks = [chunks[i] for i in indices[0]]
print("Query:", query)
print("Top relevant chunk:", top_chunks[0])
In this code:
We split a long document into chunks and encode each chunk into a vector. Chunking ensures that each piece of text is of manageable length for the model and allows the retrieval system to pinpoint the relevant section of a document.
We use an embedding model (
all-MiniLM--v2
) to get 384-dimensional embeddings for each chunk. In practice, one might use a domain-specific model or a larger model for better accuracy (at the cost of speed).The embeddings are added to a FAISS index which supports efficient similarity search (here using distance which is equivalent to cosine similarity on normalized vectors). This dramatically speeds up retrieval compared to scanning all chunks for each query.
At query time, we encode the user’s query into the same vector space and perform a nearest-neighbor search in the index. The result is the top-K most similar chunks to the query. These chunks are the ones likely containing the answer or relevant information.
This retrieval pipeline is typically followed by using the retrieved chunk(s) in the LLM’s prompt (for instance, appending the chunk text to the query for the LLM to generate a final answer, an approach known as retrieval-augmented generation). The above example demonstrates the core of a document digitization and QA system: by combining chunking, embeddings, and vector search, even large document collections can be searched semantically with low latency. Real-world systems build on these fundamentals with optimizations discussed earlier (caching embeddings, using HNSW indexes for larger scales, etc.) and handle multi-modal data similarly by using appropriate embedding models (e.g. employing CLIP for image embeddings). The continued research in 2024–2025 is making such LLM-based retrieval systems faster, more scalable, and more accurate than ever before, enabling applications from enterprise document search to robust question-answering on digitized archives.
Sources: The insights and techniques above are drawn from recent literature, including performance studies of ANN indexes (HERE), scalable vector database designs ( Curator: Efficient Indexing for Multi-Tenant Vector Databases) , domain adaptation research (Do we need domain-specific embedding models? An empirical investigation), surveys on embedding model architectures (LLMs are Also Effective Embedding Models: An In-depth Overview), and advances in multimodal retrieval ( MM-Embed: Universal Multimodal Retrieval with Multimodal LLMs). These latest findings (2024–2025) highlight the state of the art in using vector embeddings for LLM applications in document processing.