Browse all previously published AI Tutorials here.
Overview of Vector Indexing Approaches
Comparison of Approaches Speed Memory and Accuracy
Suitability for Different Scenarios
Key Research Highlights 2024-2025
Document digitization and chunking has become a common strategy for augmenting Large Language Models (LLMs) with external knowledge. In this approach, documents are split into smaller chunks, each chunk is converted to a vector embedding, and a vector index is built to enable fast similarity search. These vector indexes allow retrieval of relevant chunks given a query embedding, forming the backbone of retrieval-augmented generation (RAG) pipelines (HERE). Recent research in 2024 and 2025 has focused on improving vector indexing methods along several dimensions: speed, storage efficiency, recall–precision tradeoffs, and integration ease in LLM pipelines. This review surveys the latest methods and findings, comparing different indexing approaches and analyzing their suitability for real-time inference, batch processing, and offline retrieval scenarios.
Overview of Vector Indexing Approaches
Modern approximate nearest neighbor (ANN) search algorithms underpin most vector databases and indexes. They can be broadly categorized into a few types (Vector Search on Billion-Scale Data Collections):
Brute-Force (Flat Index) – The simplest method stores all chunk embeddings in a list and does a linear scan for queries. This “flat” index guarantees exact results (100% recall) but becomes slow as data scales (HERE). Recent guidance suggests that flat indexes with brute-force search are viable for smaller corpora or prototyping despite best-practice leaning toward ANN indexes .
Graph-Based ANN (e.g. HNSW) – Graph indexes like Hierarchical Navigable Small World (HNSW) have become a de facto standard for fast vector search (SeRF: Segment Graph for Range-Filtering Approximate Nearest Neighbor Search). They organize embeddings as a navigable small-world graph; query traversal quickly finds near neighbors. HNSW is widely adopted in industry and academia – implemented in Lucene and ElasticSearch, and powering vector DBs like Weaviate and Milvus (Zilliz) . Graph methods offer excellent speed/accuracy tradeoffs, typically achieving high recall with millisecond latencies on million-scale data.
Tree-Based ANN (e.g. VP trees, Annoy) – Tree structures (like randomized projection trees in Spotify’s Annoy) partition the vector space to prune search. They are simple to integrate (Annoy provides an easy library) but can be less efficient in very high dimensions or strict recall requirements, often outperformed by graph indexes in recent evaluations. (While 2024/25 research has focused less on tree ANN improvements, they remain a practical option for moderate scales.)
Inverted File and Quantization (IVF, PQ) – Partition-based indexes (originating from vision search) cluster the vector space into buckets (cells). At query time, only a few relevant buckets are searched (Cracking Vector Search Indexes). This inverted file (IVF) approach is often combined with Product Quantization (PQ) or other compression, which stores vectors in compact codes to save memory. The trade-off is some loss in precision due to quantization. A large number of clusters yields faster query times but higher index build cost, whereas fewer clusters (or none, i.e. brute-force) minimize upfront cost but result in slower queries at scale . Recent work highlights this tension: e.g. using 16k clusters gives very fast search but huge indexing time, while using ~1k clusters or brute-force starts queries immediately but with longer per-query latency .
Hash-Based ANN (LSH) – Locality-sensitive hashing hashes vectors into buckets such that similar vectors collide. LSH was popular historically for ANN but often requires many hash tables to reach high recall, leading to larger memory usage and slower queries compared to graph-based methods. (Recent literature in 2024/25 has not emphasized LSH for LLM retrieval, as other methods tend to outperform it on high-dimensional text embeddings.)
Hybrid and Learned Indexes – New research is exploring hybrid structures and adaptive indexes. For example, Azizi (2024) proposes ELPIS, a hybrid graph–tree index that clusters the dataset (tree partitioning) and then builds local proximity graphs, combining each approach’s strengths (Vector Search on Billion-Scale Data Collections). This design dramatically reduces indexing memory and time while maintaining efficient query answering, outperforming state-of-the-art graph indexes in throughput on billion-scale data . Another frontier is adaptive indexing: Mageirakos et al. (2025) introduce CrackIVF, an index that builds itself on the fly rather than all upfront (Cracking Vector Search Indexes). CrackIVF begins with near brute-force search and incrementally refines an IVF index as queries arrive, adapting to the query distribution . This yields orders-of-magnitude lower startup cost – the system can answer many queries immediately (albeit slower at first) while gradually converging to the performance of a fully built index . These innovations are particularly relevant for offline corpora or infrequently accessed data, where building millions of indexes in advance is impractical.
Comparison of Approaches Speed Memory and Accuracy
Search Speed: Graph-based indexes like HNSW are known for excellent query speed at scale. They can retrieve nearest neighbors in logarithmic-like time, often just a few milliseconds for million-scale corpora. Lin (2024) found that on “large” corpora (≥1M vectors), a tuned HNSW index achieved query throughput up to 10× higher than a brute-force flat index (HERE), with similar accuracy. For “medium” corpora (100K–1M vectors), HNSW was ~2–3× faster than flat search when using cached query embeddings . On small collections (<100K), the speed difference becomes negligible , making brute-force acceptable for tiny databases or prototyping. Tree-based methods (Annoy) also accelerate search over brute-force, but typically they are slower than HNSW at high recall settings (Annoy might require examining many tree nodes to reach recall parity with HNSW’s graph navigation). Inverted file (IVF) indexes offer tunable speed: with sufficient partitioning (e.g. hundreds or thousands of clusters), they drastically cut down candidates to inspect, approaching the speed of HNSW. However, if too few clusters are used (for less indexing cost), query speed suffers. CrackIVF specifically shows that starting with a small number of clusters gives quick initial answers, but as it learns and increases partitions, query latency drops to near-optimal levels (Cracking Vector Search Indexes). Hashing schemes (LSH) can answer queries quickly for very high similarity matches, but for the kind of semantic embeddings used in LLM applications (which are high dimensional and require nuanced nearest neighbor ranking), LSH often needs many probes to achieve good recall, hurting its speed advantage.
Storage Efficiency: There is a trade-off between index complexity and memory footprint. A flat index is just the raw vectors (and perhaps an ID list) — simple but memory-heavy if the corpus is large (each 768-dim embedding ~3KB in FP32). Graph indexes like HNSW add links between vectors; this overhead is linear in data size (each node stores M neighbors). For example, HNSW with M=16 might store 16 extra links per vector. This increases memory usage but not overwhelmingly so (typical overhead 50–100% of the embedding data size). Tree indexes have smaller overhead, but they don’t reduce the need to store the full vectors unless combined with quantization. Product Quantization (PQ) is a key technique to shrink storage: vectors are stored in compressed form (e.g. 8 or 16 bytes) instead of full floats. IVF+PQ was famously used to fit billions of vectors into memory in Facebook’s Faiss library. The cost is some loss in precision due to compression. An alternative lighter approach is int8 quantization of vectors (reducing 32-bit floats to 8-bit integers). Lin (2024) reported that applying int8 quantization yields a 4× reduction in memory and actually increases query speed, with only a minor impact on retrieval quality . The increased speed comes from smaller data to traverse and better cache utilization, outweighing the modest extra cost of encoding. Notably, quantizing HNSW indexes gave even larger QPS gains than quantizing flat indexes, with little to no drop in nDCG@10 compared to non-quantized HNSW . This suggests that aggressive compression is very feasible: one can quantize embeddings (and even prune dimensions) to save space while preserving most of the utility for retrieval. In practice, many vector databases (Milvus, Vespa, etc.) offer optional PQ or bit quantization modes for large deployments.
Recall and Precision Trade-offs: In approximate search, recall refers to how many of the true nearest neighbors (or relevant documents) are found, whereas precision relates to whether the retrieved top results are truly relevant. A well-tuned ANN index should retrieve nearly the same top-k as an exact search, so precision/recall should remain high. Graph methods like HNSW are known to achieve >95% recall of true neighbors in benchmarks while being much faster than brute force. Empirical results confirm that the effectiveness degradation of HNSW is minimal: in Lin’s experiments on retrieval-augmented QA, the nDCG@10 from an HNSW index was usually within a few thousandths of that from an exact flat index (HERE). In fact, in some cases HNSW even slightly exceeded the flat index’s score (within run variance) . This shows that for top-k retrieval with typical k (e.g. 5–10), a properly configured ANN gives essentially the same relevant chunks as exhaustive search. The trade-off comes if one pushes for extreme speed or compression: using fewer neighbors or lower-precision embeddings can start to miss some relevant results, harming recall. LSH tends to have a sharper trade-off: it may retrieve some close vectors very quickly but can miss others entirely (lower recall) unless many hash tables are used. Tree indexes might miss neighbors if data doesn’t partition cleanly by hyperplanes. In contrast, quantized indexes (int8 or PQ) typically maintain high recall/precision for semantic search. Lin (2024) found quantization’s impact on retrieval metrics to be minor overall, barely shifting nDCG scores in most cases . In summary, modern ANN indexes can be tuned to hit target recall (even 99%+ of exact), and the precision of retrieved chunks for LLM context is largely preserved. It becomes a matter of choosing the right parameters (e.g. HNSW efSearch, number of IVF probes, etc.) to balance slightly higher recall vs. slightly more latency. Advanced approaches are also exploring dynamic trade-offs: e.g. SOAR (NeurIPS 2023) introduced redundancy in ScaNN’s index to reduce “failure” (missing a true neighbor) (SOAR: New algorithms for even faster vector search with ScaNN) , essentially trading a bit more index size for higher recall at fixed speed.
Ease of Integration: The practicality of a vector index in LLM pipelines depends on how easily it integrates with existing tools. Here, HNSW-based indexes have an advantage due to broad adoption. Many off-the-shelf vector databases and libraries use HNSW under the hood, making it almost plug-and-play. For example, OpenAI’s Ada embeddings can be indexed in Weaviate or Milvus with HNSW – as one study notes, Weaviate’s ANN (HNSW) retrieval yields fast search with high accuracy out-of-the-box (HERE). Likewise, popular frameworks like LangChain and LlamaIndex provide connectors to vector stores (Pinecone, Vespa, etc.) that default to HNSW or similar ANN methods. Even traditional search engines have integrated dense vector indexes: Apache Lucene added HNSW in version 9, and ElasticSearch 8.x supports ANN search natively (SeRF: Segment Graph for Range-Filtering Approximate Nearest Neighbor Search). This means an LLM application can leverage existing infrastructure (e.g. add a vector field to an Elastic index) rather than building a custom system. Flat indexes are trivially easy to implement (just a matrix of embeddings), and libraries like FAISS make it one function call to do brute-force search on GPUs or CPUs. However, flat search becomes impractical beyond a certain data size unless used in a limited setting (small corpora or as a re-ranking step). Tree-based indexes like Annoy or FLANN are available as libraries but are somewhat less common in LLM pipelines today. Hashing techniques require custom integration and are rarely part of high-level frameworks now. IVF/PQ indexing usually requires using a library like FAISS or Milvus; integration is moderately easy (these libraries are well-documented), but tuning the index (choosing number of clusters, training PQ codebooks, etc.) adds complexity. In summary, approaches that have matured into open-source tools (HNSW, flat, basic PQ) are the easiest to integrate. Cutting-edge methods like CrackIVF or learned indexes would currently require custom implementation, but they build on standard primitives (e.g. Faiss for CrackIVF (Cracking Vector Search Indexes)) which suggests they could be adopted into mainstream tools in the near future.
Suitability for Different Scenarios
Real-Time Inference Settings: In interactive applications (chatbots, live question-answering), low query latency is paramount. Vector indexes that provide fast recall of relevant chunks with minimal delay are preferred. Graph-based ANN (HNSW) is generally the top choice here due to its millisecond-level response times and high recall. It’s no surprise most production RAG systems use HNSW or similar under the hood (SeRF: Segment Graph for Range-Filtering Approximate Nearest Neighbor Search). For example, a search query to a vector DB should ideally complete in <100ms to keep total LLM response time reasonable. HNSW can often fulfill a top-5 or top-10 ANN search in 5–10ms on millions of vectors. Real-time systems also benefit from vector compression (to fit indexes entirely in RAM and cache). Using int8 quantization or optimized data structures ensures the search stays memory-resident and fast. Research confirms that int8-quantized indexes boost QPS and reduce latency significantly while keeping answer quality high (HERE) . In contrast, exhaustive search would be too slow if the corpus is large (scaling linearly with N). Only in cases of very small corpora (say a few thousand embeddings) could brute-force be acceptable in real-time, or if one has massive parallel hardware (e.g. GPU searching a few million vectors quickly). Real-time pipelines also demand consistent performance – graph indexes can be tuned (e.g. setting HNSW ef
parameter) to balance latency and recall deterministically per query. This is harder with methods like LSH which have more randomness. Thus, for real-time: HNSW (or similar ANN) with possible quantization is the go-to, providing an excellent speed/recall balance . Tree indexes or LSH might be used if memory is extremely constrained and approximate results are tolerable, but those are less common. Integration-wise, using a managed vector database (or an open-source one) that slots into an LLM service can simplify deployment for real-time use.
Batch Processing: In batch or offline processing of queries, the constraints differ. Here the system might handle a large number of queries or documents in a pipeline without a human waiting on each query. Throughput (queries per second over the batch) and total processing time matter more than single-query latency. Batch scenarios can tolerate using more exhaustive or repeated work per query if it simplifies the pipeline or improves accuracy, since the “wall-clock” per query is less critical than overall job time. For example, a nightly job might embed and retrieve information for thousands of tasks. In such cases, it might be feasible to use exact search or brute-force for better recall, especially on moderate corpus sizes, because the queries can be run in parallel or during off-peak hours. Lin (2024) points out that for prototyping or small query workloads, even on medium-size corpora, the difference between HNSW and flat search in total processing time may be minor . If queries are pre-encoded (cached), a flat scan might take only a couple of seconds vs a second for ANN – not a deal-breaker in batch mode . Batch processing can also amortize index construction costs. If you have to run thousands of queries on a new dataset, spending time to build an optimized index may pay off. Methods that have heavy preprocessing (HNSW build, large IVF clustering, etc.) become more worthwhile when the index will be hit by many queries. On the flip side, something like CrackIVF is attractive for batch scenarios where you don’t know the query distribution upfront – it allows you to start querying immediately and the index catches up as the batch runs (Cracking Vector Search Indexes) . This could minimize the total time to get through a batch of queries by overlapping indexing with search. Batch settings also allow using re-ranking or ensemble retrieval for higher precision: e.g. use a fast ANN index to get 100 candidates, then do a second pass with a slower exact method or an LLM re-ranker on those. This two-stage approach is common in LLM pipelines and is computationally palatable in batch mode (since you can distribute the workload). In summary, batch processing is more forgiving – one can mix and match index methods, even brute-force or heavy reranking, as long as the throughput is acceptable. The primary goal is maximizing overall recall/precision for the batch, so often a combination (ANN for initial recall, then refine) is used.
Offline Retrieval Systems: By offline retrieval, we refer to systems that are not serving live user queries, but rather performing retrieval for analytical tasks, data indexing, or periodic jobs (e.g. building a knowledge base, or an internal search on a static archive). In offline systems, response time is a very low priority; instead, focus is on cost, scalability, and completeness. These scenarios can leverage the most thorough retrieval techniques since time constraints are loose. For instance, if scanning an entire data lake to build an index of embeddings, one might even do an exhaustive similarity join or clustering offline. However, offline systems also deal with the largest scales of data (potentially billions of documents), so storage efficiency and scalability of the index become critical. A method like DiskANN (Disk-based ANN) developed by Microsoft is designed for this: it stores the graph index partly on SSD, enabling billion-scale vectors to be indexed without all data in RAM. DiskANN sacrifices some query speed (millisec-level disk fetches) but is suitable for offline or high-scale scenarios where holding everything in memory is too costly. Another consideration is that offline retrieval might involve specialized queries or filters. For example, one might restrict search to documents of a certain date range or type. Research on range-filtering ANN (e.g. SeRF: Segment Graph, SIGMOD 2024) addresses combining vector search with structured filters efficiently (SeRF: Segment Graph for Range-Filtering Approximate Nearest Neighbor Search) . Such advanced indexes could be useful in offline analytic search where complex queries are expected. In terms of accuracy, offline systems can maximize recall by using multiple indexing methods in parallel. For example, one could maintain both a dense vector index and a traditional keyword index (BM25) and union their results for completeness – speed is secondary. Some experiments (Cahoon et al. 2025) on open-domain QA found that combining dense and sparse retrieval can yield the best of both worlds in terms of answer recall (HERE) . Offline contexts also benefit from fully automated index selection. With so much data of different types, one index type may not fit all. Techniques like CrackIVF that can adapt per dataset and workload are promising: Mageirakos et al. show that CrackIVF eventually converges to an index as effective as the best static index, while having several orders of magnitude less initial cost when deploying on new data . This is ideal offline where you might spin up an index on a new corpus and immediately start using it for analysis, letting it optimize in the background. Overall, offline retrieval systems lean towards maximizing recall and scale: they will use heavier indexing (even brute-force or very exhaustive ANN configurations), aggressive compression (to fit massive corpora), and innovative methods to minimize the operational burden of indexing millions of documents.
Key Research Highlights 2024-2025
To conclude, we highlight some of the most pertinent recent research contributions on vector indexing for LLM document retrieval:
HNSW vs. Flat Index Trade-offs: Lin (2024) provides extensive empirical guidance on when to use HNSW vs brute-force indexes for dense retrieval (HERE) , including effects of int8 quantization .
Vector Quantization in Retrieval: The same study by Lin examined int8 quantization in a production search library. The results demonstrated that quantizing embeddings (both in flat and HNSW indexes) can boost QPS by 1.5–2× and cut memory usage to a quarter, for only a very slight drop in metrics like nDCG . This confirms that modern CPUs can leverage vectorized instructions on 8-bit data, making quantization a highly attractive technique for LLM pipelines dealing with memory limits. The guidance is essentially that quantization should be applied whenever memory or speed is at a premium, as the trade-off “cost” in accuracy is minimal in most cases.
Adaptive Indexing (CrackIVF): Mageirakos et al. (2025) tackled the problem of indexing many separate datasets (as in enterprise “embedding data lakes”) where building a static ANN index for each is infeasible. They proposed CrackIVF, which uses a cracking approach from database systems to gradually build an IVF index based on query workload (Cracking Vector Search Indexes) , achieving faster startup and competitive long-term performance.
Hybrid Index Structures (ELPIS): Azizi (VLDB 2024) introduced ELPIS, a novel in-memory ANN index that merges graph-based and tree-based ideas (Vector Search on Billion-Scale Data Collections). By first clustering the data (using a technique called EAPCA) and then building lightweight neighborhood graphs within clusters, ELPIS achieves strong query performance while drastically reducing the indexing time and memory compared to pure HNSW. The author reports that ELPIS outperforms state-of-the-art baselines in throughput-optimized settings, meaning it can handle very high query rates efficiently . This reflects a trend of hybrid approaches to get the best of multiple data structures.
Integration in LLM Workflows: Beyond algorithmic advances, 2024 has also seen work on end-to-end retrieval systems with LLMs. For example, Microsoft’s GraphRAG and RAPTOR (2024) explore organizing knowledge for RAG in graph forms, and the use of powerful rerankers on retrieved chunks. While these focus more on pipeline and re-ranking than the vector index itself, they underscore that ease of integration and overall system design are active areas. An emerging best practice is to use a two-tier retrieval: a fast vector index to get candidate chunks, followed by an LLM-based reranker or reader to ensure precision (Optimizing open-domain question answering with graph-based retrieval augmented generation) . This mitigates any small loss in recall from the ANN index by letting the LLM sift relevance in a second stage, ultimately improving answer quality.
In summary, vector indexing methods for LLM document retrieval have matured and diversified. Graph-based ANN indexes (especially HNSW) remain the workhorse due to their speed and accuracy, enhanced by quantization for efficiency. For extremely large or flexible scenarios, new research provides pathways to maintain performance: whether through adaptive indexing that eliminates huge upfront costs, or hybrid structures that scale better. The recall–precision trade-offs of ANN vs exact search are now well-understood – with proper tuning, ANN methods incur only minor recall loss while delivering massive speedups (HERE). This is crucial for real-time LLM applications. Meanwhile, the choice of index can be tailored to the use-case: real-time systems favor fast ANN with high recall; batch processing can mix methods to maximize overall throughput; and offline systems can leverage heavy indexing or novel approaches to handle enormous data volumes. With these tools and findings from 2024–2025 research, practitioners can better select and configure vector indexes to power the next generation of LLM-based applications.
Sources:
Lin, J. (2024). “Operational Advice for Dense and Sparse Retrievers: HNSW, Flat, or Inverted Indexes?” – Explores trade-offs between HNSW and flat indexes (indexing time, query speed, accuracy) for dense retrieval (HERE) , including effects of int8 quantization .
Mageirakos et al. (2025). “Cracking Vector Search Indexes.” – Proposes CrackIVF adaptive index for RAG on data lakes, which incrementally builds an IVF index during querying (Cracking Vector Search Indexes) , achieving faster startup and competitive long-term performance.
Azizi, I. (2024). “Vector Search on Billion-Scale Data Collections.” (VLDB PhD Workshop) – Introduces ELPIS hybrid index combining tree and graph structures to address indexing scalability on large data, with state-of-art throughput results (Vector Search on Billion-Scale Data Collections).
Zuo, C. et al. (2024). “SeRF: Segment Graph for Range-Filtering Approximate Nearest Neighbor Search.” (SIGMOD 2024) – Develops a graph index that supports attribute-range filtering alongside ANN search, compressing multiple HNSW graphs efficiently (SeRF: Segment Graph for Range-Filtering Approximate Nearest Neighbor Search) .
Jimeno-Yepes et al. (2024). “Financial Report Chunking for Effective RAG.” – An application of RAG in finance; uses Weaviate (HNSW ANN) as the vector store and discusses chunking strategies (HERE) . Illustrates a typical LLM pipeline using chunk embeddings and ANN retrieval.
Cahoon, J. et al. (2025). “Optimizing Open-Domain QA with Graph-Based RAG.” – Benchmarks graph-based retrieval augmented generation strategies. Emphasizes combining multiple retrieval modes and LLM integration (e.g. GraphRAG vs. hybrid search) for improved QA performance (Optimizing open-domain question answering with graph-based retrieval augmented generation) .
Google Research (2023). “SOAR: Improved Indexing for ANN Search.” – NeurIPS 2023 paper introducing an algorithm (SOAR) that adds redundancy to ScaNN’s index, improving recall at given speed by using orthogonality-amplified residuals (SOAR: New algorithms for even faster vector search with ScaNN) . (Represents ongoing improvements to vector search algorithms as data scales.)