Vector Databases for RAG Literature Review

Jun 16, 2025

Browse all previously published AI Tutorials here.

Table of Contents

Vector Databases for RAG Literature Review
Introduction
Evaluation Criteria for Vector Databases
Comparison of Leading Vector Databases 2024-2025
- Pinecone
- Weaviate
- Milvus
- Qdrant
- Chroma
Comparative Insights and Recommendations
Conclusion

Introduction

Retrieval-Augmented Generation (RAG) pipelines rely on vector databases to store and search document embeddings. In a typical RAG workflow, documents are digitized and chunked into passages, each encoded as a high-dimensional vector. A user query is likewise embedded and used to retrieve the most similar chunks from the vector store (HERE) . This integration of vector search allows large language models to ground their answers in relevant data, mitigating hallucinations by providing factual context (When Large Language Models Meet Vector Databases: A Survey) . The past few years have seen a surge of interest in such Vector Database Management Systems (VDBMS) – with over 20 systems introduced in the last five years ( Survey of Vector Database Management Systems) – driven by the need for fast, scalable, and reliable similarity search to support LLMs . Below, we review recent 2024–2025 research (primarily arXiv and other reputable sources) on vector databases in RAG contexts, focusing on four key criteria: retrieval speed, storage efficiency, scalability, and query accuracy.

Evaluation Criteria for Vector Databases

Retrieval Speed: The end-to-end latency of similarity search queries and throughput (queries per second). Low-latency retrieval is critical for interactive LLM applications. Modern approximate nearest neighbor (ANN) algorithms (e.g. HNSW graphs) enable fast retrieval with high accuracy , often achieving sub-10ms response times on million-scale corpora.
Storage Efficiency: Memory and disk footprint required to store embeddings and indexes. Vector indices can be memory-intensive, especially graph-based indexes, so techniques like product quantization and disk-based storage are used to compress vectors ( Survey of Vector Database Management Systems). Efficient storage is vital for scaling to billions of embeddings without exorbitant RAM usage.
Scalability: The ability to handle very large corpora and high query loads by scaling up (more powerful hardware) or out (distributed clusters). Some vector DBs run on a single node (suitable for smaller datasets ), while others support sharding across many nodes for virtually unlimited scale . Robust scalability ensures performance remains high even as data grows.
Query Accuracy: The precision/recall of nearest-neighbor search results (how often the true nearest vectors are retrieved). ANN methods trade a tiny drop in accuracy for speed; the best systems maintain >95% recall of true neighbors (Vector Database Benchmarks - Qdrant). In practice, high recall is needed so retrieved chunks are relevant to the query, which in turn improves the fidelity of the generated answers in RAG.

Comparison of Leading Vector Databases 2024-2025

Pinecone

Retrieval Speed: Pinecone is a fully managed cloud vector DB known for low-latency queries. It employs advanced ANN indexes under the hood (proprietary, but likely graph-based or hybrid) to ensure millisecond-level search even on large scales. While specific benchmarks from research literature are sparse (Pinecone’s implementation is closed-source), it is designed to optimize for high throughput and low query latency across distributed infrastructure.
Storage Efficiency: As a managed service, Pinecone handles storage behind the scenes. It reportedly uses a mix of in-memory and disk techniques to balance speed and cost. Details in literature are limited, but Pinecone likely leverages vector compression or quantization to reduce memory footprint when storing billions of embeddings. Users do not directly tune this, but benefit from storage optimizations implemented by the service.
Scalability: Pinecone excels in scalability – it automatically shards and distributes indexes across nodes. It offers a seamless scalable system, where users can index massive corpora without managing servers ( Survey of Vector Database Management Systems). This distributed design is similar to systems like Vald, making Pinecone very user-friendly for large-scale deployments . Many organizations choose Pinecone when they require virtually unlimited scale and easy maintenance in production.
Query Accuracy: Pinecone is engineered to preserve high accuracy in ANN searches. It likely uses high-recall index configurations by default, so that results closely match those of exact nearest neighbor search. In practice, Pinecone can achieve ~95–100% recall (depending on how it’s configured) while still maintaining speed. It supports tunable accuracy (e.g. adjusting search parameters) if users need to trade off latency for even higher precision.

Weaviate

Retrieval Speed: Weaviate is an open-source vector DB (written in Go) that uses an HNSW graph index by default. It delivers fast retrieval; for instance, in a 1M vector dataset benchmark, Weaviate handled ~1,100 queries/sec with ~5ms average latency (Vector Database Benchmarks - Qdrant). This is only slightly behind the fastest engines. Weaviate’s search performance has improved over time, though one report noted it showed the least improvement among peers in recent tests . Still, it provides interactive-speed queries and integrates well with RAG pipelines (as demonstrated in financial QA tasks using Weaviate for chunk retrieval (HERE)).
Storage Efficiency: By default Weaviate stores all vectors and the HNSW index in memory, which can be memory-intensive for very large datasets. However, Weaviate supports optional Product Quantization (PQ) compression – it can construct HNSW indexes over compressed vectors . This significantly reduces memory usage (with minimal accuracy loss), making Weaviate more storage-efficient for large corpora. The index itself (HNSW) has moderate overhead, which is generally reasonable , but very large databases might require quantization or filtering to control memory growth.
Scalability: Weaviate supports scaling out in a cluster configuration. It allows sharding of data classes across multiple nodes and has a hybrid architecture to combine vector search with symbolic filters. While not a managed service, it can be run distributed in production. Several companies run Weaviate on multi-node setups for datasets in the order of hundreds of millions of vectors. Its architecture provides native support for distributed search (scatter-gather across shards), although managing a cluster requires more effort than a managed solution .
Query Accuracy: Thanks to HNSW, Weaviate achieves high recall. In benchmarks it reached ~97–99% precision/recall at 10 nearest neighbors , indicating that it retrieves nearly all relevant chunks. The ANN algorithm yields fast results without sacrificing much accuracy . Furthermore, Weaviate allows tuning HNSW parameters (M, ef) to adjust the speed-accuracy balance. In summary, Weaviate provides strong query accuracy out-of-the-box, suitable for RAG use cases that demand precise retrieval of supporting passages.

Milvus

Retrieval Speed: Milvus (an open-source DB by Zilliz) supports multiple index types (HNSW, IVF, PQ, etc.). Its query speed can vary depending on the index chosen. On one extreme, Milvus can do brute-force (exact) search very quickly using optimized BLAS, but that doesn’t scale past small datasets. For ANN, if using HNSW, its query performance is comparable to other HNSW-based systems. However, one benchmark showed Milvus lagging in search throughput for high-dimensional data: e.g. ~219 QPS with ~393ms latency on 1M 1536-dim embeddings (with HNSW parameters M=16, ef=128) . This suggests default configurations may not be tuned for latency. On the other hand, Milvus was extremely fast in indexing new data – it built an index 10× faster than some competitors . In summary, Milvus can retrieve quickly, but achieving top-tier query latency may require careful index selection and tuning.
Storage Efficiency: A strength of Milvus is flexibility in index storage. It can use quantized indexes (IVF-PQ, SQ) to greatly reduce memory usage for embeddings. For example, IVF with Product Quantization compresses vectors into small codes, dramatically saving space at some cost to accuracy ( Survey of Vector Database Management Systems). Milvus also offers a disk-based index (SPANN/DiskANN) for very large datasets, storing vectors on SSD while keeping only graphs or centroids in RAM . These options make Milvus highly efficient in storage – users can opt for an IVF-PQ index with lower memory and moderate recall, or HNSW for higher memory and recall. The ability to mix and match indexes means Milvus can be tailored to available hardware resources.
Scalability: Milvus is built with a distributed architecture (Milvus 2.x) – it uses a cluster of components (query nodes, index nodes, etcd, etc.) to manage large workloads. It natively supports sharding and replicas, enabling it to scale to billions of vectors across multiple machines. Many large-scale vector search deployments (in 2024) use Milvus clusters in production. Distributed search is a core feature: the query is broadcast to all shards and partial results aggregated . This allows Milvus to maintain throughput as data grows. In short, Milvus handles scalability well, albeit with higher operational complexity since it’s self-hosted.
Query Accuracy: Milvus can achieve high accuracy depending on index type. With HNSW or a fine-grained IVF (large number of centroids + residual PQ), Milvus can return ~99% recall of nearest neighbors . Its default HNSW settings in one test reached 0.99 precision . However, if using heavy compression (e.g. aggressive PQ), accuracy will drop. Research indicates graph-based approaches (like HNSW) generally surpass quantization-based methods (IVFPQ) in recall at the cost of more memory (HERE). Thus, for mission-critical accuracy, Milvus users might prefer HNSW or high-precision IVF settings. Milvus gives the user control to pick that accuracy/speed trade-off as needed.

Qdrant

Retrieval Speed: Qdrant (open-source, in Rust) has distinguished itself with excellent speed. Recent benchmarks (2024) show Qdrant achieving the highest throughput and lowest query latencies among vector DBs in many scenarios (Vector Database Benchmarks - Qdrant). For example, on a 1M dataset (1536-dim embeddings), Qdrant handled ~1,238 queries/sec with ~3.5ms average latency, while maintaining 99% recall . This was the top performance, outperforming similar HNSW-based systems. Qdrant’s efficiency is attributed to its Rust optimizations and data structures. In summary, Qdrant offers state-of-the-art retrieval speed, making it ideal for latency-sensitive RAG applications.
Storage Efficiency: Qdrant uses an HNSW index in memory by default, so its baseline memory usage is comparable to Weaviate or other HNSW implementations. However, the Qdrant team has incorporated techniques like binary vector compression and optimized IO to improve storage efficiency . While the full memory vs. accuracy benchmarks are still in progress (they indicated a memory consumption benchmark “coming soon”), Qdrant is actively adding support for on-disk indexes and quantization. This means Qdrant can trade some accuracy for a smaller footprint when needed. For now, with default settings, expect memory usage proportional to dataset size (plus HNSW overhead), which is fine up to many millions of vectors but could be heavy at billion-scale without compression.
Scalability: Initially, Qdrant was single-node, but it now offers a distributed (cluster) mode to scale out across multiple nodes (released in late 2024). This allows sharding the vector data and parallelizing searches, similar to other distributed VDBMSs . Qdrant’s design, being cloud-native (they also offer a managed Qdrant Cloud), focuses on horizontal scalability while keeping latency low. Early indications are that Qdrant’s cluster mode preserves its speed advantage even as data grows. Additionally, Qdrant integrates well with ecosystem tools (like Azure Cognitive Search using Qdrant under the hood for vector queries (When Large Language Models Meet Vector Databases: A Survey)), showing it can handle enterprise-scale workloads.
Query Accuracy: Qdrant’s HNSW ensures high recall. In tests it achieved 99% precision (essentially nearly exact results) while still being fastest . It supports tuning search parameters (ef search, etc.) to adjust accuracy. By default, Qdrant appears to target very high recall, which is beneficial for RAG (we want the correct supporting chunks). There is no notable accuracy penalty for using Qdrant’s ANN – like others, it can retrieve with “high accuracy” comparable to exact search (HERE). Overall, Qdrant reliably returns relevant neighbors, and its accuracy remains on par with the best of vector databases.

Chroma

Retrieval Speed: Chroma is an open-source vector store often used in lightweight RAG setups (especially with LangChain). It is designed for simplicity and runs locally (Python environment). Chroma’s core is built on FAISS, so its retrieval speed on a single machine is decent – it can perform ANN searches in a few milliseconds for moderate dataset sizes. However, being Python-based, extremely high throughput could be limited by GIL and API overhead. Chroma is sufficient for prototyping or small-scale use (e.g. thousands to low millions of vectors), delivering interactive speeds, but it may not match the optimized C++/Rust systems on very large loads.
Storage Efficiency: By default, Chroma stores embeddings in an SQLite or DuckDB and uses FAISS for indexes in memory. It does not (out-of-the-box) apply advanced compression unless you manually configure a FAISS index type like IVF or PQ. In standard use, it keeps full precision vectors, which means higher memory usage per vector (e.g. 1536-dim float vector ≈ 6 KB). For many applications this is fine, but for larger scales, memory can become a bottleneck. Chroma’s simplicity trades off some efficiency; it does not yet have built-in distributed storage or automatic vector compression. Users looking to save space might need to manually compress embeddings before insertion.
Scalability: Chroma is a single-node system – it’s not designed to be distributed across servers ( Survey of Vector Database Management Systems). It works great on a personal machine or a single server, but it cannot natively shard data across multiple machines. This limits its scalability to the constraints of one machine’s RAM and disk. In practice, Chroma is popular for managing small to mid-size corpora in RAG (e.g., a few hundred thousand chunks), but for very large document collections (tens of millions of chunks), one would have to move to a more scalable solution or run multiple Chromas manually partitioned.
Query Accuracy: Chroma leverages FAISS for similarity search, so it can achieve high accuracy depending on the index used. By default, it might use a flat (exact) or HNSW index, which yields 100% or >99% recall respectively, at the cost of speed (flat) or using more memory (HNSW). Thus, accuracy is usually not a concern – Chroma can return perfectly accurate nearest neighbors if configured to do so. If using approximate indexes, it’s as accurate as FAISS’s implementation (which is well-regarded). In summary, Chroma’s query accuracy is strong; the user can decide to use exact search for full accuracy or ANN for a balance, just as with other systems. The main limitation is not accuracy but rather performance at scale.

Comparative Insights and Recommendations

Retrieval Speed: If fast query processing is the top priority, Qdrant stands out as the leading choice, with benchmarks showing it outperforming other solutions in latency and throughput (Vector Database Benchmarks - Qdrant). Its Rust-based engine delivers consistently low query times even with million-scale data. Weaviate and Pinecone are also proven low-latency performers (both leveraging HNSW), suitable for real-time applications (HERE). Milvus can be fast, but may require tuning to reach the same level. For smaller-scale or development use, Chroma is usually “fast enough,” but for production at scale, a highly optimized engine like Qdrant or Weaviate is recommended.

Storage Efficiency: When memory or disk footprint is the main concern, consider solutions that support vector compression. Milvus offers IVF and PQ indexes to drastically cut down storage needs, making it ideal for very large corpora on limited hardware. Weaviate’s support for PQ-compressed vectors is another advantage if you need to save RAM ( Survey of Vector Database Management Systems). If using Qdrant, look into its emerging compression features (e.g. binary quantization) or run it on hardware with fast SSDs to supplement RAM. Pinecone manages storage for you and likely uses its own optimizations, but you may incur costs for large datasets. In scenarios where storage efficiency outweighs raw accuracy, using Milvus with a compressed index (IVF-PQ) is a strong option – it will sacrifice a bit of recall but use significantly less memory (HERE).

Scalability: For massive scale deployments, Pinecone is often the top recommendation due to its effortless scaling and managed infrastructure – you can index billions of vectors and let Pinecone handle the distribution . Among open-source systems, Milvus and Weaviate have proven distributed modes capable of handling very large data if you have the DevOps resources to manage a cluster. Qdrant’s new clustering is promising for scale-out as well. If your use case involves web-scale data or high availability requirements, a distributed vector DB (Pinecone, or self-hosted Milvus/Weaviate cluster) is the way to go. For smaller-scale (single node) needs, Chroma or a single-instance of Qdrant/Weaviate is simpler and will work just fine – don’t over-engineer scaling if you don’t need it.

Query Accuracy: All modern vector databases can be tuned to achieve high recall. If precision of retrieval is paramount (e.g. in domains where missing a relevant document is unacceptable), consider using HNSW-based systems like Qdrant or Weaviate, which tend to preserve semantic relationships and yield very high recall by default (HERE). In fact, Qdrant and Weaviate both reached ~99% recall in evaluations (Vector Database Benchmarks - Qdrant), meaning their ANN results were almost identical to exact search. Milvus can also attain high accuracy; just avoid overly aggressive compression if recall is critical. When maximum accuracy is needed, you can configure any of these systems with conservative ANN settings (or even brute-force search for smaller data) at the cost of some speed. In summary, for most RAG workflows, the slight differences in accuracy between top vector DBs are negligible – all can return highly relevant chunks – so you might decide based on other factors. Only if you plan to heavily compress vectors to save space will accuracy drop a bit, in which case favor a system that allows hybrid retrieval (e.g. rerank results or adjust ANN parameters).

Conclusion

Choosing the “best” vector database depends on your priority: For sheer speed, Qdrant is a front-runner; for minimal storage use, Milvus (with compression) or Weaviate (with PQ) are excellent; for effortless massive scaling, Pinecone is compelling; and for balanced performance with open-source flexibility, Weaviate and Qdrant are great all-rounders. All these databases have been successfully used in 2024–2025 RAG pipelines to enable quick and accurate retrieval of document chunks (HERE). The research and benchmarks indicate that vector databases have matured to deliver millisecond-level retrieval, efficient indexing, horizontal scalability, and high recall, powering the next generation of LLM applications with relevant knowledge ( Survey of Vector Database Management Systems). Future work will continue to refine these systems – improving consistency, hybrid query handling, and testing methodologies (Towards Reliable Vector Database Management Systems: A Software Testing Roadmap for 2030) – but even now, developers can pick a vector store that best fits their needs from a rich landscape of capable solutions.

Sources: Recent literature and benchmarks on vector databases and RAG (2024–2025) (Vector Database Benchmarks - Qdrant), including surveys from arXiv and VLDB that compare design and performance aspects ( Survey of Vector Database Management Systems). Each of the databases discussed (Pinecone, Weaviate, Milvus, Qdrant, Chroma) is referenced in contemporary studies or official benchmarks to highlight their strengths and trade-offs. The recommendations above synthesize these findings to guide selection based on speed, memory, scale, and accuracy considerations.

Rohan's Bytes

Discussion about this post