Vector Databases in Document Retrieval and RAG Applications

Jun 16, 2025

Browse all previously published AI Tutorials here.

Introduction
How Vector Databases Work - Architecture and Indexing
Vector Index vs. Vector Database vs. Vector Plugin
Comparison of Key Vector Database Technologies
- FAISS (Facebook AI Similarity Search)
- Milvus
- Weaviate
- Pinecone
Recent Research and Trends (2024-2025)

Introduction

Large Language Models (LLMs) excel at generating text but struggle with up-to-date domain-specific knowledge and can hallucinate facts. Retrieval-augmented generation (RAG) addresses this by feeding LLMs with relevant context retrieved from an external knowledge base ([2402.01763] When Large Language Models Meet Vector Databases: A Survey). In practice, documents are digitized, split into manageable chunks of text, encoded into high-dimensional vectors (embeddings), and stored in a vector database. At query time, the user’s question is also embedded as a vector and used to retrieve the most similar document chunks from the vector store (HERE). These retrieved chunks (e.g. passages) are provided to the LLM to ground its answer in factual references. This pipeline leverages vector databases (VecDBs) as efficient semantic memory, mitigating LLM limitations like hallucination and outdated knowledge ([2402.01763] When Large Language Models Meet Vector Databases: A Survey). VecDBs offer an efficient way to store and manage the high-dimensional representations needed for semantic search and RAG ([2402.01763] When Large Language Models Meet Vector Databases: A Survey). They have become integral to modern AI applications such as RAG-based QA systems, knowledge retrieval, and semantic search engines.

How Vector Databases Work - Architecture and Indexing

A vector database is a specialized data management system optimized for similarity search in high-dimensional vector spaces. Its core operation is k-nearest neighbor (kNN) search to find vectors most similar to a query vector, typically using cosine or dot-product similarity. The major challenge is that brute-force search scales poorly as data grows. High-dimensional vectors (often hundreds or thousands of dimensions) lack easy partitioning and require expensive distance computations ([2310.14021] Survey of Vector Database Management Systems). Thus, vector DBs rely on advanced indexing methods for Approximate Nearest Neighbor (ANN) search, which trade a tiny amount of accuracy for drastic speedups. Common ANN index approaches include:

Tree-based indexes: e.g. vantage-point or KD-trees, which partition space hierarchically. These work for lower dimensions but degrade as dimensionality grows ([2402.01763] When Large Language Models Meet Vector Databases: A Survey).
Hash-based indexes (LSH): Use random projections or hashing (e.g. SimHash, LSH) to bucket similar vectors. They offer sub-linear search but often require many hash tables to reach high recall ([2402.01763] When Large Language Models Meet Vector Databases: A Survey).
Quantization-based indexes: Use vector quantization to compress and cluster vectors. A prominent example is inverted file (IVF) with product quantization (PQ) ([2402.01763] When Large Language Models Meet Vector Databases: A Survey). Vectors are quantized into discrete codes, and search probes a few nearest cluster centroids (reducing candidates) then refines results with compressed codes ([2402.01763] When Large Language Models Meet Vector Databases: A Survey). This significantly cuts memory and accelerates search at some cost to recall.
Graph-based indexes: Build a proximity graph of vectors (each node links to nearest neighbors). The Hierarchical Navigable Small World (HNSW) graph is state-of-the-art, enabling fast greedy search through the graph layers (Great Algorithms Are Not Enough | Pinecone). HNSW yields excellent recall at high throughput and ANN benchmarks show it has a large performance advantage over brute-force ([2402.01763] When Large Language Models Meet Vector Databases: A Survey). Indeed, HNSW is widely used in most vector databases for its strong accuracy-speed balance ([2402.01763] When Large Language Models Meet Vector Databases: A Survey). The downside is complex index construction and more challenging dynamic updates (e.g. deletions require graph maintenance) (Faiss indexes · facebookresearch/faiss Wiki · GitHub).

A vector DB’s architecture often combines these indexes with additional system components to handle scale and data management. Many systems partition data into shards to distribute load across nodes, since there is no natural relational partition key for vectors ([2310.14021] Survey of Vector Database Management Systems). They use compression (PQ, PCA, etc.) to cope with large vector sizes ([2310.14021] Survey of Vector Database Management Systems). Some support hybrid queries that combine vector similarity with structured filters (e.g. date or category) ([2310.14021] Survey of Vector Database Management Systems). To enable this, the system may maintain auxiliary indexes for metadata or integrate vector and scalar search in query execution. For example, Weaviate stores both vectors and scalar attributes, allowing queries like “find articles on X in the last 7 days,” by first retrieving by vector similarity then filtering by date (Weaviate Properties Overview | Restackio) . Advanced vector DBs also handle streaming data (inserts/deletes) with minimal downtime, using techniques like incremental index updates or background rebuilds. Supporting real-time updates is challenging for certain indexes (e.g. HNSW) but modern implementations provide workarounds (lazy deletions, rebuild triggers, etc.) .

In terms of storage, some vector DBs keep indexes in memory for speed, while others leverage on-disk indexes for billion-scale datasets. Recent research explores hybrid memory architectures (CPU, GPU, SSD). For example, FusionANNS (2024) proposes a multi-tier CPU/GPU cooperative index with SSD storage to achieve high throughput on billion-scale data using a single GPU ([2409.16576] FusionANNS: An Efficient CPU/GPU Cooperative Processing Architecture for Billion-scale Approximate Nearest Neighbor Search). Overall, the architecture of a vector database is a layered design: a data ingestion layer (for embedding and inserting vectors), an indexing layer (ANN structures for search), and a query execution layer (to combine vector scores with optional filters and ranking). By addressing key obstacles – high dimensionality, computational cost, lack of natural partitions, and hybrid query support – modern vector databases provide fast, scalable, and accurate semantic search on unstructured data ([2310.14021] Survey of Vector Database Management Systems) ([2310.14021] Survey of Vector Database Management Systems).

Vector Index vs. Vector Database vs. Vector Plugin

It’s important to distinguish a vector index from a vector database. A vector index is the low-level data structure or algorithm that enables ANN search (such as an HNSW graph or IVF-PQ index) ([2402.01763] When Large Language Models Meet Vector Databases: A Survey). It can be seen as one component of the system – focused purely on retrieval efficiency. In contrast, a vector database is a full-fledged database management system built around vector data. A vector DB incorporates one or more indexing algorithms internally, but also provides features like data ingestion APIs, persistence, replication, scaling, security, and query interfaces (e.g. SQL/GraphQL or SDKs). Simply “bolting on” a vector index to an existing DB does not automatically yield a robust vector database (Great Algorithms Are Not Enough | Pinecone). As Pinecone’s engineers note, an existing non-vector DB with a sidecar ANN index may struggle with the memory, compute, and scaling requirements of AI workloads . A true vector DB is purpose-built to meet those needs, often designed for low latency, high recall search at scale, live index updates, and easy operations .

Meanwhile, a vector plugin refers to an integration layer that connects LLMs or other applications to a vector database. For example, OpenAI’s ChatGPT Retrieval Plugin is a middleware that takes user-provided documents, chunks them, computes embeddings, and stores them in a vector DB, exposing endpoints for query and upsert (ChatGPT Retrieval Plugin - Traffine I/O). The plugin itself isn’t storing data long-term; it relies on a chosen backend (Milvus, Pinecone, etc.) for the actual vector index and database functionalities (ChatGPT plugins - OpenAI) . In essence, the plugin provides a standardized API and tooling so that an LLM (like ChatGPT) can query the vector database for relevant context. This separation of concerns allows developers to swap out vector DB backends or support multiple databases through the same plugin interface. In summary, the vector index is the algorithmic engine, the vector database is the complete system managing vector data at scale, and the vector plugin is an integration interface enabling external services (like LLMs) to leverage the vector database in applications like RAG.

Comparison of Key Vector Database Technologies

FAISS (Facebook AI Similarity Search)

FAISS is an open-source library (C++ with Python bindings) for efficient vector similarity search, originally from Facebook AI Research. It provides a suite of ANN index types (flat brute-force, IVFFlat/IVFPQ for product quantization, HNSW, etc.) and is highly optimized for CPU and GPU execution (Vector Databases in Modern AI Applications). FAISS was one of the first libraries to enable billion-scale vector search on a single machine by leveraging GPUs for massive parallelism . Its strength lies in raw performance and flexibility: developers can choose index types and parameters to balance speed vs accuracy, and even combine multiple techniques (e.g. HNSW on top of IVF). FAISS supports batching and can compute results with very high throughput. However, FAISS is not a standalone database service – it’s essentially a library. It lacks built-in networking, user management, or distribution across nodes. Using FAISS typically means embedding it in your application or another system. For instance, Milvus v1.0 was built on FAISS as its indexing layer (Milvus: A Purpose-Built Vector Data Management System). The downside is that managing dynamic data can be non-trivial; some FAISS indexes don’t support deletion or incremental updates easily (requiring index rebuilds). FAISS is ideal when you need a fast in-memory vector search and you are handling persistence and scaling at the application level. It remains a popular choice to power custom semantic search pipelines and is often the baseline for ANN performance comparisons.

Milvus

Milvus is an open-source vector database designed from the ground up to manage large-scale embedding data. It emerged from the need to handle not only similarity search but also the data management lifecycle (ingestion, updates, filtering, etc.) for AI applications . Milvus 1.0 (SIGMOD 2021) introduced a purpose-built engine using FAISS and other ANN libraries under the hood , adding a gRPC service layer and management features. It supported real-time insertion of vectors, deletions, and provided a SQL-like interface. Milvus 2.0 (code-named “Manu”, VLDB 2022) re-architected the system to be cloud-native and distributed across nodes . It uses a cluster of services (coordination via etcd, data nodes, query nodes, index nodes) to enable horizontal scalability and high availability. A key strength of Milvus is its support for dynamic data and hybrid queries: it can ingest streaming data (e.g. millions of new embeddings) while concurrently serving searches, and it allows filtering by metadata fields and even multi-vector queries (where an entity is represented by multiple vectors) . For example, a query can ask for “images similar to X and labeled ‘cat’” – Milvus can first apply the label filter and then vector search within that subset. It achieves this by storing scalar attributes and coordinating between a vector index and an inverted index for filtering. Milvus supports various index types (HNSW, IVF, etc., some via plugins) and can even utilize GPUs for indexing/search. Its performance is improved by optimizations like minimized CPU cache misses when scanning vectors . Milvus is known for handling billion-scale data by sharding across nodes and using disk storage for older data if needed. The trade-off is the complexity of deployment – running a Milvus cluster involves multiple services (though Docker-compose and Kubernetes Helm charts exist). Milvus is well-suited for enterprises needing an open-source, scalable vector DB that integrates with existing data pipelines (it has clients in Python, Java, etc. and can be integrated with LLM frameworks).

Weaviate

Weaviate is another prominent open-source vector database, implemented in Go, with a strong focus on combining unstructured and structured data. Weaviate represents data as objects that can have both a vector embedding and additional properties (fields). Its default indexing method is a customized HNSW index that supports full CRUD (inserts, updates, deletes) (Vector Indexing - Weaviate). Weaviate’s standout feature is native hybrid search: you can query by vector similarity, by keyword (BM25 full-text search), or a combination. For instance, it can find documents semantically similar to a query and satisfying a structured filter (e.g. a time range or category) in a single query (Weaviate Properties Overview | Restackio) . Under the hood, it maintains both a vector index and a shard-specific keyword index to support such queries. Weaviate is designed for horizontal scalability via sharding: data is partitioned into classes and shards which can be distributed across nodes, allowing the index to scale beyond memory of a single machine . This sharding is crucial since HNSW graphs can become memory-hungry for very large datasets; Weaviate mitigates that by splitting data. It also provides consistency and replication controls for fault tolerance. In terms of performance, Weaviate claims sub-100ms query latency even for complex (vector + filter) queries, and it can handle high query volumes by scaling out . Integration-wise, Weaviate offers a GraphQL API and a REST API, and has modules that can automatically vectorize data using pre-trained models (for text, images, etc.), making it convenient to set up end-to-end. A possible drawback is that being an all-in-one system, it requires running the server (or using their managed cloud service) and tuning HNSW parameters for optimal trade-offs. But its ease of use, rich feature set, and strong community support (including LangChain integration) make it a popular choice for semantic search and RAG, especially when one needs to combine semantic similarity with symbolic filters for more precise results .

Pinecone

Pinecone is a fully managed vector database service, notable for abstracting away all infrastructure and index management. Unlike open-source solutions, Pinecone is proprietary SaaS – users access it via an API while Pinecone handles the backend. Pinecone’s design philosophy centers on production readiness: it was built with the idea that great algorithms alone aren’t enough without operational excellence (Great Algorithms Are Not Enough | Pinecone). Pinecone emphasizes ease of use, flexibility, and performance at scale as its core tenets . In practice, this means developers can get started by simply creating an index through the API, upserting vectors, and querying, without worrying about index types or memory allocation. Pinecone automatically indexes the data using its internal algorithms (which include graph-based methods – Pinecone has hinted at its own optimized graph index on a purpose-built architecture ). It handles scaling behind the scenes: as your dataset grows or query load increases, Pinecone can distribute the index and balance queries (the details are hidden, but likely involve sharding and replicas in their cloud). Data persistence, replication, and uptime are managed for you. One of Pinecone’s strengths is fast data refresh and consistency – inserted vectors become searchable within seconds, enabling near real-time applications . It also supports metadata filtering with queries, though heavy filtering might have performance considerations. In terms of accuracy and speed, Pinecone’s indexes can be tuned indirectly via “pods” and index configurations that trade off cost vs recall. For many standard use cases, Pinecone achieves high recall with low latency out-of-the-box. A 2024 benchmark by Timescale found that a specialized Postgres with pgvector could rival Pinecone in 99% recall latency (Pgvector vs. Pinecone: Vector Database Comparison | Timescale), highlighting that Pinecone targets a sweet spot of high recall; applications that need lower recall for more speed might find self-hosted solutions competitive. The integration with LLMs is straightforward: Pinecone has well-documented Python/JavaScript client libraries and is supported by frameworks like LangChain, making it easy to plug into RAG pipelines. The main drawbacks are cost and vendor lock-in – you pay for the managed convenience and rely on Pinecone’s closed infrastructure. Nevertheless, Pinecone is widely adopted in industry for production AI systems due to its robustness (no ops burden) and ability to handle “AI-scale” workloads without significant performance tuning.

Recent Research and Trends (2024-2025)

The vector database field is evolving rapidly, with research in 2024 and 2025 focused on pushing the boundaries of performance, scalability, and intelligent retrieval. On the indexing front, researchers are exploring learned index structures and adaptive algorithms. For example, new methods like LoRANN (NeurIPS 2024) apply low-rank matrix factorization to ANN search (Vector Databases in Modern AI Applications). and other works study balanced clustering and graph optimizations to improve recall/cost trade-offs. Hardware-aware indexes are a major theme: techniques for GPU acceleration, cache-optimized search, and SSD-based indices (e.g. DiskANN, SPANN) are being refined to handle billion-scale data efficiently ([2409.16576] FusionANNS: An Efficient CPU/GPU Cooperative Processing Architecture for Billion-scale Approximate Nearest Neighbor Search) . Another active area is hybrid search and filtering – ensuring that adding metadata filters or range queries doesn’t drastically slow down vector search. Approaches like iRangeGraph (2024) extend HNSW graphs to handle numeric range constraints alongside similarity search (iRangeGraph: Improvising Range-dedicated Graphs for Range-filtering ...). Moreover, the synergy between LLMs and vector DBs is spurring new ideas. One notable direction is optimizing the chunking of documents for better retrieval. A 2025 study proposed Mix-of-Granularity, where the chunk size is dynamically chosen per query (small snippets vs larger passages) via a trained router, improving RAG accuracy by capturing the most relevant context granularity ([2406.00456] Mix-of-Granularity: Optimize the Chunking Granularity for Retrieval-Augmented Generation) ([2406.00456] Mix-of-Granularity: Optimize the Chunking Granularity for Retrieval-Augmented Generation). This shows that the interface between how we split/index data and how the LLM consumes it is being actively researched.

Comprehensive surveys in 2024 have also catalogued these developments. Jing et al. (2024) survey the intersection of LLMs and vector databases, concluding that tightly integrating VecDBs addresses LLM challenges and forecasting future research on better LLM-VecDB co-design ([2402.01763] When Large Language Models Meet Vector Databases: A Survey) ([2402.01763] When Large Language Models Meet Vector Databases: A Survey). Pan et al. (VLDBJ 2024) survey over 20 recent vector databases, identifying common obstacles and design techniques across systems ([2310.14021] Survey of Vector Database Management Systems) ([2310.14021] Survey of Vector Database Management Systems). They emphasize that managing vector data at scale requires innovations in storage (quantization, compression), indexing (from randomization to navigable small-world graphs), and query optimization (new operators for hybrid queries and hardware utilization) ([2310.14021] Survey of Vector Database Management Systems). In summary, the latest research underscores that vector databases are a critical piece of AI infrastructure. We can expect continued improvements in their indexing algorithms, closer integration with large models, and more intelligent retrieval methods – all geared toward making knowledge retrieval faster, more accurate, and seamlessly scalable in the era of ever-larger LLMs and ever-growing unstructured data.

Rohan's Bytes

Discussion about this post