Browse all previously published AI Tutorials here.
Table of Contents
Product quantization PQ indexing method
Introduction
PQ Indexing Implementation and Trade-offs
Advancements in PQ Variants (2024-2025)
PQ vs. HNSW and IVF for LLM Retrieval
Benchmarks and Recent Results (2024-2025)
Introduction
Vector search is a key component of Retrieval-Augmented Generation (RAG) systems for LLMs, enabling efficient lookup of relevant document embeddings. As these systems scale to millions or billions of high-dimensional embeddings (often 512–1536 dimensions), memory and speed become critical concerns. Product Quantization (PQ) is a common compression-based indexing technique that addresses this by storing vectors in a compact coded form (4bit-Quantization in Vector-Embedding for RAG). In essence, PQ trades off a small amount of accuracy for major gains in memory footprint and search speed, making it attractive for large-scale LLM document retrieval (Accelerating Vector Search: NVIDIA cuVS IVF-PQ Part 1, Deep Dive | NVIDIA Technical Blog). In the following, we review how PQ indexing works and its trade-offs, recent PQ variant advancements (OPQ, residual PQ, and 2024–2025 innovations), and compare PQ-based indexes with other popular vector search methods like HNSW and IVF in the context of LLM retrieval. We also highlight benchmarks from the latest research to quantify these trade-offs.
PQ Indexing Implementation and Trade-offs
How PQ Works: Product quantization compresses high-dimensional vectors by splitting each vector into multiple low-dimensional sub-vectors and quantizing each sub-vector independently . For each sub-vector (e.g. a 16-dimensional slice of a 768-D embedding), a codebook of K cluster centroids is pre-trained (usually via k-means on a sample of data). Each sub-vector is then replaced by the ID of its nearest centroid. Concatenating these centroid IDs yields a compact code for the full vector. For example, using M sub-vectors with K=256 (1 byte per sub-vector) compresses a 768-D float vector (∼3KB) down to M bytes – often a 95%+ size reduction . The centroids for each sub-space (the codebooks) are stored in memory to enable distance computations. At query time, a search uses the PQ codes to compute approximate distances – typically via asymmetric distance computation (ADC), where the query’s sub-vectors are compared to each code’s stored centroids.
Trade-offs: PQ dramatically reduces memory and increases cache-friendliness of search at the cost of some accuracy. It is a lossy compression – finer quantization (more sub-vectors or larger codebooks) preserves more accuracy but uses longer codes, whereas aggressive compression (fewer or smaller codebooks) saves memory but incurs quantization error (4bit-Quantization in Vector-Embedding for RAG). In practice, this means there is a tunable balance between index size and recall. For instance, one study found compressing 100k 1536-D embeddings with a strong PQ setting (32 sub-vectors, 256 centroids each) yielded an extremely compact index but retrieved less than 10% of the true top-10 neighbors compared to the original exact vectors . Because PQ encodes vectors approximately, the nearest neighbor search becomes approximate – a high-recall setting may require scanning more candidates or using hybrid re-ranking (which adds latency). Additionally, building a PQ index has upfront cost: clustering each sub-space (often via k-means) can be computationally expensive for very large corpora, and the training needs to capture the data distribution well to avoid severe accuracy loss . Another implementation consideration is that PQ alone does not accelerate coarse candidate selection – it usually is combined with other indexing (like IVF) to avoid comparing a query with every vector’s code. Nonetheless, once candidates are chosen, distance calculations on compact codes are very fast and memory-light. In summary, PQ indexing’s main appeal is memory efficiency (storing compact codes instead of full vectors) and improved search speed on large datasets, at the expense of some retrieval accuracy and added index training complexity (Accelerating Vector Search: NVIDIA cuVS IVF-PQ Part 1, Deep Dive | NVIDIA Technical Blog).
Advancements in PQ Variants (2024-2025)
Over the years, many PQ variants have been proposed to improve quantization accuracy or adapt to new scenarios. We highlight a few important ones, including recent methods from 2024–2025:
Optimized Product Quantization (OPQ): OPQ augments the basic PQ by learning an optimal rotation (linear transform) of the vector space before quantization (A Detailed Guide on Indexing Algorithms in Vector Databases). By rotating axes, OPQ can better align variance across sub-vectors, which minimizes distortion. This results in less information loss and more “discriminative” codes than standard PQ for the same code length . OPQ is an offline preprocessing step (the rotation matrix is learned from training data) and is widely used (e.g. in Facebook FAISS) to boost PQ accuracy without changing the runtime or code length. It’s particularly effective when certain dimensions are correlated or have different scales – the rotation spreads information more evenly so that each sub-quantizer captures important variance.
Residual and Additive Quantization (RQ, AQ): Instead of quantizing sub-vectors independently, residual quantization (RQ) quantizes the vector in a multi-stage iterative process. The first codebook quantizes the original vector, then a second codebook quantizes the residual error (difference between the original and first reconstruction), and so on (HERE). By encoding residuals, RQ typically achieves higher fidelity than one-shot PQ for the same code size, since later codebooks correct the errors of earlier ones. However, classic RQ uses fixed codebooks per stage, not adapting to earlier quantization choices , which can limit its efficiency. Additive quantization (AQ) is a related approach where the final vector approximation is a sum of multiple codewords (from multiple codebooks) rather than a concatenation of sub-vector codewords. AQ and RQ can achieve very high accuracy, but training them is more complex (greedy or joint optimization) and search is slower (multiple codewords to decode per vector). In practice, researchers sometimes combine PQ and RQ: for example, dividing the vector into blocks and applying RQ within each block (sometimes called residual product quantization) . This hybrid yields a balance between parallelism and accuracy – a 2015 study introduced this idea, and it was revisited in recent work . For instance, Niu et al. (2023) propose Residual Vector Product Quantization (RVPQ), which slices the vector into sub-parts like PQ and then quantizes each part’s residuals iteratively, rather than using one large residual quantizer . RVPQ’s jointly trained multi-level codebooks gave better accuracy than plain PQ on ANN benchmarks while keeping search efficient.
Neural and Differentiable PQ: A recent trend is to train quantization codebooks with neural networks or end-to-end optimization, rather than using vanilla k-means. Unsupervised neural quantization (UNQ) and Deep PQ (DeepQ) are methods that use gradient-based learning (often with a form of straight-through estimator or Gumbel-softmax) to learn codebook embeddings that minimize reconstruction error globally. These methods (e.g. Morozov & Babenko 2019 for UNQ; Zhu et al. 2023 for DeepQ) showed improvements over classic OPQ/PQ by “learning to quantize” with backpropagation . However, they can be tricky to train and may need careful initialization or multiple stages. The latest advance in this vein is QINCo (Quantization with Implicit Neural Codebooks), introduced by Meta AI (Huijben et al., ICML 2024). QINCo is a neural residual quantization approach that conditions each codebook on the previously quantized portion of the vector . In other words, instead of using a fixed codebook at each residual step, QINCo trains a neural network that outputs adaptive codebooks based on what the earlier RQ steps have already encoded . This significantly reduces quantization error – QINCo outperforms prior state-of-the-art quantization methods by large margins. For example, with a 12-byte code (96-bit) budget per vector, QINCo achieved higher recall on standard ANN benchmarks than an existing method using 16-byte codes . In ablation studies, QINCo was shown to benefit from more training data (unlike k-means based PQ which saturates) and can be combined with an IVF index for further speed-ups . These results demonstrate that learned PQ schemes can narrow the gap to full-precision accuracy. Going into 2025, we’re seeing “differentiable PQ” and other learned quantization techniques become more practical, indicating that future LLM retrieval systems might train their vector indexes similarly to how models are trained, to maximize retrieval quality.
Other Notable Variants: Online Product Quantization (not to be confused with OPQ) has been explored to update codebooks on the fly for streaming data, using techniques like codebook refresh with learning and forgetting rates (A Detailed Guide on Indexing Algorithms in Vector Databases). This is useful if the embedding distribution shifts over time. There are also product quantization networks (PQN) that integrate quantization into neural network embeddings (for example, learning a PQ code as the output of an encoder). While these are active research areas, they are less common in LLM document retrieval so far. The main focus in 2024–2025 has been on improving compression rate vs. accuracy – for instance, a very recent work proposes “quantizer dropout” to allow variable bitrate PQ (encoding different vectors with different number of residuals) ( arXiv:2412.01762v1 [cs.CV] 2 Dec 2024). Overall, the advancements aim to make PQ more accurate and flexible while retaining efficiency. Techniques like OPQ and RQ are often combined with PQ in real systems (and are available in libraries like FAISS), and new neural approaches like QINCo show that significant accuracy gains are possible even at aggressive compression rates.
PQ vs. HNSW and IVF for LLM Retrieval
When building a vector index for document retrieval, one must choose an ANN method that balances speed, memory usage, and recall. Hierarchical Navigable Small World (HNSW) and Inverted File Index (IVF) based methods (with or without PQ) are among the most popular options. Here we compare them in the context of LLM-scale retrieval:
Hierarchical NSW (Graph Index): HNSW builds a multi-layer graph of all vectors, where edges link close neighbors. At query time, it performs a greedy graph traversal to find nearest neighbors quickly (4bit-Quantization in Vector-Embedding for RAG). HNSW is known for very high recall and low search latency at moderate scales, effectively achieving quality close to brute-force search. However, it is memory-intensive – it stores full vectors and a connectivity graph. For large corpora (hundreds of millions of embeddings), HNSW can become impractical due to memory: e.g. one analysis showed that storing 1 billion 768-D embeddings with HNSW (including the graph) could require on the order of a terabyte of RAM (HERE). Even with 8-bit quantized vectors, the graph overhead is huge. HNSW indexes also grow superlinear in size with dataset because each node keeps links (e.g. M=32 or 48 links per node). This memory cost is the price for its speed and accuracy. In document retrieval for LLMs, HNSW is great for smaller knowledge bases or where memory is abundant, as it can return very accurate results quickly. But at the billion-scale, pure in-memory HNSW “doesn’t scale well” without compression .
IVF (Inverted File) Index: IVF takes a coarse quantization approach: cluster the dataset into N coarse buckets (e.g. via k-means on full vectors), and assign each vector to its nearest cluster centroid. The index then is the set of cluster centroids (stored in memory), plus lists of vectors belonging to each cluster (these can be on disk or memory). At query time, only the nProbe nearest clusters to the query are searched, drastically narrowing the search space (Towards Understanding Systems Trade-offs in Retrieval-Augmented Generation Model Inference). IVF by itself (sometimes called “IVF-Flat” when storing full vectors) saves computation by avoiding brute-force search, but it doesn’t reduce memory unless combined with compression. Typically, IVF is paired with PQ: the vectors inside each cluster are stored as PQ codes, yielding the classic IVF-PQ approach (Accelerating Vector Search: NVIDIA cuVS IVF-PQ Part 1, Deep Dive | NVIDIA Technical Blog). This two-level indexing (coarse cluster then fine quantization) is extremely scalable – it was the basis of Facebook’s billion-scale ANN search (FAISS) and is widely used in vector databases. In the context of LLM retrieval, IVF-PQ offers an appealing memory vs. accuracy trade-off: the PQ compression yields huge memory savings, and IVF ensures you only decode a small fraction of codes per query. The cost is that some recall is lost compared to HNSW or IVF-Flat (since both the coarse clustering and PQ quantization introduce error). For example, a recent study characterized the trade-offs: an IVF-PQ index (with 16,384 clusters and 8-bit PQ) used 7.2× less memory than an HNSW index (with scalar-quantized vectors) for a 100M corpus, but reached only ~70% of HNSW’s recall . Increasing nProbe (searching more clusters) can raise IVF-PQ recall at the expense of latency. In practice, one can often configure IVF-PQ to hit 90–95% of the true recall if slightly more clusters or a rerank step is used. The big win is memory: By compressing a large embedding store, we can fit it in GPU memory or cheaper storage. NVIDIA reports that using IVF-PQ on a 1B-vector, 96-D dataset reduced data size from 360 GiB (FP32) to ~54 GiB with minimal loss in accuracy (maintaining >95% recall) , and even to 24 GiB with a moderate speed hit for additional compression. Such compression is essential for LLM systems dealing with web-scale knowledge.
Accuracy and Speed Comparison: HNSW tends to win on raw recall – it can often return nearly exact neighbors if given enough memory and compute. IVF-PQ will miss some neighbors due to quantization. In one 2024 benchmark on a 100M dataset, HNSW achieved ~0.87 recall while IVF-PQ leveled off around 0.61 (with fixed parameters) . However, IVF without PQ (or with lighter quantization like scalar quantization) can close this gap; for instance IVF with 8-bit scalar quantization achieved recall 0.86 in the same study . In terms of latency, HNSW performs well for single queries thanks to its graph search, but it may degrade when scaling to many concurrent queries or very large datasets (due to random memory accesses across a large graph). IVF-PQ, by contrast, benefits from sequential memory access patterns and can leverage GPUs effectively by processing many distance computations in parallel. For instance, the FAISS library on CPU schedules one thread per query for IVF-PQ, whereas GPU implementations (like RAFT IVF-PQ) can search thousands of queries in batch. The aforementioned NVIDIA test showed that at large batch sizes (e.g. 10k queries), a GPU IVF-PQ could answer ~180k queries/sec at >95% recall, whereas a multi-threaded HNSW on CPU managed ~60k QPS. That illustrates how IVF-PQ shines in high-throughput settings. On CPU with smaller batches, HNSW can be faster for high-precision search (fewer distance calculations needed), but if the PQ code length is short, distance computation is very fast too. Notably, the system trade-off study found that at comparable recall, IVF-PQ had higher tail latencies and lower peak throughput than HNSW on CPU , likely because IVF-PQ needed to scan more points (due to lower recall per probe) when trying to match HNSW’s accuracy. In summary, HNSW vs IVF-PQ is a memory-vs-accuracy trade: HNSW uses a lot of memory to get high recall easily, whereas IVF-PQ uses very little memory but needs careful tuning (and possibly more computation) to reach high recall.
Hybrid Approaches: Recent solutions try to get the best of both worlds. One approach is to use HNSW on top of PQ codes – e.g. build the HNSW graph over compressed vectors to reduce memory. This works, but the graph search on PQ codes can be less accurate (since distances are approximate). Another approach is DiskANN, a system by Microsoft that uses an on-disk graph index with PQ-compressed vectors in RAM. DiskANN (2023) essentially stores the HNSW-like graph on SSD and only keeps a small PQ-refined vector cache in memory (HERE) . This allows billion-scale indices to run on a single machine with far less RAM. In one comparison, DiskANN provided comparable performance to in-memory HNSW while cutting memory usage by ~89% . The trade-off is slightly higher latency due to SSD access, but with fast NVMe drives and optimized search (beam search + re-ranking), the impact is small. For LLM retrieval, DiskANN and similar hybrid indexes enable serving massive corpora (billions of embeddings) cost-effectively, since pure HNSW would be prohibitively expensive. Meanwhile, pure PQ approaches (like IVF-PQ) remain very relevant, especially with hardware acceleration: they offer a straightforward way to compress data and often the lowest memory solution for a given recall target. Many production systems use a combination: IVF or HNSW for coarse search, followed by PQ codes for the fine distances (or a re-rank using original vectors if those can be stored elsewhere).
In practice, selecting an index for an LLM knowledge base depends on requirements: If maximum accuracy is required and the dataset is moderate, HNSW is a good choice. If the dataset is huge and memory or cost is a concern, IVF-PQ or DiskANN are viable, achieving good recall with far lower resource usage (Towards Understanding Systems Trade-offs in Retrieval-Augmented Generation Model Inference) . It’s also common to see hierarchical indexing (IVF) combined with HNSW (for example, first select clusters, then use a local HNSW within each cluster) – highlighting that these methods are complementary. The key is understanding the trade-offs: PQ indexing offers compactness, HNSW offers accuracy, and IVF offers scalability, and modern systems often mix and match these to meet the needs of large-scale LLM retrieval.
Benchmarks and Recent Results (2024-2025)
To quantify the above discussions, we summarize some findings from latest research (all 2024+) on vector indexes for retrieval, focusing on PQ and its variants:
PQ Compression vs Accuracy: Zhang et al. 2025 evaluated PQ in a RAG setup with 100K embeddings. Using a strong compression (32 sub-vectors, codebook size 16 or 256), the top-10 retrieval overlap with the exact (floating-point) baseline was under 10%, highlighting that heavy PQ compression can drastically drop retrieval accuracy (4bit-Quantization in Vector-Embedding for RAG). In the same work, PQ-coded embeddings achieved only ~0.56–0.60 Pearson correlation with ground-truth semantic similarity scores (versus 0.85+ for 8-bit or 4-bit scalar quantization) . This underscores the importance of tuning PQ parameters to avoid too much information loss.
HNSW vs IVF-PQ – Memory and Recall: Chandrasekaran et al. 2024 compared popular ANN indexes for RAG on a 100M dataset. An HNSW index (with 8-bit quantized vectors) attained ~87% recall but used large memory, whereas an IVF-PQ index (16k clusters, 256-byte PQ codes) used about 7.2× less memory yet maxed out around 61% recall (Towards Understanding Systems Trade-offs in Retrieval-Augmented Generation Model Inference). An intermediate approach, IVF with scalar quantization (IVF-SQ), reached ~86% recall with ~2.3× memory reduction vs HNSW . This benchmark vividly shows the trade-off continuum: more compression (IVF-PQ) yields huge space savings at a cost to accuracy, while mild compression (IVF-SQ) preserves accuracy closer to the graph-based method.
Latency/Throughput Trade-offs: The same study noted that at batch sizes above 128, the throughput of IVF-PQ leveled around 110 QPS, vs 319 QPS for HNSW in their CPU setting . IVF-PQ also had higher 95th-percentile latency (up to 2.2s in worst case vs <0.9s for HNSW) . This was attributed to IVF-PQ needing more exhaustive scanning to reach high recall. However, on GPU, IVF-PQ can excel. NVIDIA’s 2024 experiments with RAFT IVF-PQ showed that for 100M vectors at recall >95%, a GPU index could answer ~150k queries/sec (batch=10k) whereas a multi-threaded HNSW on CPU managed ~50k QPS. This demonstrates that the hardware and batch scenario can flip the performance outcome – PQ compression is extremely GPU-friendly, whereas graph search favors CPU caches and many threads.
New PQ Variants Performance: Advances like QINCo have pushed PQ accuracy closer to uncompressed vectors. In QINCo’s ICML 2024 paper, using a code size of just 12 bytes per vector, it achieved higher recall on benchmarks than prior methods did with 16 bytes (HERE). For instance, on the Deep1M and BigANN1M datasets, QINCo (12B code) slightly exceeded the recall that UNQ (an earlier neural quantizer) got with 16B codes . This is a remarkable 25% reduction in index size with no loss of accuracy, thanks to better codebook learning. Such improvements mean that future PQ indexes can be both compact and high-accuracy. Similarly, RVPQ (Niu et al. 2023) reported improved recall vs standard PQ for a given code length by leveraging residual quantization in subspaces . We expect upcoming benchmarks (e.g. ANN competitions in 2025) to showcase these smarter quantization schemes outperforming traditional PQ/OPQ in retrieval tasks.
Large-Scale Deployments: Real-world scale tests underscore PQ’s value. NVIDIA’s IVF-PQ index for the DEEP1B dataset compressed 1 billion, 96-D vectors from 360 GiB to ~54 GiB – a 6–7× compression – with negligible impact on search accuracy (verified at >0.95 recall) (Accelerating Vector Search: NVIDIA cuVS IVF-PQ Part 1, Deep Dive | NVIDIA Technical Blog). Pushing further to 24 GiB (15× compression) caused about a 1.5× slowdown but still was feasible . This kind of compression enables fitting vast knowledge bases in memory (or even on a single GPU), which is crucial for RAG-based LLM applications. On the other hand, disk-based methods also show promise: Sella 2024 demonstrated that DiskANN (a graph + PQ hybrid) could handle a 50M, 768-D dataset with an 89% reduction in DRAM usage vs HNSW, yet similar query throughput (HERE). These results indicate that memory-bound scaling issues can be solved by PQ and related innovations, allowing LLM systems to retrieve from much larger document collections without sacrificing too much performance.
In summary, the recent literature (2024–2025) paints a clear picture: Product quantization remains a linchpin for efficient vector search at scale, and ongoing research is actively improving its effectiveness. Practically, PQ indexing enables large LLM knowledge bases to be searched with reasonable resources, and newer variants (optimized, residual, neural PQ) are closing the accuracy gap. Meanwhile, alternatives like HNSW and IVF (with or without PQ) offer different sweet spots in the design space. System designers often combine these techniques to balance retrieval accuracy, speed, and cost for their specific LLM-driven applications (Towards Understanding Systems Trade-offs in Retrieval-Augmented Generation Model Inference) . The benchmarks confirm there is no one-size-fits-all: if maximum precision is needed, a high-memory HNSW or IVF-Flat index may be worth it; if memory or scale is a bottleneck, PQ-based compression is indispensable. As LLM deployments grow, we anticipate further hybrid approaches and learned quantization methods will define the state-of-the-art in vector search for AI.
Sources: Recent research and technical reports from 2024–2025 have been cited to support this review, including arXiv preprints (4bit-Quantization in Vector-Embedding for RAG) , conference papers (HERE) , and industry benchmarks (Accelerating Vector Search: NVIDIA cuVS IVF-PQ Part 1, Deep Dive | NVIDIA Technical Blog). These provide the latest insights into PQ indexing and its role in LLM-scale retrieval.