Clustering techniques used in document digitization and chunking for LLMs, focusing on their role in reducing search space

Jun 16, 2025

Browse all previously published AI Tutorials here.

Clustering for Search Space Reduction in LLM Retrieval
K-Means Clustering
Hierarchical Clustering
Spectral Clustering
Density-Based Clustering
Deep Clustering and LLM-Assisted Methods
When Clustering Falls Short and Mitigations
Alternatives to Clustering for Reducing Search Space
Sources

Clustering for Search Space Reduction in LLM Retrieval

Clustering is a core technique in document digitization and chunking pipelines for large language models (LLMs) because it groups similar content, allowing searches to be confined to a few relevant clusters instead of the entire corpus. In vector databases for Retrieval-Augmented Generation (RAG), for example, partitioning the embedding datastore via k-means clustering (a popular inverted file index or IVF approach) means a query only compares against vectors in the closest cluster(s) (TeleRAG: Efficient Retrieval-Augmented Generation Inference with Lookahead Retrieval). This significantly cuts down comparisons and latency. Akesson et al. (2024) demonstrate this in Clustered RAG (CRAG): by clustering similar document reviews and summarizing each cluster, they reduced the prompt tokens by 46–90% without degrading answer quality (Clustered Retrieved Augmented Generation (CRAG)). Lin et al. (2025) likewise note that IVF pre-clustering “limits the search space to relevant clusters” at runtime , transferring only those clusters to GPU memory for fast retrieval. In essence, clustering exploits the “cluster hypothesis” – relevant pieces of information tend to live together – to avoid exhaustive search.

K-Means Clustering

K-means is one of the most widely used clustering algorithms in document retrieval systems due to its simplicity and efficiency. It partitions embeddings into k clusters of roughly spherical shape by minimizing intra-cluster distance. Many RAG systems apply k-means during indexing; for instance, vector indexes often use k-means centroids as representatives so that a query first finds the nearest centroid(s) and then searches only that partition (A Developer’s Guide to Approximate Nearest Neighbor (ANN) Algorithms | Pinecone). This two-stage search greatly reduces candidate chunks to consider . CRAG uses k-means to group semantically similar reviews before summarization , and the authors suggest exploring alternative clustering algorithms to further improve results . However, a limitation of k-means is that it assumes convex clusters of similar size. When document embeddings don’t form neat spherical groups or when an item lies near a cluster boundary, k-means can miscluster relevant pieces. This can lead to missing information if a query’s answer spans multiple clusters. A common mitigation is to search multiple clusters: rather than only the top-1 closest cluster, retrieve from the top-N clusters to improve recall at some cost. Lin et al. (2025) address this by concurrently searching any “missed” clusters on CPU to merge with the main results, ensuring no relevant chunk is skipped . In practice, selecting a slightly larger k (more fine-grained clusters) or using overlapping cluster assignments can also alleviate boundary effects.

Hierarchical Clustering

Hierarchical methods build a tree of clusters at multiple granularity levels, which is very useful for organizing large documents or multi-topic corpora. Recent RAG frameworks store documents in hierarchical structures (e.g. chapters → sections → paragraphs) and perform coarse-to-fine retrieval. Goel and Chandak (2024) introduce HIRO, a hierarchical retrieval that performs a depth-first search through a document tree, pruning entire branches whose summary embeddings have low similarity to the query (HIRO: Hierarchical Information Retrieval Optimization) . By recursively scoring parent nodes and only drilling into relevant branches, HIRO minimizes context passed to the LLM “without informational loss,” yielding a >10% performance gain on NarrativeQA . Hierarchical clustering of chunks can thus reduce search space dramatically – the query ignores whole sections of data that are irrelevant. A similar idea is used in RAPTOR (2024), which recursively clusters and summarizes text into a tree structure for long-document QA. The challenge with hierarchical clustering is choosing cut-offs for pruning; if set too aggressively, relevant leaves might be pruned (a failure mode). To mitigate this, systems often use a threshold on similarity scores and ensure some exploration of sibling nodes. Overall, hierarchical clustering offers a balanced strategy: broad clusters quickly narrow down scope, then finer clusters ensure detailed coverage.

Spectral Clustering

Spectral clustering treats document chunks as nodes in a graph, using pairwise similarity (e.g. cosine of embeddings) to create a similarity matrix. By computing the eigenvectors of this matrix (Laplacian), spectral methods can partition the graph into clusters that are not necessarily spherical or equal-size. This flexibility means spectral clustering can capture complex topical structures – clusters of “diverse shapes [and] densities” in text data (Explainable Graph Spectral Clustering of text documents | PLOS One) – which might be missed by centroid-based methods. For instance, a spectral clustering on a citation graph or semantic network could group documents by theme even if their embeddings are not contiguous in vector space. The downside is computational cost: building and diagonalizing an N×N similarity matrix is expensive for large corpora, making spectral clustering less common for real-time LLM retrieval. Moreover, the resulting clusters can be hard to interpret or explain in terms of original content . In practice, spectral clustering may be used on smaller subsets or offline to inform indexing. Recent text clustering surveys note that graph-based methods like spectral clustering “provide a structured approach to understanding the global structure of document relationships” ( Recent Advances in Text Documents Clustering). In summary, spectral clustering can yield high-quality groupings (reducing search space by focusing on meaningful subgraphs), but careful tuning and scaling strategies (e.g. clustering in batches or using approximate spectral methods) are needed to apply it at LLM scale.

Density-Based Clustering

Density-based algorithms such as DBSCAN and HDBSCAN cluster points based on regions of high density in the embedding space. They do not require specifying the number of clusters a priori, and they naturally label sparse outliers as noise. This makes them attractive for document chunking when one wants to identify tight topical clusters and isolate off-topic or irrelevant chunks. For example, Castillo (2025) demonstrates using HDBSCAN to cluster news article embeddings, automatically discovering topic groupings without preset k (Clustering Documents with OpenAI embeddings, HDBSCAN and UMAP – Dylan Castillo). Such clusters could be used to limit an LLM’s search to dense topic areas while filtering out noise. A benefit of HDBSCAN is robustness to varying cluster shapes – it can find a very irregular cluster of related documents if they form a dense manifold in the vector space. However, in high-dimensional text embeddings, density estimation faces the “curse of dimensionality” (distances tend to homogenize, making it tricky to set the ε or minimum density thresholds). If parameters are not tuned well, a density-based method might put most points in one giant cluster or conversely label too many points as outliers (failing to reduce the search space effectively). Mitigation strategies include dimensionality reduction (e.g. UMAP) before clustering to accentuate true neighborhoods, or using adaptive density thresholds. In practice, density clustering is often used in combination with other methods – for instance, first using k-means to partition broadly, then HDBSCAN within each partition to find fine-grained groups or detect anomalies.

Deep Clustering and LLM-Assisted Methods

Emerging approaches leverage neural networks and even LLMs themselves to improve clustering of document chunks. The idea is to learn representations or refine cluster assignments in ways classical algorithms cannot. Some 2023 methods (IDAS, ClusterLLM) directly incorporate LLM-generated insights: e.g. using an LLM to produce abstractive summaries or to predict semantic relations between sentences, then clustering based on those cues (HERE). These showed promise but often required many expensive LLM calls and did not generalize across domains . In 2024, Lin et al. proposed LLMEdgeRefine, a two-stage clustering approach that addresses clustering failures at the “edges.” First, they run k-means to get initial clusters. Then they identify edge points (outliers or ambiguous points near cluster boundaries) and group these via a secondary agglomerative clustering into “super-points” to reduce noise . In the second stage, LLMEdgeRefine uses an LLM’s understanding to softly reassign or remove these edge points based on semantic context . By letting the LLM reconsider borderline cases, the clusters become more semantically coherent. This yielded consistently higher clustering accuracy on text datasets compared to baseline methods . Deep clustering can also involve training neural encoders (e.g. via autoencoders or contrastive learning) to produce embedding spaces where clusters are more well-formed. The general advantage of deep clustering is adaptability – the clustering process can learn from the data or from an LLM’s vast knowledge. The risk is complexity and overfitting: if the LLM or model is not carefully constrained, it might create idiosyncratic clusters that don’t generalize. Nonetheless, techniques like LLMEdgeRefine illustrate that using LLMs in the loop can mitigate classic clustering pitfalls (like outliers and wrong assignments) by injecting high-level semantic judgment into the clustering process.

When Clustering Falls Short and Mitigations

Despite their utility, clustering techniques can fail to reduce search space effectively in certain scenarios. A known failure mode is when relevant information for a query is split across clusters. If the retrieval system only looks at one cluster, it may miss critical pieces (lowering recall). This is often due to imperfect clustering boundaries. As one study notes, cluster-partitioned ANN indexes typically need to scan more candidates to reach the same recall as graph-based indexes (A Developer’s Guide to Approximate Nearest Neighbor (ANN) Algorithms | Pinecone), implying some nearest neighbors fall outside the assigned cluster. One remedy is multi-cluster retrieval – querying top several clusters. TeleRAG’s design, for example, prefetches the likely relevant IVF clusters to GPU but also “ensur[es] complete retrieval” by searching any missed clusters on CPU in parallel (TeleRAG: Efficient Retrieval-Augmented Generation Inference with Lookahead Retrieval). This hybrid approach protects accuracy at a minor cost. Another issue is cluster imbalance: if one cluster is very large or heterogeneous, searching within it is nearly as hard as searching the whole corpus. This can be mitigated by increasing the number of clusters (making each more fine-grained), or using hierarchical clustering to split large clusters into subclusters. Soft clustering (where documents can belong to multiple clusters or have fuzzy membership) is also a strategy to handle overlap – a document that touches multiple topics could be indexed in all relevant clusters, so a query for either topic still finds it. To address cluster quality problems, some methods use cluster ensembles or re-clustering: run multiple clustering algorithms and intersect results to find stable groupings. We also see fusion of clustering with other signals as a powerful mitigation. Yang (2024) proposes a cluster-based partial retrieval guided by sparse retrieval results (SIGIR '24: Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval). In that approach, keywords (BM25 results) help identify which learned clusters likely contain relevant docs, effectively correcting clustering mistakes by cross-referencing textual cues. This kind of dense-sparse fusion can retain strong relevance while still cutting down the search space . In summary, when clustering alone isn’t perfect, combining clusters with alternative retrieval strategies, increasing redundancy (searching a bit beyond the single best cluster), or refining cluster assignments with human-in-the-loop (LLM) adjustments are effective ways to ensure important information isn’t overlooked.

Alternatives to Clustering for Reducing Search Space

Clustering is not the only game in town. Several other techniques can narrow the search space in document retrieval and chunking:

Approximate Nearest Neighbor (ANN) Graphs: Graph-based indexes like HNSW (Hierarchical Navigable Small World) are a popular alternative to cluster-partitioning. Instead of grouping by centroids, they organize embeddings as a navigable graph. Queries traverse the graph to find nearest neighbors, often examining far fewer points to reach a target recall than cluster-based methods (A Developer’s Guide to Approximate Nearest Neighbor (ANN) Algorithms | Pinecone). These graph indexes have empirically the best computational complexity, being among “the fastest algorithms for in-memory vector search” . In practice, ANN graphs can achieve high recall with low latency, effectively reducing search space by pruning paths that are unlikely to lead to close neighbors. Many vector search libraries default to HNSW or similar graph algorithms for this reason.
Semantic Indexing and Filters: Before resorting to full vector search, one can filter or index documents by high-level categories or keywords. For example, Jiang et al. (2025) use a theme-scoped retrieval that classifies queries into topic scopes to “efficiently narrow the search space while maintaining retrieval quality” (RAS: Retrieval-And-Structuring for Knowledge-Intensive LLM Generation). By routing a query to only the subset of documents on that theme, the search load is drastically reduced. Similarly, classic inverted indexes (keyword-based) can quickly pre-filter candidate chunks by lexical cues; this can be combined with an embedding search for precision. Although simpler than clustering, these methods require predefined taxonomy or metadata. They work well when documents are already labeled or easily separable by context (e.g. search only in the “finance” section for a finance query).
Hash-Based Methods: Locality-Sensitive Hashing (LSH) and related hashing techniques map embeddings to buckets so that similar items land in the same bucket with high probability. This allows sub-linear retrieval by only searching within a few buckets. LSH was traditionally used for high-dimensional data search, but recent reviews note that purely hash-based indexes have been surpassed by graph and clustering methods in performance (A Developer’s Guide to Approximate Nearest Neighbor (ANN) Algorithms | Pinecone). Still, hashing remains an alternative for certain cases (it’s simple and can be combined with clustering: e.g. assign a hash within each cluster to further narrow search).
Cascaded Retrieval and Caching: Multi-stage retrieval pipelines can cut down search space at each stage. For instance, a first stage might use a cheap model or smaller embedding to retrieve a rough set of candidate chunks, and later stages refine this set with stronger models. This cascade ensures only a small fraction of the corpus is ever examined with the expensive LLM or embedding. Additionally, semantic caching has been explored as a way to avoid repeated searches: results of frequent queries (or query embeddings) are cached so that similar new queries can reuse those results (HERE). By serving some queries from cache or memory, the system bypasses a full corpus scan. Gill et al. (2023) introduced RAGCache, a multilevel cache for RAG that stores intermediate results and was shown to significantly cut down redundant retrieval work .

In conclusion, clustering techniques – from classical k-means and hierarchical clustering to advanced spectral, density-based, and LLM-assisted methods – play a pivotal role in structuring document data for LLM applications. They reduce the search space by grouping related information, thus making retrieval and chunk selection more efficient. Each technique has its strengths (e.g. speed of k-means, structure awareness of spectral, nuance of deep clustering) and failure modes (boundary cases, computational limits, etc.). Modern studies in 2024–2025 have shown not only how clustering boosts retrieval efficiency (TeleRAG: Efficient Retrieval-Augmented Generation Inference with Lookahead Retrieval), but also how to address its shortcomings via smarter algorithms and hybrid strategies (SIGIR '24: Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval). Moreover, when clustering is not suitable, alternative approaches like ANN indexing, thematic filtering, and caching ensure that we can still tame the search space explosion that comes with large document collections. By judiciously choosing and sometimes combining these methods, practitioners can build RAG and chunking pipelines that scale to massive corpora while delivering relevant context to LLMs with high accuracy and low latency.

Sources

Lin et al., “TeleRAG: Efficient Retrieval-Augmented Generation Inference with Lookahead Retrieval,” arXiv, 2025 .
Ąkesson & Santos, “Clustered Retrieved Augmented Generation (CRAG),” arXiv, 2024 .
Goel & Chandak, “HIRO: Hierarchical Information Retrieval Optimization,” arXiv, 2024 (HIRO: Hierarchical Information Retrieval Optimization) .
Starosta et al., “Explainable Graph Spectral Clustering of text documents,” PLOS One, 2024 (Explainable Graph Spectral Clustering of text documents | PLOS One).
Castillo, “Clustering Documents with OpenAI Embeddings, HDBSCAN and UMAP,” Blog, updated Feb 2025 (Clustering Documents with OpenAI embeddings, HDBSCAN and UMAP – Dylan Castillo).
Lin et al., “LLMEdgeRefine: Enhancing Text Clustering with LLM-Based Refinement,” EMNLP 2024 (HERE) .
Pinecone, “A Developer’s Guide to ANN Algorithms,” 2024 (A Developer’s Guide to Approximate Nearest Neighbor (ANN) Algorithms | Pinecone) .
Jiang et al., “Retrieval-And-Structuring (RAS) for Knowledge-Intensive LLM Generation,” arXiv, 2025 (RAS: Retrieval-And-Structuring for Knowledge-Intensive LLM Generation).
Yang et al., “Cluster-based Partial Dense Retrieval Fused with Sparse Text Retrieval,” SIGIR 2024 .
Akheel et al., “Semantic Caching for LLM Applications: A Review,” J. Sci. & Eng. Research, 2024 (HERE).

Rohan's Bytes

Discussion about this post