Advanced Search Algorithms for LLMs in Large-Scale Datasets (2024–2025)

Apr 12, 2025

Browse all previoiusly published AI Tutorials here.

Advanced Search Algorithms for LLMs in Large-Scale Datasets (2024–2025)
Algorithmic Improvements
- Vector Search Techniques in LLM Retrieval
- Retrieval-Augmented Generation (RAG) Enhancements
- Sparse vs. Dense Retrieval Methods
- Hybrid Search Strategies (Combining Paradigms)
Infrastructure Optimizations
- Indexing Strategies and Data Structures
- Hardware Acceleration (GPUs, FPGAs, etc.)
- Distributed Search Architectures
Trends, Trade-offs, and Practical Takeaways

Large language models (LLMs) increasingly rely on advanced retrieval to ground their responses in massive external data. Retrieval-Augmented Generation (RAG) – injecting search results into LLM prompts – is a popular technique to enhance accuracy and reduce hallucinations (HERE(https://arxiv.org/pdf/2409.06464#:~:text=Retrieval,This makes retrieval a)). The quality and efficiency of these search results are critical (“garbage in, garbage out” for LLM answers (HERE(https://arxiv.org/pdf/2409.06464#:~:text=large language model ,a critical component of RAG))). Recent research (2024–2025) has focused on improving both algorithmic aspects of search (embedding-based retrieval, hybrid methods) and infrastructure (indexing, hardware acceleration, distribution) to achieve efficient, accurate retrieval at scale. Below is a structured review of key developments, with an emphasis on arXiv papers from 2024/25, their findings, and the trade-offs and trends they highlight.

Algorithmic Improvements

Vector Search Techniques in LLM Retrieval

Vector similarity search uses high-dimensional embeddings to find semantically relevant items. This dense retrieval paradigm has seen refinements in algorithms and indexes for speed/accuracy trade-offs:

Approximate Nearest Neighbor (ANN) Search: Exact brute-force search (flat index) guarantees recall but is slow for large data. ANN methods accelerate queries by examining only a subset of vectors, trading a tiny accuracy loss for huge speedups (BANG: Billion-Scale Approximate Nearest Neighbor Search using a Single GPU) . Graph-based ANN indexes (e.g. HNSW) are popular for their strong recall-speed balance (HERE). Best practices often recommend HNSW for large corpora, while flat (brute-force) indexes can be viable for smaller corpora or prototyping . For instance, Lin (2024) shows flat indexes perform nearly as well as HNSW under ~100k documents, but become 2–3× slower once the corpus grows to hundreds of thousands . At multi-million scales, graph indexes significantly outperform flat indexes in query throughput, justifying the higher indexing cost.
Vector Quantization and Compression: To make vector search more memory-efficient, embeddings can be compressed (e.g. product quantization). Compressed or quantized indexes store smaller vector representations, enabling larger datasets to fit in memory (or on faster GPU memory) at a minor accuracy cost . For example, the BANG system (2024) uses compressed vectors on GPU and keeps the ANN graph on CPU, allowing billion-scale search on a single GPU without running out of memory . BANG’s hybrid GPU–CPU design overlaps computation with data transfer and achieves 40–200× higher throughput than prior methods at 90% recall on billion-item datasets .
Multi-Vector Representations: Instead of a single embedding per document, some techniques use multiple vectors to capture different aspects of content (e.g. ColBERT for passages). While multi-vector methods can improve recall, they increase index size and query cost. Recent work continues to focus on single-vector ANN due to its simplicity, leaving multi-vector indexing as a specialized approach (HERE). The general trend is to first exhaust simpler single-vector ANN optimizations before resorting to multi-vector schemes.

Retrieval-Augmented Generation (RAG) Enhancements

RAG integrates external retrieval into LLM workflows: the model first retrieves relevant documents and then conditions its generation on this evidence. Research in 2024–2025 has refined RAG to improve both retrieval quality and downstream reasoning:

Core RAG Benefits: By grounding LLM prompts in retrieved text, RAG can supply up-to-date knowledge and reduce hallucinations (HERE). It’s become standard in LLM applications (question answering, assistants) because it mitigates LLMs’ limited internal knowledge. However, RAG’s effectiveness depends on the retriever’s accuracy – poor search results lead to flawed answers . Thus, many recent works target the retrieval component for improvement.
Optimizing Retrieval for RAG: High-quality retrieval in RAG often entails domain-specific indexing or chunking of knowledge, as well as reranking steps. A 2024 hybrid RAG system by Yuan et al. combines multiple optimizations: refining text chunks and tables, using attribute prediction to cut off irrelevant content, extracting knowledge with LLM and knowledge graph modules, and a final reasoning step that consolidates references (A Hybrid RAG System with Comprehensive Enhancement on Complex Reasoning). These enhancements led to significantly better accuracy on complex reasoning benchmarks (KDD Cup RAG Challenge), substantially increasing scores and reducing errors compared to a baseline RAG model . The trend is toward pipeline approaches – integrating retrieval with knowledge extraction and reasoning – to tackle complex queries that require more than just retrieving a passage.
Long Context via Retrieval: Another line of research uses retrieval to extend LLM context length efficiently. RetrievalAttention (Liu et al., 2024) accelerates long-context LLM inference by retrieving only the most relevant past tokens’ key/value vectors instead of attending to the entire context ( RetrievalAttention: Accelerating Long-Context LLM Inference via Vector Retrieval). They build an ANN index of the LLM’s internal hidden vectors and fetch top relevant keys for each new token’s attention, using an attention-aware ANN algorithm to handle the distribution shift in embeddings . This method achieves near full accuracy while accessing only 1–3% of the context, drastically reducing memory and compute costs (e.g. enabling 128K-token contexts on a single 24GB GPU). Such techniques highlight how external search algorithms can even optimize the internal operations of LLMs.

Sparse vs. Dense Retrieval Methods

Two major paradigms in text retrieval are sparse (lexical) methods and dense (vector) methods. Each has advantages, and recent works explore their differences and complementary nature:

Sparse Retrieval: Methods like BM25 use keyword frequencies to score documents. They excel at precise keyword matching and are highly interpretable (we know which terms matched) ( Efficient and Effective Retrieval of Dense-Sparse Hybrid Vectors using Graph-based Approximate Nearest Neighbor Search). Sparse indexes (inverted indexes) can quickly retrieve exact matches and have been the backbone of search engines for decades. However, they struggle with vocabulary mismatch – if a query uses different words from the relevant text, sparse methods might miss it . They also cannot easily capture semantic similarity beyond exact terms.
Dense Retrieval: Dense methods encode text into semantic vectors using neural models (e.g. DPR, Sentence-BERT). They handle conceptual similarity and synonyms well, retrieving relevant content even if wording differs . Dense retrievers have surged with LLM and transformer advances, often outperforming sparse retrieval on tasks where semantic match is crucial. The downside is that dense models may miss exact matches for rare keywords or names (since everything is compressed into a vector) and lack the explicit transparency of sparse methods . They also require substantial training data for embedding models and efficient ANN indexes for production use.
Comparative Performance: Jimmy Lin’s 2024 study provides practical guidance on dense vs. sparse trade-offs (HERE). Generally, dense retrieval (with a well-tuned embedding model) can achieve higher recall on natural language queries, but BM25 remains strong for keyword-heavy or out-of-distribution queries. Many systems now use dense retrieval as a first stage and may fall back to sparse for queries where dense fails – or use hybrid methods (below). The research community recognizes that no one-size-fits-all solution exists; effectiveness can vary by corpus type and size . Thus, understanding the query/domain is key in choosing sparse vs dense or a combination.

Hybrid Search Strategies (Combining Paradigms)

To get the best of both worlds, hybrid strategies combine multiple retrieval techniques. Two notable hybrid approaches are sparse–dense fusion and LLM-assisted retrieval:

Sparse + Dense Hybrid Retrieval: Merging lexical and semantic signals often yields the highest accuracy. A hybrid retriever might, for example, retrieve candidate documents with both BM25 and a dense model, then merge or rerank the results. This leverages sparse’s precision and dense’s semantic recall ( Efficient and Effective Retrieval of Dense-Sparse Hybrid Vectors using Graph-based Approximate Nearest Neighbor Search). Recent studies confirm that such hybrids can significantly improve recall and robustness across domains . In fact, hybrid retrieval has become common in RAG-based LLM systems . The naive way to implement this is a two-stage approach: run separate searches (one sparse, one dense) and then combine results. However, this doubles the indexing and maintenance effort and can miss cases where the true top-k relevant items are split across the two methods . A 2024 work by Zhang et al. identifies key challenges for unified hybrid search: dense and sparse vectors have incompatible distributions and computational characteristics, making a single index hard to build efficiently ( Efficient and Effective Retrieval of Dense-Sparse Hybrid Vectors using Graph-based Approximate Nearest Neighbor Search). They propose a unified graph-based ANN for hybrid vectors, using distribution alignment to calibrate sparse vs dense distances and a two-stage search that first ranks by dense-only then refines with sparse contributions . With additional pruning of less informative sparse dimensions, their method achieves up to an 8.9×–11.7× throughput boost at equal accuracy compared to prior hybrid search pipelines . This indicates that carefully designed hybrid algorithms can approach the speed of single-method search while reaping the accuracy gains of both .
LLM-Assisted Re-ranking (Neural Hybrid): Another hybrid paradigm uses an LLM itself as part of the search loop. In an LLM-assisted search scheme, a vector search is first used to fetch a set of candidates, and then the LLM re-ranks or filters these using its deeper understanding of context or complex queries ( LLM-assisted Vector Similarity Search). Riyadh et al. (2024) demonstrate this two-step approach: initial ANN retrieval yields candidates, then a prompt-based LLM considers the query and candidates to produce a final ranking . This approach shined on complex queries with nuanced conditions (negations, multi-facet constraints), where pure vector similarity might fail . The LLM re-ranker can understand the context and subtleties (e.g. “find documents about X but not Y”) and thus significantly improve precision on these difficult queries . Critically, this is done without a massive slowdown – vector search narrows the field, so the LLM only processes a handful of results. The study found that LLM-assisted ranking outperforms vector search alone on complex queries while preserving efficiency . This underscores a broader trend of neural hybrid systems: using large models not just to answer questions but to mediate the search process itself (via re-ranking, query reformulation, or even generating sparse+dense representations as in PromptReps below).
Prompt-Generated Hybrid Representations: Bridging sparse and dense at the representation level, PromptReps (Zhuang et al., 2024) is an innovative zero-shot method where an LLM is prompted to generate both a dense and a sparse representation for each text ( PromptReps: Prompting Large Language Models to Generate Dense and Sparse Representations for Zero-Shot Document Retrieval). With a clever prompt, the LLM produces a single-word summary (for sparse keyword matching) and a vector embedding (from its hidden state) for a document or query. This yields a hybrid index without any training: the LLM’s outputs serve as the indexed vectors and terms. Remarkably, PromptReps achieved retrieval effectiveness on par with state-of-the-art dense models that require extensive training data, especially when using a larger LLM for generation . It combines the advantages of dense and sparse retrieval implicitly – the hidden-state vector encodes semantics while the chosen word provides a keyword signal . This line of work hints at LLMs being used to build better search indexes on the fly, and opens up research into how prompt-guided indexing can reduce the need for labeled training or complex index engineering.
Connect with me on X (Twitter)

Infrastructure Optimizations

Indexing Strategies and Data Structures

Efficient indexing underpins large-scale search, turning raw vectors or documents into a structure that can be queried quickly. Recent research touches on both classic data structures and new learned or adaptive indexes for LLM-scale retrieval:

Graph vs. Tree vs. Inverted Indexes: There is renewed analysis of which index types work best for modern workloads. Graph-based indexes (like HNSW) often give the fastest query times for high-dimensional dense vectors, whereas tree-based or inverted indexes excel for lower-dimension or sparse data. Lin (2024) explicitly compares HNSW graphs to flat (array) indexes for dense retrieval, as well as to traditional inverted indexes for sparse retrieval (HERE) . His results show the indexing time vs. query speed trade-off clearly: HNSW is slower to build but much faster to query on large corpora, whereas a flat index is quick to set up but slows drastically as data grows . In practice, many systems use a hybrid indexing strategy: e.g. a coarse partitioning (clustering or IVF) to narrow the search, then a fine-grained structure (graph or brute-force) within each partition. This hierarchy (adopted by libraries like FAISS) is effective for scaling to millions of vectors. We also see specialized structures for text: e.g. SPLADE uses an inverted index with learned sparse weights, effectively a sparse–dense fusion at index time. The key trend is recognizing that “no free lunch” exists – index choice depends on corpus size, update frequency, and memory constraints, so researchers now provide empirical guidance rather than one absolute answer .
Dynamic Index Maintenance: Real-world datasets are not static; new documents arrive and old ones update. A 2025 study by Harwood et al. investigated ANN methods on dynamic datasets, evaluating the cost of updates along with query speed ( Approximate Nearest Neighbour Search on Dynamic Datasets: An Investigation). They found that some structures degrade significantly with frequent updates – notably, k-d tree variants were slower than brute-force search when data was changing rapidly . Graph-based ANN (HNSW) handled insertions much better, maintaining a speedup over brute-force in an online setting . Another method, a variant of the ScaNN algorithm, also performed well for moderate recall targets . This highlights a trade-off: indexes that are optimal for static search may not be optimal when you need to insert or delete vectors on the fly. Hierarchical graphs like HNSW offer a good balance of search speed and updateability, whereas tree structures may need complete rebalancing after many inserts. Future indexing strategies are likely to emphasize incremental update costs as a first-class criterion, since LLM knowledge bases (enterprise documents, etc.) are continually evolving.
Learned and Adaptive Indexes: Inspired by learned B-tree indexes, researchers have explored learning-based approaches for vector search indexes. While no dominant learned ANN index has emerged yet, work like Brute-force to Learn (Douze et al., 2024) attempts to learn optimal quantization or clustering for a given dataset ( Efficient and Effective Retrieval of Dense-Sparse Hybrid Vectors using Graph-based Approximate Nearest Neighbor Search). Additionally, alignment techniques (like the distribution alignment in the hybrid search work ( Efficient and Effective Retrieval of Dense-Sparse Hybrid Vectors using Graph-based Approximate Nearest Neighbor Search)) can be seen as adapting the index structure to the data characteristics. The PromptReps approach above is another example of baking more intelligence into indexing: using an LLM’s linguistic knowledge to create index keys. In practice, organizations combine open-source tools (FAISS, Annoy, Lucene) with task-specific tuning. The trend is toward auto-tuning indexes (choosing parameters or structures based on data/workload analysis) rather than manual one-size-fits-all settings.

Hardware Acceleration (GPUs, FPGAs, etc.)

As datasets reach billions of entries, search speed is bound by hardware capabilities. Recent works leverage specialized hardware and novel architectures to boost throughput:

GPU-Accelerated Search: GPUs, with their massive parallelism, are natural for ANN computations. Facebook’s FAISS library introduced GPU indexing, and newer systems build on it. BANG (2024) is a notable GPU-based ANN engine that breaks the memory barrier by storing the main graph index on CPU and compressed vectors on GPU, communicating efficiently (BANG: Billion-Scale Approximate Nearest Neighbor Search using a Single GPU). It overlaps CPU–GPU execution so that while the GPU computes distances on one batch, the CPU traverses the graph for the next, hiding PCIe latency . BANG achieved huge speedups (dozens of times faster) over prior GPU methods on billion-scale data . This shows that with careful system design, even a single GPU can handle web-scale search. There is also interest in GPU-friendly index structures (like graph algorithms optimized for CUDA) – for example, Groh et al. (2022) proposed a GPU-optimized HNSW variant. Overall, GPUs excel for high-throughput ANN, but one must manage memory limits (hence compression, or sharding data across multiple GPUs).
FPGA and ASIC Acceleration: Some 2024 research aims at ultra-low-latency search using FPGAs. FPGAs can be customized to run ANN graph traversal in hardware, achieving consistent low response times. Falcon (Zhang et al., 2024) is an FPGA-based graph search accelerator combined with a novel traversal algorithm called Delayed-Synchronization Traversal (DST) (Accelerating Graph-based Vector Search via Delayed-Synchronization Traversal). By optimizing the search order and parallelism for hardware, Falcon+DST achieved up to 4.3× faster query latency and 8× better energy efficiency than a CPU ANN search, and even beat a GPU by 19.5× in latency (26.9× energy) on graph-based ANN tasks . These are dramatic improvements, indicating the potential of specialized silicon for vector search. Falcon also supports various graph types (not just HNSW) and processes many queries in parallel to maximize throughput . On the ASIC side, companies are exploring dedicated chips for recommendation/search (since ANN is common in those domains), and academic projects like ICE (2022) have looked at in-memory computing for similarity search . While GPUs are more accessible, FPGAs and eventually ASICs can push latency down to microsecond scale, which is valuable for real-time LLM applications where each millisecond counts.
Memory and Storage Optimizations: Hardware advances are also enabling memory-disaggregated or out-of-core search. One example is CXL-ANNS (2023), which uses the CXL memory interconnect to treat multiple machines’ memory as one pool for ANN search, enabling billion-scale indexes to be searched without all data on one node . This kind of architecture lets a search service scale beyond a single machine’s RAM, important for enormous corpora. Another innovation is using fast SSDs for ANN (as in DiskANN), which can hold billions of vectors on NVMe storage and still retrieve with low latency by careful caching and search ordering. Memory-efficient architectures also include smarter caching of popular queries or vectors (so LLM systems can quickly retrieve frequent items without full index scans). In LLM serving, caching recently used embeddings or results in high-speed memory can reduce repeated computation. The overall trend is designing systems that balance memory, compute, and networking – e.g. offloading some ANN work to CPUs or specialized chips while GPUs focus on the LLM itself ( RetrievalAttention: Accelerating Long-Context LLM Inference via Vector Retrieval). We also see the emergence of retrieval-aware LLM serving frameworks (e.g. Chameleon, 2023 ) that co-design the LLM inference engine with the retrieval backend, so that hardware resources are used optimally across both tasks.
I write everyday for my readers on actionable AI. Subscribe and instantly get a 1300+ page Python book.

Distributed Search Architectures

For truly web-scale search (hundreds of billions of documents), a single server is not enough. Distributed search architecture splits the indexing and query load across many machines. Key considerations and developments include:

Index Sharding and Federation: Large indexes are typically partitioned (sharded) by document ID or by vector space clustering. A query is broadcast to all shards (or a selected subset for efficiency), and each shard returns top-k results which are then merged. This distributed query processing is standard in production search engines and now in vector databases like Milvus or ElasticSearch’s vector mode. The challenge is keeping latency low when aggregating results from many nodes. Systems like Elastic and Solr have refined shard coordination over years; for vector search, newer systems employ similar strategies. Resource selection (query routing) is an active area: e.g., only query the shards most likely to have results, to reduce work. Recent enterprise solutions report using distributed search with technologies like Apache Solr to handle product catalogs, greatly improving recall and user engagement (as noted in some case studies).
Throughput vs Latency Trade-offs: A distributed setup can handle more data and queries in parallel, but network overhead can hurt single-query latency. Research is examining how to minimize this overhead. One approach is pipelining ANN search – e.g., start retrieving from one shard while others are still processing, and using asynchronous merges. Another approach is caching at the edge: if certain queries (or vector queries) are common, cache their results in a distributed fashion to answer without hitting the full index. The 2023 DF-GAS architecture even explores a distributed FPGA cloud service for ANN, aiming to combine low latency and scalability by offloading search to a network of accelerator-equipped nodes (Accelerating Graph-based Vector Search via Delayed-Synchronization Traversal). This indicates a trend of specialized distributed services for similarity search (vector search as a service), where multiple accelerators work in concert on massive indices.
Consistency and Index Sync: In retrieval-augmented LLM systems, maintaining consistency across distributed index shards is important, especially if updates occur (e.g., new data being indexed). Techniques like distributed indexing (simultaneously building index on multiple nodes) or periodic index merging are used. While not the focus of many 2024 papers, it remains a practical concern: ensuring all nodes have the latest data and the global top-k is truly found. Some research from databases (e.g., consensus algorithms for search indices) is being applied to vector DBs to handle this. There is also interest in how distributed search can be made fault-tolerant so that an LLM doesn’t hang if one shard is slow or down – solutions include having replicas of shards and quickly failing over. These system-level optimizations are often described in technical reports or open-source documentation rather than arXiv papers, but they form the backbone that allows the above algorithms to serve real-world applications.

Trends, Trade-offs, and Practical Takeaways

Trends: The recent literature shows a clear trend toward hybridization – combining methods (sparse + dense, retrieval + LLM reasoning, CPU + GPU, etc.) to overcome individual limitations. Rather than an end-to-end monolithic model, the best systems of 2024 have an ecosystem of components (specialized indexes, rerankers, accelerators, knowledge modules) working together. This modular approach is well-suited to large-scale deployments where different components can be scaled and optimized independently.

Trade-offs: A recurring theme is balancing accuracy vs efficiency vs memory. Dense vectors improve semantic accuracy but need ANN indexing to be efficient; adding sparse signals improves accuracy further but complicates the index. Increasing index size (or using more hardware) can speed up queries but at higher cost. Each improvement often comes with a trade-off: e.g., quantization sacrifices a bit of precision for big memory savings, approximate search sacrifices a bit of recall for speed, distributed search sacrifices some latency to gain scale. Understanding these trade-offs is crucial for practitioners. For example, if an application’s queries are simple and precision-critical, a well-tuned BM25 on a single node might suffice; but for open-domain Q&A with nuanced language, dense or hybrid retrieval with powerful hardware is worth the investment. Recent empirical studies (HERE) serve as valuable guides for making these decisions based on corpus size, query complexity, and update frequency.

Practical Implementations: Many of these research advances are filtering into open-source tools and industry practice. Vector databases (Milvus, Pinecone, Weaviate, etc.) implement ANN algorithms like HNSW or IVF-PQ under the hood, often inspired by papers. Likewise, LLM frameworks for RAG (Haystack, LlamaIndex) allow easy switching between sparse, dense, or hybrid retrieval backends as per the latest methods. Hardware acceleration is also becoming accessible – e.g. NVIDIA’s libraries for vector search, or cloud FPGA services. Engineers deploying LLMs at scale should keep an eye on emerging techniques like retrieval-aware attention or FPGA-as-a-service, as these can dramatically cut costs or latency for large deployments. At the same time, simpler optimizations (sharding, caching, index parameter tuning) continue to provide solid wins and are supported by matured infrastructure.

In summary, the 2024–2025 research landscape for LLM-oriented search is rich with innovations. Algorithmically, we see a synthesis of dense and sparse retrieval, smarter use of LLMs in the loop, and improved understanding of how to retrieve the right knowledge for generation. Infrastructure-wise, there’s a push toward making retrieval blazingly fast and scalable through better indexes, compression, and exploitation of modern hardware (GPUs, FPGAs, high-speed interconnects). The convergence of these advances brings us closer to LLM systems that can reliably and efficiently tap into virtually unlimited external knowledge, delivering accurate results even in enterprise or web-scale settings. Each piece – from algorithm to index to hardware to architecture – contributes to the overarching goal: maximizing knowledge value per query in the era of large-scale AI.

Connect with me on X (Twitter)

Sources:

Z. Jing et al. (2024). “When Large Language Models Meet Vector Databases: A Survey.” – Overview of LLM challenges and how vector databases (embedding indexes) address them (HERE) .
J. Lin (2024). “Operational Advice for Dense and Sparse Retrievers: HNSW, Flat, or Inverted Indexes?” – Empirical comparison of index structures (HNSW vs flat vs inverted) and dense vs sparse retrieval trade-offs (HERE) .
M. Riyadh et al. (2024). “LLM-assisted Vector Similarity Search.” – Hybrid two-step retrieval (vector search + LLM re-ranking) to handle complex queries ( LLM-assisted Vector Similarity Search).
S. Zhuang et al. (2024). “PromptReps: Prompting LLMs to Generate Dense and Sparse Representations for Zero-Shot Document Retrieval.” – Using LLMs to create hybrid sparse+dense document representations, achieving SOTA zero-shot retrieval performance ( PromptReps: Prompting Large Language Models to Generate Dense and Sparse Representations for Zero-Shot Document Retrieval).
H. Zhang et al. (2024). “Efficient and Effective Retrieval of Dense-Sparse Hybrid Vectors using Graph-based ANN Search.” – Unified graph index for combined sparse+dense vectors, with distribution alignment and two-stage search, greatly improving hybrid search throughput ( Efficient and Effective Retrieval of Dense-Sparse Hybrid Vectors using Graph-based Approximate Nearest Neighbor Search).
B. Harwood et al. (2025). “Approximate Nearest Neighbour Search on Dynamic Datasets: An Investigation.” – Evaluation of ANN methods under frequent updates; shows HNSW advantages for dynamic data and k-d tree weaknesses ( Approximate Nearest Neighbour Search on Dynamic Datasets: An Investigation).
D. Liu et al. (2024). “RetrievalAttention: Accelerating Long-Context LLM Inference via Vector Retrieval.” – Introduces attention-aware ANN to fetch relevant past context tokens, enabling 128k token contexts with minimal resources ( RetrievalAttention: Accelerating Long-Context LLM Inference via Vector Retrieval).
Y. Yuan et al. (2024). “A Hybrid RAG System with Comprehensive Enhancement on Complex Reasoning.” – Describes a complex pipeline combining refined retrieval, knowledge extraction, and reasoning for RAG, yielding superior performance on multi-hop reasoning tasks (A Hybrid RAG System with Comprehensive Enhancement on Complex Reasoning) .
A. Zhang et al. (2024). “BANG: Billion-Scale ANN Search using a Single GPU.” – GPU-CPU hybrid design with compressed vectors, overlapping computation/transfer, achieving order-of-magnitude speedups on billion-scale data (BANG: Billion-Scale Approximate Nearest Neighbor Search using a Single GPU) .
Y. Zhang et al. (2024). “Accelerating Graph-based Vector Search via Delayed-Synchronization Traversal.” – FPGA-based Falcon accelerator and DST algorithm, dramatically reducing latency and energy for ANN search (Accelerating Graph-based Vector Search via Delayed-Synchronization Traversal). Also discusses the impact of search latency on LLM serving pipelines .
(Additional citations embedded inline above for specific points (HERE), etc.)

Rohan's Bytes

Discussion about this post