Neural-Based Rank Fusion for Multi-Source Retrieval in LLMs

Apr 13, 2025

Browse all previoiusly published AI Tutorials here.

Neural Ranking Fusion Techniques
Multi-Modal and Hybrid Search Merging
Scalability vs. Accuracy Trade-offs
Evaluation Metrics and Benchmarks
Conclusion

In the context of Advanced Search Algorithms in LLMs, If you have search results from multiple methods, how would you merge and homogenize the rankings into a single result set?

Large Language Models (LLMs) often rely on external search results (documents, passages, etc.) to augment their answers. When multiple retrieval methods or modalities are used (e.g. keyword search, neural embeddings, image search, metadata filtering), a key challenge is rank fusion – merging these different result lists into one coherent, relevant ranking. Recent literature (2024–2025) has explored both traditional rank fusion algorithms (like Reciprocal Rank Fusion) and advanced neural approaches to improve retrieval effectiveness in such multi-source settings. This review covers: (1) neural ranking fusion techniques (from RRF to deep learning models), (2) multi-modal search result merging (text, images, metadata, etc.), (3) scalability vs. accuracy trade-offs in large-scale rank fusion, and (4) evaluation metrics & benchmarks used to assess these methods. Key findings from recent papers are highlighted with citations.

Neural Ranking Fusion Techniques

Reciprocal Rank Fusion (RRF): RRF is a simple yet effective rank aggregation method that combines results from multiple ranked lists by summing the reciprocal of their rank positions ( RAG-Fusion: a New Take on Retrieval-Augmented Generation). For a given document in various result lists, RRF assigns a score like score = ∑ 1/(k + rank_i) (with k a smoothing constant) and produces a fused ranking. Even though RRF is not a neural method, it has been a strong baseline in recent studies due to its robust performance. For example, Rackauckas (2024) notes that RRF “outperforms many other document reranking methods” when used to merge results . Because of its simplicity (no training needed) and solid effectiveness, RRF is widely used to merge outputs of heterogeneous search systems (e.g. lexical and neural search) into a single list (Hybrid search scoring (RRF) - Azure AI Search - Microsoft Learn).

RAG-Fusion (RRF with LLM-generated queries): A novel use of RRF in the LLM context is RAG-Fusion, which integrates reciprocal rank fusion into Retrieval-Augmented Generation (RAG). Instead of relying on a single query, RAG-Fusion generates multiple queries via an LLM and retrieves documents for each, then fuses those results with RRF . This approach was shown to produce more accurate and comprehensive answers in a QA chatbot setting by “contextualizing the original query from various perspectives” through multiple query variants . The fused ranking of documents provides the LLM with diverse yet relevant context, improving answer quality. Zackary Rackauckas (2024) found that RAG-Fusion delivered more comprehensive answers than a standard single-query RAG, although care must be taken as irrelevant generated queries can introduce off-topic results .

Learning-to-Rank and Neural Fusion Models: Beyond RRF’s fixed formula, researchers have explored trainable neural models to fuse rankings. Traditional learning-to-rank ensembles (e.g. LambdaMART or neural networks) can learn optimal combinations of multiple ranker outputs. A 2024 CLEF LongEval study by Gründel et al. fused a lexical BM25, a sparse cross-encoder (RankZephyr), and a dense retriever (ColBERT) using a weighted rank fusion scheme (Neural Re-Ranking and Rank Fusion for Temporal Stability) . Their weighted rank fusion achieved higher immediate effectiveness than any single model – improving retrieval accuracy (nDCG) by combining complementary strengths of lexical and neural methods . However, they observed a trade-off: the fused system was less stable over time, degrading more in longitudinal evaluations than individual models . This highlights that while neural fusion can boost raw relevance metrics, it may need regular retuning to remain robust. Overall, neural rank fusion methods often outperform static fusion rules in effectiveness, given sufficient training data, but can be sensitive and resource-intensive (Rank Fusion Algorithms - From Simple to Advanced).

Transformer-Based Fusion (Fusion-in-T5): An advanced line of work collapses multi-stage ranking pipelines into a single neural model. Fusion-in-T5 (FiT5) by Yu et al. (2024) is a re-ranking model that unifies various ranking signals (text matching scores, traditional features, and context from other documents) within a single T5 transformer (Fusion-in-T5: Unifying Variant Signals for Simple and Effective Document Ranking with Attention Fusion). Instead of merging rankings post-hoc, FiT5 takes the query, candidate document text, document’s initial rank features, and even top-ranked neighbors (pseudo-relevance feedback) as input, and uses attention-based fusion to re-score documents . This neural fusion approach significantly improved ranking accuracy on standard benchmarks – e.g. on MS MARCO passage ranking and TREC Deep Learning, FiT5 achieved higher MRR and NDCG than traditional multi-step pipelines . Notably, FiT5 outperformed a cascade of a BERT ranker + feature-based ranker + query expansion, simplifying the architecture while improving effectiveness . This demonstrates the power of a neural model to learn an optimal fusion of ranking signals end-to-end. Importantly, FiT5 maintained efficiency: it increased inference time and GPU memory by only ~4.5% compared to a single T5 ranker, despite replacing an entire multi-model pipeline . This result shows that neural rank fusion can be both effective and scalable with careful design (e.g. attention mechanisms that allow cross-document feature fusion).

Connect with me on X (Twitter)

Multi-Modal and Hybrid Search Merging

Beyond text-only retrieval, modern LLM applications often deal with multi-modal search – for instance, retrieving text documents, images, or using structured metadata filters together. Merging results across modalities or index types presents unique challenges, as the relevance scores may not be directly comparable. Recent research in 2024–2025 has proposed methods for multi-modal rank fusion:

Late Fusion (“Two-Leg” Models): A common approach is to maintain separate models for each modality (e.g. one for text, one for vision) and then combine their outputs. Wei et al. (2024) introduced UniIR, a “two-legs” multi-modal retriever that performs score-level and feature-level fusion of separate text and image retrievers (Joint Fusion and Encoding: Advancing Multimodal Retrieval from the Ground Up). In this framework, text queries and images are encoded separately (e.g. via CLIP or BLIP models), and then a fusion network merges the scores or representations from each modality to produce a final ranking . This kind of late fusion improved over single-modality baselines by leveraging complementary cues – for example, combining visual similarity scores with textual relevance improved overall retrieval of image-text pairs . However, late fusion still treats modalities independently up to a final step, which can miss fine-grained cross-modal interactions.
Early Fusion (“One-Tower” Models): Newer techniques attempt to fuse modalities from the input encoding stage, allowing joint reasoning over text and visual content. Huang et al. (2025) proposed a Joint Fusion Encoder (JFE) that processes multi-modal inputs in a single transformer model for retrieval . Instead of merging at the score level, JFE integrates visual and textual cues within the same embedding space through early cross-attention. This approach yielded notable improvements in retrieval tasks that require true cross-modal understanding, such as queries involving both image and text descriptors . In their experiments, the one-tower JFE significantly outperformed two-tower late fusion models on complex multi-modal queries, achieving a higher Recall@1 by a large margin . In fact, JFE attained ~20.1% Recall@1 on averaged multi-modal tasks, versus ~17-18% for prior state-of-the-art late-fusion systems . These gains underscore the effectiveness of early fusion when the query’s intent crosses modalities (e.g. a text query referencing specific visual attributes). The drawback is that a unified model can be harder to train (needing joint multi-modal data), but it opens up richer interactions that late fusion might overlook .
Hybrid Text Retrieval (Lexical + Neural): A special case of multi-source merging is hybrid search, which combines sparse lexical search (e.g. BM25) with dense vector search (e.g. bi-encoder embeddings). This has been actively studied as it merges “bag-of-words” relevance with semantic matching. Empirically, fusing dense and sparse results yields higher recall and robustness than either alone (Efficient and Effective Retrieval of Dense-Sparse Hybrid Vectors using Graph-based Approximate Nearest Neighbor Search) . The simplest approach is again RRF or weighted score combination of two separate rankings. Microsoft’s Azure Search and ElasticSearch have even built-in support for RRF to blend vector and keyword results (Hybrid search scoring (RRF) - Azure AI Search - Microsoft Learn). However, a 2024 study by Zhang et al. points out that searching separately and then merging can hurt scalability and miss some relevant items . If each method returns only a top-k list, their union might still exclude some documents that would be highly ranked in an ideal combined space (due to lack of overlap in results ). To address this, Zhang et al. (2024) propose a unified dense-sparse index with a graph-based ANN search that handles hybrid vectors directly . By aligning the distribution of sparse and dense vectors and using a two-stage retrieval (first dense-only, then dense+sparse), they improved hybrid search accuracy by up to 9% while greatly boosting efficiency . In summary, hybrid retrieval merging is enhanced either by simple fusion (robust, easy) or by deeper integration (for better scaling). The consensus in recent work is that combining lexical and neural signals leads to better relevance – e.g. LexBoost (Kalamkar et al., 2024) and others show consistent gains in metrics like MAP by blending BM25 with neural neighbor information – but the challenge is doing so efficiently.

Notably, beyond text and image, metadata and structured fields can also be fused into rankings. One approach is to treat metadata as additional features in a learning-to-rank model. The Fusion-in-T5 model mentioned earlier is an example: it injected categorical and numeric features (like document quality scores, click counts, etc.) alongside text into the re-ranker’s input (Fusion-in-T5: Unifying Variant Signals for Simple and Effective Document Ranking with Attention Fusion). This kind of multi-feature fusion improved fine-grained ranking decisions (the model can learn, for instance, to prioritize newer documents or penalize duplicates as part of its holistic relevance prediction ). Overall, multi-modal and multi-source rank fusion techniques in 2024/25 trend towards unification – either by late-stage ensemble of specialized models, or by early-fusion architectures that combine diverse signals within one neural model.

Scalability vs. Accuracy Trade-offs

Merging search results from multiple methods must balance accuracy improvements with scalability, especially on large corpora. Key considerations from recent research include:

Complexity of Merging Multiple Rankings: Running several retrieval methods in parallel and then fusing results can increase system complexity and latency. Zhang et al. (2024) highlight that the “two-route” hybrid search (dense and sparse separately) suffers from poor scalability due to duplicate indexing and search overhead (Efficient and Effective Retrieval of Dense-Sparse Hybrid Vectors using Graph-based Approximate Nearest Neighbor Search). Naively increasing the number of systems (e.g. more query prompts in RAG-Fusion, or multiple modality-specific models) can lead to slower responses and higher resource usage. One must also retrieve sufficiently deep results from each source to avoid missing relevant items in the merged list, which further increases computational cost . In practice, there’s a trade-off in how many results to fuse: using only top-10 from each source is fast but might drop items that would have been 11th in both lists yet 1st when combined, whereas taking top-100 from each improves recall but adds overhead.
Efficiency Optimizations: Recent works propose methods to achieve near-optimal accuracy with lower cost. For example, the graph-based hybrid retrieval by Zhang et al. uses distribution alignment and staged retrieval – computing cheap dense similarities first, then gradually incorporating sparse signals – to reduce unnecessary computations . They report a ~9× to 11× throughput increase at equal accuracy compared to a brute-force separate fusion approach . This shows that careful algorithmic design (e.g. unified ANN indices, pruning less-informative features) can greatly improve scalability without sacrificing relevance. Another approach is caching and reusing computations: if multiple rankers use the same initial results (like a first-stage BM25 list), overall cost can be kept modest.
Neural Model Scalability: Advanced neural fusion models (like re-rankers or transformers that consider multiple inputs) raise concerns of runtime cost on large candidate sets. A straightforward ensemble of large models can be prohibitively slow – e.g. the Expando-Mono-Duo system (2021) required pairwise cross-encoder evaluations for n candidates, leading to enormous complexity (Fusion-in-T5: Unifying Variant Signals for Simple and Effective Document Ranking with Attention Fusion). Fusion-in-T5 (2024) addressed this by consolidating stages: instead of two separate re-rankers for features and context, it uses one transformer to handle all signals. Impressively, FiT5 achieved superior ranking accuracy with only a marginal (~4.5%) increase in inference time and memory use compared to a single-stage T5 re-ranker . This indicates that neural rank fusion can scale to large document sets if the architecture is optimized for parallel processing (e.g. attending to multiple documents in one forward pass). The use of global attention in FiT5 allowed processing 100 candidate passages together efficiently , avoiding the linear cost of re-ranking each document independently.
Scalability of RRF and Simpler Methods: Simpler fusion algorithms like RRF, CombSUM, or Borda Count have essentially linear complexity in the total number of results fused, and they require no model training. This makes them highly scalable for large-scale search where one can afford to post-process hundreds or thousands of results. For instance, OpenSearch and Elastic have adopted RRF for production hybrid search because it adds minimal latency overhead yet consistently improves recall (Hybrid search scoring (RRF) - Azure AI Search - Microsoft Learn). The flip side is that these methods do not explicitly optimize a particular objective (they are heuristic), so while scalable, they might not reach the absolute best accuracy that a learned model could. Nonetheless, the strong performance of RRF in many studies is a reason it remains a go-to solution when computational budget is a concern ( RAG-Fusion: a New Take on Retrieval-Augmented Generation).
I write everyday for my readers on actionable AI. Subscribe and instantly get a 1300+ page Python book.

In summary, modern rank fusion research emphasizes efficient architectures that retain the gains of combining signals without overwhelming resource usage. This includes unified indexes for multi-modal vectors, single-model fusion re-rankers, and parameter-light ensembles. There is often a sweet spot sought between accuracy and efficiency: e.g. combining a small number of high-quality rankers can yield most of the gains of ensembling, without the cost of dozens of models. The 2024 LongEval study even suggests considering temporal efficiency: too complex a pipeline might be brittle as data evolves (Neural Re-Ranking and Rank Fusion for Temporal Stability) . Overall, scalable rank fusion is achieved by reducing redundant computations (merging pipelines, unified retrieval) and leveraging the power of neural networks to make one model do the work of many.

Evaluation Metrics and Benchmarks

To objectively evaluate ranking fusion methods, researchers rely on standard information retrieval metrics and benchmark datasets. The common metrics in 2024–2025 literature include:

Normalized Discounted Cumulative Gain (NDCG): NDCG@K measures the quality of the ranked list by considering the graded relevance of results and discounting lower ranks. It’s widely used in web search and document ranking tasks. For example, Yu et al. report NDCG@10 on the TREC Deep Learning 2019/2020 benchmarks to compare their FiT5 model with baselines (Fusion-in-T5: Unifying Variant Signals for Simple and Effective Document Ranking with Attention Fusion). FiT5 achieved an NDCG@10 of 0.776 on TREC DL’19, outperforming a BERT ranker (0.701) and a previous T5-based reranker (0.726) . Such improvements in NDCG indicate better ranking of highly relevant documents at the top of the list. In LongEval 2024, nDCG was also used to measure effectiveness over time; interestingly, they observed that nDCG of all systems declines on newer data, highlighting the metric’s role in detecting temporal performance drops (Neural Re-Ranking and Rank Fusion for Temporal Stability) .
Mean Reciprocal Rank (MRR): MRR focuses on the rank of the first relevant result. It is popular in QA and passage retrieval evaluations (e.g. MS MARCO) where typically one or a few answers are relevant. Many papers report MRR@10 for the MS MARCO passage ranking benchmark. In Fusion-in-T5’s results, the model’s MRR@10 on MS MARCO was 0.439, compared to 0.406 for a prior monoT5 re-ranker . An MRR improvement means users, on average, find a relevant answer slightly earlier in the ranked results. MRR is straightforward to interpret for single-answer tasks, and improvements reflect a higher probability that the top answer is correct.
Recall@K: Especially in multi-modal retrieval and first-stage retrieval, Recall at K is crucial. Recall@K is the fraction of all relevant items that appear in the top K results. For instance, Huang et al. (2025) use Recall@1,5,10 to evaluate their multi-modal JFE model (Joint Fusion and Encoding: Advancing Multimodal Retrieval from the Ground Up) . JFE’s significant boost to Recall@1 (e.g. 20.1% vs ~15-18% for others) directly translates to a higher chance that the very top result is relevant in a cross-modal search . Datasets like FashionIQ and COCO retrieval typically report Recall@10 or @5 as primary metrics . In hybrid search studies, Recall@10 or Recall@100 is often examined to ensure that fusing methods increases the coverage of relevant documents (Efficient and Effective Retrieval of Dense-Sparse Hybrid Vectors using Graph-based Approximate Nearest Neighbor Search) . For example, a hybrid dense+lexical approach might report that it achieves higher recall@100 than either method alone, demonstrating that their union finds more relevant items.
Mean Average Precision (MAP) and others: Some works include MAP or Precision@K. MAP (Mean of Average Precision for each query) was mentioned in the context of CLEF LongEval and other fusion research as well. It gives a single-figure summary of precision across recall levels. While not as common as NDCG in recent neural IR papers, MAP@10 or MAP can still appear in evaluation tables (Fusion-in-T5: Unifying Variant Signals for Simple and Effective Document Ranking with Attention Fusion). Additionally, specialized metrics like MRR@100 or success@K might be used in certain benchmarks (e.g. large-scale QA). The choice of metric often aligns with the benchmark: MS MARCO Dev uses MRR@10 by tradition, TREC DL uses NDCG and MAP, and image retrieval tasks use Recall@K.
Connect with me on X (Twitter)
Benchmarks and Datasets: A number of standard benchmarks facilitate comparison of rank fusion methods. The MS MARCO Passage Ranking dataset (8M passages, with sparse relevance labels) is a common testbed – FiT5 and LexBoost both used it (LexBoost: Improving Lexical Document Retrieval with Nearest Neighbors). The TREC Deep Learning 2019 & 2020 sets (ad hoc retrieval with graded labels) are often used for NDCG/MAP evaluation of re-rankers . For multi-modal retrieval, datasets like COCO, Flickr30k, or fashion/product search datasets (FashionIQ, Fashion200K) are used to test image-text fusion models (Joint Fusion and Encoding: Advancing Multimodal Retrieval from the Ground Up). These benchmarks provide relevance judgments that allow computing the above metrics. In the context of LLMs, some works create custom evaluation: Rackauckas (2024) manually evaluated answer accuracy, relevance, and comprehensiveness for the RAG-Fusion chatbot ( RAG-Fusion: a New Take on Retrieval-Augmented Generation), since the end goal was a correct answer rather than just document ranking. However, even in such cases, intermediate metrics like recall of supporting documents or answer recall are considered.

It’s worth noting that achieving improvements on these metrics is increasingly challenging. By 2024, baseline neural retrievers are already strong, so a fusion method must demonstrate statistically significant gains. For example, Fusion-in-T5’s +5% NDCG gain over a strong monoT5 was statistically significant (p < 0.05) in a paired test (Fusion-in-T5: Unifying Variant Signals for Simple and Effective Document Ranking with Attention Fusion), emphasizing that the improvement is reliable. Similarly, JFE’s multi-modal recall jump was large enough to be clearly state-of-the-art . Researchers also use significance tests (t-test, permutation test) to ensure that any reported metric gains from fusion are not due to chance .

Summary of Performance: State-of-the-art fusion methods in 2024/2025 have delivered notable improvements. To illustrate, FiT5’s one-model fusion achieved MRR@10 ~0.44 on MS MARCO and NDCG@10 ~0.87 on TREC DL 2020, outperforming multi-model pipelines . A hybrid ranker in LongEval (2024) combining BM25+ColBERT+RankZephyr topped out around nDCG ~0.65 on a fresh web corpus, higher than any single model (Neural Re-Ranking and Rank Fusion for Temporal Stability). For multi-modal, JFE pushed top-1 recall above 20%, whereas prior CLIP-based methods were in the teens (Joint Fusion and Encoding: Advancing Multimodal Retrieval from the Ground Up). These figures, while specific to their test sets, indicate that rank fusion techniques can measurably advance the state of the art on standard benchmarks, making search results more relevant for LLMs to consume.

Conclusion

In the era of LLMs and heterogeneous data sources, merging search results from multiple methods has become a crucial component for building high-quality retrieval-augmented systems. Neural-based rank fusion techniques build upon simple yet strong baselines like RRF and have evolved into sophisticated models that can learn to blend lexical, semantic, and even multimodal signals. Recent papers from 2024–2025 demonstrate that such fusion can significantly boost relevance (higher NDCG, MRR, recall) by leveraging complementary strengths of different retrieval methods ( RAG-Fusion: a New Take on Retrieval-Augmented Generation). At the same time, researchers are mindful of efficiency: approaches like attention-based fusion and unified indexing show promising results in maintaining scalability on large corpora (Fusion-in-T5: Unifying Variant Signals for Simple and Effective Document Ranking with Attention Fusion). There is also a growing recognition of new evaluation dimensions – for example, stability over time (longitudinal robustness) – when deploying fused rankers in dynamic environments .

Overall, the literature indicates that rank fusion is beneficial for accuracy across a variety of scenarios: hybrid text search, multi-modal retrieval, and LLM query augmentation all see improvements from combining evidence. The best method often depends on context: RRF remains a competitive choice for quick hybrid merges , whereas a trained transformer fusion may win out when maximum accuracy is needed . Multi-modal retrieval has particularly embraced early fusion models for their superior understanding of complex queries (Joint Fusion and Encoding: Advancing Multimodal Retrieval from the Ground Up) . As we move forward, one can expect continued research on making neural fusion methods more lightweight and adaptive. The advances of 2024 and 2025 set the stage for retrieval systems that seamlessly integrate multi-source information, delivering ranked results that are both highly relevant and efficiently obtained – a key enabler for powerful, context-aware LLM applications.

References (2024–2025): Recent works that informed this review include Rackauckas (2024) on RAG-Fusion ( RAG-Fusion: a New Take on Retrieval-Augmented Generation), Yu et al. (2024) on Fusion-in-T5 (Fusion-in-T5: Unifying Variant Signals for Simple and Effective Document Ranking with Attention Fusion) , Huang et al. (2025) on the JFE multi-modal retriever , Zhang et al. (2024) on efficient dense-sparse hybrid search (Efficient and Effective Retrieval of Dense-Sparse Hybrid Vectors using Graph-based Approximate Nearest Neighbor Search) , and the CLEF LongEval 2024 study on rank fusion stability by Gründel et al. (Neural Re-Ranking and Rank Fusion for Temporal Stability) , among others. These studies collectively push the frontier of rank fusion, providing both algorithmic innovations and rigorous evaluations (NDCG, MRR, Recall@K, etc.) to guide the development of next-generation retrieval systems. Each demonstrates how merging multi-faceted search results – when done intelligently – can greatly enhance the information available to LLMs, ultimately leading to more accurate and robust AI systems.