Browse all previoiusly published AI Tutorials here.
Table of Contents
Comparison of Retrieval Evaluation Metrics
Offline vs. Online Evaluation Methods
Benchmarks and Datasets for Evaluation
In the context of Advanced Search Algorithms in LLMs, I have a recommendation system, which metric should I use to evaluate the system?
Recent arXiv papers (2024–2025) explore how to evaluate document retrieval in recommendation systems that incorporate large language models (LLMs). Key themes include which retrieval metrics to use, how offline evaluations compare to online tests, and what benchmarks/datasets are available. Below, we review the latest findings in these areas.
Comparison of Retrieval Evaluation Metrics
Common Metrics and Their Trade-offs: Modern recommender and retrieval systems use a variety of ranking metrics, each with strengths and limitations. Widely used metrics include Normalized Discounted Cumulative Gain (NDCG@K), Precision@K, Recall@K, Hit Rate@K (a variant of recall), Mean Reciprocal Rank (MRR), and Area Under the ROC Curve (AUC) (A Comprehensive Survey of Evaluation Techniques for Recommendation Systems). These metrics provide different perspectives on retrieval quality:
NDCG@K: Measures ranking quality by rewarding highly relevant items appearing early in the list. It normalizes DCG by the ideal ranking’s DCG, yielding 1.0 for a perfect rank order . NDCG is popular in recent studies and often serves as a primary benchmark for new methods (On (Normalised) Discounted Cumulative Gain as an Off-Policy Evaluation Metric for Top- Recommendation). Its strength is handling graded relevance (not all hits are equally relevant) and emphasizing top-ranked results. However, recent research has pointed out limitations. Jeunen et al. (2024) show that normalizing DCG can introduce inconsistencies: using NDCG may invert the true performance order of systems even when unnormalized DCG would correlate with actual online rewards . In other words, an algorithm that truly yields higher user reward might get a lower NDCG than a competitor due to the normalization factor. This suggests NDCG, while prevalent, can mislead if used improperly .
Recall@K (Hit Rate@K): Checks what fraction of all relevant documents are retrieved in the top K results. It is crucial for coverage – ensuring the system finds as many relevant items as possible. Recall is especially important in LLM-powered pipelines where an LLM can potentially scan a set of retrieved documents internally. In fact, generative IR research suggests that if the LLM (acting as a reader) is willing to consider many results, a high recall@K might be more valuable than a precision-focused metric like NDCG for the initial retrieval stage (Generative Information Retrieval Evaluation) . The trade-off is that recall doesn’t account for ranking order; it treats a relevant item at rank 1 and rank 10 equally, which may not reflect a human user’s experience (though it could be acceptable if an LLM re-ranks or uses all retrieved items).
Precision@K: Measures the proportion of the top K results that are relevant. This emphasizes accuracy of the top recommendations. A high precision means users are likely to see mostly relevant items without wading through irrelevant ones. The trade-off is that precision ignores any relevant items beyond rank K and can be insensitive to the total number of relevant items available. In recommender settings with very few correct items per user query, precision@K can be a strict metric. It’s often used alongside recall to balance completeness vs. exactness .
MRR: Focuses on the rank of the first relevant result. It is the reciprocal of the rank of the first relevant item, averaged over queries. MRR is useful when typically only one item (the top recommendation or answer) truly matters to the user (e.g. first relevant article clicked). Its limitation is that it ignores the presence of other relevant items beyond the first; it’s a single-answer metric.
AUC: Common for evaluating implicit feedback models, AUC measures the probability that a random relevant item is ranked above a random irrelevant item. It considers overall ranking distributions and is threshold-agnostic. While AUC can be useful for binary relevance tasks, it has coarse granularity in the top ranks. Recent work notes that metrics like AUC “do not offer sufficient information for comparing subtle differences between two competitive recommender systems”, which can lead to different outcomes once deployed ( RecSys Arena: Pair-wise Recommender System Evaluation with Large Language Models). In other words, two algorithms might have similar AUC, yet differ in how they rank the very top results, causing significant changes in user experience long-term.
Best Metric – Is There One? Based on 2024/2025 literature, NDCG@K remains a favored all-purpose metric for document retrieval and recommendation ranking. Many state-of-the-art models are chiefly compared using NDCG (and Recall) at various cutoffs (Sparse Meets Dense: Unified Generative Recommendations with ...) , because NDCG rewards getting highly relevant documents to the top, aligning with the goal of satisfying users quickly. Its graded relevance and position discounting reflect user behavior better than simple accuracy. However, researchers caution that no single metric is perfect . NDCG’s normalization can distort comparisons in some cases , and optimizing solely for NDCG might overlook other factors (diversity, novelty, etc.). Thus, the “best” metric is context-dependent: for first-stage retrieval feeding into an LLM, Recall@K might be most critical (ensure the LLM has the needed info) , whereas for final ranked lists shown to users, NDCG@K is often preferred for balancing relevance and rank. The consensus is to use a suite of metrics to get a holistic view , but if one must be chosen, NDCG@K is frequently the leading choice due to its strong track record in correlating with user-engagement and its widespread adoption in recent research.
Offline vs. Online Evaluation Methods
Offline Evaluation (Pre-deployment): Offline metrics like those above are computed on historical data—e.g. a held-out test set of user queries and documents with known relevance or past interaction labels. Examples include calculating NDCG or Recall@10 on a test split of a recommendation dataset ( Online and Offline Evaluations of Collaborative Filtering and Content Based Recommender Systems). The role of offline evaluation is to benchmark and iterate on models quickly before live deployment. It provides a safe, low-cost proxy for performance: researchers can try many ideas without impacting real users ( RecSys Arena: Pair-wise Recommender System Evaluation with Large Language Models). A number of studies emphasize that offline tests are essential for rapid algorithm development and for sanity-checking that a model meets a minimum quality bar (An Evaluation-Driven Approach to Designing LLM Agents: Process and Architecture). For instance, Elahi and Zirak (2024) report using offline ranking metrics (hit-rate@K, NDCG) to tune recommender algorithms prior to any live experiment . Offline metrics are also used in academic settings where online access to users is impossible – they serve as a stand-in for user satisfaction.
However, limitations of offline evaluation are well-recognized. Offline metrics rely on historical interaction data, which can be biased or incomplete (e.g. missing feedback for truly good recommendations that were never shown to users). Moreover, user satisfaction is complex and not fully captured by proxy metrics . As one paper notes, “offline evaluation usually cannot fully reflect users’ preferences ... results may not be consistent with online A/B tests” . In other words, a model that looks superior offline (higher NDCG) might perform worse in a real user experiment due to factors like UI differences, novelty effects, or sample bias in the offline data. This gap drives the need for online testing.
Online Evaluation (Deployment & A/B Testing): Online evaluation involves testing the recommender system in a live environment, observing real user behavior. This can be done via A/B tests (randomly showing different algorithms to different user groups) or by monitoring user engagement metrics in production. Common online metrics include click-through rate (CTR), conversion rate, dwell time, user ratings, or any interactive feedback that indicates satisfaction. These are considered the ground truth of performance – indeed, online experiments are often called the “gold standard” for evaluation (On (Normalised) Discounted Cumulative Gain as an Off-Policy Evaluation Metric for Top- Recommendation) because they directly measure what we ultimately care about: real users finding recommendations useful. For example, in the large-scale system studied by Elahi & Zirak, models were compared through an online A/B test measuring CTR to pick the best recommender for production ( Online and Offline Evaluations of Collaborative Filtering and Content Based Recommender Systems). Online evaluation has the advantage of capturing all the complex factors of user experience: novelty, diversity, presentation effects, and longer-term engagement. It can also reveal unforeseen issues (e.g. a recommendation algorithm might have great offline recall but if it recommends very similar items repeatedly, users might get bored – something an engagement metric would catch).
The downside of online methods is cost and risk. Running A/B tests is time-consuming and can expose users to suboptimal recommendations. As Wu et al. (2024) put it, online experiments are “risky and time-consuming” ( RecSys Arena: Pair-wise Recommender System Evaluation with Large Language Models), so they can’t be the sole method to evaluate every tweak. This is why researchers rely on offline metrics as a filter before promoting models to an online test.
Combining Offline and Online: Recent research encourages a hybrid approach where offline and online evaluations inform each other. Offline metrics are used to pre-screen and diagnose algorithms, while online tests validate real-world impact. An evaluation-driven framework for LLM agents highlights that offline tests establish controlled baselines and catch issues prior to deployment, whereas online evaluation provides continuous feedback under real conditions; together they offer complementary insights (An Evaluation-Driven Approach to Designing LLM Agents: Process and Architecture) . In practice, a strong correlation between offline and online results is desired, but not guaranteed. Some 2024 studies have examined this correlation explicitly. For instance, one large-scale analysis found that an un-normalized DCG metric tracked online reward closely, whereas NDCG did not (On (Normalised) Discounted Cumulative Gain as an Off-Policy Evaluation Metric for Top- Recommendation) – illustrating that even the choice of offline metric affects alignment with online outcomes. When discrepancies arise, techniques like user simulations with LLMs have been proposed to bridge the gap (e.g. using an LLM to simulate user preference for two recommendation lists as in RecSys Arena ). Overall, the consensus is that offline evaluations are invaluable for development, but online evaluation is irreplaceable for measuring true user satisfaction, and the best practice is to use both. As one study succinctly noted: offline tests ensure a model is “ready before deployment under controlled conditions,” while online tests “provide continuous performance monitoring in real-world settings” .
Benchmarks and Datasets for Evaluation
Common Datasets: A variety of public datasets are used to evaluate document retrieval and recommendation performance, many of which remain the standard testbeds in the LLM era. Table 2 of a recent LLM-enhanced RS survey highlights several widely-used datasets by domain ( Large Language Model Enhanced Recommender Systems: Taxonomy, Trend, Application and Future):
Movie Recommendation: MovieLens (movie rating dataset) and Netflix Prize data are classic benchmarks for evaluating how well a system recommends movies to users.
E-commerce: Amazon product datasets (e.g. Amazon product reviews) and Alibaba’s dataset are used to test product recommendation and retrieval of relevant items from large catalogs.
Point-of-Interest: Yelp (business reviews), Foursquare and even Delivery Hero data appear in recent studies to evaluate location or restaurant recommendations.
Video/Short Video: KuaiSAR, a dataset for short video recommendation (e.g. similar to TikTok), has been used in 2024 research on LLM-based recommenders .
Textual Content: For document-focused recommendations, news and book recommendation sets are important. The MIND dataset (a large-scale news recommendation dataset from Microsoft) is commonly used to benchmark how well models retrieve relevant news articles for users . In the books domain, Goodreads (book reviews/ratings) and BookCrossing datasets serve to evaluate recommendation of written content . Similarly, data from platforms like WeChat articles or personalized job recommendations are emerging for testing LLM-driven recommenders .
These datasets provide a ground truth for offline evaluation: they typically contain a history of user-item interactions (clicks, ratings, etc.) that can be split into training and test sets for reproducible benchmarking. Many 2024 papers report results on a handful of these standard datasets to demonstrate improvements in metrics like NDCG or Recall.
Benchmarks and Trends: While these legacy datasets are heavily used, the community recognizes the need for new benchmarks tailored to LLM-powered systems. Liu et al. (2024) note that LLM-Enhanced Recommender Systems (LLMERS) is a “newborn direction” with no standardized benchmark yet, and they call for developing comprehensive benchmarks to accelerate research ( Large Language Model Enhanced Recommender Systems: Taxonomy, Trend, Application and Future). This means that, currently, researchers often repurpose classic recsys datasets for LLM settings, but might not capture all LLM-specific challenges (e.g. understanding natural language content or long documents). We are beginning to see efforts to curate datasets that include richer textual information or conversational recommendation scenarios to better evaluate LLM-integrated models.
Impact of Dataset Selection: An important insight from recent work is that which dataset you use can greatly influence evaluation outcomes. Different datasets vary in characteristics like sparsity, popularity distribution, item content, and user behavior patterns. A 2024 benchmarking study emphasized that dataset choice significantly affects evaluation conclusions (From Variability to Stability: Advancing RecSys Benchmarking Practices). For example, a complex model might outperform on a sparse dataset but not on a dense one, or a model optimizing diversity might shine on a dataset with broad user interests but do worse on a narrow-interest dataset. Zhao et al. (2022) found that the average number of datasets used in studies is limited, and preprocessing choices (filters on data) can alter a model’s measured performance . This has led to a trend towards evaluating on multiple, diverse datasets to ensure robustness. Recent papers often report results across several benchmark sets (e.g. one movie, one e-commerce, one news dataset) to demonstrate consistent gains . Additionally, there’s growing awareness of evaluating beyond accuracy—considering fairness, novelty, and other aspects—which sometimes necessitates new data or augmenting existing sets to test those dimensions.
In summary, the evaluation of document retrieval in LLM-powered recommendation systems is a multi-faceted problem. The community is converging on using established metrics like NDCG and Recall (with careful attention to their limitations) and validating models both offline and online. At the same time, there is an active push to enrich the evaluation ecosystem with better benchmarks and datasets that reflect the nuances of LLM-driven recommendations, ensuring that the next generation of recommender systems is judged against reliable and relevant standards.