ML Interview Q Series: How would you compare your new search system to the current one and track performance metrics?

May 04, 2025

📚 Browse the full ML Interview series here.

Comprehensive Explanation

One key approach to compare two search engines is to systematically evaluate how they perform on relevant relevance-based metrics and user-centric behavior metrics. Evaluations can be done both offline (using a test dataset and known relevance judgments) and online (with real users in a live environment). In practice, you usually combine the two methods: offline metrics help refine the ranking algorithm early on, while online testing measures real-world user satisfaction and engagement.

Connect with me on X (Twitter)

Offline testing typically uses a labeled dataset in which each document (or webpage) is assigned a relevance level for given queries. You compare the new ranking algorithm against the existing one based on how well their results align with these known relevance scores. Common metrics include MRR (Mean Reciprocal Rank), Precision@k, Recall@k, and the widely used DCG/NDCG (Discounted Cumulative Gain / Normalized Discounted Cumulative Gain).

When implementing NDCG, each result is assigned a relevance score that is more nuanced than a simple binary label. The DCG formula sums these scores with a log-based discount factor for results at lower ranks. NDCG is simply DCG normalized by the best possible ordering (i.e., ideal ordering). Below is the central formula for NDCG at rank K:

Here NDCG(K) represents the normalized discounted cumulative gain up to rank K. The term rel_i is the graded relevance for the result placed at rank i. The expression 2^(rel_i) - 1 is a way to give more weight to highly relevant items, while log_2(i + 1) discounts the contribution of results at lower positions. The denominator IDCG(K) is the ideal DCG, which is computed by sorting items by their highest relevance first. Dividing the DCG by IDCG(K) yields a score between 0 and 1, with higher numbers indicating better performance.

For online evaluation, A/B testing is frequently used. You randomly serve a fraction of user traffic the new engine’s results, while the rest continue to see results from the production engine. You then monitor engagement metrics such as click-through rate, dwell time, bounce rate, time to first click, or any explicit user feedback. A carefully designed A/B test ensures each variant is exposed to similar user populations, controlling external factors like geography and time of day.

In practice, you must also track more advanced signals. Dwell time (how long a user stays on a result’s page before returning) is an indicator of perceived relevance. A short dwell time may reveal dissatisfaction. Another is bounce rate: if a large fraction of users quickly leave the site, it suggests the displayed results are not meeting their needs. You could also consider success metrics like how frequently users refine or reformulate queries, or how many users escalate to more advanced searches (like refining advanced filters). Combining these user behavior signals with direct feedback data (like rating search results) gives a robust picture of overall effectiveness.

It’s important to run tests long enough to achieve statistical significance. You typically pre-define success metrics (for instance, a 1% improvement in NDCG@10 or an increase in click-through rate) and employ statistical tests to confirm whether differences are due to the new ranking strategy or random noise. Tools like confidence intervals, hypothesis testing (e.g., p-values), and metrics of effect size guide decisions about rolling out the new engine more broadly.