ML Case-study Interview Question: Real-Time Marketplace Listing Re-ranking with LLM Two-Tower Models
Browse all the ML Case-Studies here.
Case-Study question
A large online marketplace holds over 60 million user-generated listings. Each listing has a title, description, categorical and numerical attributes. Users input queries to find relevant listings. The core challenge is to re-rank the top listings, boosting the most relevant ones for the user query in near real-time. The marketplace seeks a scalable solution that uses large language models to learn better query-ad relevance and then serves those models under strict latency and throughput constraints. Propose an end-to-end design: data preparation, model architecture, training approach, serving framework, and evaluation metrics.
Detailed solution
Data Preparation
A broad dataset is built from user interactions. Implicit click patterns help label which listings are good or bad for a given query. Statistical filtering and example weighting reduce bias from noisy user clicks. Model training pairs query-listing examples labeled based on positive or negative feedback. Each example includes textual data (title, description) plus tabular data (categories, numerical signals).
Model Architecture
A two-tower bi-encoder structure is used. One tower encodes the listing. The other tower encodes the query. Each tower is formed by a pretrained large language model plus custom layers for categorical and numerical features. Both towers project inputs into a lower-dimensional representation that captures semantic meaning. A scorer head then processes the concatenated representations to produce a click propensity probability.
Where:
Embedding_{Ad} is the output embedding for the listing tower.
Embedding_{Query} is the output embedding for the query tower.
[a, b] is concatenation of vectors a and b.
W is a trainable weight matrix.
sigma is the sigmoid function mapping to probability of click.
The two encoders share some pretrained weights for textual data but differ in top MLP layers to handle tabular inputs. Joint training ensures the model learns to place similar listings and queries close in embedding space when the listing is a good match.
Training Procedure
All relevant examples are sent to the model in mini-batches. A contrastive loss approach pushes relevant query-listing embeddings closer and non-relevant pairs farther. A typical loss can be a cross-entropy or margin-based criterion. Final tuning uses validation sets and metrics like nDCG for ranking performance.
Serving Framework
Ads are pre-encoded offline by calling the Ad Encoder once per listing and storing the resulting vectors in a vector database. At query time, a smaller subset (top k) of listings is retrieved by a fast traditional search index. Those top k listing embeddings are fetched from the vector store. The query is encoded in real time, then each query embedding is concatenated with the respective ad embedding. The scorer outputs a new relevance score that can be merged with the original search engine score for the final listing order.
Example Python Snippet
import torch
class TwoTowerReRanker(torch.nn.Module):
def __init__(self, text_encoder, tabular_mlp, final_scorer):
super().__init__()
self.text_encoder = text_encoder
self.tabular_mlp = tabular_mlp
self.final_scorer = final_scorer
def encode_ad(self, ad_text, ad_tabular):
text_vec = self.text_encoder(ad_text) # returns embedding
tabular_vec = self.tabular_mlp(ad_tabular)
return torch.cat([text_vec, tabular_vec], dim=1)
def encode_query(self, query_text, query_tabular):
text_vec = self.text_encoder(query_text)
tabular_vec = self.tabular_mlp(query_tabular)
return torch.cat([text_vec, tabular_vec], dim=1)
def forward(self, ad_vec, query_vec):
combined = torch.cat([ad_vec, query_vec], dim=1)
score = self.final_scorer(combined)
return torch.sigmoid(score)
This illustration shows a simplified approach. The text encoder could be a DistilBERT-like model. The tabular MLP handles categorical/numerical features. The final scorer merges the embeddings and outputs a probability.
Practical Considerations
Fast inference requires hardware acceleration (GPUs or specialized accelerators). The top k listings re-ranking must be done in a tight latency window (few tens of milliseconds). Techniques like model quantization or smaller hidden sizes help meet throughput targets.
Evaluation
Relevance is measured with ranking metrics such as nDCG at various ranks. Business metrics (click rate, contact rate) are tracked in A/B testing. Gains in these metrics indicate improved user satisfaction. Gains of around 5-10% are typical when replacing a purely keyword-based ranking.
How would you handle these follow-up questions?
What if we have highly imbalanced classes where most listings are not clicked?
Undersampling or oversampling can help, but weighting each sample based on feedback frequency is more robust. The label distribution can be balanced by assigning higher loss weight to minority labels or by using specialized losses that focus on hard negatives. A data pipeline that normalizes the ratio of positive vs. negative feedback also improves stability.
How do you ensure the two-tower embeddings do not drift over time if new listings appear continuously?
A regular offline job re-encodes all new listings. Incremental updates keep the vector database synchronized. A partial refresh strategy can embed only new or updated listings. Full re-embeddings happen periodically to prevent drift. Monitoring distribution shifts in textual or tabular data signals when re-training is needed.
How do you combine the search engine’s score with the neural model’s score?
A weighted or parameterized combination merges the two scores. The final rank might be alpha * neural_score + (1 - alpha) * engine_score. Tuning alpha can be done in an A/B test. Another approach is to feed the engine’s score as a feature into the neural ranker so that the model learns an optimal combination.
How do you handle diverse queries with ambiguous terms?
A universal text encoder helps capture shared semantic patterns. The model learns from many queries and listings to disambiguate context. Additional user signals, like category filters or user’s past clicks, can enrich the query encoder. If terms are ambiguous, the model can still rank the best possible matches first.
What techniques maintain fast inference during spikes in traffic?
Autoscaling GPU or CPU instances is typical. Batching queries together speeds up matrix operations. Model distillation or quantization reduces compute load. Timeouts prevent slow queries from blocking the system. Caching the query embeddings for repeated queries can also help.
How would you troubleshoot a drop in ranking performance after deployment?
Compare offline and online metrics. Examine distribution shifts. Check whether the embeddings for new listings or new user queries differ from training. Investigate data pipeline issues. Re-check hyperparameters. Validate if the combined scores align with business feedback. Retrain or fine-tune with updated data if needed.
How do you guard against adversarial or spam listings?
A classification filter can remove or penalize suspicious listings before re-ranking. The text encoder can learn spam signals from labeled data. Additional rules or anomaly detection track suspicious patterns in listing text or attributes. Monitoring large embedding clusters that represent spam phrases can trigger removal or downranking.
How do you scale beyond 60 million listings?
Index sharding across multiple vector databases is an option. The offline pipeline and real-time cluster both scale horizontally. Large embedding generation is distributed with multiple workers. Summarized representations or a hierarchical approach can reduce dimensionality for faster nearest-neighbor operations.
How does the two-tower approach differ from a cross-encoder approach?
A two-tower approach independently encodes the listing and query, then combines them downstream. This enables offline listing embeddings, saving real-time computation. A cross-encoder applies a single transformer on the combined text, often yielding higher accuracy but at higher inference cost. For large catalogs, the two-tower approach is more practical because of the offline embedding step.
Would you modify the neural ranker architecture if your listings have more complex attributes like images?
A multi-modal extension can incorporate an image encoder tower. The listing tower becomes a fusion of text, tabular, and image embeddings. This requires adjusting the final scorer. Training data must label the alignment of images to queries. GPU memory and inference latency might increase, so strong engineering optimizations are important.
How do you manage personalized re-ranking?
User features (geolocation, past searches, categories visited) can be fed into the query encoder. Learned representations capture user preferences. The system re-ranks listings differently for each user. However, it demands more frequent updates because user behavior changes faster than general text data.
How do you do online experimentation?
A/B testing is standard. A fraction of traffic sees the new neural-based ranking. Another fraction uses the previous approach. Metrics like clicks, contact rates, and time on site are measured. Significance tests confirm whether improvements are real. After success in the experiment, roll out gradually to more traffic.
How would you handle model drift or degrade?
Periodic retraining with fresh data is crucial. Real-time signals about performance triggers alerts. Drifts can arise from major changes in listing distribution or new user behaviors. Retraining cycles might be weekly or monthly. A robust pipeline ensures smooth re-deployment of updated models without downtimes.
How do you ensure minimal latency from the final ranker step?
Profile each step. Cache repeated queries. Pre-compute embeddings for listings offline. Serve the ranker behind a load-balancer that auto-scales. Optimize model size. Use frameworks like TensorRT or ONNX runtime with half-precision or integer-8 quantization. Monitor 99th percentile latencies carefully.
Why choose a contrastive learning objective?
Contrastive learning forces relevant query-listing pairs to have closer embeddings and irrelevant ones to stay far apart. This improves ranking tasks where similarity is essential. It also handles implicit feedback better than naive classification. Contrastive objectives can be more sample-efficient for ranking compared to standard classification.
How is the final score calibrated?
A known approach is temperature scaling or Platt scaling. The raw score from the ranker might not be interpretable as a probability. A small calibration set adjusts the score mapping. Another strategy is to feed the ranker’s output into a logistic regressor that aligns with real click probabilities.
How would you track coverage of niche queries?
Monitoring tail queries is vital. Inspect recall and rank positions for queries with sparse data. If tail queries degrade, gather specialized examples for retraining. Possibly cluster query embeddings to identify categories of queries needing more robust representation. Weighted sampling of tail queries can improve coverage.
Why not store all listings in the ranker for each request?
That would be computationally infeasible. A multi-million listing set cannot all go through the ranker in real time. A coarse search (like TF-IDF or BM25) is used to retrieve a small subset. The ranker then refines that subset. This two-stage approach balances relevance with efficiency.
How do you embed new ads immediately if your pipeline is batch-based?
A streaming pipeline or micro-batch approach can embed new ads as they arrive. The system might store a default embedding for incomplete items, then re-embed them fully when data is stable. A mini-service can handle ad updates in near real time. This ensures new content is searchable with minimal delays.
Would you segment listings by category and train separate models?
In some cases, specialized models per category yield better performance. However, it complicates maintenance. A single robust model with a shared text encoder usually generalizes well if it sees diverse data. Sub-models can overfit to limited data. A unified approach is often easier to scale and maintain.
How do you handle multiple languages?
A multilingual text encoder can be used. Training data must include multiple languages and relevant user queries. Tokenizers handle different scripts. The approach for embeddings remains similar. The ranker sees more variety in text but still produces universal embeddings. Language detection can route queries to different sub-models if needed.
What if your marketplace has ephemeral listings?
Items can expire quickly. The system re-embeds only active items. The vector store must remove outdated listings. The search index is pruned accordingly. A real-time pipeline ensures only valid items remain. This is crucial if listing churn is high and user experience depends on fresh results.
How do you measure user satisfaction beyond clicks?
Time spent on listing pages or direct user feedback (like a star rating for results) can guide relevance. If users contact sellers, that indicates deeper engagement. Another metric might be eventual transactions or conversions. The final objective typically blends short-term interactions like clicks with longer-term behaviors like purchases.
How do you handle zero-shot queries or brand-new search terms?
A pretrained language model captures generalized semantics. Zero-shot queries can still have meaningful embeddings if the text is in-domain enough. The ranker can approximate relevance based on semantic similarity. If there is a domain shift, a small prompt-tuning approach on the LLM might help adapt to new terms.
How do you avoid collisions where distinct items have similar embeddings?
You can incorporate metadata or categories in the final representation. By concatenating textual and tabular embeddings, the ranker can distinguish items even if textual descriptions overlap. Additional supervised signals about listing uniqueness also help. If collisions persist, a larger dimension space or more training data can separate items better.
How do you productionize the model with minimal risk?
Shadow mode or canary tests can run the new ranker in parallel, capturing logs without affecting users. Any anomalies trigger rollback. Careful versioning of the model and data pipeline ensures reproducibility. Gradual rollout to small user segments validates correctness before the full switch.
How would you handle huge textual fields in listings?
A truncated version of the listing text is often enough. Summaries or highlights can be extracted. If crucial details lie deep in the text, a snippet approach can parse relevant sections. Some choose specialized architecture that handles long sequences. But typical solutions use partial text due to computational constraints.
How do you handle repeated queries in a session?
If a user refines a query, you can store embeddings in a short-term cache to reduce computation. Session context can feed into the query encoder. The ranker can adapt if a user picks an item from a certain category and then modifies the query. This leads to a dynamic re-ranking that reflects user intent shifts.
How do you handle variation in user spelling or typing errors?
The language model is robust to minor spelling errors via subword tokenization. Additional fuzzy matching can be done in the retrieval step. The ranker can correct small variations in spelling through learned embeddings. For heavier misspellings, expansions from a dictionary or a spell-checker can fix queries before encoding.
How do you manage pipeline reproducibility across offline training and online inference?
A consistent data transformation codebase ensures the same text cleaning, tokenization, and feature encoding. Containerizing the environment with the same library versions prevents discrepancies. Regular checks confirm that training embeddings match the same queries or listings encoded at inference time.
How would you justify the cost of large GPU clusters to management?
Monitor direct metrics like increased conversion, more listing contacts, and higher user satisfaction. Show how these correlate with revenue gains. Compare the compute costs with the lift in key business KPIs. If the incremental gains surpass infrastructure spending, it validates the ROI of large-scale ranking.