ML Case-study Interview Question: Targeted Audience Expansion Using User Embeddings and Per-Advertiser Classifiers
Browse all the ML Case-Studies here.
Case-Study question
A large online platform wants to refine a system that finds highly relevant new users for targeted advertising. The platform collects rich first-party user behavior signals, including ways users interact with items on the platform. The marketing team has seed user lists (existing users who have already shown valuable engagement), and they want to expand to new, similar users who have a high probability of becoming similarly engaged. Propose how you would build, train, and deploy a machine learning pipeline to accomplish this audience expansion task, and describe how you would evaluate its effectiveness at scale.
Detailed Solution
Traditional regression-based solutions train a separate classifier for each seed list in a sparse feature space. Similarity-based solutions generate user embeddings and find nearest neighbors. Each approach faces unique challenges. Regression-based models capture seed list structure but suffer when seed data is limited. Similarity-based approaches generalize well but may miss certain nuanced patterns specific to individual seed lists.
User Embeddings + Per-Advertiser Classifier Pre-trained universal user embeddings, derived from user-item interactions, encode behavioral patterns into dense vectors. Feeding these vectors into a multi-layer perceptron (MLP) classifier for each advertiser combines both benefits. Embeddings solve data sparsity and speed up convergence, while per-advertiser classifiers learn seed-specific patterns. This combined approach excels on small and large seed lists.
Weighted Training Positive samples come from the advertiser’s seed list. Negative samples come from a random pool of other users. The model applies a weighting scheme so seed users with higher engagement have higher impact. This weighting can leverage metrics like impressions or click-through rates. A log transform and min-max scaling keep weights in a reasonable range. The MLP then trains with weighted binary cross-entropy to differentiate which users best match the seed list’s engagement patterns.
Here:
N is the total number of samples.
w_i is the engagement-based weight for each sample.
y_i is the true label (1 for seed user, 0 for non-seed user).
\hat{y}_i is the predicted probability.
Offline Evaluation
Hold out 10% of the seed users for validation. Train with the remaining 90%. Score the entire user base. Compare how many of the held-out 10% appear in the top k ranks. Measure recall@k and precision@k and compare to baselines. The combined approach usually outperforms both separate regression-based and similarity-based solutions across different seed list sizes.
Online Evaluation
Run an A/B test where the control uses a previous hybrid system blending regression-based and similarity-based outputs. The treatment uses the new combined model. Monitor metrics such as user reach, impressions, revenue, hide rate, and clicks. A strong lift in impressions and revenue, together with stable or improved engagement quality, indicates success. Infrastructure becomes simpler because only one model pipeline is maintained.
Implementation Details
Data generation starts with a distributed system that updates user embeddings regularly from large-scale user-item interactions. Training each advertiser’s MLP classifier uses a batch framework (like Spark on Kubernetes) to handle seeds in the scale of 10^5 and a candidate pool in the scale of 10^8. Negative sampling ensures a balanced training set. The deployment pipeline stores these trained classifiers. An inference job periodically scores all users to produce an expanded audience list for each advertiser. This list becomes available for targeted advertising.
Example Code Snippet
import torch
import torch.nn as nn
import torch.optim as optim
class MLPClassifier(nn.Module):
def __init__(self, input_dim, hidden_dim):
super(MLPClassifier, self).__init__()
self.layers = nn.Sequential(
nn.Linear(input_dim, hidden_dim),
nn.ReLU(),
nn.Linear(hidden_dim, 1),
nn.Sigmoid()
)
def forward(self, x):
return self.layers(x)
# Assume embeddings and weights are loaded
# X is (num_samples x input_dim), y is (num_samples), w is (num_samples)
model = MLPClassifier(input_dim=128, hidden_dim=64)
criterion = nn.BCELoss(reduction='none')
optimizer = optim.Adam(model.parameters(), lr=0.001)
for epoch in range(epochs):
optimizer.zero_grad()
y_pred = model(X)
loss_individual = criterion(y_pred.squeeze(), y)
weighted_loss = (loss_individual * w).mean()
weighted_loss.backward()
optimizer.step()
The code shows a simple MLP. Weighted loss applies to each sample. The actual pipeline would incorporate custom data loaders, logging, and distributed processing.
Potential Improvements
Factorization machines or attention-based architectures can capture interactions more effectively. Contextual factors, such as user recency or content trends, can further refine expansions. Better sampling schemes for negative data or more nuanced weighting can also improve classifier accuracy.
What if the seed list is very small?
When the seed list has few users, regression-based methods risk overfitting. The combined approach’s reliance on powerful universal embeddings solves that. Dense vectors learned from massive global data offer rich signals for cold-start seed lists. The MLP’s parameters adjust only slightly to capture these small but distinct patterns.
How do you handle very large seed lists?
Large seed lists lead to high variance in user behavior. A logistic regression that trains on raw features may exploit the volume but can miss nonlinear relationships. The combined approach includes only up to a certain number (like 200,000) of positive samples from the seed list to keep training feasible, then leverages universal embeddings to capture broader information. The MLP’s hidden layers capture complex correlations among embedding dimensions, leading to stronger performance.
Why not only use nearest-neighbor search on embeddings?
Pure similarity-based methods quickly find new users similar to a seed cluster. They work well if the seed list is small, but for large or noisy seeds, the approach misses subtle signals that a dedicated classifier can learn. Classifiers also incorporate weighting or specialized loss functions, adjusting to business rules or engagement metrics that a basic nearest-neighbor approach cannot handle.
How do you prevent overfitting on each advertiser’s seed list?
Regularization methods like dropout or weight decay on the MLP reduce overfitting. Limiting the number of dense layers can help. Since embeddings are pretrained, the effective dimensionality is controlled. Negative sampling from the wider user base also ensures the model sees a range of examples.
How do you select the weighting scheme?
Picking the right engagement metrics is critical. If your primary goal is revenue, weigh seed users by revenue contribution. If you want to maximize clicks, weigh them by click frequencies. Logs transform outliers and min-max scaling ensures balanced weights. Empirical evaluation guides which metric or combination works best.
How would you monitor performance in production?
Track impressions, clicks, revenue, hide rate, and additional engagement metrics in real time. Confirm that new expansions bring enough incremental lift. Watch for distribution shifts in user behavior that could degrade model predictions. Regularly retrain or refresh embeddings if the user-item interaction patterns evolve.
How do you improve system scalability?
Distribute training and inference processes. Partition users and seed lists. Use a big data framework for model training and scoring. Cache partial outputs when possible. Exploit approximate nearest-neighbor search if certain fallback or hybrid strategies are needed.
Would you incorporate multi-task learning or advanced architecture?
An MLP with shared layers across advertisers could use multi-task learning to share knowledge. Complex architectures like factorization machines or attention layers can capture feature interactions or temporal context. The best choice depends on data size, complexity, and computational budgets.