ML Case-study Interview Question: ML-Powered Contact Accuracy Score: Unifying Email and Company Verification
Browse all the ML Case-Studies here.
Case-Study question
A large data intelligence platform merged two different systems that each provided a single metric indicating data accuracy. One system used last updated date, and the other used a human-verified vs machine-generated label. These single metrics were sometimes misleading or expensive to maintain. The company chose to unify them by creating a machine learning-based contact accuracy score focusing on email address and company name accuracy. How would you design a solution to generate an accuracy score for each contact, ensuring scalability, strong predictive power, and continuous improvement?
Provide a step-by-step plan. Describe how you would handle ground truth data creation, feature extraction, model selection, model deployment, and ongoing maintenance.
Detailed Solution
Overview
This combines multiple signals (recent updates, data sources, verification status, etc.) into a single score for each contact. The main target is whether the email address is valid and whether the contact is associated with the correct company. A random subset of records is labeled as good or bad. A model then predicts the likelihood a new record is correct based on relevant features.
Ground Truth Construction
Randomly sample a subset of contacts. Manually check if each email address is valid (bounce tests) and confirm the person’s current company. Label as good if both company and email are valid, or if only the company is valid and email is missing. Label as bad if the company is invalid or the email is invalid. This becomes the training set.
Exploratory Data Analysis
Explore each field to see how it correlates with good vs bad. Use statistical tests or data visualization to see which features best separate good from bad. Investigate how last updated date, human verification indicator, and other fields (like phone presence or multiple data sources) correlate with correctness.
Feature Selection
Focus on fields with the highest predictive power:
Age of the record or last updated timestamp
Whether the email was machine-generated or user-supplied
Availability of a phone number
Number of distinct data sources feeding the record
Age of any signatures or references
Model Choice
A practical approach is logistic regression. It models the probability that a record is good or bad. A general form is shown below.
Here, p is the probability a record is good. x_1..x_n are features such as last updated date, user verification, phone number presence, etc. beta_0..beta_n are learned parameters.
After training on the labeled subset, apply the model to all contacts. Output is a probability in the range [0,1]. Map that to a final 70-99 range if poor contacts are already removed or cleaned from the system.
Sample Python Code
import pandas as pd
from sklearn.linear_model import LogisticRegression
# df contains rows of contact data with relevant features
# 'label' is 1 for good, 0 for bad
X = df[['last_updated_days','verified_flag','phone_exists','multiple_sources','signature_age']]
y = df['label']
model = LogisticRegression()
model.fit(X, y)
# Score for new data
df['score_raw'] = model.predict_proba(X)[:,1]
df['contact_accuracy_score'] = 70 + 29 * df['score_raw']
Explain in a simple paragraph format: The code loads the features, fits a logistic regression model, and then predicts the probability a contact is good. It maps the probability to a 70-99 range for the final contact accuracy score. For any contact updates, re-run these steps or an incremental retraining process.
Validation
To confirm the score is meaningful, randomly sample records, calculate their score, and manually re-verify email and company correctness. Score distributions should correlate with actual correctness rates. Adjust thresholds or modeling parameters if the predictions deviate from observed outcomes.
Maintenance
Continuously sample new records or changed records for manual verification. Retrain periodically using the newly labeled records. Expand or refine features (like phone type or recency of job transitions) to capture more signal. Consider advanced models if logistic regression underperforms.
How would you address these Follow-Up Questions?
1) How do you handle data that changes rapidly?
Monitor frequently updated fields and run incremental retraining. For each incremental batch, gather ground truth labels, update features like last update timestamp, and retrain or fine-tune the model. Apply automated checks (e.g., bounce tests) to high-value records first.
2) Why focus on email and company accuracy?
Email and company name are business-critical fields for marketing, sales, and engagement. Invalid emails cause bounces and penalties with mailing services. Incorrect company associations waste resources and lead to lost opportunities.
3) What if you want a separate score for phone numbers?
Repeat the same approach. Define a ground truth for phone correctness. Retrain a similar logistic regression or more advanced model using phone-specific labels. Combine scores or produce multiple accuracy scores (e.g., EmailAccuracyScore and PhoneAccuracyScore).
4) How do you choose between logistic regression and more complex algorithms?
Compare performance metrics (e.g., area under ROC curve) for multiple approaches such as random forest, gradient boosting, or neural networks. Logistic regression is simple and transparent, making it easier to explain. If a complex model demonstrates significantly higher accuracy, weigh that benefit against interpretability, training cost, and data scale.
5) How do you address label imbalance if most records are good?
Use stratified sampling to preserve class proportions in training data. If good vs bad is highly imbalanced, apply techniques like oversampling bad records or undersampling good records. Experiment with class-weight parameters in the model training function. Evaluate performance carefully on a balanced validation set.
6) How do you keep your data pipeline efficient?
Automate data ingestion, cleaning, feature engineering, and model scoring. Cache intermediate outputs to reduce repetitive computations. Use distributed computing frameworks if the dataset is large. Log data changes to trigger partial scoring rather than recomputing everything.
7) What if human verification becomes too expensive?
Prioritize high-risk or high-impact subsets for manual checks. Reduce labeling frequency for stable data. Explore more advanced machine learning approaches. Integrate feedback loops from email bounce logs or user responses to refine labels. Focus on ROI: manual checks might be justified for certain high-value segments.
8) Why set a minimum score at 70?
Poor-quality or outdated records are removed before scoring, so even the lowest-scoring records still meet a minimal quality threshold. This scoring strategy is a product decision. If new requirements arise, shift the baseline to different minimum values (like 50) or allow negative scores.
9) How do you ensure generalization to new data?
Include diverse samples during model training. If your user base expands internationally, incorporate those countries in training data. Periodically retrain using new contact profiles, watch for drift (e.g., changing email patterns), and confirm the model’s assumptions still hold.
10) What improvements would you consider in the future?
Incorporate more features (e.g., role-based emails, auto-detected seniority). Assign different weights to fresh vs older data. Build ensemble models that average multiple algorithms’ outputs. Segment the model by industries or regions if you observe different data patterns.