ML Case-study Interview Question: Classifying Incorrect Address Locations using RoBERTa and Geohashing
Browse all the ML Case-Studies here.
Case-Study question
You are given a scenario where a food delivery platform notices many canceled orders because Delivery Partners cannot locate the correct address. The platform captures each address’s text description and a corresponding geo-location from the mobile’s GPS during onboarding. Actual deliveries happen at locations sometimes different from what was captured. Construct a solution that flags addresses where the captured location is likely incorrect. Describe a data acquisition strategy for labeled training data without direct human annotation, propose a suitable model pipeline, and explain how you would handle noisy GPS signals.
Provide the most detailed approach possible, including:
Methods to build a dataset of correct (label-0) and incorrect (label-1) location-text pairs.
Algorithms or models to classify address-location mismatches.
Techniques to validate and improve the classifier’s performance.
Detailed Solution
Dataset Preparation
Collected historical orders with address text and GPS coordinates. For each delivered order, the Delivery Partner’s “delivered” location was stored. Observed that the Delivery Partner might tap “delivered” at various distances from the real drop point. Synthesized two categories:
Label-0 set (correct pairs). Aggregated each address’s multiple delivered pings and local trajectory points. Computed a median coordinate that best approximated the real location. Called this the “synthetic location.” If the synthetic location was within a certain threshold from the original captured location, marked that address as label-0. Spot checks on these confirmed high accuracy.
Label-1 set (incorrect pairs). Direct extraction of incorrect addresses from the delivered data had lower reliability because Delivery Partners might tap “delivered” far from the actual spot or addresses might change over time. Instead, took each label-0 pair, then perturbed the location. Ensured the shift exceeded a certain minimum distance. This guaranteed the resulting location was unrelated to the address text.
Key Distance Formula
r is Earth’s approximate radius, phi_1 and phi_2 are latitudes in radians, lambda_1 and lambda_2 are longitudes in radians, and d is the spherical distance on Earth’s surface. This was used as a check for “captured vs. synthetic” distance and for perturbation thresholds.
Model Approach
Treated the text plus geohash encoding as input. Transformed location coordinates into geohash strings like L8 geohash to unify numeric latitude-longitude into textual format. Appended geohash with the address text, separated by a special token. Fine-tuned a language model (like a scaled-down RoBERTa) to output a binary label: 0 if address text and location match, 1 otherwise.
Pretrained the model in two phases:
Phase 1: Masked Language Model (MLM) on large unlabeled address text combined with geohash segments. This helped the model learn the local geography-related tokens and address semantics.
Phase 2: Fine-tuned the full network weights for classification. Freezing only the top classification layer was less effective. Updating all layers yielded better precision-recall.
Handling Noise
Used the Delivery Partner’s local route as an extended cluster around the delivered location. Incorporated those points if they lay near the delivered ping, thus smoothing out random GPS spikes. Ensured that addresses in large complexes or gated communities did not hamper the median-based clustering, though sometimes those yields required higher thresholds.
Example Code Snippet
import geopy.distance
import random
import torch
from transformers import RobertaForMaskedLM, RobertaForSequenceClassification
def create_synthetic_label1(lat_lng, min_shift_dist=125.0, var=0.02):
# lat_lng is a tuple (lat, lng)
# min_shift_dist in meters
# var controls how far from original location to move
while True:
lat_noise = random.gauss(0, var)
lng_noise = random.gauss(0, var)
candidate = (lat_lng[0] + lat_noise, lat_lng[1] + lng_noise)
dist = geopy.distance.geodesic(lat_lng, candidate).meters
if dist >= min_shift_dist:
return candidate
model = RobertaForMaskedLM.from_pretrained('roberta-base')
# Pretrain with MLM on address+geohash data...
# Then switch to classification head:
classification_model = RobertaForSequenceClassification.from_pretrained('roberta-base', num_labels=2)
# Fine-tune all layers with your labeled dataset...
Scalability
Processed thousands of addresses daily. Automated the label generation with minimal human oversight by verifying random subsets. Model inference runs with a straightforward text input pipeline in near real-time.
Follow-up Question 1
How would you ensure high-quality labels if Delivery Partners sometimes tap “delivered” far from the exact spot?
Detailed Answer
Filtered the raw delivered points by checking the Delivery Partner’s nearby route. For each delivered point, gathered other route pings within a small radius. Averaged them to reduce random offsets. Next, aggregated these denoised pings across multiple successful deliveries for the same address. Computed a median location to get a stable “synthetic location.” Compared that with the captured location to decide if it was correct. This overcame outliers from occasional distant tap locations. Added additional threshold checks if the data distribution showed large standard deviations.
Follow-up Question 2
Why not label addresses as incorrect when the captured and synthetic location differ by a large distance?
Detailed Answer
Directly labeling addresses as incorrect if distance exceeded a threshold had issues:
Delivery Partners might mark “delivered” at the apartment gate instead of the door. Repetitive gating yields an artificially large distance.
Users might move to a new city and edit the same address text. Synthetic location would drift.
Noise or single outlier pings could falsely inflate distances. These inconsistencies generated many inaccurate label-1 entries. Perturbing the high-confidence label-0 addresses was simpler and guaranteed correctness because shifting a known correct coordinate always yields a wrong pairing.
Follow-up Question 3
Explain why you do full fine-tuning instead of freezing all RoBERTa layers except the final classification head.
Detailed Answer
Freezing the main layers while training only the classification head performs well for many general language tasks. This address-based classification is domain-specific, containing short text with local context (street names, geohashes). The pretrained model lacks enough representation for these specialized tokens. Full fine-tuning updates the internal embeddings to learn nuances of Indian address formats, location codes, and domain details. Experiments showed that freezing the base layers plateaued in precision and recall. Unlocking all layers enabled the model to learn the specialized language structure faster. This produced substantial improvements in the final F1 score.
Follow-up Question 4
Describe your evaluation criteria once this model is live. How do you handle mismatches between training data distribution and real production data?
Detailed Answer
Monitored precision and recall on fresh samples of canceled orders. Resampled flagged addresses that triggered re-verification and had them validated by a human or cross-checked with additional signals such as repeated edits. Looked for distribution drifts: seasonal changes in addresses, expansions into new regions with unique addressing norms, or new GPS hardware variances. Retrained the model periodically, incorporating new data patterns. Applied incremental monitoring by tracking false positives on newly served areas.
Follow-up Question 5
How would you adapt or extend this approach for other languages or country-specific address structures?
Detailed Answer
Retrained the masked language model with local text from the new region. Gathered addresses in that language, combined them with geohash strings as before. Fine-tuned the classification layer with domain-specific training examples. Adjusted tokenizers to handle unique characters or patterns. Verified the numeric geohash approach for that region’s coordinate system. Kept similar perturbation logic for label-1 generation, but validated thresholds to match typical address patterns in that locale. Verified with pilot sets to confirm model alignment with local language and address format.