ML Case-study Interview Question: Using Machine Learning Entity Linking to Disambiguate Names in Leaked Data
Browse all the ML Case-Studies here.
Case-Study question
A media organization investigates large whistleblower leaks containing names of individuals possibly engaged in illicit activities. They need a system that identifies and disambiguates these people by linking their names in leaked files to unique entries in a database. They want to reduce time-consuming manual effort by using a pipeline that extracts names, matches them to real-world entities, and helps journalists visualize relationships among these entities. How would you design, build, and evaluate a machine learning solution that can handle ambiguous names, multiple data sources, and incomplete records? Propose a full solution, explain each step in detail, and discuss how you would overcome significant engineering or data challenges.
Detailed solution
Data ingestion and preprocessing
Large-scale data dumps often arrive in unstructured formats. Text extraction from source files is the first step. Many leaks require reading PDF files, emails, or scanned documents. Converting these into plain text is crucial. Optical Character Recognition helps if documents are image-based.
Standard text cleaning removes noise like HTML tags, broken characters, or repeated boilerplate. Journalists integrate external knowledge bases containing entries of interest. Combining multiple data sources requires consistent formatting. This unification stage ensures unique keys for each entry.
Named Entity Recognition
NER tags all person names in the text. Language models such as spaCy handle complex text. After scanning paragraphs, the NER model returns strings tagged as person names, often capturing first and last names. NER occasionally mislabels or misses partial references. Thorough model fine-tuning on relevant training data helps mitigate errors.
Knowledge base construction
Journalists often rely on curated databases listing individuals, organizations, or entities flagged for possible wrongdoing. Merging sources like LittleSis and OpenSanctions can build a large repository. Duplicate entries or stale references require heavy cleaning, merging, and reconciling. A structured database must keep attributes such as name variations, affiliations, and short bios.
Entity linking approach
Where e is a candidate entry from the knowledge base, m is the mention (the recognized name string in the text), and c is the surrounding context like words around the mention.
The system uses a function that ranks each candidate e by probability given the mention m and the context c. The best match is the one with the highest score under this function.
String similarity identifies a rough set of candidates from the knowledge base by matching name tokens. More advanced filters consider known aliases or the presence of job titles or geolocations in the context. The entity linker then resolves which candidate is correct by modeling the text context: if the paragraph references politics or business deals, that can shift probabilities toward specific individuals.
Training data creation
Labeled examples are the backbone of supervised entity linking. Annotators read passages mentioning people and link these to the correct database entry. Ambiguous names like “Adam Smith” require careful context reading. Single-name mentions often lack enough cues, making them harder to map.
A single annotator can only label a certain number of examples per hour. Thousands of examples are needed for robust generalization. Decision rules reduce variance across annotators. A consistent labeling guide ensures uniform tagging of partial names, repeated references, or multiple name variations.
Model training
Entity linkers can be trained using frameworks like spaCy. Training proceeds by presenting the model with mention-context pairs and their correct database link. The model learns textual cues that distinguish close matches. When the context is generic (“the politician said...”), the model may fail to differentiate among candidates with similar backgrounds. Additional context or pre-filters help disambiguation.
If the database has multiple contradictory records, the pipeline struggles. Data deduplication is essential. The model can incorporate short descriptions or additional fields, such as known associates, to refine link choices. Once validated, the trained model can run in production for large-scale mention resolution.
Visualization and analysis
Journalists see the final results through a dashboard or graph-based interface. Connections between individuals, organizations, and relevant documents become visible. This helps them spot suspicious patterns or repeated appearances of certain names across multiple leaks.
Python code snippet (simplified prototype)
import spacy
from spacy.tokens import DocBin
nlp = spacy.load("en_core_web_sm")
linker_model = ... # Assume a custom pipeline component
text_data = ["Some text mentioning John Doe, a known figure in politics.",
"Another text about John Smith, an economist."]
docs = [nlp(text) for text in text_data]
# Linking each doc
for doc in docs:
for ent in doc.ents:
if ent.label_ == "PERSON":
link_result = linker_model(ent.text, doc.text)
print(f"Mention: {ent.text}, Link: {link_result}")
The code shows a minimal approach. The linker_model
function acts as a custom component that returns the most likely database link for a recognized mention.
Follow-up question 1
How would you handle real-world naming inconsistencies, where a single individual appears under different aliases or with varying name formats, such as abbreviated first names?
Answer
Aliases cause mismatches between text mentions and knowledge base entries. The pipeline can store alternative name fields (e.g., full name, common nicknames, transliterations). The entity-linking candidate retrieval process can include checks on known abbreviations or partial matches. If multiple mentions map to the same identifier, the system unifies them. Journalists supply additional cross-references to ensure these aliases represent the same person. If confidence is insufficient, the system flags the entry for manual review.
Follow-up question 2
What strategy would you adopt to evaluate the performance of your entity linking approach?
Answer
A holdout set of annotated text passages with known ground truth is key. The model’s predictions are compared to the true identifiers. Precision measures how many predicted links were correct. Recall measures how many correct links were retrieved from all possible references. F1 combines them. False positives occur when the model picks the wrong entry for a mention, while false negatives occur when the mention is missed or left unlinked. Analysis of errors on ambiguous names reveals where the model needs improvement. Segmenting the evaluation by mention length (single name vs full name) or by mention frequency (famous vs obscure individuals) highlights weaknesses.
Follow-up question 3
How would you scale the system to handle millions of documents with minimal latency?
Answer
Batch processing with a distributed system is often necessary. Parallel workers run the NER and entity linking steps on different document chunks. A fast indexing layer helps retrieve candidate entries quickly. GPU-based solutions improve inference time for language models. Spark or Ray can coordinate large-scale document pipelines. Caching partial computations avoids repeated string similarity checks. A streaming system might handle new documents in real time, updating indexes incrementally instead of reprocessing everything from scratch.
Follow-up question 4
How would you design the system to allow iterative improvements, especially when journalists flag errors in linked entities?
Answer
Journalists can provide feedback whenever they notice a wrong mapping. The system stores these corrections as new training data for future model retraining or fine-tuning. A feedback loop updates the knowledge base by adding or merging relevant aliases. The pipeline’s candidate retrieval adjusts weighting for certain textual cues if recurring mistakes appear. A version-controlled approach ensures the system can be rolled back if a new release produces degraded performance. Periodic re-annotation sessions expand the coverage of ambiguous cases, helping the model learn from real-world usage.
Follow-up question 5
How would you handle a scenario where the knowledge base itself contains outdated or conflicting records?
Answer
Merging data from multiple sources leads to duplicate or contradictory profiles. Automated scripts check for consistent fields, such as birth dates or known affiliations. If conflicts are found, a manual or semi-automated step resolves them. Records with severe contradictions are flagged for expert review. The pipeline logs ambiguities, letting data curators intervene. A continuous cleaning process ensures the knowledge base remains valid. Periodic revalidation catches issues introduced by newly ingested data.
Follow-up question 6
How do you address legal or privacy concerns when linking potentially sensitive individuals?
Answer
Data logs must follow strict compliance rules. Redacting protected information is necessary if the leaked data includes personal identifiers like home addresses. Access controls limit who can see the full pipeline. Audit trails show exactly how the model made each linking decision, allowing oversight. Storing only relevant fields prevents accidental exposure of private data. Cryptographic hashing can protect sensitive references in the pipeline. This approach balances investigative needs with privacy obligations.