ML Case-study Interview Question: High-Precision Text Classification Identifies Good First Issues in Public Projects
Browse all the ML Case-Studies here.
Case-Study question
A large technology platform launched a feature to help new contributors find easy tasks in public projects. The feature identifies “good first issues” in a project’s issue tracker. They initially relied on specific labels (like “easy” or “beginner-friendly”) to locate these issues, but only a fraction of projects used them. The team then introduced a machine learning classifier that looks at each issue’s title and body to decide if it is “good first issue.” They implemented daily data pipelines to keep recommendations fresh.
You are a Senior Data Scientist. Propose a complete approach for building and maintaining this type of recommendation system. Cover the data pipeline, model architecture, labeling strategy, and practical concerns (like balancing precision vs recall). Suggest how to avoid overwhelming new contributors with inaccurate results. Finally, discuss ways to improve the feature and how you would measure success.
Detailed solution
Data ingestion and labeling
The team collected issues from public repositories. They built an automatic labeling scheme (weakly-supervised approach) by:
Marking issues with community labels (like “good first issue,” “beginner-friendly,” and “documentation”) as positive samples.
Including issues that were closed by a pull request from a brand-new contributor or by a minimal code change.
Treating all other issues as negative.
They balanced the dataset by subsampling the huge negative set.
Feature engineering
They relied only on the issue title and body. They removed template text from bodies because boilerplate headers or instructions had no predictive value. This data was then encoded as sequences of words.
Model architecture
They experimented with classical algorithms (random forests with tf-idf vectors) and deep neural networks (1D convolutional or recurrent neural networks). The neural networks used separate embedding layers for title and body, then combined these embeddings. Deep networks performed better because they extracted context from word order.
They enforced high precision by adjusting thresholds. That means an issue’s predicted probability needed to be sufficiently high before it was suggested. High precision reduced false positives but lowered recall.
Training pipeline
They ran daily tasks that:
Collected new issues.
Automatically assigned weak labels.
Trained or fine-tuned the models.
Generated predictions. They used scheduling workflows (like Argo) to orchestrate these pipelines.
Ranking
They combined model-based confidence scores and label-based confidence. Issues explicitly labeled by maintainers as “good first issue” were ranked higher than purely ML-detected ones. Older issues got penalized so that fresh issues were more likely to appear.
Maintaining accuracy
They watched user interactions and refined the threshold. They also considered letting maintainers confirm or remove ML-based suggestions. This feedback loop improved overall system precision.
Future improvements
They planned personalized suggestions for users who already contributed to a repository. They wanted to incorporate new signals, such as programming language or known difficulty from prior issues. They also considered weighting the results by the health of the repository or ongoing community engagement.
Example code snippet
import tensorflow as tf
from tensorflow.keras import layers
# Hypothetical data pipeline output
# X_title: tokenized titles
# X_body: tokenized bodies
# y: binary labels
title_input = tf.keras.Input(shape=(None,), name="title_input")
body_input = tf.keras.Input(shape=(None,), name="body_input")
title_embed = layers.Embedding(input_dim=5000, output_dim=64)(title_input)
body_embed = layers.Embedding(input_dim=5000, output_dim=64)(body_input)
# 1D Conv for each
title_conv = layers.Conv1D(filters=32, kernel_size=3, activation='relu')(title_embed)
body_conv = layers.Conv1D(filters=32, kernel_size=3, activation='relu')(body_embed)
title_pool = layers.GlobalMaxPooling1D()(title_conv)
body_pool = layers.GlobalMaxPooling1D()(body_conv)
merged = layers.concatenate([title_pool, body_pool])
dense = layers.Dense(64, activation='relu')(merged)
output = layers.Dense(1, activation='sigmoid')(dense)
model = tf.keras.Model(inputs=[title_input, body_input], outputs=output)
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
# Train
# model.fit([X_title, X_body], y, epochs=5, batch_size=32, validation_split=0.2)
Explain to interviewers that you would experiment with different architectures, hyperparameters, and textual augmentation to fight class imbalance.
Precision and recall (key formula)
Where TP is the count of true positives, FP is the count of false positives, and FN is the count of false negatives. Balancing these is crucial. They chose a high-precision threshold so that only issues with high confidence surfaced.
Metric tracking
They tracked:
Click-through rates on recommended issues.
How often new contributors actually open pull requests.
Maintainer feedback to fine-tune classification thresholds.
Possible follow-up questions
How would you address domain differences across various programming languages?
Many projects differ in language and style. A single model might miss domain-specific jargon or specialized problem statements. Training data from multiple domains helps generalize better. Techniques like domain adaptation or separate specialized models may help. Another approach is dynamic embeddings that adapt to language-specific vocabularies.
How would you handle extremely imbalanced classes?
They used a subsampling approach for negative examples. Additional techniques:
Oversampling positive examples or using synthetic text augmentation.
Applying class-weighting in the loss function.
Using advanced methods like focal loss to focus on harder examples.
Why not rely solely on maintainers’ labels?
Only a small fraction of repositories consistently label issues for newcomers. Relying solely on these labels limits coverage. Automatically detecting easy issues increases coverage, but requires careful precision tuning to avoid unhelpful suggestions.
How would you let maintainers refine model predictions?
They could provide UI elements on issues to confirm or reject a “good first issue” suggestion. This feedback updates the training set. Daily or weekly retraining integrates that feedback. Over time, false positives drop and coverage improves.
Why is fresh data important?
Issue states change fast. A seemingly easy issue might get resolved within a day. Daily inference ensures that only open issues remain in the recommendation set. Also, retraining with recent data captures new patterns or labeling approaches that maintainers adopt.
What if the model mislabels complex tasks as easy?
Excessive false positives frustrate users. High precision is enforced by increasing the probability threshold. Developer feedback and maintainers’ validations further reduce mislabeling. If the system sees negative engagement signals (no one wants to fix a supposedly “easy” bug), that signals a threshold adjustment.
How would you measure success beyond precision and recall?
User-centered metrics. Look at how many first-time contributors open successful pull requests. Check if contributors remain active in the community. Evaluate user satisfaction via surveys or platform analytics. Growth in successful contributions is a key measure.
Could you incorporate user interests?
Yes. Personalized recommendations consider user profiles, past projects, or languages of interest. That might require a user-level embedding or a separate filtering stage. This personalization aims for higher relevance and engagement.
Any concerns about using minimal code changes as a label proxy?
It is an imperfect heuristic. Sometimes a one-line fix is tricky. Using multiple signals (labels, new contributor merges, minimal changes) is more robust. Checking real usage patterns and maintainers’ feedback refines the labeling process.