ML Case-study Interview Question: Using XGBoost to Intelligently Select Tests in Large Continuous Integration Systems
Browse all the ML Case-Studies here.
Case-Study question
You have a massive continuous integration system with around 85,000 unique tests, each covering different modules and features. These tests run on more than 90 build configurations and multiple operating systems. Each day, hundreds of new code changes appear. Running all tests on every change causes an unmanageable load. Propose a machine learning approach to intelligently select which tests to run for each code change. Design your solution to maintain high regression detection while minimizing resource usage and cognitive overhead for developers. Include details on data collection, model architecture, training methodology, metrics for effectiveness, and infrastructure for both hosting and evaluating your predictive service. Outline how you would handle missing test data, intermittent failures, and platform redundancy. Explain how you would prove your approach improves efficiency compared to heuristic-based or exhaustive test selection.
Detailed solution
A large test suite and many build configurations create a huge matrix of potential runs. A naive strategy runs most tests on each change. This approach wastes resources and lengthens feedback loops for developers. A machine learning approach reduces redundant testing by learning which patches are most likely to fail which tests. This includes:
Collecting historical pass/fail data tied to the patches that introduced regressions. Correlating changes in specific source files with failure patterns of certain tests. Using a model that outputs a subset of tests most likely to fail under the new changes.
Historical data may be incomplete because not all tests run on each code change. Periodic “full” runs fill coverage gaps. This still leads to uncertainty about which patch caused a newly discovered failure. Intermittent (flaky) tests complicate things. A test might fail sometimes without direct code changes that would justify a true regression.
A system of heuristics helps pinpoint root causes for failures. For example, if a patch that caused failing tests is backed out and the tests start passing, that patch likely introduced the regression. Patch metadata is deterministic, because changes are stored in a version control system. The data pipeline aggregates these patch details and merges them with historical test outcomes, building a combined dataset of features such as frequencies of co-modification between source files and tests, or paths shared by changed files and test files.
The training set excludes data from the future to simulate real-time model usage. A chronological split ensures the validation set only contains patches that appear after the training set. This emulates real-world deployment, where the model cannot see future regressions while it trains.
An extreme gradient boosting model (XGBoost) is trained on tuples (test, patch). Each tuple is labeled as fail or not fail. The model learns patterns such as how often a test fails when certain files are touched, or how far the changed paths lie from the test directory. This single model generalizes to all tests, instead of training separate classifiers for each test, because knowledge that helps predict failures in one area may transfer to another area.
False positives (selecting a test that does not fail) incur less penalty than false negatives (omitting a failing test). The cost of extra runs is relatively small compared to missing an actual regression. The model is periodically retrained to stay aligned with changing code.
Selecting test configurations further optimizes resource usage. Tests that historically fail only on certain platforms do not need to run on all platforms. A solution combines failure statistics and frequent itemset mining to identify redundant configurations. A mixed-integer programming solver finds the minimal-cost set of configurations that still preserves coverage guarantees for each test.
A REST service hosts the model. When a developer pushes changes, the service analyzes modified files and runs the model. Results are cached so multiple requests for the same patch do not trigger re-computation. The integration pipeline queries the service to decide which tests to run on the integration branch. Developers can also query it when choosing tests on private branches.
Effectiveness is measured by regression detection rate (percentage of regressions caught on the offending push) and total runtime. The metric:
scheduler_effectiveness = 1000 * (regression_detection_rate / hours_per_push)
A higher value signals a better trade-off between coverage and runtime. Shadow schedulers run in parallel and simulate alternative strategies without actually scheduling tests. Data pipelines compare their results to find an optimal approach. Once identified, it becomes the default.
Example snippet for training XGBoost in Python
import xgboost as xgb
import pandas as pd
# Suppose df_train has columns: ['feature1', 'feature2', 'label']
X_train = df_train[['feature1', 'feature2']]
y_train = df_train['label']
model = xgb.XGBClassifier(n_estimators=100, max_depth=6, learning_rate=0.1)
model.fit(X_train, y_train)
# Periodically retrain with new data.
This code fits an XGBoost model on training data. Feature engineering includes text-based or path-based signals, statistics of test-file co-failures, or revision-interval features. The prediction step uses the trained model on new patches: it scores (patch, test) pairs and ranks them by failure probability.
How to handle tricky follow-up questions
How do you ensure the model stays accurate over time?
The codebase evolves. New modules or tests appear, and developers shift focus. A stale model misses fresh patterns. Retrain regularly, such as every few weeks, using new code changes. Validate using a chronologically later set of patches. Monitor performance metrics (false negatives, regression detection rate) and re-trigger training if drift appears.
How do you deal with flaky tests that fail or pass at random?
Distinguish genuine regressions from noise. Track intermittent rates historically. If a test transitions from pass to fail but recovers without code changes, mark it flaky. The label for that patch-test tuple might be uncertain, so you can weight it lower or run more confirmatory retriggers before final labeling. Also record partial classifications from humans, who might label known flakiness in the system.
What if many new tests are introduced?
Include them in the training pipeline. Initially, they have limited historical data, so feature signals may be sparse. Weigh older tests more strongly at first. As these new tests accumulate pass/fail data, incorporate them in retraining. The system self-corrects once enough labeled data is available.
How do you control risk of missing critical regressions?
Use shadow scheduling. If the model omits certain tests from the default set, a separate job occasionally runs full coverage. If a missed regression is discovered, the system backfills blame to the correct patch. The patch-test association is updated, and the model learns not to omit that test under similar conditions next time. If you must be absolutely sure for crucial modules, you can force specific tests to run every time, regardless of the model’s prediction.
How do you handle scaling this pipeline in production?
A specialized data pipeline queries historical data for each push, merges it with patch metadata, and submits it to the model. Caching reduces load for frequent re-checks. A queue system processes multiple pushes concurrently, but it must be sized to respond quickly. The model inference itself can be done in parallel or on GPU if throughput becomes a bottleneck. Scheduled offline retraining can be larger-scale, since it is not time-critical.
Why is a single classifier better than training per-test models?
A combined model exploits shared patterns across tests. A common pattern in changed files might predict multiple test failures. Training a separate model for every test multiplies memory use, can lead to overfitting on sparse data, and might not generalize well. A single multi-class approach sees broader examples and reuses learned signals for different tests.
Why is the chronological split for training and validation so important?
Random splits leak future knowledge into the model. A test that started failing last week might give the model examples of that failure in both training and validation sets. In real life, the model deployed last month would not have seen last week’s failures. The chronological split simulates the real environment.
How do you measure success when you first deploy such a system?
Compare resource usage and regression detection to the old approach. The metric of scheduler_effectiveness quantifies this. If you see reduced total compute hours per push and consistent detection of regressions, that is evidence of success. Shadow schedulers confirm that the new approach outperforms alternatives.
How would you address developer trust in the new system?
Developers may worry about missed regressions. Show them that any omitted tests are periodically covered by full runs or backfill if suspicious failures appear. Provide a user-friendly interface to override the model and add extra tests in edge cases. Log predictions for transparency. Show a dashboard tracking how many regressions the model directly detected and how many it missed.
Would you ever consider adaptive test selection during a single run?
Yes. After partial results are known, a second stage might skip related tests. For example, if two related tests pass, the model might skip a third correlated test. This requires tight coupling between test scheduling and results collection. The complexity is higher but can unlock further savings.
How do you safeguard against cost overruns if the model chooses too many tests?
Refine the threshold for selecting tests. Start with a higher threshold that picks fewer tests. Adjust if the false negative rate is too high. The cost penalty for false positives is smaller than missing regressions, but an extreme number of unnecessary tests also adds cost. Monitor a cost/detection ratio. Shift the threshold to balance these competing concerns.
How do you ensure the final approach is generalizable to other teams or organizations?
Abstract the pipeline and avoid codebase-specific assumptions. Provide hooks for custom heuristics or classification rules. Document the minimal data requirements: pass/fail outcomes over time, patch metadata, stable version control. The same method should work wherever large test suites exist, as long as the new environment also collects patch-test correlations over time.
What if your training data is highly imbalanced (very few failures relative to passes)?
Use techniques like focusing on failure examples, applying class weighting, or oversampling the minority class. You can also emphasize more recent failures. XGBoost handles imbalance well with built-in parameters (scale_pos_weight). Also track recall carefully. You might miss rare but important regressions if the model overfits to passing tests.