ML Case-study Interview Question: From Diagnostics to Patches: Finetuning Code Models for Automated Repair
Browse all the ML Case-Studies here.
Case-Study question
A software development platform logs a high volume of coding events while users build applications. It uses a language server to detect code errors across sessions. Currently, only a small fraction of these detected errors have suggested fixes. How would you build a system that automatically repairs a user's code based on these diagnostics? Describe your approach to designing the data pipelines, synthesizing training data, creating a suitable model architecture and input-output format, handling evaluation, and putting the final model into production.
Detailed Solution
A system for automated code repair starts with data pipelines that gather large amounts of user session logs. Each session log includes operational transformations (OTs) or equivalent structures representing how code evolves over time, and language server diagnostics that specify errors. Reconstruct the file system at the point of each diagnostic. Verify that the reconstructed code aligns with stored project snapshots and replicate the same diagnostic when running local analyzers. This ensures data correctness.
Use code states and diagnostics to build input-output pairs. The input contains the original broken code plus a diagnostic showing the error line. The output is a minimal patch to fix the error. To create these pairs at scale, gather raw data in a distributed environment and filter out stylistic rules or trivial fixes that a language server can already handle. Reconstruct the code file at the timestamp of each diagnostic and then synthesize a short line-level diff showing how the broken code should be fixed. Filter out incomplete or malformed diffs and keep only pairs that a model can learn from.
Train a specialized code repair model by finetuning an existing code-focused language model. Use a structured schema with sentinel tokens indicating code regions and a consistent line numbering scheme to prevent ambiguity. Maintain a concise template so that the model outputs a line-diff in an easily parsed format. When training, use standard supervised learning with cross-entropy loss, focusing on mapping the broken code and diagnostic to a valid line-diff patch.
Use the decoupled AdamW optimizer and a cosine schedule with warmup for training stability. One common form of cosine schedule is shown below.
Here, L(t) is the learning rate at step t, L_{0} is the initial learning rate, and T is the total decay horizon. This gradually lowers the learning rate, avoiding abrupt changes.
Evaluate the system with two types of tests. One is a public benchmark with curated problems, where functional correctness can be measured by running test suites. The other is a real-world benchmark that includes genuine user code in varied conditions, measured by whether the repaired code textually or syntactically matches a known correct solution. This captures practical success.
Once validated, serve the model in a production environment, perhaps behind an API. Incorporate fallback logic to handle cases where the model fails or where the line-diff cannot be applied. Collect feedback on whether users keep or discard the suggestions. Over time, use these acceptance signals to further refine the model.
How would you confirm the correctness of your data pipeline?
Reconstruct the file system at each diagnostic timestamp by replaying OTs. Cross-check this reconstruction against a stored snapshot. Re-run a local analyzer on the reconstructed code, ensuring it surfaces the same diagnostics seen in the logs. If the diagnostics match, store the code and diagnostic pair. This approach confirms that the pipeline is accurate, because discrepancies would indicate missing or corrupted edits.
How do you generate synthetic diffs without polluting the training data?
Use a strong code model to suggest fixes. Avoid entire end-to-end code generation. Start with real broken states to minimize synthetic distribution drift. Request a short line-diff patch from the code model. Filter malformed or inapplicable patches, then keep only patches that apply successfully and eliminate those that do not fix the issue. This process produces targeted, realistic fixes while preserving the distribution of user code.
Why center on line-based diffs?
It keeps the output short and unambiguous. A model that outputs line indexes and minimal textual changes reduces decoding costs. Line-based diffs also align well with typical developer workflows and make it easier to apply patches inside an editor.
How would you handle cross-file or more complex fixes?
Expand the input schema with references to multiple files if necessary. For multi-line or multi-file changes, line numbering remains a primary source of truth. Provide a bigger context window so the model sees relevant lines across files. Retain the same training approach—(broken state, diagnostic) to (patch)—but allow the patch to include changes referencing multiple lines across different files.
Why not just rely on a large general-purpose model?
Large models can fix code zero-shot, but they often hallucinate or produce incomplete patches. Instruction finetuning on real (code, diagnostic, patch) data yields more consistent outputs. It enforces a predictable schema and ensures the model’s outputs are directly applicable.
How do you test functional correctness for real user code?
Use curated test sets with known correct solutions whenever possible. In some practical cases, rely on textual or AST-level exact match. Also gather user acceptance data—whether a user reverts the model’s fix or not. Over time, feed this acceptance metric back into training by a reward mechanism or by favoring patterns of fixes that are accepted.
How do you integrate this with your development platform?
Embed a repair function into the environment. When a diagnostic triggers, pass the relevant code, line index, and error message to the model. Receive a short line-diff. Apply it tentatively and prompt the user to accept or reject. Use a server-side microservice for quick calls. Log the feedback and store it for iterative refinement.
How do you ensure your model handles out-of-distribution errors?
Add robust filtering and fallback. For known error codes, the model uses line-diffs. If the error is out of scope or the model output is malformed, surface a generic message. Continue collecting user data for those new errors and refine the training set, incrementally broadening coverage.
Which deployment optimizations help with latency and cost?
Quantize the model or distill it to a smaller version. Implement efficient attention kernels or flash attention. Cache repeated code contexts when a user’s file changes only slightly. Execute inference on GPUs best suited for real-time requests, or batch predictions if you can tolerate slight delays. Tune your streaming approach so you start returning tokens to the user quickly.
How do you manage expansions to other languages?
Maintain the same approach but gather (broken code, diagnostic, line-diff) data for each new language. Transfer learning helps when code tokens overlap across languages. Retain the sentinel-based schema. Only add new language indicators or file extensions as needed. Validate each language separately.
If your model’s initial suggestions are consistently incomplete, how do you refine it?
Collect real user acceptance signals or create carefully targeted synthetic data. Expand your training set with more complex multi-line patches. Adjust your line-diff schema to allow partial line replacements or expansions. Validate that the model sees enough context. If needed, enlarge or switch the base model to handle more tokens.
How do you see future improvements?
Scale the training set with more code states and richer diagnostics across languages. Integrate reinforcement or feedback-based tuning so that your model uses signals from real usage. Track distribution drift carefully, because your user base changes. Try advanced strategies such as referencing entire project structures or hooking into build commands to dynamically run tests.
How do you answer if an interviewer asks: “Are there simpler ways to do code repair?”
Explain that patterns-based or rule-based solutions can fix common syntactic errors quickly, but they do not generalize well. A specialized model that generalizes from real training pairs is more flexible. As user code and error types evolve, the model can adapt by ingesting more data, while rule-based systems often need manual updates.
Why is line-level patching more robust than directly returning entire corrected files?
It is more interpretable to the user. They see exactly which lines changed, and you avoid reintroducing errors outside the fix region. It also saves bandwidth and processing cost. Short diffs are easier to parse and apply, and they reduce possible hallucinations about the rest of the file.
How do you guard against privacy issues in user data?
Anonymize the logs, remove personal or sensitive content, and ensure you have user consent or an opt-out mechanism. Keep only the minimal code snippet and diagnostic lines, discarding extraneous text. Respect data retention policies and comply with relevant regulations.
What if your approach consistently misses certain classes of errors?
Add specialized data for those errors or create new synthetic error-generation scripts. Increase coverage by capturing cross-file dependencies if the errors relate to references in other files. Expand the line-diff approach to accommodate more advanced patterns. Validate carefully in the real environment to confirm better coverage.