ML Interview Q Series: What issues arise when the cost function must handle partially observed labels (e.g., partial annotation or weak supervision), and how could you address them?

Mar 29, 2025

📚 Browse the full ML Interview series here.

Hint: Consider estimating missing labels or using expected values in the loss.

Comprehensive Explanation

When labels are missing or only partially observed, the most immediate challenge is that the training objective cannot be straightforwardly computed in standard supervised learning fashion. A typical loss function expects fully known labels for each training instance, but under partial annotation or weak supervision, we might only have access to partial or incomplete signals about the true target.

Connect with me on X (Twitter)

This partial availability introduces several interconnected problems. First, models can become biased if they only use the small subset of examples that have labels. Second, ignoring unlabeled parts of the data can reduce effective sample size, leading to underfitting and poor generalization. Third, models may overfit to noisy or partially incorrect labels if the weakly supervised signals are themselves unreliable. Finally, the very definition of a "loss" becomes trickier, as we need a procedure for weighting or imputing labels that are unknown or only weakly specified.

Addressing these issues often involves a method to either (1) infer or estimate missing labels, or (2) treat the missing labels as latent variables and incorporate an expectation-based approach in the loss. One popular approach is to rely on an expectation of the loss over the unknown label distribution. Below is a conceptual expression for an expected loss in the presence of missing labels, which can be central to solutions that integrate partial supervision:

Here, f_{theta}(x_i) is the model’s prediction function parameterized by theta in plain text notation. The outer summation runs over the N training instances. For each instance, we take an expectation over the unknown or partially observed labels. In practice, this expectation can be approximated through various techniques:

In some scenarios, you might have partially observed labels in the form of constraints (for example, we might know a label is from a certain set of possible classes, but not which one exactly). In others, you might have multiple noisy labeling functions, all of which can be combined to give a probabilistic estimate of the label. By modeling this partial knowledge as a distribution over possible labels, you then compute the expected loss for each sample.

Strategies include iterative Expectation-Maximization (EM) procedures, pseudo-labeling, or introducing latent variables that represent the unknown labels, and marginalizing them out in the objective function. EM-like methods repeatedly estimate the most probable labels (E-step) given current model parameters, then optimize the parameters to best fit these labels (M-step). Alternatively, weak supervision frameworks, where multiple noisy or partial label sources are combined, seek to estimate the underlying true label distributions and produce label estimates that can be utilized in a standard training pipeline.

From a practical perspective, it is also important to handle the potential biases or errors introduced by these estimation processes. It can help to have validation schemes that can detect if the model is overfitting to mislabeled data, or if certain subsets of partially observed labels degrade performance. Furthermore, domain knowledge can guide how to incorporate constraints or partial information. For instance, if you know that a label must come from a fixed subset of classes, that knowledge can drastically reduce the uncertainty in the missing labels and improve the stability of training.

Potential Follow-up Questions