ML Interview Q Series: In a knowledge distillation setup, how does the “teacher–student” training objective alter the standard cost function, and why might a temperature parameter be introduced?

Mar 28, 2025

📚 Browse the full ML Interview series here.

Hint: Softened probability distributions guide the student more gently than one-hot labels.

Comprehensive Explanation

Knowledge distillation is a technique used to transfer the “dark knowledge” from a larger, more complex teacher model to a smaller student model. Unlike the standard supervised learning approach, where the model is trained using only the ground-truth (often one-hot) labels, knowledge distillation involves incorporating information from the teacher’s output distribution. This approach modifies the cost function to balance between matching the teacher’s “soft” probability distribution and fitting the true labels.

Connect with me on X (Twitter)

Modified Training Objective

In a purely standard classification setting, a student model would be trained using a cross-entropy loss with respect to the one-hot ground-truth labels. However, in knowledge distillation, the student is also guided by the teacher’s probability distribution predictions, especially when they are “softened” via a temperature parameter. A commonly used distillation loss combines two terms: one term for matching the ground-truth labels and another for matching the teacher’s distribution.

In this expression, CE is the cross-entropy loss between the one-hot labels y_true and p_student, the student’s predictions. KL is the Kullback–Leibler divergence between the teacher’s softened distribution p_teacher^(T) and the student’s softened distribution p_student^(T). The scalar alpha is a hyperparameter that balances the contribution between the ground-truth training and the teacher-based training. T is the temperature parameter.

The standard cross-entropy loss only accounts for correctness with respect to the single ground-truth class, whereas the knowledge distillation term leverages richer signals contained in the teacher’s output probabilities for all classes.

Role of the Temperature Parameter

The temperature parameter (denoted T) “softens” or “sharpens” the probabilities predicted by the teacher and the student. Normally, probabilities from a softmax are computed as exp(logits_i)/sum_j(exp(logits_j)). By dividing logits by T (or multiplying by 1/T) before the softmax, one can control the smoothness of the probability distribution:

When T is large, the output distribution becomes smoother, with no single dominant class probability but rather a spread of probabilities across classes.
When T is 1, we get the model’s normal softmax probabilities (as if no temperature adjustment were used).
When T is less than 1, the distribution becomes more “peaked,” but in knowledge distillation one typically uses T > 1 to soften the probabilities.

These softened distributions provide more nuance about how the teacher ranks the classes. The student can learn not only which class is correct but also how the teacher “perceives” relationships among classes. This can help the student learn from similarities and differences between classes in a more guided way than just seeing a single hard label.

Why Include a KL Term Instead of Just Cross-Entropy With Teacher’s Outputs

KL divergence in the distillation term measures how one probability distribution diverges from the other. In practice, one can use cross-entropy of the teacher distribution with respect to the student distribution, or some variant of cross-entropy plus teacher’s entropy as a constant. The main goal is that the teacher’s distribution influences the student’s output probabilities at each training example. Both cross-entropy and KL-based formulations serve the purpose of matching distributions, but KL divergence is a common choice because it is directly interpretable as measuring how one distribution diverges from another.

Temperature-Scaling Implementation Detail

In practice, you first compute the logits from both models. Then you divide those logits by T before applying the softmax. For example, if logits_student is your student’s raw scores, you get:

p_student^(T)[i] = softmax(logits_student[i] / T)

Similarly, for the teacher:

p_teacher^(T)[i] = softmax(logits_teacher[i] / T)

These softened probabilities are plugged into the KL divergence term. Usually, one multiplies KL by T^2 because the gradients become scaled when temperature is introduced. This factor T^2 counters the magnitude effect introduced by dividing the logits.

Balancing the Loss Terms

Alpha is a hyperparameter that must be tuned. If alpha is set too low, the student focuses mostly on matching ground-truth labels and may ignore the teacher’s signals. If alpha is set too high, the student might overfit to the teacher’s distribution and disregard the original classification objective. Typical alpha values range between 0.1 and 0.9, but exact tuning is data- and model-dependent.

Example Code Snippet for Knowledge Distillation

import torch
import torch.nn as nn
import torch.optim as optim

class StudentModel(nn.Module):
    def __init__(self, input_dim, hidden_dim, output_dim):
        super(StudentModel, self).__init__()
        self.fc1 = nn.Linear(input_dim, hidden_dim)
        self.relu = nn.ReLU()
        self.fc2 = nn.Linear(hidden_dim, output_dim)

    def forward(self, x):
        x = self.fc1(x)
        x = self.relu(x)
        x = self.fc2(x)
        return x

def softmax_temperature(logits, T):
    return nn.functional.softmax(logits / T, dim=1)

def distillation_loss(student_logits, teacher_logits, labels, alpha, T):
    # Standard cross-entropy loss with true labels
    ce_loss = nn.functional.cross_entropy(student_logits, labels)

    # KL divergence with softened probabilities
    p_student_T = softmax_temperature(student_logits, T)
    p_teacher_T = softmax_temperature(teacher_logits, T).detach()

    kl_loss = nn.functional.kl_div(p_student_T.log(), p_teacher_T, reduction="batchmean") * (T**2)

    # Combined loss
    return (1 - alpha) * ce_loss + alpha * kl_loss

# Usage example
student = StudentModel(input_dim=784, hidden_dim=128, output_dim=10)
teacher = StudentModel(input_dim=784, hidden_dim=512, output_dim=10)  # assume already trained teacher
optimizer = optim.Adam(student.parameters(), lr=1e-3)

for data, labels in dataloader:  # Suppose we have a dataloader
    optimizer.zero_grad()

    # Teacher forward pass
    with torch.no_grad():
        teacher_logits = teacher(data)

    # Student forward pass
    student_logits = student(data)

    # Calculate distillation loss
    loss = distillation_loss(student_logits, teacher_logits, labels, alpha=0.5, T=2.0)

    loss.backward()
    optimizer.step()

In this code, student and teacher are both neural network models. The teacher is assumed to be pretrained. The student is trained by combining the standard cross-entropy (with the ground-truth labels) and the distillation loss (KL divergence between softened teacher and student distributions). The temperature T, here set to 2.0, is used to obtain smoother probability distributions from logits for both teacher and student.

Potential Follow-Up Questions