ML Interview Q Series: Why is the median often chosen as the representative value of a dataset, rather than other central measures?

Mar 21, 2025

Comprehensive Explanation

The median is a statistic that represents the midpoint of a sorted dataset. It splits the data into two equal groups, ensuring that 50% of the values lie on either side. It is typically favored as a measure of central tendency when there is concern about the influence of extreme values or when the data is not symmetrically distributed.

Connect with me on X (Twitter)

Mathematical Representation of the Median

$$\text{Median} =

\begin{cases} x_{((n + 1)/2)} & \text{if } n \text{ is odd} \[6pt] \frac{x_{(n/2)} + x_{(n/2 + 1)}}{2} & \text{if } n \text{ is even} \end{cases}$$

In the expression above:

n is the total number of data points in the dataset.
x_{(k)} denotes the k-th smallest value in the sorted order.

When n is odd, the position is (n+1)/2, which is always an integer. When n is even, the median is calculated as the average of the values at positions n/2 and (n/2 + 1).

Resistance to Outliers

The primary reason to use the median is that it is robust to outliers. Extreme values in the dataset have a negligible effect on the median’s position. This property is especially important when dealing with skewed distributions or situations where extreme values might distort the mean.

Skewed Distributions

When a dataset is heavily skewed, the mean may get pulled in the direction of the tail. By contrast, the median remains near the bulk of the data. As a result, the median can better capture what is “typical” in the presence of significant skew or extreme outliers.

Ease of Interpretation

The median is easy to interpret in real-world scenarios, especially for ordinal data. For instance, in surveys or Likert scales where data might not be numerical in a strict sense but still ordered, the median is a useful summary measure.

Practical Considerations

While the median is robust and easy to interpret, it can be less stable than the mean when drawing repeated samples from the same population. In other words, the median might show higher variance between different samples of the same size. However, in cases where outliers or skew are prominent, this drawback is often outweighed by its robustness.

Follow-Up Questions

How do you handle missing data when computing the median?

Handling missing data can be approached by removing rows with missing values or imputing them. If the amount of missing data is small and randomly distributed, it is common to drop those entries. In other cases, values may be imputed using techniques like regression imputation or simply using median imputation from the rest of the dataset. However, imputing with the median might alter the actual distribution, so it is important to clearly document and justify the approach to missing data, especially for highly skewed datasets or datasets where missingness is not random.

When outliers are present, dropping too many points with missing data could distort the distribution. A careful strategy involves investigating the mechanism behind the missing data (e.g., Missing Completely at Random, Missing at Random, or Missing Not at Random) before deciding how to impute.

Suppose your dataset has multiple modes; how does that affect the median?

The presence of multiple modes does not affect the median directly. The median is concerned with the position of data points once they are sorted. Even if the distribution has more than one mode, the median will still be uniquely determined as the middle value(s) in the ordered list. In contrast, mode can be ambiguous in such situations if there are multiple peaks in the distribution. Thus, the median remains stable despite multimodality.

In real-world data analysis, how do you decide whether to use the mean or median?

The primary question to ask is whether the data is skewed or has significant outliers. If the data is roughly symmetric and does not contain large outliers, the mean is often a better measure of central tendency because it leverages the full dataset. On the other hand, if there are strong outliers or a heavily skewed distribution, the median might be a more informative representation of the typical value.

Another factor is interpretability. For instance, in certain fields like housing or income data, the median is more intuitive because these domains can have heavy right-skew distributions where large values pull the mean upward.

Could you illustrate how outliers impact the median’s value compared to the mean?

Imagine a small dataset of house prices: [100,000; 110,000; 120,000; 125,000; 10,000,000]. The mean would be heavily inflated by the 10,000,000 value, whereas the median would remain comparatively low. Here is a quick demonstration:

Mean: (100,000 + 110,000 + 120,000 + 125,000 + 10,000,000) / 5 = 2,091,000
Median: The sorted list is [100,000; 110,000; 120,000; 125,000; 10,000,000], so the median is 120,000.

This stark difference shows how a single outlier exerts a tremendous effect on the mean, whereas the median remains close to the typical house price in this scenario.

What are potential pitfalls in using the median in high-dimensional data or large datasets?

In high-dimensional contexts, each dimension can have outliers, and the notion of “central” points might be more complex. Some pitfalls include:

Computation in extremely large datasets: Calculating the median still requires sorting or other selection algorithms (like Quickselect), which must be managed efficiently for massive data.
Multi-dimensional data: In more than one dimension, there is no single “median” that splits the data evenly in all dimensions. Some generalized median concepts exist (like the median in each dimension independently, or geometric medians), but their computation can be non-trivial and interpretation can get more complex.
Loss of information: In some analytics tasks, an aggregate measure alone might not capture the relationships among dimensions. Using the median in isolation may oversimplify the picture in multi-feature problems.

A thorough understanding of data structure, domain, and goals is needed to decide whether the median, or another robust statistic, is the right choice in high-dimensional problems.

Below are additional follow-up questions

How does the median compare to a trimmed mean as a robust measure of central tendency?

A trimmed mean (often referred to as the truncated mean) removes a certain percentage of the largest and smallest values before computing the average. The median effectively removes all but one or two central values (depending on an odd or even number of observations) once the data is sorted. Both methods reduce the influence of extreme outliers.

A key difference is that a trimmed mean incorporates more values than the median, so it can sometimes capture more nuance. However, it still requires choosing how much trimming (for instance, 5% or 10%) is appropriate, which can be somewhat arbitrary. By contrast, the median has a fixed “trimming” threshold conceptually set at 50%.