ML Interview Q Series: What are the advantages of using Root Mean Squared Error instead of Mean Absolute Error for evaluating model performance?
📚 Browse the full ML Interview series here.
Comprehensive Explanation
Root Mean Squared Error (RMSE) and Mean Absolute Error (MAE) are among the most commonly used metrics to assess a regression model’s predictive performance. Although both summarize the average magnitude of error between predictions and ground truth labels, RMSE applies a heavier penalty to large errors because it squares them before averaging.
Mathematical Formulas
Below is the key formula for RMSE.
In this formula:
n is the number of data points.
y_i is the actual (observed) value for the i-th data point.
\hat{y}_i is the predicted value for the i-th data point.
The expression (y_i - \hat{y}_i) represents the error for each prediction.
On the other hand, MAE can be expressed as:
Here:
The notation is the same as above, except we take the absolute value of the errors.
Why RMSE Might Be Preferred Over MAE
Stronger Emphasis on Large Errors RMSE squares the deviations before averaging, which magnifies larger errors more than smaller ones. In scenarios where large deviations are especially undesirable (e.g., critical systems where a few big mistakes outweigh many small mistakes), RMSE provides a stricter penalty, pushing models to minimize large unexpected discrepancies.
Continuous Differentiability While MAE involves absolute values (which create a non-differentiable kink at zero error), RMSE (through MSE) remains smooth and differentiable everywhere with respect to the predicted values. This continuous derivative makes gradient-based optimization often more stable and easier to handle, although modern frameworks can handle non-smooth loss functions as well.
Connection to Variance and Gaussian Noise RMSE has a close relationship to the variance term in a Gaussian distribution. If the underlying data noise is normally distributed, minimizing RMSE aligns well with maximizing the likelihood under a Gaussian assumption.
Potential Drawbacks of RMSE
Increased Sensitivity to Outliers Any model predictions that deviate significantly from the observed values cause a disproportionately large effect on RMSE because of squaring the error. This can be problematic if your dataset is prone to outliers or if you do not wish to heavily penalize them.
Less Intuitive Interpretability for Some Use Cases MAE is sometimes more intuitive because it directly reflects the average absolute difference in the same unit as the target variable. In contrast, RMSE is in the square root of that unit. Nevertheless, if you are comfortable with the units, RMSE is also straightforward to interpret.
Practical Usage with Python
Below is a small example of how one might compute both RMSE and MAE in Python. This example uses pure Python, but similar methods exist in libraries such as scikit-learn.
import numpy as np
# Suppose we have some true values and predictions
y_true = np.array([3.0, -0.5, 2.0, 7.0])
y_pred = np.array([2.5, 0.0, 2.0, 8.0])
# Compute MAE
mae = np.mean(np.abs(y_true - y_pred))
# Compute RMSE
mse = np.mean((y_true - y_pred)**2)
rmse = np.sqrt(mse)
print("MAE =", mae)
print("RMSE =", rmse)
What Happens If We Have Outliers?
RMSE magnifies large errors more than MAE, so if we introduce even a single outlier in y_true or y_pred, the RMSE might jump significantly, whereas the MAE will increase, but not by as large a proportion. This behavior is beneficial when large errors must be heavily penalized but may not be ideal when outliers are simply noise.
Follow-Up Questions
Could we simply take a squared MAE to replicate the idea of squaring errors?
When you square the absolute error, you are essentially constructing a function similar to the MSE term. However, squaring inside the sum (as with MSE) and then taking the square root (as with RMSE) is not strictly the same as taking the absolute error first and then squaring. Additionally, the mathematical structure of RMSE ensures a smooth function that is often easier to optimize. Squaring MAE outright can introduce complexities in the gradient. RMSE is also directly tied to statistical properties such as variance minimization in Gaussian models.
In which situations would MAE be preferred despite the popularity of RMSE?
MAE is often chosen if:
The data contains outliers that should not be heavily amplified in the error metric.
You desire a more straightforward interpretation of error in the original scale (without a square root).
The noise distribution is expected to be Laplacian (heavy-tailed) rather than Gaussian.
How does the derivative of RMSE differ from that of MAE, and how might this affect training?
The derivative of the MAE involves the sign of the error (positive or negative), which can be non-differentiable at zero. In contrast, the derivative of RMSE (or MSE) is differentiable everywhere and is proportional to the error itself, making gradient-based methods smoother in parameter updates. In practice, modern deep learning frameworks handle both absolute and squared errors effectively, but the smoother gradient of RMSE can sometimes converge faster in certain models.
Are there cases where neither RMSE nor MAE is the best choice?
Yes. Depending on the task, other metrics like Mean Absolute Percentage Error (MAPE) can be used for percentage-based error analysis, or specialized metrics might be preferred if domain-specific constraints exist. Additionally, sometimes domain-specific criteria, such as the maximum error, the quantile-based errors, or custom cost functions (e.g., financial risk measures), provide better insights or align more closely with business objectives.
How can we interpret a large difference between RMSE and MAE on the same dataset?
A large difference between RMSE and MAE often indicates that the data or model predictions have some substantial outliers. Because RMSE emphasizes larger errors more, if a few data points have considerably bigger errors than the rest, RMSE will be disproportionately higher compared to MAE.
Summary of Key Points
RMSE squares errors, thus penalizing large mistakes more than MAE.
RMSE aligns with minimizing variance under Gaussian noise assumptions.
MAE is more robust to outliers, but can be less smooth for gradient-based optimization.
Practical choice depends on your tolerance for large errors and the noise distribution of your data.