MSE, MAE, R-Squared and Beyond — Regression Metrics That Actually Matter
A scroll-driven deep dive into regression metrics. Understand MSE, RMSE, MAE, MAPE, R-squared, and Adjusted R-squared — when to use each, their gotchas, and how to report results properly.
Model Evaluation
How wrong is your model?
Let’s quantify it.
Classification has accuracy. Regression has a zoo of metrics — MSE, RMSE, MAE, MAPE, R², Adjusted R². Each answers a slightly different question about how far off your predictions are. Pick the wrong one and you’ll optimize for the wrong thing.
It All Starts With Residuals
Definition of a residual
The difference between what actually happened and what the model predicted. It's the fundamental unit of error in regression — every metric is a function of residuals.
eᵢ = yᵢ − ŷᵢ When the residual is positive, the actual value was higher than predicted — the model under-predicted and guessed too low.
eᵢ > 0 → under-prediction When the residual is negative, the actual value was lower than predicted — the model over-predicted and guessed too high.
eᵢ < 0 → over-prediction Make residuals small AND unstructured. If there are patterns in your residuals, the model hasn't captured all the learnable signal in the data.
Your model predicts house price = $300K. Actual price = $350K. What is the residual?
💡 Residual = actual − predicted. What's the sign?
Residual = actual − predicted = $350K − $300K = +$50K. The positive sign tells us the model under-predicted (the actual value was HIGHER than what the model guessed). Every regression metric is a function of these residuals.
MSE, RMSE, and MAE
The core regression metrics
Mean Absolute Error — the average absolute residual, reported in original units (e.g., dollars). Treats all errors equally and is robust to outliers, making it the default choice.
MAE = (1/n) Σ|yᵢ − ŷᵢ| Mean Squared Error — the average squared residual. By squaring, it penalizes large errors much more heavily than small ones. A few big misses will dominate this metric.
MSE = (1/n) Σ(yᵢ − ŷᵢ)² Root Mean Squared Error — the square root of MSE, bringing it back to original units. Still sensitive to outliers like MSE, but easier to interpret since the units match your target variable.
RMSE = √MSE MSE is always greater than or equal to MAE squared (Jensen's inequality). This means squaring amplifies large errors — a few big misses inflate MSE far more than MAE.
MSE ≥ MAE² You're predicting delivery times. Most are off by 5-10 min, but occasionally you're off by 2 hours. Which metric should you optimize?
💡 Think about the COST of errors. Is a 2-hour error 2× as bad or 24× as bad as a 5-min error?
This is a business decision, not a statistical one. If a 2-hour delay means a lost customer (catastrophic cost), MSE is right — it prioritizes fixing the worst predictions. If every extra minute costs the same (e.g., driver idle time), MAE is better — it optimizes for the average case without letting outliers dominate. In practice: MSE for safety-critical, MAE for average-case optimization.
R²: The Proportion of Explained Variance
R-squared and Adjusted R-squared
SS_res = Σ(y_i − ŷ_i)² SS_tot = Σ(y_i − ȳ)² R² = 1 − SS_res / SS_tot R² = 1.0 → perfect predictions (SS_res = 0) R² = 0.0 → model = predicting the mean R² < 0 → model is WORSE than the mean! Adjusted R² = 1 − [(1−R²)(n−1)] / [n−p−1] Model A has R²=0.85 with 3 features. Model B has R²=0.86 with 50 features. Which is likely better?
💡 47 extra features for 1% gain — sounds like overfitting...
Going from 3 to 50 features for only 1% R² improvement is a red flag. Those 47 extra features are likely fitting noise (overfitting). Adjusted R² for Model B would be LOWER than its raw R² because it penalizes the extra features. Adjusted R² for Model A would be close to its raw R² since 3 features is lean. In practice, Model A is almost certainly better — it's simpler, less prone to overfitting, and captures nearly the same variance.
MAPE and Scale-Independent Metrics
Mean Absolute Percentage Error
Average percentage error — scale-independent, so you can compare across different targets. Great for saying 'we're off by 5%' regardless of the scale of the values.
MAPE = (100/n) Σ |yᵢ − ŷᵢ| / |yᵢ| Undefined when any actual value is zero (division by zero). Also asymmetric: over-predictions are bounded at 100% error, but under-predictions can be unbounded.
Symmetric MAPE fixes some of MAPE's issues by dividing by the average of actual and predicted values. Bounded between 0–200%, making it more stable and symmetric across over/under-predictions.
sMAPE = (100/n) Σ |yᵢ − ŷᵢ| / ((|yᵢ| + |ŷᵢ|)/2) Practical Decision Guide
You're predicting quarterly revenue for companies ranging from $1M to $1B. An MAE of $5M means very different things for small vs. large companies. Which metric handles this?
💡 Which metric measures errors as a FRACTION of the actual value?
MAPE converts errors to percentages relative to the actual value. A $5M error on a $1B company = 0.5% (great). A $5M error on a $10M company = 50% (terrible). This scale-independence is MAPE's main advantage. MAE treats both the same ($5M). MSE treats them the same but also squares them. R² compares to the mean baseline but doesn't distinguish company-level scale. Just watch out for MAPE's pitfall: if any company has $0 revenue, MAPE is undefined.
🎓 What You Now Know
✓ Residual = actual − predicted — Every regression metric is a function of residuals.
✓ MAE = robust average error — Use by default. Resistant to outliers.
✓ MSE/RMSE = punish large errors — Use when big misses are costly. Sensitive to outliers.
✓ R² = variance explained vs. mean baseline — Adjusted R² for fair model comparison.
✓ MAPE = scale-independent percentage error — Great for cross-scale comparison, but breaks at zero.
No single metric tells the whole story. Report MAE + R² at minimum. Add RMSE if outlier sensitivity matters, MAPE if scale-independence matters. And always compare against the trivial baseline — if predicting the mean does 95% as well, your model isn’t adding much value. 📊
↗ Keep Learning
Linear Regression — The Foundation of Machine Learning
A scroll-driven visual deep dive into linear regression. From data points to loss functions to gradient descent — understand the building block behind all of ML.
Polynomial Regression — When Lines Aren't Enough
A scroll-driven visual deep dive into polynomial regression. See why straight lines fail, how curves capture nonlinear patterns, and when you're overfitting vs underfitting.
Bias-Variance Tradeoff — The Most Important Concept in ML
A scroll-driven visual deep dive into the bias-variance tradeoff. Learn why every model makes errors, how underfitting and overfitting emerge, and how to balance them.
Comments
No comments yet. Be the first!