Lesson 3 - Checking Linear Regression Fit

Checking Whether the Line Actually Fits

In the previous lessons you fit a linear regression model and learned how to read its coefficients. Fitting is the easy part: scikit-learn will happily draw a line through any cloud of points you hand it, whether or not a line is a sensible description of the data. The harder and more important question is the one this lesson answers: is the model any good?

Answering that question well is what separates a careful analyst from someone who just calls .fit() and trusts the output. You will learn to interrogate a model from three angles: the assumptions it quietly relies on, the size of its errors, and the fraction of variation it manages to explain.

By the end of this lesson, you will be able to:

  • Explain the assumptions linear regression makes about its errors
  • Compute residuals and read a residual plot to check those assumptions
  • Calculate and interpret mean squared error (MSE) and root mean squared error (RMSE)
  • Calculate and interpret the R-squared (coefficient of determination) score
  • Combine these diagnostics to decide whether linear regression fits your data

You should be comfortable with basic Python, pandas, and the scikit-learn fit/predict pattern from the earlier lessons. Let’s begin.


Why Errors Are the Thing to Inspect

A linear regression model assumes the outcome can be split into exactly two pieces: a part the predictors explain, and everything left over. Written out, the model for an observation looks like this:

Y=β0+β1X1+β2X2++βpXp+ϵ Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \dots + \beta_p X_p + \epsilon

Here β0 \beta_0 is the intercept, each βj \beta_j is a coefficient, and ϵ \epsilon is the error: the part of the outcome that the predictors genuinely cannot account for. The model is only trustworthy if that leftover error behaves the way the math assumes it does.

We never observe the true errors ϵ \epsilon directly. What we can observe, after fitting, are the residuals, our best estimate of those errors. The residual for observation i i is simply the actual value minus the predicted value:

ri=YiY^i r_i = Y_i - \hat{Y}_i

where Y^i \hat{Y}_i (read “y-hat”) is the model’s prediction for that row. If the residuals look the way well-behaved errors should look, we gain confidence that a straight-line model is appropriate. If they show a clear pattern, that is the data warning us that a line is the wrong shape.

Linear regression assumes two things about the errors, and therefore expects the same of the residuals:

  1. Zero mean. On average the errors cancel out: E[ϵ]=0 E[\epsilon] = 0 . The model is not systematically too high or too low.
  2. Constant variance. The spread of the errors stays the same no matter how large the prediction is: Var[ϵ]=σ2 \mathrm{Var}[\epsilon] = \sigma^2 . This property has a name worth knowing: homoskedasticity. Its opposite, where the spread grows or shrinks with the prediction, is heteroskedasticity.

The rest of this lesson is mostly about checking these two assumptions and then quantifying how well the model does once they hold.


Loading the Data

You will diagnose a model that predicts a car’s price from five physical measurements, using the real automobiles dataset (the classic UCI automobile imports data). It has 159 rows, 26 columns, and no missing values, which keeps the focus squarely on model checking.

import pandas as pd

df = pd.read_csv("automobiles.csv")  # download: https://datatweets.com/datasets/automobiles.csv

print("Shape:", df.shape)
print(df["price"].describe()[["mean", "min", "max"]].round(0))
# Output:
# Shape: (159, 26)
# mean    11446.0
# min      5118.0
# max     35056.0
# Name: price, dtype: float64

Prices range from about $5,118 to $35,056, with a mean near $11,446. Keep that mean in mind: it is the yardstick you will measure every error against later.

A Data Dictionary

You will use five numeric predictors. Here is what each one means, plus the target.

ColumnTypeMeaning
priceintTarget: retail price in US dollars
engine_sizeintEngine displacement in cc
horsepowerintEngine power output
curb_weightintWeight of the car with no occupants, in pounds
widthfloatBody width in inches
highway_mpgintFuel efficiency on the highway (miles per gallon)

These columns were chosen because each plausibly drives price: bigger, heavier, more powerful cars cost more, while thriftier cars (high highway_mpg) tend to be cheaper.

Preparing Features, Splitting, and Scaling

The recipe here matches the multiple-regression model from the previous lesson exactly, so the diagnostics you compute apply to a model you already understand. You select the five features, hold out 25 percent of the rows as a test set, and standardize the features so every coefficient lives on the same scale.

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression

features = ["engine_size", "horsepower", "curb_weight", "width", "highway_mpg"]
X = df[features]
y = df["price"]

# Hold out 25% for an honest test of fit
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.25, random_state=42
)

# Standardize: fit the scaler on TRAIN only, then apply to both
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

model = LinearRegression()
model.fit(X_train_scaled, y_train)

for name, coef in zip(features, model.coef_):
    print(f"  {name:12s} coef = {coef:8.1f}")
print(f"  intercept = {model.intercept_:.1f}")
# Output:
#   engine_size  coef =   1808.4
#   horsepower   coef =    336.5
#   curb_weight  coef =   1935.4
#   width        coef =   1892.0
#   highway_mpg  coef =     82.6
#   intercept = 11442.5

The golden rule of scaling

Fit the scaler on the training data only, then apply the same transform to the test set. If you fit on the full dataset, information about the test rows leaks into training, and every diagnostic you compute afterward becomes too optimistic. The model must never see the test set before evaluation.

Because the features are standardized, the coefficients are directly comparable: curb_weight and engine_size move price the most, highway_mpg the least. The intercept of 11442.5 is close to the average price, which makes sense for standardized inputs. That is the model under inspection. Now you check whether it deserves your trust.


Computing Residuals

The first diagnostic needs predictions, and predictions are one method call away. You ask the fitted model for its predicted prices, then subtract them from the actual prices to get residuals.

predictions = model.predict(X_test_scaled)
residuals = y_test - predictions

print("First five residuals:")
print(residuals.head().round(0).values)
# Output:
# First five residuals:
# [ -642.  1188.  -311.  2954.  -1377.]

Each residual is a dollar amount: how far above (positive) or below (negative) the true price the model landed for that car. A residual of 2954 means the model under-predicted that car’s price by almost $3,000; a residual of -1377 means it over-predicted by about $1,377.

Checking the Zero-Mean Assumption

The first assumption says the residuals should average out to (approximately) zero, meaning the model is not biased high or low. Checking it is one line.

import numpy as np

residual_mean = np.mean(residuals)
print(f"Mean of residuals: {residual_mean:.2f}")
# Output:
# Mean of residuals: 95.42

Compared to a mean price of about $11,446, a residual mean of roughly $95 is tiny: less than one percent of the typical price. The model is essentially unbiased on the test set.

Why the training residuals always average to zero

For any linear regression fit with an intercept, the mean of the residuals on the training data is mathematically forced to be exactly zero. That is a side effect of how the intercept is chosen, so checking it on the training set is redundant. On the test set, as here, the mean is not guaranteed to be zero, which makes it a genuine check that the model generalizes without bias.


Reading the Residual Plot

Zero mean is necessary but not sufficient. A model can be unbiased on average while still being badly wrong in a structured way, for example over-predicting cheap cars and under-predicting expensive ones in equal measure. To catch that, you need to look at how the residuals are distributed across the range of predictions.

The tool for this is a residual plot: a scatter plot with the model’s predicted values on the x-axis and the residuals on the y-axis. You are looking for one specific thing: a shapeless, horizontal band of points centered on zero, with roughly the same vertical spread everywhere.

import matplotlib.pyplot as plt

plt.scatter(predictions, residuals)
plt.axhline(0, color="black", linewidth=1)
plt.xlabel("Predicted price ($)")
plt.ylabel("Residual ($)")
plt.title("Residuals vs predicted price")
plt.show()

The figure below shows the result for our model.

Residual plot of residuals versus predicted price
A good model leaves residuals scattered randomly around zero.

The points scatter fairly evenly above and below the zero line across most of the price range, with no obvious funnel or curve. That is what a healthy residual plot looks like, and it tells you a straight-line model is a reasonable choice here.

What Violations Look Like

The reason this check matters is that an unhealthy residual plot has a recognizable shape. Train your eye on these three patterns:

GOOD (homoskedastic)      FUNNEL (heteroskedastic)     CURVE (wrong shape)
                                                        
 r |  . . . . . . .        r |        . . :             r |  .          .
   |. . . . . . . .          |     . . : :                |   .       .
 0 |-.-.-.-.-.-.-.-.        0 |--.-:-:-:-:-:--          0 |-----.-.-.-----
   | . . . . . . .           |  . : : :                  |  .         .
   |  . . . . . .            |    . . :                  | .           .
   +---------------          +---------------            +---------------
        Y-hat                     Y-hat                       Y-hat
  • Good. An even horizontal band. The spread is constant and there is no trend. The assumptions hold.
  • Funnel. The spread grows (or shrinks) as predictions get larger. This is heteroskedasticity: the constant-variance assumption is violated, and the model is less reliable for some price ranges than others.
  • Curve. The residuals bend, often arcing below zero in the middle and above at the ends. This signals that the true relationship is non-linear, so a straight line is the wrong shape. You might fix it by adding a squared term or transforming a feature.

This is a qualitative check. You are not computing a number; you are confirming that you do not see the two bad patterns. Our automobiles residual plot passes.

Use predicted values, not a single feature

With one predictor you could plot residuals against that predictor. With several predictors, plot residuals against the model’s predicted values instead. The prediction summarizes all the features at once, so a single residual-versus-prediction plot checks the whole model rather than one variable at a time.


Measuring Error Size: MSE and RMSE

The residual plot tells you whether the shape of the errors is acceptable. It does not tell you how big they are. For that you need a single summary number.

Recall that fitting chose coefficients to minimize the sum of squared errors. We can turn that same quantity into an average, the mean squared error (MSE):

MSE=1ni=1n(YiY^i)2 \text{MSE} = \frac{1}{n} \sum_{i=1}^{n} (Y_i - \hat{Y}_i)^2

MSE is the average of the squared residuals. Squaring keeps every term positive and punishes large misses harder than small ones. The trouble is its units: because we squared dollars, MSE is in dollars squared, which is impossible to interpret directly. The fix is to take the square root, giving the root mean squared error (RMSE):

RMSE=1ni=1n(YiY^i)2 \text{RMSE} = \sqrt{\frac{1}{n} \sum_{i=1}^{n} (Y_i - \hat{Y}_i)^2}

RMSE is back in dollars, so you can read it as “the typical size of a prediction error.” scikit-learn computes both for you. Always evaluate on the test set, because a model is optimized to minimize error on the training set and will flatter itself there.

from sklearn.metrics import mean_squared_error, mean_absolute_error

test_mse = mean_squared_error(y_test, predictions)
test_rmse = np.sqrt(test_mse)
test_mae = mean_absolute_error(y_test, predictions)

print(f"Test MSE:  {test_mse:,.0f}")
print(f"Test RMSE: ${test_rmse:,.0f}")
print(f"Test MAE:  ${test_mae:,.0f}")
# Output:
# Test MSE:  5,417,000
# Test RMSE: $2,327
# Test MAE:  $1,863

An RMSE of about $2,327 means the model’s price predictions are typically off by a little over two thousand dollars. Against an average price of $11,446, that is roughly a 20 percent typical error: not perfect, but a usable estimate.

RMSE vs MAE

You also printed the mean absolute error (MAE), the average of the absolute residuals:

MAE=1ni=1nYiY^i \text{MAE} = \frac{1}{n} \sum_{i=1}^{n} \left| Y_i - \hat{Y}_i \right|

MAE here is about $1,863, smaller than RMSE. That gap is informative. Because RMSE squares the residuals before averaging, it weights a few large misses heavily, so RMSE is always greater than or equal to MAE. When RMSE sits well above MAE, a handful of outliers are inflating the error; when they are close, the errors are fairly uniform in size. Our $2,327 versus $1,863 says there are a few cars the model misses badly, but no single catastrophe dominates.

Which error metric should you report?

Report RMSE when large errors are especially costly and you want them penalized; report MAE when you want a robust, outlier-resistant sense of the typical miss. They answer slightly different questions, and showing both, as we did, gives a fuller picture than either alone.


Measuring Explained Variance: R-Squared

RMSE tells you the size of the errors in dollars, but it does not tell you whether that size is good relative to how much prices vary in the first place. An RMSE of $2,327 would be excellent if prices ranged over $200,000 and mediocre if they barely moved. The coefficient of determination, written R2 R^2 , captures exactly this idea: the proportion of the variation in the outcome that the model explains.

The reasoning starts from a neat fact. The total variance in the outcome splits into the variance the model captures plus the variance left in the errors:

Var(Y)=Var(Y^)+Var(ϵ) \mathrm{Var}(Y) = \mathrm{Var}(\hat{Y}) + \mathrm{Var}(\epsilon)

From there, R2 R^2 is defined as one minus the share of variance that remains unexplained:

R2=1Var(ϵ)Var(Y)=1i(YiY^i)2i(YiYˉ)2 R^2 = 1 - \frac{\mathrm{Var}(\epsilon)}{\mathrm{Var}(Y)} = 1 - \frac{\sum_i (Y_i - \hat{Y}_i)^2}{\sum_i (Y_i - \bar{Y})^2}

Read it this way:

  • R2=1 R^2 = 1 : the model explains everything; residuals are all zero.
  • R2=0 R^2 = 0 : the model explains nothing; it does no better than always predicting the mean Yˉ \bar{Y} .
  • R2<0 R^2 < 0 : the model is worse than predicting the mean (yes, this can happen on a test set).

scikit-learn provides r2_score, and the fitted model’s .score() method returns the same value.

from sklearn.metrics import r2_score

test_r2 = r2_score(y_test, predictions)
print(f"Test R^2: {test_r2:.3f}")
print(f"Same via .score(): {model.score(X_test_scaled, y_test):.3f}")
# Output:
# Test R^2: 0.793
# Same via .score(): 0.793

An R2 R^2 of 0.793 means the five features together explain about 79 percent of the variation in car prices on data the model has never seen. The remaining 21 percent is driven by things not in our five columns: brand prestige, options, condition, and plain market noise. For five physical measurements, explaining four-fifths of price is a genuinely strong fit.

A high R-squared is not the whole story

R2 R^2 measures explained variance, not correctness of the model’s shape. A curved relationship can still post a respectable R2 R^2 while the residual plot screams that a line is wrong. Always pair R2 R^2 with the residual plot: the number tells you how much variation is captured, the plot tells you whether you captured it the right way.

Seeing the Fit Directly

One last view ties the diagnostics together. If you plot each car’s predicted price against its actual price, perfect predictions would land exactly on the diagonal line Y^=Y \hat{Y} = Y . The tighter the points hug that diagonal, the higher the R2 R^2 .

Scatter plot of predicted price versus actual price with the ideal diagonal line
Predicted versus actual price; points near the diagonal mean accurate predictions.

The points cluster close to the diagonal across the whole price range, which is the visual signature of an R2 R^2 near 0.8. A couple of expensive cars sit a bit off the line, the same outliers that pushed RMSE above MAE earlier. Every diagnostic in this lesson is telling a consistent story.

For completeness, here are the standardized coefficients of the very model you just validated, so you can connect “is it a good fit?” back to “what is it actually using?”

Bar chart of standardized regression coefficients for the five features
Standardized coefficients: curb weight and engine size move price the most.

Putting the Diagnostics Together

Here is the full evaluation workflow condensed into one runnable block. This is the template you will reuse to check any regression model.

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

# 1. Load and select features
df = pd.read_csv("automobiles.csv")  # download: https://datatweets.com/datasets/automobiles.csv
features = ["engine_size", "horsepower", "curb_weight", "width", "highway_mpg"]
X, y = df[features], df["price"]

# 2. Split, then scale (fit on train only)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.25, random_state=42
)
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

# 3. Fit
model = LinearRegression().fit(X_train, y_train)

# 4. Diagnose on the TEST set
preds = model.predict(X_test)
residuals = y_test - preds

print(f"Residual mean: ${np.mean(residuals):,.0f}")   # zero-mean check
print(f"RMSE:          ${np.sqrt(mean_squared_error(y_test, preds)):,.0f}")
print(f"MAE:           ${mean_absolute_error(y_test, preds):,.0f}")
print(f"R^2:           {r2_score(y_test, preds):.3f}")
# Output:
# Residual mean: $95
# RMSE:          $2,327
# MAE:           $1,863
# R^2:           0.793

In a dozen lines you loaded real data, fit a model, and ran every check that matters: the residual plot for shape, RMSE and MAE for error size, and R2 R^2 for explained variance. A model that passes all four is one you can defend.


Practice Exercises

Try these before checking the hints.

Exercise 1: Plot Residuals for a Weaker Model

Fit a linear regression using only highway_mpg as the predictor of price (still standardized, same split), then build a residual plot. Compare its shape to the five-feature model’s plot. Does it look as clean?

import pandas as pd
df = pd.read_csv("automobiles.csv")

# Your code here

Hint

Set X = df[["highway_mpg"]], run the same split, scale, and fit. Compute residuals = y_test - model.predict(X_test_scaled) and call plt.scatter(model.predict(X_test_scaled), residuals). With a single weak predictor, expect a wider, less even band of residuals: the visual sign of a poorer fit.

Exercise 2: Compute RMSE and R-squared Yourself

Without using mean_squared_error or r2_score, compute the test RMSE and R2 R^2 for the five-feature model using only NumPy, and confirm they match $2,327 and 0.793.

# Reuse predictions and y_test from the lesson
# Your code here

Hint

RMSE is np.sqrt(np.mean((y_test - predictions) ** 2)). For R2 R^2 , compute the residual sum of squares ss_res = np.sum((y_test - predictions) ** 2) and the total sum of squares ss_tot = np.sum((y_test - y_test.mean()) ** 2), then r2 = 1 - ss_res / ss_tot.

Exercise 3: Compare Train and Test R-squared

Compute R2 R^2 on both the training set and the test set for the five-feature model. Is the training score higher? Explain in one sentence why that is expected.

# Your code here

Hint

Call model.score(X_train_scaled, y_train) and model.score(X_test_scaled, y_test). The training R2 R^2 is usually a little higher because the model was optimized on exactly that data, which is why you must always report the test score as your honest measure of fit.


Summary

You learned how to judge a linear regression model rather than just fit one. Let’s review.

Key Concepts

Residuals and Assumptions

  • A residual is the actual value minus the predicted value, ri=YiY^i r_i = Y_i - \hat{Y}_i , the model’s stand-in for the unobservable error
  • Linear regression assumes the errors have zero mean (no systematic bias) and constant variance (homoskedasticity)
  • A residual plot of residuals versus predictions should show a shapeless horizontal band centered on zero; a funnel signals heteroskedasticity and a curve signals a non-linear relationship

Error Metrics

  • MSE averages the squared residuals; RMSE takes its square root to return to the original units (dollars here)
  • MAE averages the absolute residuals and is more robust to outliers; RMSE above MAE means a few large misses
  • Our model: RMSE about $2,327 and MAE about $1,863 against a mean price of $11,446

R-Squared

  • R2 R^2 is the proportion of outcome variance the model explains, from 1 (perfect) down through 0 (no better than the mean) and even negative on a test set
  • Our model reaches R2=0.793 R^2 = 0.793 , explaining about 79 percent of price variation on unseen data
  • A high R2 R^2 does not prove the model’s shape is right; always read it alongside the residual plot

Discipline

  • Compute every diagnostic on the test set, and fit the scaler on the training set only, to keep the evaluation honest

Why This Matters

Every regression project ends with the same question from a stakeholder: “Can we trust this model?” The diagnostics in this lesson are how you answer responsibly. The residual plot tells you whether a line is even the right shape, RMSE and MAE tell you how large the mistakes are in real-world units, and R2 R^2 tells you how much of the story the model captures. Together they let you move past “the code ran” to “the model is sound,” which is the judgment professional data work is built on.


Next Steps

You can now diagnose a fitted model from every important angle. In the next lesson you will put a model to work end to end, training on real data and generating predictions on a held-out test set.

Continue to Lesson 4 - Applying Linear Regression Models

Train on real data and make predictions on a held-out test set.

Back to Module Overview

Return to the Regression module overview.


Keep Sharpening Your Judgment

Fitting a model is a single line of code; deciding whether to believe it is a skill. The habit you built here, of pairing a visual check with quantitative metrics and always measuring on unseen data, transfers to every model you will ever train, linear or not. Make these four checks reflexive, and you will catch broken models long before they reach a decision-maker’s desk.