Lesson 4 - Applying Linear Regression Models

From Theory to a Working Model

In the previous lessons you learned what a regression line is, how it is fit, and how to read its coefficients. This lesson is where it all comes together. You will take a real dataset, build a model that predicts a car’s price from its physical characteristics, and then ask the only question that really matters: how well does this model do on data it has never seen?

That last part is the heart of applied machine learning. A model that fits its training data perfectly but stumbles on new data is worthless. So you will learn to split your data honestly, scale your features the right way, and measure error on a held-out test set.

By the end of this lesson, you will be able to:

  • Load and explore a real regression dataset and pick useful numeric features
  • Fit a simple linear regression and read its intercept and slope
  • Build a multiple linear regression with several predictors
  • Scale features with StandardScaler so coefficients become comparable
  • Evaluate a model honestly on a test set using R-squared, RMSE, and MAE
  • Explain the difference between training error and test error, and why the gap matters

You should be comfortable with basic Python, pandas, and the idea of a line of best fit. Let’s begin.


The Problem: Predicting Car Prices

Imagine you work for a company that buys and resells used cars. When a new vehicle arrives, you need a fast, defensible estimate of what it is worth. You have a catalog of cars with their measured specifications: engine size, horsepower, weight, dimensions, fuel economy, and more. You also know the price each one sold for.

This is a classic regression problem. The target you want to predict, price, is a continuous number (dollars), not a category. Your job is to learn the relationship between a car’s specifications and its price, then apply that relationship to price new arrivals.

You will use the automobiles dataset, a well-known collection of car specifications. It has 159 rows and 26 columns, with no missing values, which lets you focus on modeling rather than cleaning.

You can download it and load it with pandas.

import pandas as pd

df = pd.read_csv("automobiles.csv")  # download: https://datatweets.com/datasets/automobiles.csv

print("Shape:", df.shape)
# Output: Shape: (159, 26)

Each row is one car. The target column price is an integer dollar amount, and the rest of the columns describe the vehicle.

A Data Dictionary

You will not use all 26 columns. Here are the ones that matter for this lesson, grouped by type.

ColumnTypeMeaning
priceintTarget: sale price in US dollars
engine_sizeintEngine displacement in cubic centimeters
horsepowerintEngine power output
curb_weightintWeight of the car in pounds, without passengers
widthfloatWidth of the body in inches
lengthfloatLength of the body in inches
wheel_basefloatDistance between front and rear axles
city_mpg, highway_mpgintFuel economy, miles per gallon
bore, strokefloatCylinder dimensions
make, body_style, drive_wheelscategoryManufacturer, shape, drivetrain
fuel_type, aspirationcategoryGas vs diesel, standard vs turbo

Take a quick look at the target before modeling. Knowing its range and center keeps your later error numbers in perspective.

print(df["price"].describe()[["mean", "min", "max"]].round(0))
# Output:
# mean    11446.0
# min      5118.0
# max     35056.0
# Name: price, dtype: float64

Prices run from about $5,118 to $35,056, with a mean near $11,446. When you later see a typical prediction error of around $2,300, you can judge it against this spread: roughly 20 percent of the average price, which is a reasonable starting point for a simple linear model.


Starting Simple: One Predictor

The best way to understand a model is to start with the smallest version of it. So before juggling many features, fit a regression with a single predictor: engine_size. It is a natural first guess, since bigger engines tend to appear in more expensive cars.

A simple linear regression models the target as a straight line:

y=β0+β1x y = \beta_0 + \beta_1 x

Here y y is price, x x is engine_size, β0 \beta_0 is the intercept (the predicted price when engine size is zero), and β1 \beta_1 is the slope (how many dollars the price changes for each extra cubic centimeter of engine).

from sklearn.linear_model import LinearRegression

X_simple = df[["engine_size"]]   # a table with one column
y = df["price"]

simple_model = LinearRegression()
simple_model.fit(X_simple, y)

print("intercept:", round(simple_model.intercept_, 1))
print("slope    :", round(simple_model.coef_[0], 2))
# Output:
# intercept: -7914.1
# slope    : 162.38

This gives you the fitted line:

price=7914.1+162.38×engine_size \text{price} = -7914.1 + 162.38 \times \text{engine\_size}

The slope says each additional cubic centimeter of engine displacement is associated with about $162 more in price. The intercept is negative, which is not physically meaningful on its own (a car cannot have a negative price or a zero-size engine), but it is the value that makes the line fit best across the real range of engine sizes. Intercepts are often just mathematical anchors rather than literal predictions.

How good is this one-feature model? A standard summary is the coefficient of determination, or R2 R^2 , the fraction of the variation in price that the model explains:

R2=1i(yiy^i)2i(yiyˉ)2 R^2 = 1 - \frac{\sum_i (y_i - \hat{y}_i)^2}{\sum_i (y_i - \bar{y})^2}

where yi y_i is the actual price, y^i \hat{y}_i is the predicted price, and yˉ \bar{y} is the mean price. An R2 R^2 of 1 is perfect; an R2 R^2 of 0 means the model does no better than always guessing the average.

print("R^2:", round(simple_model.score(X_simple, y), 3))
# Output: R^2: 0.708

An R2 R^2 of 0.708 means engine size alone explains about 71 percent of the variation in price. That is a lot of signal from a single column. The scatter plot below shows the data and the fitted line: the upward trend is clear, but there is real spread around it, which is the 29 percent the model does not capture.

Scatter of price versus engine size with the fitted regression line
Price rises with engine size, and the fitted line captures about 71 percent of the variation.

The remaining spread is your motivation to add more features. A car’s price depends on far more than its engine.


Adding More Predictors

A single feature leaves money on the table. Multiple linear regression lets the model use several predictors at once, each with its own coefficient:

y=β0+β1x1+β2x2++βkxk y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \cdots + \beta_k x_k

You will use five numeric features that each describe a different aspect of a car: how big the engine is, how powerful it is, how heavy it is, how wide it is, and how efficient it is.

features = ["engine_size", "horsepower", "curb_weight", "width", "highway_mpg"]

X = df[features]
y = df["price"]

print("Feature matrix shape:", X.shape)
# Output: Feature matrix shape: (159, 5)

Splitting Into Train and Test

Here is the question at the center of machine learning: how do you know the model learned something real, instead of just memorizing these 159 cars? The answer is to hold some data back. You train on one portion and test on a separate portion the model never sees during fitting. If it predicts well on the unseen test cars, you can trust it on future cars.

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.25,      # hold out 25% for testing
    random_state=42,     # make the split reproducible
)

print("Training cars:", X_train.shape[0])
print("Test cars:    ", X_test.shape[0])
# Output:
# Training cars: 119
# Test cars:     40

The random_state=42 fixes the randomness so you get the same split every run, which makes your results reproducible. Any fixed number works; the point is consistency.

Scaling the Features

Look at the feature ranges. Engine size is in the hundreds, horsepower in the tens to low hundreds, curb weight in the thousands, width around 65 inches, and highway mpg in the tens. These wildly different scales make the raw coefficients hard to compare: a coefficient of 50 on curb weight and 50 on width would mean very different things in real terms.

The fix is standardization with StandardScaler, which rescales each feature to have a mean of 0 and a standard deviation of 1. The transform applied to each value x x is:

z=xμσ z = \frac{x - \mu}{\sigma}

where μ \mu is the feature’s mean and σ \sigma is its standard deviation. After scaling, every feature is on the same footing, so a coefficient directly tells you how many dollars the price moves for a one-standard-deviation change in that feature.

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)  # learn mean/std on TRAIN only
X_test_scaled = scaler.transform(X_test)        # apply the SAME transform to test

The golden rule of scaling

Fit the scaler on the training data only, then apply it to both sets. If you fit on the full dataset, information about the test set leaks into training, and your evaluation becomes too optimistic. The same discipline applies to the model: it must never see the test set during fitting.

Fitting the Multiple Regression

With the data split and scaled, fitting is a single call.

model = LinearRegression()
model.fit(X_train_scaled, y_train)

print("intercept:", round(model.intercept_, 1))
for name, coef in zip(features, model.coef_):
    print(f"  {name:12s} coef = {coef:8.1f}")
# Output:
# intercept: 11442.5
#   engine_size  coef =   1808.4
#   horsepower   coef =    336.5
#   curb_weight  coef =   1935.4
#   width        coef =   1892.0
#   highway_mpg  coef =     82.6

Because the features are standardized, these coefficients are directly comparable. The intercept, $11,442, is the predicted price for a perfectly average car (one at the mean of every feature), which lines up nicely with the mean price you saw earlier.

The three biggest drivers are curb weight ($1,935 per standard deviation), engine size ($1,808), and width ($1,892). In other words, heavier, larger-engined, wider cars command higher prices. Horsepower adds a smaller $336, and highway mpg barely moves the needle at $83 once the other features are accounted for. The bar chart below makes the ranking obvious.

Bar chart of standardized regression coefficients for the five features
Standardized coefficients: curb weight, engine size, and width dominate the price.

Why scaling makes coefficients readable

Without scaling, you cannot compare a coefficient on curb weight (measured in pounds) to one on width (measured in inches), because a “one-unit” change means something completely different for each. Standardizing puts every feature in the same units, namely standard deviations, so the size of a coefficient becomes a fair measure of that feature’s importance.


Judging the Model Honestly

You now have a fitted model, but you have not yet answered the real question: how well does it predict cars it has never seen? This is where the test set earns its keep.

Training Error Is Too Optimistic

The model’s parameters were chosen specifically to fit the training data, so its error on that same data is biased downward. It always looks better on what it trained on than on fresh data. That is why you never judge a model by its training error alone. You judge it by its test error, computed on the held-out set.

You hope the test error is close to the training error. If it is, the model generalizes. If the test error is far worse, the model has overfit, meaning it memorized quirks of the training cars instead of learning the real pattern.

Three Error Metrics

For regression, three metrics are standard. The mean squared error (MSE) averages the squared gaps between predictions and reality:

MSE=1ni=1n(yiy^i)2 \text{MSE} = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2

Squaring punishes big misses heavily, but it also leaves the result in squared dollars, which is hard to interpret. Taking the square root gives the root mean squared error (RMSE), which is back in plain dollars:

RMSE=1ni=1n(yiy^i)2 \text{RMSE} = \sqrt{\frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2}

The mean absolute error (MAE) averages the absolute gaps instead, so it is less sensitive to a few large misses:

MAE=1ni=1nyiy^i \text{MAE} = \frac{1}{n} \sum_{i=1}^{n} |y_i - \hat{y}_i|

Compute all three, plus R2 R^2 , on the test set.

import numpy as np
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

test_pred = model.predict(X_test_scaled)

test_r2 = r2_score(y_test, test_pred)
test_rmse = np.sqrt(mean_squared_error(y_test, test_pred))
test_mae = mean_absolute_error(y_test, test_pred)

print(f"TEST  R^2  = {test_r2:.3f}")
print(f"TEST  RMSE = ${test_rmse:,.0f}")
print(f"TEST  MAE  = ${test_mae:,.0f}")
# Output:
# TEST  R^2  = 0.793
# TEST  RMSE = $2,327
# TEST  MAE  = $1,863

These numbers tell a clear story. On cars the model never saw, it explains about 79 percent of the variation in price, a real jump over the 71 percent from engine size alone. The typical error is about $2,327 (RMSE) or $1,863 (MAE). Against an average price near $11,446, an error around $2,000 is respectable for a handful of physical measurements.

The RMSE is larger than the MAE, which is expected: RMSE squares the errors, so it gets pulled up by the occasional large miss. When the two diverge a lot, a few outliers are doing the damage.

Looking at the Predictions

Numbers are convincing, but a picture is better. The scatter below plots each test car’s predicted price against its actual price. A perfect model would put every point on the diagonal line. The points here cluster tightly around it, which is exactly what an R2 R^2 of 0.79 looks like.

Predicted vs actual price scatter on the test set
On the test set, predictions track actual prices closely (R² = 0.79).

It is just as important to inspect the residuals, the leftover errors (actual minus predicted). A healthy model has residuals that scatter randomly around zero with no pattern. If you saw a curve or a fan shape, it would signal that a straight line is the wrong shape for the data.

Residuals plotted against predicted price on the test set
Residuals scatter around zero without a strong pattern, a good sign for a linear model.

The residuals form a rough, even band around zero, which means the linear model is capturing the main structure of the data. There is some spread, as you would expect with only five features, but no glaring shape that screams “use a different model.”


A Glimpse Under the Hood: How Fitting Works

So far you have called .fit() and trusted scikit-learn to find the best line. But what is it actually doing? It is searching for the coefficients that make the MSE as small as possible. One of the most important ways to perform that search is gradient descent, and it is worth seeing in miniature before the next lesson covers it in full.

The idea: start with a guess for the coefficient, measure how the loss changes as you nudge it, and step in the direction that lowers the loss. Repeat until the loss stops falling. To keep the picture simple, standardize both engine_size and price and fit a single weight with one bias term.

import numpy as np

# Standardize a single predictor and the target
x = (df["engine_size"] - df["engine_size"].mean()) / df["engine_size"].std()
t = (df["price"] - df["price"].mean()) / df["price"].std()
x, t = x.values, t.values
n = len(x)

w, b = 0.0, 0.0          # start at zero
lr = 0.1                 # learning rate (step size)

for i in range(60):
    pred = w * x + b
    error = pred - t
    # gradients of the MSE with respect to w and b
    grad_w = (2 / n) * np.dot(error, x)
    grad_b = (2 / n) * error.mean() * 2
    w -= lr * grad_w
    b -= lr * grad_b

mse = ((w * x + b - t) ** 2).mean()
print(f"final w = {w:.3f}  b = {b:.3f}  MSE = {mse:.4f}")
# Output:
# final w = 0.841  b = 0.000  MSE = 0.2919

The update rule at the core of this loop is, for each parameter θ \theta :

θθηMSEθ \theta \leftarrow \theta - \eta \, \frac{\partial \, \text{MSE}}{\partial \theta}

where η \eta is the learning rate. After 60 iterations the weight settles at about 0.841 and the bias at essentially 0. On standardized data, that final weight equals the correlation between engine size and price, and the loss flattens out near 0.292. The curve below shows the MSE dropping fast at first, then leveling off as the model converges.

Mean squared error decreasing over gradient descent iterations
The loss falls quickly and then flattens as gradient descent converges.

The learning rate controls the step size, and it matters enormously. Too small, and the model crawls toward the answer and may not arrive in time. Too large, and it overshoots and bounces around. The comparison below shows three rates: 0.01 is too slow, 0.1 is just right, and 0.6 overshoots and oscillates.

Loss curves for learning rates 0.01, 0.1, and 0.6
Learning rate matters: too small crawls, too large overshoots, and a good value converges cleanly.

You do not need to hand-code gradient descent in practice, but seeing it demystifies what .fit() does. The next lesson develops this fully.


The Same Answer, a Different Engine

To prove that gradient descent really is doing the same job as the exact LinearRegression solver, swap in scikit-learn’s SGDRegressor, which fits by stochastic gradient descent instead of solving the equations directly. Run it on the same scaled features.

from sklearn.linear_model import SGDRegressor

sgd = SGDRegressor(max_iter=2000, random_state=42, eta0=0.01)
sgd.fit(X_train_scaled, y_train)

sgd_r2 = sgd.score(X_test_scaled, y_test)
print(f"OLS  test R^2 = {test_r2:.3f}")
print(f"SGD  test R^2 = {sgd_r2:.3f}")
# Output:
# OLS  test R^2 = 0.793
# SGD  test R^2 = 0.795

The two methods land on essentially the same test R2 R^2 : 0.793 for the exact solver and 0.795 for gradient descent. They take different routes, but they arrive at the same place, because both are minimizing the same MSE loss. The chart below puts them side by side.

Bar chart comparing OLS and SGD test R-squared
Gradient descent (SGD) reaches the same test performance as the exact solver.

This is reassuring: the iterative method you peeked at above is not a toy. It is the same machinery that scales to models far too large to solve exactly.


Communicating Your Results

A model is only useful if you can explain it. When you hand results to a teammate or a manager, skip the math and answer the questions they actually care about. A good summary covers:

  • How much data, split how? 159 cars, 75 percent for training and 25 percent held out for an honest test.
  • What features, and why? Five physical measurements (engine size, horsepower, curb weight, width, highway mpg), each describing a distinct aspect of the car.
  • What did you find? Curb weight, engine size, and width drive price the most; fuel economy barely matters once size is accounted for.
  • How good is it on new data? It explains about 79 percent of price variation on unseen cars, with a typical error near $2,300.
  • Is that good enough? For a quick first estimate on a $5,000-$35,000 range, yes. For a final offer, you would want more features and a more flexible model.

Being transparent about the split, the features, and the test error is what separates a trustworthy result from a number pulled out of thin air.


Practice Exercises

Try these before peeking at the hints.

Exercise 1: Add Two More Features

Extend the multiple regression by adding length and wheel_base to the feature list. Re-split (same random_state=42), re-scale, refit, and report the test R2 R^2 and RMSE. Did the extra features help?

features = ["engine_size", "horsepower", "curb_weight", "width", "highway_mpg",
            "length", "wheel_base"]
# Your code here: split, scale (fit on train only), fit, evaluate

Hint

Reuse the exact pattern from the lesson: train_test_split(X, y, test_size=0.25, random_state=42), then a fresh StandardScaler fit on X_train only, then LinearRegression().fit(...). Compute r2_score and np.sqrt(mean_squared_error(...)) on the test predictions. Adding features that overlap with existing ones (length and width both measure size) usually gives only a small change.

Exercise 2: Compare Two Single-Feature Models

Fit one simple regression on horsepower and another on curb_weight, each predicting price on the full dataset. Print the R2 R^2 of each. Which single feature is the stronger predictor on its own?

# Your code here

Hint

For each feature, build X = df[["horsepower"]] (note the double brackets so it stays a table), then LinearRegression().fit(X, df["price"]) and .score(X, df["price"]). Compare the two R2 R^2 values. Curb weight tends to be one of the strongest single predictors of car price.

Exercise 3: Change the Learning Rate

Take the from-scratch gradient descent loop and run it with three learning rates: 0.01, 0.1, and 0.6. For each, print the final MSE after 60 iterations. What happens at the largest rate?

# Your code here (reuse x, t, n from the lesson)

Hint

Wrap the loop in a function that takes lr as an argument and returns the final MSE. Call it for each of [0.01, 0.1, 0.6]. At 0.01 the loss is still high (too slow to converge in 60 steps), at 0.1 it settles near 0.29, and at 0.6 it overshoots and the loss can blow up instead of shrinking.


Summary

You took linear regression from a formula to a working, evaluated model on real data. Let’s review.

Key Concepts

Building the Model

  • A simple regression uses one predictor: price = -7914.1 + 162.38 * engine_size, with R2=0.708 R^2 = 0.708
  • A multiple regression uses several predictors at once, each with its own coefficient
  • Selecting useful numeric features (engine size, horsepower, curb weight, width, highway mpg) gives the model more signal

Preparing the Data

  • Split into train and test with train_test_split, using a fixed random_state for reproducibility
  • Scale features with StandardScaler, fitting on the training set only to avoid leakage
  • After standardizing, coefficients are directly comparable: curb weight, engine size, and width drive price the most

Evaluating Honestly

  • Training error is too optimistic, so judge a model by its test error
  • R2 R^2 is the fraction of variance explained; RMSE and MAE report typical error in dollars
  • On the test set the model reached R2=0.793 R^2 = 0.793 , RMSE $2,327, and MAE $1,863
  • Residual plots reveal whether a straight line is the right shape for the data

How Fitting Works

  • Gradient descent searches for coefficients by stepping downhill on the loss; on standardized data the weight converged to about 0.841 with a final MSE of 0.292
  • The learning rate controls step size: too small crawls, too large overshoots
  • SGDRegressor reached the same test R2 R^2 (0.795) as the exact solver (0.793), confirming both minimize the same loss

Why This Matters

The discipline you practiced here, splitting honestly, scaling correctly, and reporting test error, is what makes a model trustworthy. The exact algorithm will change as you move to ridge regression, trees, or neural networks, but this evaluation workflow stays the same. A model is only as good as the evidence that it generalizes, and that evidence lives in the test set.

You also got your first look under the hood at gradient descent, the engine that trains not just linear regression but nearly every modern model. The next lesson opens that hood fully.


Next Steps

You can now build, scale, and honestly evaluate a regression model end to end. Next, you will dive into the algorithm that actually fits these models, gradient descent, and understand exactly how it finds the best coefficients.

Continue to Lesson 5 - Understanding Gradient Descent

Open the hood on how regression models are actually trained.

Back to Module Overview

Return to the Regression module overview.