Lesson 1 - Introduction to Linear Regression

Welcome to Linear Regression

This lesson introduces linear regression, the model most practitioners reach for first whenever the thing they want to predict is a number. You will learn what a regression actually is, why the best-fit line is defined by minimizing squared error, and how to fit both simple and multiple regressions in scikit-learn, all while predicting the price of real cars.

By the end of this lesson, you will be able to:

  • Explain what regression is and how linear regression models a relationship
  • Identify the intercept, coefficients, and error term of a linear model
  • Fit a simple linear regression with one predictor using scikit-learn
  • Fit a multiple linear regression with several predictors and evaluate it honestly
  • Describe the squared-error cost function and how gradient descent minimizes it

No prior machine learning experience is needed. You should be comfortable with basic Python, pandas, and NumPy. Let’s begin.


What Is Regression?

Imagine you work at a dealership and a trade-in rolls onto the lot. Before you can list it, you need a price. You could guess, but you already have records for hundreds of cars: their engine size, horsepower, weight, fuel economy, and the price each one sold for. The question is simple to state and hard to eyeball: given a car’s measurable characteristics, what is a fair price?

Regression is the technique that answers exactly this kind of question. A regression models the relationship between two sets of variables. On one side are the predictors (also called features): the measurable inputs you think influence the answer, like engine size and weight. On the other side is the outcome (also called the target): the single number you want to predict, here the car’s price.

Mathematically, we write that relationship as a function:

Y=f(X)+ϵ Y = f(X) + \epsilon

This says the outcome Y Y splits into two pieces. The first piece, f(X) f(X) , is the systematic part: how the predictors drive the outcome. The second piece, ϵ \epsilon , is the error (or noise): everything the predictors cannot explain. The relationship is never perfect, because no set of measurements fully determines a price. The error term fills that gap, and later it becomes the basis for measuring how good a model is.

What Makes It “Linear”

The word “linear” tells you the shape of f(X) f(X) . Linear regression assumes the outcome is a linear combination of the predictors. With a single predictor, that is just the equation of a line:

Y=β0+β1X+ϵ Y = \beta_0 + \beta_1 X + \epsilon

Here β1 \beta_1 is the coefficient (the slope). It says how much Y Y changes when X X increases by one unit. If β1 \beta_1 is large, small changes in the predictor move the outcome a lot. The term β0 \beta_0 is the intercept: the value of Y Y when X X is zero. Together, β0 \beta_0 and β1 \beta_1 are the parameters of the model. Parameters are not chosen by hand; they are learned from the data.

When a model has exactly one predictor, it is a simple linear regression. Nothing stops you from adding more predictors, which turns it into a multiple linear regression:

Y=β0+β1X1+β2X2++βpXp+ϵ Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \dots + \beta_p X_p + \epsilon

The principle is identical; there are just more slopes to learn.

Why Linear Regression Is Worth Learning First

Linear regression is one of the workhorse models of statistics and machine learning, used in business, science, and medicine. It is fast, it rarely overfits, and above all it is interpretable: every coefficient has a plain-language meaning you can explain to a non-technical colleague. Even when you eventually move to more complex models, linear regression remains the baseline you measure them against.


The Best-Fit Line

If a line is defined by its slope and intercept, then fitting a regression means choosing the slope and intercept that fit the data best. But “best” needs a precise definition.

For any candidate line, each data point sits some vertical distance away from it. That distance, the gap between the actual value and what the line predicts, is the residual for that point. A good line makes the residuals small across the whole dataset. The best-fit line is the one that makes them as small as possible, in a specific sense we will define in a moment.

Diagram of a best-fit line minimizing residuals
The best-fit line is the one that makes the sum of squared residuals as small as possible.

Notice the vertical bars in the figure. Each one is a residual: the error the line makes on a single point. Fitting the model means shrinking the total size of those bars. We will turn that intuition into a formula, the cost function, shortly. First, let’s get the real data loaded.


Meet the Data

You will predict car prices using the classic Automobiles dataset, drawn from real automobile import records. Each row is one car model, with measurements of its body, engine, and performance, plus the price it sold for. You can download it and load it with pandas.

import pandas as pd

df = pd.read_csv("automobiles.csv")  # download: https://datatweets.com/datasets/automobiles.csv

print("Shape:", df.shape)
# Output: Shape: (159, 26)

The dataset has 159 rows and 26 columns, and conveniently it has no missing values, so you can focus on modeling rather than cleaning. The column you will predict is price, in US dollars.

A Data Dictionary

You will not use all 26 columns. Here are the ones that matter for this lesson:

ColumnTypeMeaning
priceintTarget: selling price in US dollars
engine_sizeintEngine displacement in cubic centimeters
horsepowerintEngine power output
curb_weightintWeight of the car in pounds, empty
widthfloatWidth of the car in inches
length, wheel_basefloatOther body dimensions in inches
city_mpg, highway_mpgintFuel economy, miles per gallon
make, body_style, drive_wheelscategoryBrand, shape, and drivetrain
fuel_type, aspiration, num_of_doorscategoryOther categorical descriptors

Take a quick look at the target and a few predictors.

print(df["price"].describe()[["mean", "min", "max"]].round())
# Output:
# mean    11446.0
# min      5118.0
# max     35056.0
# Name: price, dtype: float64

print(df[["make", "engine_size", "horsepower", "curb_weight", "price"]].head(3))
# Output:
#    make  engine_size  horsepower  curb_weight  price
# 0  audi          109         102         2337  13950
# 1  audi          136         115         2824  17450
# 2  audi          136         110         2844  17710

Prices range from about $5,118 to $35,056, with an average near $11,446. That spread is what your model will try to explain.

Numerical and Categorical Predictors

Predictors come in two flavors. Numerical predictors contain numbers, like engine_size or horsepower, and can go straight into a regression. Categorical predictors indicate group membership, like body_style (sedan, hatchback, wagon) or drive_wheels (fwd, rwd, 4wd). A regression cannot multiply a coefficient by the word “sedan,” so categorical columns must first be converted into numeric dummy variables, one 0/1 column per category. You will work with categoricals in a later lesson; for this one, you will stick to numeric predictors so the focus stays on the model itself.


Simple Linear Regression

Start with the simplest possible model: predict price from a single predictor, engine_size. Bigger engines tend to cost more, so this is a sensible first relationship to test.

scikit-learn follows a consistent pattern for every model: instantiate, then fit. The features go in a two-dimensional table X (note the double brackets, which keep it a DataFrame), and the target goes in a one-dimensional y.

from sklearn.linear_model import LinearRegression

X = df[["engine_size"]]   # features: a table with one column
y = df["price"]           # target: a single column

model = LinearRegression()
model.fit(X, y)           # learn the best slope and intercept

print("Intercept:", round(model.intercept_, 1))
print("Slope:    ", round(model.coef_[0], 2))
# Output:
# Intercept: -7914.1
# Slope:     162.38

The model learned this line:

price^=7914.1+162.38×engine_size \widehat{\text{price}} = -7914.1 + 162.38 \times \text{engine\_size}

Read the slope in plain English: each additional cubic centimeter of engine displacement is associated with about $162 more in price. The intercept of -$7,914 is the predicted price for a hypothetical engine size of zero, which has no real-world meaning here; intercepts often do not, and that is fine.

How well does this single predictor explain price? The standard score for regression is the coefficient of determination, written R2 R^2 . It measures the fraction of the variation in the outcome that the model explains, on a scale where 1.0 is perfect and 0.0 is no better than always guessing the mean:

R2=1i(yiy^i)2i(yiyˉ)2 R^2 = 1 - \frac{\sum_i (y_i - \hat{y}_i)^2}{\sum_i (y_i - \bar{y})^2}
print("R^2:", round(model.score(X, y), 3))
# Output: R^2: 0.708

An R2 R^2 of 0.708 means engine size alone explains roughly 71 percent of the variation in price. That is a strong start for one feature. The scatter plot below shows the cars and the fitted line running through them.

Scatter of engine size vs price with a fitted line
A simple linear regression of car price on engine size.

Each dot is a car. The line is your model. The vertical distance from each dot to the line is that car’s residual, and the fitting procedure chose the one line that minimizes those residuals overall.


The Cost Function

You just trusted scikit-learn to find the “best” line. Now let’s pin down what “best” means, because it is the idea at the heart of almost every machine learning model.

For each car, the model makes a prediction, and the residual is the gap between the true price and that prediction. To judge a whole line, you need to combine all the residuals into one number. The obvious idea, adding them up, fails: a large positive residual and a large negative one would cancel out, making a terrible line look perfect. The fix is to square each residual before summing, so every error counts as positive and large errors are penalized heavily. This gives the sum of squared errors (SSE), the cost function for linear regression:

L(β0,β1,,βp)=i=1n(yiy^i)2=i=1nϵi2 L(\beta_0, \beta_1, \dots, \beta_p) = \sum_{i=1}^{n} \left( y_i - \hat{y}_i \right)^2 = \sum_{i=1}^{n} \epsilon_i^2

A common variant divides by the number of points n n to get the mean squared error (MSE), which is easier to compare across datasets of different sizes:

MSE=1ni=1n(yiy^i)2 \text{MSE} = \frac{1}{n} \sum_{i=1}^{n} \left( y_i - \hat{y}_i \right)^2

“Best-fit line” simply means the values of β0 \beta_0 and β1 \beta_1 that make this cost as small as possible. This is the method of least squares. Different models use different cost functions, but the motivation never changes: pick the parameters that minimize the cost on your data.

For linear regression there is even a closed-form formula that solves for the minimizing coefficients directly with linear algebra, which is what scikit-learn’s LinearRegression uses internally. But many models have no such shortcut, so it is worth seeing the general-purpose method that does work everywhere: gradient descent.


Gradient Descent: Finding the Minimum by Walking Downhill

Picture the cost function as a landscape. For a simple regression with one slope, plotting the cost for every possible slope traces out a bowl shape, high on the sides and low in the middle. The best slope sits at the very bottom. Gradient descent is an algorithm that finds that bottom by repeatedly stepping downhill.

At any point on the bowl, the gradient (the derivative of the cost) tells you the direction of steepest ascent. Step the opposite way and you move toward the minimum. The size of each step is controlled by the learning rate, written η \eta . The update rule for a single parameter w w is:

wwηLw w \leftarrow w - \eta \, \frac{\partial L}{\partial w}

For the mean squared error with standardized data, the gradient of the slope works out to a clean expression, and you can implement the whole loop yourself in a few lines. To make the math behave nicely, first standardize both the predictor and the target (subtract the mean, divide by the standard deviation) so they share a common scale.

import numpy as np

# Standardize engine_size and price to mean 0, std 1
x = (df["engine_size"] - df["engine_size"].mean()) / df["engine_size"].std()
t = (df["price"] - df["price"].mean()) / df["price"].std()
n = len(x)

w, b = 0.0, 0.0      # start the slope and intercept at zero
eta = 0.1            # learning rate
losses = []

for i in range(60):
    pred = w * x + b
    error = pred - t
    losses.append((error ** 2).mean())          # record MSE this step
    w -= eta * (2 / n) * (error * x).sum()       # gradient step for slope
    b -= eta * (2 / n) * error.sum()             # gradient step for intercept

print("Final slope w:", round(w, 3))
print("Final bias  b:", round(b, 3))
print("Final MSE:    ", round(losses[-1], 4))
# Output:
# Final slope w: 0.841
# Final bias  b: 0.0
# Final MSE:     0.292

After 60 iterations the slope settles at about 0.841, the bias at 0, and the MSE bottoms out near 0.292. On standardized data, that learned slope is exactly the correlation between engine size and price, which is a satisfying sanity check. The intercept lands at zero because both variables were centered.

Watching the loss fall each iteration shows the algorithm working. It starts high and drops quickly, then flattens as it approaches the minimum.

print("MSE at iterations 0, 1, 2, 3:", [round(l, 3) for l in losses[:4]])
# Output: MSE at iterations 0, 1, 2, 3: [0.994, 0.742, 0.58, 0.476]
Loss decreasing over gradient descent iterations
The mean squared error falls steeply at first, then levels off as gradient descent nears the minimum.

The Learning Rate Matters

The learning rate is the single most important knob in gradient descent, and getting it wrong breaks training. Run the same loop with three different rates and compare.

for eta in [0.01, 0.1, 0.6]:
    w, b = 0.0, 0.0
    for i in range(60):
        error = (w * x + b) - t
        w -= eta * (2 / n) * (error * x).sum()
        b -= eta * (2 / n) * error.sum()
    print(f"eta={eta:<4}  final w={w:.3f}  final MSE={((w*x+b - t)**2).mean():.4f}")
# Output:
# eta=0.01  final w=0.589  final MSE=0.3533
# eta=0.1   final w=0.841  final MSE=0.2900
# eta=0.6   final w=0.841  final MSE=0.2900
Loss curves for three different learning rates
Too small a learning rate crawls; a good one converges smoothly; too large a one jumps around and can diverge.

The three rates tell a clear story:

  • eta=0.01 is too small. After 60 steps it has only reached a slope of 0.589 and an MSE of 0.353; it is still creeping toward the minimum and needs far more iterations.
  • eta=0.1 is just right. It descends smoothly and lands at the true minimum well within 60 steps.
  • eta=0.6 is aggressive. It reaches the minimum fast but takes large, jumpy steps to get there. Push the rate just a little higher (past about 1.0) and the steps overshoot the bottom so badly that the loss grows each iteration and the algorithm diverges entirely.

The lesson is to pick a learning rate large enough to make progress but small enough to stay stable.


Multiple Linear Regression

One predictor explained 71 percent of price variation. You can do better by giving the model more to work with. Multiple linear regression uses several predictors at once, each with its own coefficient. You will use five numeric features that all relate to price: engine size, horsepower, curb weight, width, and highway fuel economy.

Splitting and Scaling

Two preparation steps matter here. First, you must evaluate the model on data it never trained on, so split the data into a training set and a test set. The model learns from the training set; the test set is locked away and used only at the end to measure honest performance. Second, the features live on wildly different scales (engine size in the hundreds, width near 70, mpg in the tens), so you standardize them with StandardScaler to put every feature on a common footing. This makes the coefficients directly comparable, since each then describes the effect of a one-standard-deviation change.

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

features = ["engine_size", "horsepower", "curb_weight", "width", "highway_mpg"]
X = df[features]
y = df["price"]

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.25, random_state=42
)

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)  # learn mean/std on TRAIN only
X_test_scaled = scaler.transform(X_test)        # apply the SAME transform to test

The golden rule of scaling

Fit the scaler on the training data only, then apply it to both sets. If you fit on the full dataset, information about the test set leaks into training and your scores become too optimistic. The same discipline applies to the model: it must never see the test set until the final evaluation.

Fitting and Reading the Coefficients

Fitting is the same two lines as before. Now there are five coefficients, one per feature, plus an intercept.

model = LinearRegression()
model.fit(X_train_scaled, y_train)

for name, coef in zip(features, model.coef_):
    print(f"  {name:<12} coef = {coef:8.1f}")
print(f"  intercept = {model.intercept_:.1f}")
# Output:
#   engine_size  coef =   1808.4
#   horsepower   coef =    336.5
#   curb_weight  coef =   1935.4
#   width        coef =   1892.0
#   highway_mpg  coef =     82.6
#   intercept = 11442.5

Because the features are standardized, each coefficient is the dollar change in predicted price for a one-standard-deviation increase in that feature, holding the others fixed. Curb weight (1935), engine size (1808), and width (1892) are the heavy hitters; horsepower contributes less once those are accounted for, and highway mpg barely moves the needle. The intercept of $11,442 is the predicted price for an “average” car, since standardized features are zero at their means.

Bar chart of standardized regression coefficients
Standardized coefficients let you compare features directly: weight, engine size, and width dominate.

Evaluating on the Test Set

Now measure performance on the held-out test set with three standard metrics. R2 R^2 you have already met. RMSE (root mean squared error) is the typical prediction error in the same units as the target, dollars here. MAE (mean absolute error) is the average absolute miss, less sensitive to large outliers.

import numpy as np
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error

preds = model.predict(X_test_scaled)

r2 = r2_score(y_test, preds)
rmse = np.sqrt(mean_squared_error(y_test, preds))
mae = mean_absolute_error(y_test, preds)

print(f"TEST  R^2={r2:.3f}  RMSE=${rmse:,.0f}  MAE=${mae:,.0f}")
# Output: TEST  R^2=0.793  RMSE=$2,327  MAE=$1,863

The multiple regression explains 79 percent of price variation on unseen cars, up from 71 percent with engine size alone, and its typical error is about $2,327. For prices averaging $11,446, missing by roughly $2,000 on average is a respectable result.

A good way to see model quality is to plot predicted price against actual price. Points hugging the diagonal line are accurate predictions.

Scatter of predicted versus actual prices
Predicted versus actual price on the test set; points near the diagonal are good predictions.

You should also check the residuals, the prediction errors. If the model is well behaved, residuals scatter randomly around zero with no obvious pattern. A pattern would signal that the linear form is missing something.

Residual plot scattered around zero
Residuals scattered around zero with no clear pattern suggest the linear model fits reasonably.

A Note on Gradient Descent in scikit-learn

You wrote gradient descent by hand to understand it, but scikit-learn ships a production version called SGDRegressor (stochastic gradient descent). It minimizes the same squared-error cost and is a great choice when datasets grow too large for the closed-form solution. On the scaled car features, it lands in essentially the same place as ordinary least squares.

from sklearn.linear_model import SGDRegressor

sgd = SGDRegressor(max_iter=2000, random_state=42, eta0=0.01)
sgd.fit(X_train_scaled, y_train)

sgd_r2 = sgd.score(X_test_scaled, y_test)
print(f"OLS  R^2=0.793   SGD R^2={sgd_r2:.3f}")
# Output: OLS  R^2=0.793   SGD R^2=0.795

The two methods reach the same answer (0.793 versus 0.795) by different routes: least squares solves for the minimum directly, while gradient descent walks down to it. Understanding both is what lets you choose the right tool as your problems scale up.

Comparison of SGD and OLS test scores
Gradient descent and the closed-form solution converge to essentially the same model.

Practice Exercises

Now it is your turn. Try these before checking the hints.

Exercise 1: Fit a Different Simple Regression

Build a simple linear regression that predicts price from curb_weight instead of engine_size. Print the slope, intercept, and R2 R^2 . Does weight explain more or less of the price variation than engine size did?

import pandas as pd
from sklearn.linear_model import LinearRegression

df = pd.read_csv("automobiles.csv")
# Your code here

Hint

Set X = df[["curb_weight"]] and y = df["price"], then model.fit(X, y). Read the slope from model.coef_[0], the intercept from model.intercept_, and the fit from model.score(X, y). Curb weight is a strong predictor too, so expect an R2 R^2 in a similar range to engine size.

Exercise 2: Add a Predictor to the Multiple Regression

Take the five-feature model from the lesson and add wheel_base as a sixth predictor. Re-split (with random_state=42), re-scale on the training set only, refit, and print the test R2 R^2 . Did the extra feature help?

# Your code here (reuse the train_test_split and StandardScaler pattern from the lesson)

Hint

Add "wheel_base" to the features list, then repeat the exact split-scale-fit-score pipeline. Compare the new test R2 R^2 to the lesson’s 0.793. A small change in either direction is normal; not every added feature improves an honest test score.

Exercise 3: Experiment with the Learning Rate

Modify the from-scratch gradient descent loop to try eta = 1.05 for 60 iterations on the standardized engine_size and price. Print the final slope and MSE. What happens, and why?

import numpy as np
# Standardize x and t as in the lesson, then run the loop with eta = 1.05

Hint

Reuse the standardization and the update rule, changing only the learning rate. With eta = 1.05 the steps overshoot the minimum and the loss grows each iteration: the slope and MSE blow up to huge values instead of settling near 0.841 and 0.29. This is divergence, the failure mode of a learning rate set too high.


Summary

Congratulations! You have built your first regression models, from a single-predictor line to a five-feature model evaluated on unseen cars, and you have seen how gradient descent finds the best fit. Let’s review what you learned.

Key Concepts

What Regression Is

  • Regression models the relationship between predictors (X) and a numeric outcome (y)
  • An outcome splits into a systematic part f(X) f(X) and an error term ϵ \epsilon the predictors cannot explain
  • Linear regression assumes the outcome is a linear combination of the predictors

Parameters

  • The coefficient (slope) says how much the outcome changes per unit of a predictor
  • The intercept is the predicted outcome when all predictors are zero, and often has no real-world meaning
  • Parameters are learned from data, not chosen by hand

The Cost Function

  • The best-fit line minimizes the sum of squared errors (the cost function)
  • Squaring residuals stops positive and negative errors from canceling and penalizes large misses
  • Gradient descent minimizes the cost by stepping downhill; the learning rate controls step size
  • A learning rate too small crawls; too large overshoots and can diverge

Building and Evaluating Models

  • scikit-learn uses one pattern for every model: instantiate, then .fit()
  • Split into train and test sets so you evaluate on data the model never saw
  • Standardize features (fit the scaler on train only) to compare coefficients and stabilize training
  • Judge regressions with R2 R^2 (variance explained), RMSE, and MAE (typical error in target units)

Why This Matters

Linear regression is the foundation the rest of this module builds on. The habits you practiced here, defining a cost function, splitting data honestly, scaling features, and reading metrics on a held-out set, are exactly the habits every supervised learning project depends on, no matter how complex the model. You also saw that a model fit two completely different ways, closed-form least squares and iterative gradient descent, reaches the same answer, which is the bridge from classic statistics to modern machine learning. Master these basics and the more powerful models ahead become far easier to understand.


Next Steps

You now understand how a linear regression is fit and evaluated. In the next lesson, you will dig into what the slope and intercept actually mean, so you can explain your model’s predictions with confidence.

Continue to Lesson 2 - Interpreting Regression Parameters

Learn what the slope and intercept actually mean.

Back to Module Overview

Return to the Regression module overview.