Lesson 1 - Introduction to Linear Regression
Welcome to Linear Regression
This lesson introduces linear regression, the model most practitioners reach for first whenever the thing they want to predict is a number. You will learn what a regression actually is, why the best-fit line is defined by minimizing squared error, and how to fit both simple and multiple regressions in scikit-learn, all while predicting the price of real cars.
By the end of this lesson, you will be able to:
- Explain what regression is and how linear regression models a relationship
- Identify the intercept, coefficients, and error term of a linear model
- Fit a simple linear regression with one predictor using scikit-learn
- Fit a multiple linear regression with several predictors and evaluate it honestly
- Describe the squared-error cost function and how gradient descent minimizes it
No prior machine learning experience is needed. You should be comfortable with basic Python, pandas, and NumPy. Let’s begin.
What Is Regression?
Imagine you work at a dealership and a trade-in rolls onto the lot. Before you can list it, you need a price. You could guess, but you already have records for hundreds of cars: their engine size, horsepower, weight, fuel economy, and the price each one sold for. The question is simple to state and hard to eyeball: given a car’s measurable characteristics, what is a fair price?
Regression is the technique that answers exactly this kind of question. A regression models the relationship between two sets of variables. On one side are the predictors (also called features): the measurable inputs you think influence the answer, like engine size and weight. On the other side is the outcome (also called the target): the single number you want to predict, here the car’s price.
Mathematically, we write that relationship as a function:
This says the outcome splits into two pieces. The first piece, , is the systematic part: how the predictors drive the outcome. The second piece, , is the error (or noise): everything the predictors cannot explain. The relationship is never perfect, because no set of measurements fully determines a price. The error term fills that gap, and later it becomes the basis for measuring how good a model is.
What Makes It “Linear”
The word “linear” tells you the shape of . Linear regression assumes the outcome is a linear combination of the predictors. With a single predictor, that is just the equation of a line:
Here is the coefficient (the slope). It says how much changes when increases by one unit. If is large, small changes in the predictor move the outcome a lot. The term is the intercept: the value of when is zero. Together, and are the parameters of the model. Parameters are not chosen by hand; they are learned from the data.
When a model has exactly one predictor, it is a simple linear regression. Nothing stops you from adding more predictors, which turns it into a multiple linear regression:
The principle is identical; there are just more slopes to learn.
Why Linear Regression Is Worth Learning First
Linear regression is one of the workhorse models of statistics and machine learning, used in business, science, and medicine. It is fast, it rarely overfits, and above all it is interpretable: every coefficient has a plain-language meaning you can explain to a non-technical colleague. Even when you eventually move to more complex models, linear regression remains the baseline you measure them against.
The Best-Fit Line
If a line is defined by its slope and intercept, then fitting a regression means choosing the slope and intercept that fit the data best. But “best” needs a precise definition.
For any candidate line, each data point sits some vertical distance away from it. That distance, the gap between the actual value and what the line predicts, is the residual for that point. A good line makes the residuals small across the whole dataset. The best-fit line is the one that makes them as small as possible, in a specific sense we will define in a moment.
Notice the vertical bars in the figure. Each one is a residual: the error the line makes on a single point. Fitting the model means shrinking the total size of those bars. We will turn that intuition into a formula, the cost function, shortly. First, let’s get the real data loaded.
Meet the Data
You will predict car prices using the classic Automobiles dataset, drawn from real automobile import records. Each row is one car model, with measurements of its body, engine, and performance, plus the price it sold for. You can download it and load it with pandas.
import pandas as pd
df = pd.read_csv("automobiles.csv") # download: https://datatweets.com/datasets/automobiles.csv
print("Shape:", df.shape)
# Output: Shape: (159, 26)The dataset has 159 rows and 26 columns, and conveniently it has no missing values, so you can focus on modeling rather than cleaning. The column you will predict is price, in US dollars.
A Data Dictionary
You will not use all 26 columns. Here are the ones that matter for this lesson:
| Column | Type | Meaning |
|---|---|---|
price | int | Target: selling price in US dollars |
engine_size | int | Engine displacement in cubic centimeters |
horsepower | int | Engine power output |
curb_weight | int | Weight of the car in pounds, empty |
width | float | Width of the car in inches |
length, wheel_base | float | Other body dimensions in inches |
city_mpg, highway_mpg | int | Fuel economy, miles per gallon |
make, body_style, drive_wheels | category | Brand, shape, and drivetrain |
fuel_type, aspiration, num_of_doors | category | Other categorical descriptors |
Take a quick look at the target and a few predictors.
print(df["price"].describe()[["mean", "min", "max"]].round())
# Output:
# mean 11446.0
# min 5118.0
# max 35056.0
# Name: price, dtype: float64
print(df[["make", "engine_size", "horsepower", "curb_weight", "price"]].head(3))
# Output:
# make engine_size horsepower curb_weight price
# 0 audi 109 102 2337 13950
# 1 audi 136 115 2824 17450
# 2 audi 136 110 2844 17710Prices range from about $5,118 to $35,056, with an average near $11,446. That spread is what your model will try to explain.
Numerical and Categorical Predictors
Predictors come in two flavors. Numerical predictors contain numbers, like engine_size or horsepower, and can go straight into a regression. Categorical predictors indicate group membership, like body_style (sedan, hatchback, wagon) or drive_wheels (fwd, rwd, 4wd). A regression cannot multiply a coefficient by the word “sedan,” so categorical columns must first be converted into numeric dummy variables, one 0/1 column per category. You will work with categoricals in a later lesson; for this one, you will stick to numeric predictors so the focus stays on the model itself.
Simple Linear Regression
Start with the simplest possible model: predict price from a single predictor, engine_size. Bigger engines tend to cost more, so this is a sensible first relationship to test.
scikit-learn follows a consistent pattern for every model: instantiate, then fit. The features go in a two-dimensional table X (note the double brackets, which keep it a DataFrame), and the target goes in a one-dimensional y.
from sklearn.linear_model import LinearRegression
X = df[["engine_size"]] # features: a table with one column
y = df["price"] # target: a single column
model = LinearRegression()
model.fit(X, y) # learn the best slope and intercept
print("Intercept:", round(model.intercept_, 1))
print("Slope: ", round(model.coef_[0], 2))
# Output:
# Intercept: -7914.1
# Slope: 162.38The model learned this line:
Read the slope in plain English: each additional cubic centimeter of engine displacement is associated with about $162 more in price. The intercept of -$7,914 is the predicted price for a hypothetical engine size of zero, which has no real-world meaning here; intercepts often do not, and that is fine.
How well does this single predictor explain price? The standard score for regression is the coefficient of determination, written . It measures the fraction of the variation in the outcome that the model explains, on a scale where 1.0 is perfect and 0.0 is no better than always guessing the mean:
print("R^2:", round(model.score(X, y), 3))
# Output: R^2: 0.708An of 0.708 means engine size alone explains roughly 71 percent of the variation in price. That is a strong start for one feature. The scatter plot below shows the cars and the fitted line running through them.
Each dot is a car. The line is your model. The vertical distance from each dot to the line is that car’s residual, and the fitting procedure chose the one line that minimizes those residuals overall.
The Cost Function
You just trusted scikit-learn to find the “best” line. Now let’s pin down what “best” means, because it is the idea at the heart of almost every machine learning model.
For each car, the model makes a prediction, and the residual is the gap between the true price and that prediction. To judge a whole line, you need to combine all the residuals into one number. The obvious idea, adding them up, fails: a large positive residual and a large negative one would cancel out, making a terrible line look perfect. The fix is to square each residual before summing, so every error counts as positive and large errors are penalized heavily. This gives the sum of squared errors (SSE), the cost function for linear regression:
A common variant divides by the number of points to get the mean squared error (MSE), which is easier to compare across datasets of different sizes:
“Best-fit line” simply means the values of and that make this cost as small as possible. This is the method of least squares. Different models use different cost functions, but the motivation never changes: pick the parameters that minimize the cost on your data.
For linear regression there is even a closed-form formula that solves for the minimizing coefficients directly with linear algebra, which is what scikit-learn’s LinearRegression uses internally. But many models have no such shortcut, so it is worth seeing the general-purpose method that does work everywhere: gradient descent.
Gradient Descent: Finding the Minimum by Walking Downhill
Picture the cost function as a landscape. For a simple regression with one slope, plotting the cost for every possible slope traces out a bowl shape, high on the sides and low in the middle. The best slope sits at the very bottom. Gradient descent is an algorithm that finds that bottom by repeatedly stepping downhill.
At any point on the bowl, the gradient (the derivative of the cost) tells you the direction of steepest ascent. Step the opposite way and you move toward the minimum. The size of each step is controlled by the learning rate, written . The update rule for a single parameter is:
For the mean squared error with standardized data, the gradient of the slope works out to a clean expression, and you can implement the whole loop yourself in a few lines. To make the math behave nicely, first standardize both the predictor and the target (subtract the mean, divide by the standard deviation) so they share a common scale.
import numpy as np
# Standardize engine_size and price to mean 0, std 1
x = (df["engine_size"] - df["engine_size"].mean()) / df["engine_size"].std()
t = (df["price"] - df["price"].mean()) / df["price"].std()
n = len(x)
w, b = 0.0, 0.0 # start the slope and intercept at zero
eta = 0.1 # learning rate
losses = []
for i in range(60):
pred = w * x + b
error = pred - t
losses.append((error ** 2).mean()) # record MSE this step
w -= eta * (2 / n) * (error * x).sum() # gradient step for slope
b -= eta * (2 / n) * error.sum() # gradient step for intercept
print("Final slope w:", round(w, 3))
print("Final bias b:", round(b, 3))
print("Final MSE: ", round(losses[-1], 4))
# Output:
# Final slope w: 0.841
# Final bias b: 0.0
# Final MSE: 0.292After 60 iterations the slope settles at about 0.841, the bias at 0, and the MSE bottoms out near 0.292. On standardized data, that learned slope is exactly the correlation between engine size and price, which is a satisfying sanity check. The intercept lands at zero because both variables were centered.
Watching the loss fall each iteration shows the algorithm working. It starts high and drops quickly, then flattens as it approaches the minimum.
print("MSE at iterations 0, 1, 2, 3:", [round(l, 3) for l in losses[:4]])
# Output: MSE at iterations 0, 1, 2, 3: [0.994, 0.742, 0.58, 0.476]The Learning Rate Matters
The learning rate is the single most important knob in gradient descent, and getting it wrong breaks training. Run the same loop with three different rates and compare.
for eta in [0.01, 0.1, 0.6]:
w, b = 0.0, 0.0
for i in range(60):
error = (w * x + b) - t
w -= eta * (2 / n) * (error * x).sum()
b -= eta * (2 / n) * error.sum()
print(f"eta={eta:<4} final w={w:.3f} final MSE={((w*x+b - t)**2).mean():.4f}")
# Output:
# eta=0.01 final w=0.589 final MSE=0.3533
# eta=0.1 final w=0.841 final MSE=0.2900
# eta=0.6 final w=0.841 final MSE=0.2900The three rates tell a clear story:
eta=0.01is too small. After 60 steps it has only reached a slope of 0.589 and an MSE of 0.353; it is still creeping toward the minimum and needs far more iterations.eta=0.1is just right. It descends smoothly and lands at the true minimum well within 60 steps.eta=0.6is aggressive. It reaches the minimum fast but takes large, jumpy steps to get there. Push the rate just a little higher (past about 1.0) and the steps overshoot the bottom so badly that the loss grows each iteration and the algorithm diverges entirely.
The lesson is to pick a learning rate large enough to make progress but small enough to stay stable.
Multiple Linear Regression
One predictor explained 71 percent of price variation. You can do better by giving the model more to work with. Multiple linear regression uses several predictors at once, each with its own coefficient. You will use five numeric features that all relate to price: engine size, horsepower, curb weight, width, and highway fuel economy.
Splitting and Scaling
Two preparation steps matter here. First, you must evaluate the model on data it never trained on, so split the data into a training set and a test set. The model learns from the training set; the test set is locked away and used only at the end to measure honest performance. Second, the features live on wildly different scales (engine size in the hundreds, width near 70, mpg in the tens), so you standardize them with StandardScaler to put every feature on a common footing. This makes the coefficients directly comparable, since each then describes the effect of a one-standard-deviation change.
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
features = ["engine_size", "horsepower", "curb_weight", "width", "highway_mpg"]
X = df[features]
y = df["price"]
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.25, random_state=42
)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train) # learn mean/std on TRAIN only
X_test_scaled = scaler.transform(X_test) # apply the SAME transform to testThe golden rule of scaling
Fit the scaler on the training data only, then apply it to both sets. If you fit on the full dataset, information about the test set leaks into training and your scores become too optimistic. The same discipline applies to the model: it must never see the test set until the final evaluation.
Fitting and Reading the Coefficients
Fitting is the same two lines as before. Now there are five coefficients, one per feature, plus an intercept.
model = LinearRegression()
model.fit(X_train_scaled, y_train)
for name, coef in zip(features, model.coef_):
print(f" {name:<12} coef = {coef:8.1f}")
print(f" intercept = {model.intercept_:.1f}")
# Output:
# engine_size coef = 1808.4
# horsepower coef = 336.5
# curb_weight coef = 1935.4
# width coef = 1892.0
# highway_mpg coef = 82.6
# intercept = 11442.5Because the features are standardized, each coefficient is the dollar change in predicted price for a one-standard-deviation increase in that feature, holding the others fixed. Curb weight (1935), engine size (1808), and width (1892) are the heavy hitters; horsepower contributes less once those are accounted for, and highway mpg barely moves the needle. The intercept of $11,442 is the predicted price for an “average” car, since standardized features are zero at their means.
Evaluating on the Test Set
Now measure performance on the held-out test set with three standard metrics. you have already met. RMSE (root mean squared error) is the typical prediction error in the same units as the target, dollars here. MAE (mean absolute error) is the average absolute miss, less sensitive to large outliers.
import numpy as np
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error
preds = model.predict(X_test_scaled)
r2 = r2_score(y_test, preds)
rmse = np.sqrt(mean_squared_error(y_test, preds))
mae = mean_absolute_error(y_test, preds)
print(f"TEST R^2={r2:.3f} RMSE=${rmse:,.0f} MAE=${mae:,.0f}")
# Output: TEST R^2=0.793 RMSE=$2,327 MAE=$1,863The multiple regression explains 79 percent of price variation on unseen cars, up from 71 percent with engine size alone, and its typical error is about $2,327. For prices averaging $11,446, missing by roughly $2,000 on average is a respectable result.
A good way to see model quality is to plot predicted price against actual price. Points hugging the diagonal line are accurate predictions.
You should also check the residuals, the prediction errors. If the model is well behaved, residuals scatter randomly around zero with no obvious pattern. A pattern would signal that the linear form is missing something.
A Note on Gradient Descent in scikit-learn
You wrote gradient descent by hand to understand it, but scikit-learn ships a production version called SGDRegressor (stochastic gradient descent). It minimizes the same squared-error cost and is a great choice when datasets grow too large for the closed-form solution. On the scaled car features, it lands in essentially the same place as ordinary least squares.
from sklearn.linear_model import SGDRegressor
sgd = SGDRegressor(max_iter=2000, random_state=42, eta0=0.01)
sgd.fit(X_train_scaled, y_train)
sgd_r2 = sgd.score(X_test_scaled, y_test)
print(f"OLS R^2=0.793 SGD R^2={sgd_r2:.3f}")
# Output: OLS R^2=0.793 SGD R^2=0.795The two methods reach the same answer (0.793 versus 0.795) by different routes: least squares solves for the minimum directly, while gradient descent walks down to it. Understanding both is what lets you choose the right tool as your problems scale up.
Practice Exercises
Now it is your turn. Try these before checking the hints.
Exercise 1: Fit a Different Simple Regression
Build a simple linear regression that predicts price from curb_weight instead of engine_size. Print the slope, intercept, and . Does weight explain more or less of the price variation than engine size did?
import pandas as pd
from sklearn.linear_model import LinearRegression
df = pd.read_csv("automobiles.csv")
# Your code hereHint
Set X = df[["curb_weight"]] and y = df["price"], then model.fit(X, y). Read the slope from model.coef_[0], the intercept from model.intercept_, and the fit from model.score(X, y). Curb weight is a strong predictor too, so expect an in a similar range to engine size.
Exercise 2: Add a Predictor to the Multiple Regression
Take the five-feature model from the lesson and add wheel_base as a sixth predictor. Re-split (with random_state=42), re-scale on the training set only, refit, and print the test . Did the extra feature help?
# Your code here (reuse the train_test_split and StandardScaler pattern from the lesson)Hint
Add "wheel_base" to the features list, then repeat the exact split-scale-fit-score pipeline. Compare the new test to the lesson’s 0.793. A small change in either direction is normal; not every added feature improves an honest test score.
Exercise 3: Experiment with the Learning Rate
Modify the from-scratch gradient descent loop to try eta = 1.05 for 60 iterations on the standardized engine_size and price. Print the final slope and MSE. What happens, and why?
import numpy as np
# Standardize x and t as in the lesson, then run the loop with eta = 1.05Hint
Reuse the standardization and the update rule, changing only the learning rate. With eta = 1.05 the steps overshoot the minimum and the loss grows each iteration: the slope and MSE blow up to huge values instead of settling near 0.841 and 0.29. This is divergence, the failure mode of a learning rate set too high.
Summary
Congratulations! You have built your first regression models, from a single-predictor line to a five-feature model evaluated on unseen cars, and you have seen how gradient descent finds the best fit. Let’s review what you learned.
Key Concepts
What Regression Is
- Regression models the relationship between predictors (
X) and a numeric outcome (y) - An outcome splits into a systematic part and an error term the predictors cannot explain
- Linear regression assumes the outcome is a linear combination of the predictors
Parameters
- The coefficient (slope) says how much the outcome changes per unit of a predictor
- The intercept is the predicted outcome when all predictors are zero, and often has no real-world meaning
- Parameters are learned from data, not chosen by hand
The Cost Function
- The best-fit line minimizes the sum of squared errors (the cost function)
- Squaring residuals stops positive and negative errors from canceling and penalizes large misses
- Gradient descent minimizes the cost by stepping downhill; the learning rate controls step size
- A learning rate too small crawls; too large overshoots and can diverge
Building and Evaluating Models
- scikit-learn uses one pattern for every model: instantiate, then
.fit() - Split into train and test sets so you evaluate on data the model never saw
- Standardize features (fit the scaler on train only) to compare coefficients and stabilize training
- Judge regressions with (variance explained), RMSE, and MAE (typical error in target units)
Why This Matters
Linear regression is the foundation the rest of this module builds on. The habits you practiced here, defining a cost function, splitting data honestly, scaling features, and reading metrics on a held-out set, are exactly the habits every supervised learning project depends on, no matter how complex the model. You also saw that a model fit two completely different ways, closed-form least squares and iterative gradient descent, reaches the same answer, which is the bridge from classic statistics to modern machine learning. Master these basics and the more powerful models ahead become far easier to understand.
Next Steps
You now understand how a linear regression is fit and evaluated. In the next lesson, you will dig into what the slope and intercept actually mean, so you can explain your model’s predictions with confidence.
Continue to Lesson 2 - Interpreting Regression Parameters
Learn what the slope and intercept actually mean.
Back to Module Overview
Return to the Regression module overview.