Lesson 5 - Understanding Gradient Descent
On this page
- Why a Model Needs to Search
- Setting the Stage with Real Data
- Step One: Measuring How Wrong a Line Is
- Step Two: The Cost Surface Is a Landscape
- Step Three: The Gradient Points Downhill
- Step Four: The Learning Rate Controls Your Stride
- Step Five: Global vs. Local Minima
- Putting the Pieces Together
- Practice Exercises
- Summary
- Next Steps
- Keep Building Your Skills
Why a Model Needs to Search
In the earlier lessons of this module you fit linear regression with a single call to .fit(), and scikit-learn handed you the coefficients. That is convenient, but it hides the most important idea in machine learning: how does a model decide which coefficients are best?
Picture fitting a line to the automobiles data. You want to predict a car’s price from its engine_size. A line has two numbers, a slope and an intercept. There are infinitely many lines you could draw. Most of them are terrible. A few are good. Exactly one is best. The model’s job is to search through all those possibilities and land on the best one. Gradient descent is the search algorithm that does this, and it powers far more than linear regression: it trains neural networks, logistic regression, and most of modern machine learning.
By the end of this lesson, you will be able to:
- Explain why a model needs a cost function and what it measures
- Describe gradient descent as stepping downhill on a cost surface
- Read and interpret the gradient descent update rule and its mathematics
- Explain how the learning rate controls the size of each step and what goes wrong when it is too small or too large
- Distinguish a global minimum from a local minimum and explain why convex cost functions make linear regression easy
You should be comfortable with basic Python, pandas, and the linear regression you saw earlier in this module. No calculus background is required; the small amount you need is explained as you go. Let’s begin.
Setting the Stage with Real Data
Throughout this lesson you will reason about one concrete problem: predicting a car’s price from its engine size. Load the automobiles dataset so the numbers are real, not hypothetical.
import pandas as pd
df = pd.read_csv("automobiles.csv") # download: https://datatweets.com/datasets/automobiles.csv
print("Shape:", df.shape)
print(df["price"].describe()[["mean", "min", "max"]].round(0))
# Output:
# Shape: (159, 26)
# mean 11446.0
# min 5118.0
# max 35056.0
# Name: price, dtype: float64The dataset has 159 rows and 26 columns, with no missing values. Each row is one car model. You will focus on just two columns for most of this lesson, because two columns are easy to picture, but the same machinery scales to dozens of features.
A Small Data Dictionary
Here are the columns that matter for this lesson and the next few.
| Column | Type | Meaning |
|---|---|---|
price | int | Target: the car’s price in US dollars (about 5,118 to 35,056) |
engine_size | int | Engine displacement in cubic centimeters |
horsepower | int | Engine power output |
curb_weight | int | Weight of the car in pounds |
width | float | Body width in inches |
highway_mpg | int | Fuel economy on the highway |
For now, pull out the single feature and the target you will model.
X = df[["engine_size"]] # one feature, kept as a 2D table
y = df["price"] # the target we want to predict
print(X.shape, y.shape)
# Output: (159, 1) (159,)Step One: Measuring How Wrong a Line Is
Before a model can search for the best line, it needs a way to score any given line. That score is called a cost function (also a loss function). The cost function takes a candidate set of coefficients and returns a single number: how badly that line fits the data. A perfect fit scores zero; the worse the fit, the higher the cost.
For linear regression the standard choice is mean squared error (MSE). For each car, you take the difference between the true price and the price the line predicts, square it, and average over all the cars. Writing the line as , the cost is:
Three things are worth noticing. The error is squared, so a prediction that is off by 2,000 dollars is punished four times as hard as one off by 1,000 dollars; large mistakes dominate the cost. Squaring also makes every term positive, so errors above and below the line cannot cancel out. And the result is averaged, so the number does not balloon just because you have more cars.
To make this concrete, score two candidate lines on the real cars and watch the cost function do its job. You do not need any machine learning library for this; the cost function is just arithmetic.
import numpy as np
x = df["engine_size"].to_numpy()
y = df["price"].to_numpy()
def mse(slope, intercept):
predictions = slope * x + intercept
return np.mean((y - predictions) ** 2)
# A guess that is clearly too flat, and one closer to the truth
print("flat guess :", round(mse(50, 0)))
print("better guess:", round(mse(162, -7914)))
# Output:
# flat guess : 51743975
# better guess: 10023135The “better guess” scores a far lower cost than the flat one, which is exactly what you want: the line that hugs the data more closely earns the smaller number. Gradient descent’s whole job is to keep lowering that number until it cannot go any lower.
The fitted line from earlier in this module, found by minimizing exactly this cost, was:
price = -7914.1 + 162.38 * engine_size (R^2 = 0.708)Below, that best-fit line is drawn through the scatter of real cars. Each vertical gap between a point and the line is one residual, and MSE is the average of those gaps squared.
Now the search problem is sharp. Gradient descent looks for the slope and intercept that make this MSE as small as possible.
Why squared error, not absolute error?
You could measure error with absolute differences instead of squares. Squared error has two advantages that matter for optimization: it is smooth everywhere (it has a well-defined slope at every point, which gradient descent needs), and it produces a single bowl-shaped surface with one clear bottom. Those properties are exactly what make linear regression so easy to train.
Step Two: The Cost Surface Is a Landscape
Here is the mental shift that makes gradient descent click. The cost function takes coefficients as input and returns a cost as output. If you imagine sweeping the slope and intercept across all their possible values and plotting the resulting MSE as height, you get a surface, a landscape of hills and valleys.
For mean squared error with a linear model, that landscape has a very special shape: a single smooth bowl. There is exactly one lowest point, and that point corresponds to the best-fitting line. Training the model means finding the bottom of the bowl.
But there is a catch that makes this interesting. The model cannot see the whole landscape at once. It is like standing on a foggy mountainside: you can feel the slope right under your feet, but you cannot see where the valley is. All you can do is sense which way is downhill and take a step in that direction. Repeat enough times and you reach the bottom. That is the entire idea of gradient descent.
Each dot on that curve is one candidate line. A dot high on the wall is a poorly fitting line with large MSE. The dot at the bottom is the best-fitting line. Gradient descent starts somewhere on the wall and walks down, step by step, until it can go no lower.
Step Three: The Gradient Points Downhill
To take a step downhill, the algorithm needs to know which direction is downhill. That is what the gradient tells it.
The gradient is the slope of the cost surface. In one dimension it is just the derivative; in many dimensions it is the collection of partial derivatives, one per coefficient. The crucial fact is this: the gradient points in the direction of steepest increase. So to go downhill, you step in the opposite direction, the negative gradient.
For mean squared error, the slope of the cost with respect to a weight works out to a clean formula. If the model is , then the partial derivative of the cost with respect to is:
You do not need to derive this by hand; the point is that the gradient is computable from the data and the current predictions. When the predictions are too low, the term is positive, the gradient is negative, and the update pushes the weight up. When the predictions are too high, the gradient flips sign and pulls the weight back down. The math automatically corrects in the right direction.
The update rule that uses this gradient is the heart of the whole algorithm. With a learning rate , each weight is nudged like so:
Read it in plain English: new weight equals old weight minus a small step in the uphill direction, which is a small step downhill. The intercept is updated by exactly the same rule with its own partial derivative. Apply this repeatedly and the weights drift toward the bottom of the bowl.
The gradient does double duty
Notice that the gradient encodes two pieces of information at once: a direction (which way is downhill, from its sign) and a magnitude (how steep the slope is, from its size). On a steep wall the gradient is large and steps are big; near the flat bottom the gradient shrinks toward zero and steps get tiny. Gradient descent naturally slows down as it approaches the minimum.
Watching a Single Step
It helps to see one step happen on the real, standardized data. Below, the feature and target are standardized (the smart preprocessing you will see justified shortly), the weight starts at 0, and you take exactly one gradient step with learning rate 0.1. Watch the cost drop and the weight move toward its eventual home of about 0.841.
from sklearn.preprocessing import StandardScaler
xs = StandardScaler().fit_transform(df[["engine_size"]]).ravel()
ys = StandardScaler().fit_transform(df[["price"]]).ravel()
w, b, alpha, n = 0.0, 0.0, 0.1, len(xs)
cost_before = np.mean((ys - (w * xs + b)) ** 2)
grad_w = -(2 / n) * np.sum(xs * (ys - (w * xs + b))) # the slope of the cost in w
w = w - alpha * grad_w # one step downhill
cost_after = np.mean((ys - (w * xs + b)) ** 2)
print(f"grad_w : {grad_w:.3f}")
print(f"w after 1 step : {w:.3f}")
print(f"cost: {cost_before:.3f} -> {cost_after:.3f}")
# Output:
# grad_w : -1.683
# w after 1 step : 0.168
# cost: 1.000 -> 0.745The gradient came out negative, so the update pushed the weight up from 0 to 0.168, and the cost fell sharply from 1.000 to 0.745. That is one downhill step. Repeat it sixty times and the weight climbs to 0.841 while the cost bottoms out near 0.292. You will run that full loop in the next lesson.
Step Four: The Learning Rate Controls Your Stride
The update rule has one knob you choose yourself: the learning rate . It scales every step. Pick it well and the algorithm glides to the bottom in a few dozen steps. Pick it badly and training crawls, or blows up entirely.
There are three regimes to understand:
- Too small (for example ): every step is a baby step. The algorithm does head downhill, but it takes far too many iterations to get there. Training is correct but painfully slow.
- Just right (for example ): steps are large enough to make real progress, small enough to stay controlled. The cost drops quickly and settles at the minimum.
- Too large (for example ): steps overshoot the bottom of the bowl, landing on the opposite wall higher than before. The next step overshoots again, even harder. The cost bounces and can diverge to infinity.
The chart below runs gradient descent on the standardized engine-size-to-price problem with all three learning rates and plots the cost at each iteration. The small rate inches down, the good rate plunges and flattens, and the large rate refuses to settle.
With a well-chosen learning rate, the cost falls steeply at first (you are high on a steep wall, so the gradient is large) and then levels off as you near the bottom (the wall flattens, the gradient shrinks). That smooth, decreasing curve is the signature of healthy training.
You will write the exact code that produces these curves in the next lesson. For now, hold on to the result that this run reaches: with a learning rate of 0.1 over 60 iterations on the standardized data, the algorithm converges to a weight of about 0.841, a bias of essentially 0, and a final MSE of about 0.292.
# What the next lesson's run produces (standardized engine_size -> price):
# lr=0.1 final w=0.841 b=0.000 final MSE=0.2919That weight of 0.841 is not arbitrary. On standardized data the best-fit slope equals the correlation between the two variables, and gradient descent has discovered it by feel, one downhill step at a time, never being told the answer.
Standardize before gradient descent
Gradient descent is sensitive to the scale of your features. engine_size runs in the thousands while width is around 65; a single learning rate cannot suit both at once, and the cost surface becomes a long, skewed valley that is slow to descend. Standardizing every feature to mean 0 and standard deviation 1 with StandardScaler reshapes the surface into a round bowl and lets one learning rate work for all features. This is why the run above standardizes both engine_size and price first.
When Does It Stop?
The loop has to end somewhere. There are two common ways to decide when gradient descent is done, and most implementations use both as a safety net.
- A fixed number of iterations. You simply run the update rule a set number of times, like the 60 iterations above. This guarantees the algorithm halts, but you have to pick a number large enough to reach the bottom yet not so large that you waste time spinning at a flat minimum.
- A tolerance on progress. You stop early once a step barely changes the cost, say when the cost improves by less than
0.0001from one iteration to the next. Because the gradient shrinks near the bottom, the steps get tiny there, so this is a natural signal that you have arrived.
scikit-learn’s SGDRegressor bundles both: max_iter caps the iterations and a tol parameter stops early when progress stalls. Knowing these two stopping rules explains the max_iter=2000 you will pass to it shortly. It is a generous ceiling that almost never gets hit, because the tolerance trips first once the cost flattens out.
Step Five: Global vs. Local Minima
There is one more idea that separates the easy case from the hard case: the difference between a global minimum and a local minimum.
A global minimum is the single lowest point of the entire cost surface, the best possible coefficients. A local minimum is a point that is lower than everything immediately around it, a small valley, but not the deepest valley overall. If a cost surface has several valleys, gradient descent can get trapped in a shallow one: it walks downhill, reaches the bottom of a minor dip, finds no downhill direction, and stops, even though a far deeper valley exists elsewhere.
A bumpy cost surface (the hard case)
cost
| * *
| * * * * <- a local minimum traps a
| * * * * * downhill walker here...
| * * * *
| (A) * *
| * *
| (B) <- ...even though (B) is lower
+-----------------------------> coefficientWhere you start on a bumpy surface decides which valley you fall into, which means the answer depends on luck. This is a genuine and famous problem for complicated models like deep neural networks, whose cost surfaces are full of bumps.
Linear regression is spared this trouble entirely. Because mean squared error with a linear model is convex, its surface is a single smooth bowl with exactly one minimum. There are no traps. No matter where you start, downhill always leads to the same bottom, the global minimum. That is why the simple regression line in this module is uniquely determined and why gradient descent on it is so reliable.
You can convince yourself that the bowl really has one bottom by sweeping the slope across many values, fixing the intercept at its best value, and watching the cost trace out a single clean U.
slopes = np.linspace(0, 320, 9)
for s in slopes:
print(f"slope={s:6.1f} cost={mse(s, -7914):,.0f}")
# Output:
# slope= 0.0 cost=409,131,032
# slope= 40.0 cost=236,717,496
# slope= 80.0 cost=112,742,240
# slope= 120.0 cost= 37,205,262
# slope= 160.0 cost= 10,106,565
# slope= 200.0 cost= 31,446,146
# slope= 240.0 cost=101,224,007
# slope= 280.0 cost=219,440,147
# slope= 320.0 cost=386,094,566The cost falls, reaches its lowest value near a slope of 160 (right where the fitted slope of 162.38 sits), then climbs again. One descent, one valley, one minimum. Gradient descent cannot get lost on a surface like this.
Convexity is a gift
A function is convex if a straight line between any two points on its surface never dips below the surface itself. In practice it means one bowl, one bottom, no local-minimum traps. Linear regression with squared error is convex, so its best line is guaranteed to exist and to be unique. When you move on to non-convex models later, you will appreciate just how much this guarantee buys you.
Putting the Pieces Together
You now have every conceptual ingredient of gradient descent. Here is the whole algorithm in one place, in pseudocode, so you can see how the parts connect before you implement them for real next lesson.
# Gradient descent in pseudocode (you will write the real version next lesson)
#
# 1. Standardize the features and target.
# 2. Initialize the weight and bias (often to 0).
# 3. Choose a learning rate alpha and a number of iterations.
# 4. Repeat for each iteration:
# a. Predict: y_hat = w * x + b
# b. Measure: cost = mean((y - y_hat) ** 2) # the MSE cost function
# c. Slope: grad_w, grad_b = gradient of cost # which way is uphill
# d. Step: w = w - alpha * grad_w # step downhill
# b = b - alpha * grad_b
# 5. Return the final w and b.Trace the five lessons-worth of ideas through it: the cost function (step b) scores each line, the gradient (step c) finds downhill, the learning rate (step d) sets the stride, the iteration loop repeats the descent, and convexity guarantees you reach the one true bottom. Everything you read above maps onto a single line of this loop.
And here is the payoff worth remembering: scikit-learn’s LinearRegression solves this exact problem with a one-shot formula, but its cousin SGDRegressor solves it with gradient descent, the very algorithm sketched above. You can put them side by side on the full five-feature automobiles model from earlier in this module.
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression, SGDRegressor
features = ["engine_size", "horsepower", "curb_weight", "width", "highway_mpg"]
X = df[features]
y = df["price"]
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.25, random_state=42
)
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train) # fit on train only
X_test = scaler.transform(X_test)
ols = LinearRegression().fit(X_train, y_train)
sgd = SGDRegressor(max_iter=2000, random_state=42, eta0=0.01).fit(X_train, y_train)
print(f"OLS test R^2: {ols.score(X_test, y_test):.3f}")
print(f"SGD test R^2: {sgd.score(X_test, y_test):.3f}")
# Output:
# OLS test R^2: 0.793
# SGD test R^2: 0.795The two land in almost the same place: ordinary least squares scores a test R-squared of 0.793, and gradient descent via SGDRegressor scores 0.795. Stepping downhill by feel reaches essentially the same answer as the exact closed-form formula. That is gradient descent earning its keep, and it is exactly why the same algorithm scales to models that have no closed-form solution at all.
The chart below makes the agreement visual: the two methods’ test predictions track each other almost perfectly.
Practice Exercises
Work through these on paper or in a notebook before peeking at the hints. They reinforce the intuition without needing the full implementation yet.
Exercise 1: Score Two Lines by Hand
Take three tiny “cars” with engine_size values [100, 150, 200] and true prices [8000, 12000, 16000]. Compute the mean squared error for the line price = 80 * engine_size and for the line price = 90 * engine_size. Which line fits better, and what does the lower MSE tell you about it?
import numpy as np
x = np.array([100, 150, 200])
y = np.array([8000, 12000, 16000])
# Your code here: compute MSE for slope 80 and slope 90Hint
Predictions are slope * x. The error per car is y - predictions; square those errors and average them with np.mean((y - predictions) ** 2). The line with the smaller MSE sits closer to the points, so it is the better fit. (Slope 80 predicts exactly [8000, 12000, 16000], so its MSE is 0.)
Exercise 2: Predict the Effect of the Learning Rate
Without writing any code, describe what you expect to happen to the loss curve in three runs of gradient descent on the same data: one with learning rate 0.001, one with 0.1, and one with 0.9. Sketch the three curves on the same axes.
Hint
Think back to the three regimes. A tiny rate (0.001) makes a slowly, gently decreasing curve that has not reached the bottom by the end. A good rate (0.1) plunges then flattens. A large rate (0.9) overshoots, so the curve jumps around or climbs instead of settling. The middle rate reaches the lowest cost in the fewest iterations.
Exercise 3: Spot the Local Minima
Given the cost values [3, 1, 1, 8, 6, 5, 2, 6, 4, 3] read left to right across a bumpy surface, how many local minima does it have, and which value is the global minimum? Explain why a downhill walker starting in the middle might never find the global minimum.
Hint
A local minimum is any valley, a value lower than its neighbors on both sides. Scan for dips: the 1 near the start, the 2 in the middle, and the 3 at the end are all valleys, giving three local minima. The global minimum is the smallest of them, the 1. A walker starting near the 2 would descend into that valley and stop, never crossing the hill to reach the deeper 1. Convex surfaces avoid this entirely because they have only one valley.
Summary
You now understand the engine that trains linear regression, not as a black box but as a search for the bottom of a bowl. Let’s review.
Key Concepts
The Cost Function
- A cost function scores how badly a set of coefficients fits the data; a perfect fit scores zero
- Linear regression uses mean squared error: the average of the squared gaps between predictions and truth
- Squaring punishes large errors heavily and keeps the surface smooth and bowl-shaped
Gradient Descent
- The cost function defines a landscape; training means finding its lowest point
- The gradient is the slope of that landscape and points uphill, so you step in the opposite direction
- The update rule nudges each weight a little downhill every iteration
The Learning Rate
- The learning rate scales each step: too small crawls, just right converges fast, too large overshoots and can diverge
- A healthy loss curve drops steeply, then flattens as the steps near the minimum shrink
- Standardize features first so a single learning rate works for all of them
Global vs. Local Minima
- A global minimum is the deepest valley; a local minimum is a shallow dip that can trap a downhill walker
- Mean squared error with a linear model is convex: one bowl, one bottom, no traps, so the best line is unique
Why This Matters
Gradient descent is not a detail of linear regression; it is the optimization algorithm behind most of machine learning, from logistic regression to deep neural networks. Every one of those models defines a cost function and walks downhill to minimize it, using exactly the cost-gradient-learning-rate machinery you learned here. Linear regression is the gentlest possible place to meet these ideas because its surface is a single guaranteed bowl. Master the intuition now, on real cars and real prices, and the same picture will carry you through every harder model that follows.
Next Steps
You understand why gradient descent works. In the next lesson you will write it from scratch in Python, watch the loss fall iteration by iteration, and reproduce the exact convergence numbers you saw quoted above.
Continue to Lesson 6 - Implementing Gradient Descent in Python
Code gradient descent from scratch and watch the loss fall.
Back to Module Overview
Return to the Regression module overview.
Keep Building Your Skills
You just opened the hood on the optimizer that quietly powers nearly every model you will ever train. The next time you call .fit(), you will know what is happening underneath: a patient walk downhill on a cost surface, one gradient-guided step at a time. Carry that picture forward, because every algorithm in the rest of this program rests on it.