Lesson 6 - Implementing Gradient Descent in Python
On this page
- How Does a Model Actually Learn?
- The Data: Predicting Car Prices
- Step 1: Measuring How Wrong a Line Is
- Step 2: Standardizing So the Math Behaves
- Step 3: Which Way Is Downhill?
- Step 4: The Gradient Descent Loop
- Step 5: Watching the Loss Fall
- Step 6: The Learning Rate Changes Everything
- Putting It All Together
- Scaling Beyond One Feature
- Practice Exercises
- Summary
- Next Steps
How Does a Model Actually Learn?
In the earlier lessons you called LinearRegression().fit(X, y) and a fitted line appeared, complete with coefficients and an intercept. That single line of code hid a question worth answering: how did scikit-learn find those numbers? It did not try every possible line. Instead it followed a procedure that quietly powers almost all of modern machine learning, from linear regression to deep neural networks. That procedure is gradient descent.
In this lesson you will open the box. You will define what it means for a line to be “good,” turn that definition into a number you can minimize, and then write the loop that walks downhill toward the best line, one small step at a time. You will do it on the real automobiles dataset you have used throughout this module, so the parameters you compute by hand will line up with the ones scikit-learn gave you.
By the end of this lesson, you will be able to:
- Explain why a model needs a loss function and compute the mean squared error
- Derive and code the gradients of the loss with respect to the weight and bias
- Write a gradient descent loop from scratch that fits a line to real data
- Diagnose convergence by watching the loss curve fall across iterations
- Explain how the learning rate controls whether training crawls, converges, or diverges
You should be comfortable with basic Python, pandas, and NumPy, and you should have seen linear regression at least once. No calculus background is required; we will build up the few derivatives we need from scratch. Let’s begin.
The Data: Predicting Car Prices
You will work with the automobiles dataset, the classic catalog of imported cars where each row describes one model and its specifications. Your job is to predict a car’s price from its characteristics. Download it and load it with pandas.
import pandas as pd
df = pd.read_csv("automobiles.csv") # download: https://datatweets.com/datasets/automobiles.csv
print("Shape:", df.shape)
print(df["price"].describe()[["mean", "min", "max"]].round(0))
# Output:
# Shape: (159, 26)
# mean 11446.0
# min 5118.0
# max 35056.0
# Name: price, dtype: float64The dataset has 159 rows and 26 columns, with no missing values, which keeps the focus on the algorithm rather than on cleaning. Prices range from about $5,118 to $35,056, with a mean near $11,446.
A Small Data Dictionary
You will only need a couple of numeric columns for this lesson, but here are the most useful ones across the module.
| Column | Type | Meaning |
|---|---|---|
price | int | Target: list price in US dollars |
engine_size | int | Engine displacement (a proxy for power) |
horsepower | int | Rated horsepower |
curb_weight | int | Weight of the car in pounds |
width, length, wheel_base | float | Body dimensions in inches |
city_mpg, highway_mpg | int | Fuel economy |
make, body_style, fuel_type | category | Manufacturer and configuration |
To keep the math visible, we will train on a single feature: engine_size. A bigger engine usually means a more expensive car, so it is a sensible predictor of price, and using one feature lets us plot every step of the algorithm.
Why Not Just Solve It Directly?
A fair question: for linear regression there is a famous closed-form solution (the normal equation) that computes the best line in one shot, with no iteration at all. So why bother walking downhill?
The answer is generality. That direct formula exists only because linear regression’s loss is simple enough to solve with algebra. The moment you move to logistic regression, neural networks, or almost any modern model, no such formula exists; the only practical way to fit them is to start somewhere, measure the slope of the loss, and step downhill repeatedly. Gradient descent is that universal procedure. Learning it on linear regression, where you can check your answer against the exact solution, is the cleanest possible introduction. Everything you build here transfers unchanged to the harder models later.
x_raw = df["engine_size"]
y_raw = df["price"]
print("engine_size range:", x_raw.min(), "to", x_raw.max())
print("price range: ", y_raw.min(), "to", y_raw.max())
# Output:
# engine_size range: 61 to 326
# price range: 5118 to 35056Step 1: Measuring How Wrong a Line Is
Linear regression fits a line. With one feature, that line is just
where is the predicted price, is the engine size, is the weight (the slope: how much price rises per unit of engine size), and is the bias (the intercept: the predicted price when ). Different choices of and give different lines, and some fit the data far better than others.
To let the computer choose, we need to turn “better fit” into a single number. That number is the loss (also called the cost). The lower the loss, the better the line. For regression, the standard choice is the mean squared error (MSE): take each point’s error (actual minus predicted), square it so positive and negative errors do not cancel, and average over all points.
Squaring has two effects worth understanding. It makes every error count as positive, and it punishes large errors much harder than small ones, so the line is pulled toward reducing its worst misses. An MSE of zero would mean the line passes through every point exactly; in practice, our goal is to make MSE as small as possible.
Here is the loss as a function. It takes the true values and the predictions and returns one number.
import numpy as np
def mean_squared_error(y_true, y_predicted):
# average of the squared gaps between truth and prediction
return np.mean((y_true - y_predicted) ** 2)
# A quick sanity check with a deliberately bad line: w=0, b=0 (predict 0 for everything)
bad_predictions = 0 * x_raw + 0
print("MSE of the zero line:", round(mean_squared_error(y_raw, bad_predictions), 1))
# Output:
# MSE of the zero line: 175763392.6The “predict zero for everything” line is terrible, and its enormous MSE says so. Our task is to find the and that drive this number down.
Why squared, not absolute?
You could measure error with the absolute difference instead of the square. Squaring is preferred here for two reasons: it penalizes big mistakes more heavily, and it produces a smooth, bowl-shaped loss surface whose slope we can compute exactly with calculus. That smoothness is what makes gradient descent work.
Step 2: Standardizing So the Math Behaves
Engine sizes run from 61 to 326, while prices run into the tens of thousands. When two quantities live on such different scales, the loss surface becomes a long, narrow valley, and gradient descent zig-zags slowly down it. The cure is standardization: rescale each variable to have mean 0 and standard deviation 1.
Here is the mean and is the standard deviation. After standardizing both engine_size and price, both live on the same comfortable scale, and the algorithm converges quickly and predictably.
def standardize(series):
return (series - series.mean()) / series.std()
x = standardize(x_raw).to_numpy() # standardized engine_size
y = standardize(y_raw).to_numpy() # standardized price
print("x mean/std:", round(x.mean(), 3), round(x.std(), 3))
print("y mean/std:", round(y.mean(), 3), round(y.std(), 3))
# Output:
# x mean/std: -0.0 1.003
# y mean/std: 0.0 1.003Both variables now sit near mean 0 with unit spread. A useful consequence: on standardized data the best-fit weight equals the correlation between the two variables, so we already have a target to check our answer against once training finishes.
Step 3: Which Way Is Downhill?
We have a loss we want to minimize. Gradient descent minimizes it by repeatedly asking, if I nudge and a little, which direction makes the loss go down? The answer comes from the gradient, which is just the slope of the loss surface in each direction. Where the slope is steep, we take a bigger step; where it flattens out near the bottom, the steps shrink automatically.
The loss depends on two variables, so we need the slope in each direction separately. These are the partial derivatives of the MSE with respect to and . Working them out from the MSE formula (the chain rule does all the work) gives two clean expressions:
Read them in plain language. The bias gradient is proportional to the average error: if the line sits too low on average, the gradient pushes the intercept up. The weight gradient is the same idea, but each error is weighted by its value, so points far from the center have more say in the slope. Both are direct, mechanical translations into NumPy.
The minus sign in front of each formula matters. It is why, in the update step, we subtract the gradient: a negative gradient means the loss is still decreasing in the positive direction, so we move the parameter up; a positive gradient means the opposite. Following the negative of the gradient is what guarantees every step heads downhill rather than up. You never have to reason about this case by case; the arithmetic handles the direction for you.
def gradients(x, y, w, b):
n = len(x)
y_predicted = w * x + b # current line's predictions
error = y - y_predicted # how far off each point is
dw = -(2 / n) * np.sum(x * error) # slope of the loss w.r.t. the weight
db = -(2 / n) * np.sum(error) # slope of the loss w.r.t. the bias
return dw, db
# At the starting point w=0, b=0 the line predicts 0 everywhere, so the error is just y
dw0, db0 = gradients(x, y, w=0.0, b=0.0)
print("Initial gradients -> dw:", round(dw0, 3), " db:", round(db0, 3))
# Output:
# Initial gradients -> dw: -1.671 db: -0.000The negative weight gradient at the start tells us something concrete: increasing will decrease the loss. That is the nudge gradient descent is about to follow.
Step 4: The Gradient Descent Loop
Now we assemble the pieces. The update rule is the heart of the algorithm. We move each parameter a small step in the opposite direction of its gradient (downhill, not uphill), and the size of that step is set by the learning rate .
We repeat this update for a fixed number of iterations, recording the loss each time so we can watch progress. A few design notes on the function below:
wandbstart at zero. On standardized data, zero is a perfectly reasonable starting point.learning_rateanditerationsare hyperparameters: knobs you set before training, not values the model learns.- We append the current MSE to a
historylist every step purely so we can plot convergence afterward.
def gradient_descent(x, y, learning_rate=0.1, iterations=60):
w, b = 0.0, 0.0 # initial weight and bias
history = [] # MSE at each iteration, for plotting
for i in range(iterations):
y_predicted = w * x + b
cost = mean_squared_error(y, y_predicted)
history.append(cost)
# how far downhill, and in which direction
dw, db = gradients(x, y, w, b)
# take one step against the gradient
w = w - learning_rate * dw
b = b - learning_rate * db
return w, b, historyLet’s run it on the standardized car data with a learning rate of 0.1 for 60 iterations.
w, b, history = gradient_descent(x, y, learning_rate=0.1, iterations=60)
print(f"Learned weight (w): {w:.3f}")
print(f"Learned bias (b): {b:.3f}")
print(f"Final MSE: {history[-1]:.4f}")
# Output:
# Learned weight (w): 0.841
# Learned bias (b): 0.000
# Final MSE: 0.2919The loop converged to a weight of about 0.841, a bias of essentially 0, and a final MSE near 0.2919. Recall the check from Step 2: on standardized data the best-fit weight equals the correlation between engine size and price. A weight of 0.841 says the two are strongly, positively related, exactly what we expected, and exactly what an analytical solution would give. Your from-scratch loop landed on the right answer.
Connecting back to the real coefficients
These numbers are in standardized units. If you reverse the standardization (multiply through by the price standard deviation and divide by the engine-size standard deviation), the slope maps back to roughly 162 dollars of price per unit of engine size, the same coefficient the simple LinearRegression model produced in the earlier lesson. Gradient descent and the closed-form solution agree, because they are minimizing the very same loss.
Step 5: Watching the Loss Fall
A single final number hides the story. The real value of recording history is that we can plot the MSE at every iteration and watch the algorithm learn. A healthy run shows the loss dropping fast at first, when the gradients are steep, then leveling off as it approaches the minimum and the steps naturally shrink.
import matplotlib.pyplot as plt
plt.figure(figsize=(8, 5))
plt.plot(range(len(history)), history, color="#2563eb")
plt.xlabel("Iteration")
plt.ylabel("MSE (standardized units)")
plt.title("Gradient descent loss curve")
plt.show()The figure below shows that loss curve. The first handful of iterations do most of the work, slashing the MSE from 1.0 toward roughly 0.3, after which the curve flattens into a gentle approach to the minimum.
This shape is the signature of a well-behaved training run. When you train more complex models later, this same curve is the first thing you will look at: a smooth, decreasing loss means learning is working, while a curve that flatlines too early, bounces around, or shoots upward signals trouble.
Step 6: The Learning Rate Changes Everything
Of all the hyperparameters, the learning rate is the one that most often makes or breaks a training run. It sets the step size. Too small and the algorithm crawls, needing thousands of iterations to arrive. Too large and it overshoots the minimum, bouncing across the valley and sometimes diverging entirely. The 0.1 we used was a good middle choice; let’s see what happens on either side of it by running the same loop with three different rates.
for rate in [0.01, 0.1, 0.6]:
_, _, hist = gradient_descent(x, y, learning_rate=rate, iterations=60)
print(f"lr={rate:<5} MSE after 60 iters: {hist[-1]:.4f}")
# Output:
# lr=0.01 MSE after 60 iters: 0.5430
# lr=0.1 MSE after 60 iters: 0.2919
# lr=0.6 MSE after 60 iters: 0.2919The numbers tell part of the story, and the next figure tells the rest. With lr=0.01 the steps are tiny, so after 60 iterations the loss is still well above the minimum: the run is simply not finished. With lr=0.1 the loss settles smoothly into the floor near 0.292. With lr=0.6 the steps are so large that the loss lurches up and down on its way down, overshooting the minimum each time before correcting; it happens to land in the same place here, but the path is jagged and, with a slightly larger rate, would diverge.
There is no universal best learning rate; it depends on the data and the model. The practical workflow is to try a few values spaced by powers of ten (0.001, 0.01, 0.1) and pick the largest one whose loss curve still falls smoothly. That gives you the fastest convergence that remains stable.
When training blows up
If your loss curve climbs instead of falling, or prints nan, your learning rate is almost certainly too high: the steps overshoot so badly that each update lands somewhere worse. Lower the rate by a factor of ten and try again. This single habit will save you hours of confusion when you train larger models.
Putting It All Together
Here is the entire algorithm, from raw data to a trained line, condensed into one runnable script. This is the template behind every .fit() call you have made.
import numpy as np
import pandas as pd
# 1. Load the real data
df = pd.read_csv("automobiles.csv") # download: https://datatweets.com/datasets/automobiles.csv
# 2. Standardize the single feature and the target
def standardize(s):
return ((s - s.mean()) / s.std()).to_numpy()
x = standardize(df["engine_size"])
y = standardize(df["price"])
# 3. Loss and its gradients
def mean_squared_error(y_true, y_pred):
return np.mean((y_true - y_pred) ** 2)
def gradients(x, y, w, b):
n = len(x)
error = y - (w * x + b)
dw = -(2 / n) * np.sum(x * error)
db = -(2 / n) * np.sum(error)
return dw, db
# 4. The gradient descent loop
def gradient_descent(x, y, learning_rate=0.1, iterations=60):
w, b, history = 0.0, 0.0, []
for _ in range(iterations):
history.append(mean_squared_error(y, w * x + b))
dw, db = gradients(x, y, w, b)
w -= learning_rate * dw
b -= learning_rate * db
return w, b, history
# 5. Train and report
w, b, history = gradient_descent(x, y, learning_rate=0.1, iterations=60)
print(f"w={w:.3f} b={b:.3f} final MSE={history[-1]:.4f}")
# Output: w=0.841 b=0.000 final MSE=0.2919In about 30 lines, with no machine learning library at all, you implemented the optimization engine that scikit-learn, TensorFlow, and PyTorch all rely on. The details get more elaborate for deep networks (more parameters, mini-batches, fancier update rules), but the core idea never changes: define a loss, compute its gradient, and step downhill.
Scaling Beyond One Feature
We used a single feature so we could see and plot everything, but nothing about gradient descent is limited to one input. With several features, the line becomes a plane (or a hyperplane), each feature gets its own weight, and the prediction is
The loss is still MSE, and the update rule is identical: each weight steps against its own partial derivative,
In code, the only real change is replacing the scalar weight with a vector and using a matrix multiply for the predictions. The same loop, the same downhill step, just more bookkeeping.
def gradient_descent_multi(X, y, learning_rate=0.1, iterations=200):
n, m = X.shape
w = np.zeros(m) # one weight per feature
b = 0.0
history = []
for _ in range(iterations):
y_pred = X @ w + b # predictions for every row at once
error = y - y_pred
history.append(np.mean(error ** 2))
w -= learning_rate * (-(2 / n) * (X.T @ error)) # vector of weight gradients
b -= learning_rate * (-(2 / n) * np.sum(error))
return w, b, historyThis is exactly how scikit-learn’s gradient-based regressor scales to the five-feature price model you built earlier (engine_size, horsepower, curb_weight, width, highway_mpg). You do not need to run it here; the point is that the algorithm you wrote by hand is the same one, just widened. In the next lesson you will let scikit-learn handle that widening for you and confirm it reaches the same answer.
Practice Exercises
Try these before peeking at the hints. They build directly on the code above.
Exercise 1: Add a Convergence Check
Our loop always runs the full 60 iterations, even after the loss stops improving. Modify gradient_descent to stop early once the improvement between two consecutive iterations is smaller than a tolerance (say 1e-6). Print how many iterations it actually ran.
# Start from the gradient_descent function above and add early stopping.Hint
Keep a previous_cost variable, initialized to None. Inside the loop, after computing current_cost, check if previous_cost is not None and abs(previous_cost - current_cost) < tolerance: break. Update previous_cost = current_cost each iteration, and return the loop index so you can see how early it stopped.
Exercise 2: Train on a Different Feature
Repeat the whole process using horsepower instead of engine_size to predict price. Standardize it, run gradient descent, and compare the learned weight. Which feature has the stronger standardized relationship with price?
x = standardize(df["horsepower"])
y = standardize(df["price"])
# Your code here: run gradient_descent and print wHint
Reuse the exact same functions; only the input column changes. The learned weight on standardized data is the correlation between the feature and price, so the larger weight (closer to 1) marks the stronger linear relationship. Both engine_size and horsepower are strong predictors of price.
Exercise 3: Find the Breaking Point
The learning rates 0.01, 0.1, and 0.6 all eventually behaved. Push further: run the loop with learning_rate=1.05 and print the loss history. What happens to the MSE, and why?
_, _, hist = gradient_descent(x, y, learning_rate=1.05, iterations=30)
print([round(c, 2) for c in hist])Hint
With too large a learning rate, each step overshoots the minimum by more than it started, so the loss grows instead of shrinking. You should see the MSE values climbing toward infinity (and eventually nan). This is divergence, the failure mode the earlier caution box warned about. Lowering the rate fixes it.
Summary
You just built the optimization algorithm that sits underneath nearly all of machine learning. Let’s review what you learned.
Key Concepts
Loss Functions
- A loss function turns “how good is this model” into a single number to minimize
- Mean squared error averages the squared errors, punishing big mistakes hardest and giving a smooth, bowl-shaped surface
Gradients
- The gradient is the slope of the loss surface; its sign tells you which way is uphill
- The partial derivatives of MSE have intuitive readings: the bias gradient tracks the average error, and the weight gradient weights each error by its feature value
Gradient Descent
- The update rule steps each parameter against its gradient:
- Repeated for many iterations, this walks the parameters downhill to the loss minimum
- On the standardized car data, the loop found
w=0.841,b=0.000, with a final MSE of0.2919, matching the closed-form solution
The Learning Rate
- The learning rate controls step size and is the most important hyperparameter to get right
- Too small crawls, a good value converges smoothly, and too large overshoots or diverges
- A rising or
nanloss is the classic symptom of a learning rate that is too high
Why This Matters
Every framework you will ever use, from scikit-learn to the libraries behind large neural networks, trains models with some flavor of gradient descent. Understanding it from scratch demystifies that .fit() call: you now know that fitting means minimizing a loss by following its gradient downhill. When a real model trains too slowly, plateaus, or blows up, you will recognize the cause in the loss curve and know which knob to turn. That diagnostic intuition is worth far more than any single result, and it carries you straight into deep learning.
Next Steps
You implemented gradient descent by hand and watched it converge. Next, you will hand the work back to scikit-learn and use its SGDRegressor, which runs this same idea at scale, and confirm it lands on the same answer your own loop did.
Continue to Lesson 7 - Gradient Descent with Scikit-Learn
Use scikit-learn’s SGDRegressor instead of hand-written loops.
Back to Module Overview
Return to the Regression module overview.