Lesson 6 - Implementing Gradient Descent in Python

How Does a Model Actually Learn?

In the earlier lessons you called LinearRegression().fit(X, y) and a fitted line appeared, complete with coefficients and an intercept. That single line of code hid a question worth answering: how did scikit-learn find those numbers? It did not try every possible line. Instead it followed a procedure that quietly powers almost all of modern machine learning, from linear regression to deep neural networks. That procedure is gradient descent.

In this lesson you will open the box. You will define what it means for a line to be “good,” turn that definition into a number you can minimize, and then write the loop that walks downhill toward the best line, one small step at a time. You will do it on the real automobiles dataset you have used throughout this module, so the parameters you compute by hand will line up with the ones scikit-learn gave you.

By the end of this lesson, you will be able to:

  • Explain why a model needs a loss function and compute the mean squared error
  • Derive and code the gradients of the loss with respect to the weight and bias
  • Write a gradient descent loop from scratch that fits a line to real data
  • Diagnose convergence by watching the loss curve fall across iterations
  • Explain how the learning rate controls whether training crawls, converges, or diverges

You should be comfortable with basic Python, pandas, and NumPy, and you should have seen linear regression at least once. No calculus background is required; we will build up the few derivatives we need from scratch. Let’s begin.


The Data: Predicting Car Prices

You will work with the automobiles dataset, the classic catalog of imported cars where each row describes one model and its specifications. Your job is to predict a car’s price from its characteristics. Download it and load it with pandas.

import pandas as pd

df = pd.read_csv("automobiles.csv")  # download: https://datatweets.com/datasets/automobiles.csv

print("Shape:", df.shape)
print(df["price"].describe()[["mean", "min", "max"]].round(0))
# Output:
# Shape: (159, 26)
# mean    11446.0
# min      5118.0
# max     35056.0
# Name: price, dtype: float64

The dataset has 159 rows and 26 columns, with no missing values, which keeps the focus on the algorithm rather than on cleaning. Prices range from about $5,118 to $35,056, with a mean near $11,446.

A Small Data Dictionary

You will only need a couple of numeric columns for this lesson, but here are the most useful ones across the module.

ColumnTypeMeaning
priceintTarget: list price in US dollars
engine_sizeintEngine displacement (a proxy for power)
horsepowerintRated horsepower
curb_weightintWeight of the car in pounds
width, length, wheel_basefloatBody dimensions in inches
city_mpg, highway_mpgintFuel economy
make, body_style, fuel_typecategoryManufacturer and configuration

To keep the math visible, we will train on a single feature: engine_size. A bigger engine usually means a more expensive car, so it is a sensible predictor of price, and using one feature lets us plot every step of the algorithm.

Why Not Just Solve It Directly?

A fair question: for linear regression there is a famous closed-form solution (the normal equation) that computes the best line in one shot, with no iteration at all. So why bother walking downhill?

The answer is generality. That direct formula exists only because linear regression’s loss is simple enough to solve with algebra. The moment you move to logistic regression, neural networks, or almost any modern model, no such formula exists; the only practical way to fit them is to start somewhere, measure the slope of the loss, and step downhill repeatedly. Gradient descent is that universal procedure. Learning it on linear regression, where you can check your answer against the exact solution, is the cleanest possible introduction. Everything you build here transfers unchanged to the harder models later.

x_raw = df["engine_size"]
y_raw = df["price"]

print("engine_size range:", x_raw.min(), "to", x_raw.max())
print("price range:      ", y_raw.min(), "to", y_raw.max())
# Output:
# engine_size range: 61 to 326
# price range:       5118 to 35056

Step 1: Measuring How Wrong a Line Is

Linear regression fits a line. With one feature, that line is just

y^=wx+b \hat{y} = w x + b

where y^ \hat{y} is the predicted price, x x is the engine size, w w is the weight (the slope: how much price rises per unit of engine size), and b b is the bias (the intercept: the predicted price when x=0 x = 0 ). Different choices of w w and b b give different lines, and some fit the data far better than others.

To let the computer choose, we need to turn “better fit” into a single number. That number is the loss (also called the cost). The lower the loss, the better the line. For regression, the standard choice is the mean squared error (MSE): take each point’s error (actual minus predicted), square it so positive and negative errors do not cancel, and average over all n n points.

MSE(w,b)=1ni=1n(yiy^i)2=1ni=1n(yi(wxi+b))2 \text{MSE}(w, b) = \frac{1}{n}\sum_{i=1}^{n}\bigl(y_i - \hat{y}_i\bigr)^2 = \frac{1}{n}\sum_{i=1}^{n}\bigl(y_i - (w x_i + b)\bigr)^2

Squaring has two effects worth understanding. It makes every error count as positive, and it punishes large errors much harder than small ones, so the line is pulled toward reducing its worst misses. An MSE of zero would mean the line passes through every point exactly; in practice, our goal is to make MSE as small as possible.

Here is the loss as a function. It takes the true values and the predictions and returns one number.

import numpy as np

def mean_squared_error(y_true, y_predicted):
    # average of the squared gaps between truth and prediction
    return np.mean((y_true - y_predicted) ** 2)

# A quick sanity check with a deliberately bad line: w=0, b=0 (predict 0 for everything)
bad_predictions = 0 * x_raw + 0
print("MSE of the zero line:", round(mean_squared_error(y_raw, bad_predictions), 1))
# Output:
# MSE of the zero line: 175763392.6

The “predict zero for everything” line is terrible, and its enormous MSE says so. Our task is to find the w w and b b that drive this number down.

Why squared, not absolute?

You could measure error with the absolute difference instead of the square. Squaring is preferred here for two reasons: it penalizes big mistakes more heavily, and it produces a smooth, bowl-shaped loss surface whose slope we can compute exactly with calculus. That smoothness is what makes gradient descent work.


Step 2: Standardizing So the Math Behaves

Engine sizes run from 61 to 326, while prices run into the tens of thousands. When two quantities live on such different scales, the loss surface becomes a long, narrow valley, and gradient descent zig-zags slowly down it. The cure is standardization: rescale each variable to have mean 0 and standard deviation 1.

z=xμσ z = \frac{x - \mu}{\sigma}

Here μ \mu is the mean and σ \sigma is the standard deviation. After standardizing both engine_size and price, both live on the same comfortable scale, and the algorithm converges quickly and predictably.

def standardize(series):
    return (series - series.mean()) / series.std()

x = standardize(x_raw).to_numpy()  # standardized engine_size
y = standardize(y_raw).to_numpy()  # standardized price

print("x mean/std:", round(x.mean(), 3), round(x.std(), 3))
print("y mean/std:", round(y.mean(), 3), round(y.std(), 3))
# Output:
# x mean/std: -0.0 1.003
# y mean/std: 0.0 1.003

Both variables now sit near mean 0 with unit spread. A useful consequence: on standardized data the best-fit weight w w equals the correlation between the two variables, so we already have a target to check our answer against once training finishes.


Step 3: Which Way Is Downhill?

We have a loss we want to minimize. Gradient descent minimizes it by repeatedly asking, if I nudge w w and b b a little, which direction makes the loss go down? The answer comes from the gradient, which is just the slope of the loss surface in each direction. Where the slope is steep, we take a bigger step; where it flattens out near the bottom, the steps shrink automatically.

The loss depends on two variables, so we need the slope in each direction separately. These are the partial derivatives of the MSE with respect to w w and b b . Working them out from the MSE formula (the chain rule does all the work) gives two clean expressions:

MSEw=2ni=1nxi(yiy^i)MSEb=2ni=1n(yiy^i) \frac{\partial \text{MSE}}{\partial w} = -\frac{2}{n}\sum_{i=1}^{n} x_i\bigl(y_i - \hat{y}_i\bigr) \qquad \frac{\partial \text{MSE}}{\partial b} = -\frac{2}{n}\sum_{i=1}^{n}\bigl(y_i - \hat{y}_i\bigr)

Read them in plain language. The bias gradient is proportional to the average error: if the line sits too low on average, the gradient pushes the intercept up. The weight gradient is the same idea, but each error is weighted by its x x value, so points far from the center have more say in the slope. Both are direct, mechanical translations into NumPy.

The minus sign in front of each formula matters. It is why, in the update step, we subtract the gradient: a negative gradient means the loss is still decreasing in the positive direction, so we move the parameter up; a positive gradient means the opposite. Following the negative of the gradient is what guarantees every step heads downhill rather than up. You never have to reason about this case by case; the arithmetic handles the direction for you.

def gradients(x, y, w, b):
    n = len(x)
    y_predicted = w * x + b              # current line's predictions
    error = y - y_predicted              # how far off each point is
    dw = -(2 / n) * np.sum(x * error)    # slope of the loss w.r.t. the weight
    db = -(2 / n) * np.sum(error)        # slope of the loss w.r.t. the bias
    return dw, db

# At the starting point w=0, b=0 the line predicts 0 everywhere, so the error is just y
dw0, db0 = gradients(x, y, w=0.0, b=0.0)
print("Initial gradients -> dw:", round(dw0, 3), " db:", round(db0, 3))
# Output:
# Initial gradients -> dw: -1.671  db: -0.000

The negative weight gradient at the start tells us something concrete: increasing w w will decrease the loss. That is the nudge gradient descent is about to follow.


Step 4: The Gradient Descent Loop

Now we assemble the pieces. The update rule is the heart of the algorithm. We move each parameter a small step in the opposite direction of its gradient (downhill, not uphill), and the size of that step is set by the learning rate α \alpha .

wwαMSEwbbαMSEb w \leftarrow w - \alpha \frac{\partial \text{MSE}}{\partial w} \qquad b \leftarrow b - \alpha \frac{\partial \text{MSE}}{\partial b}

We repeat this update for a fixed number of iterations, recording the loss each time so we can watch progress. A few design notes on the function below:

  • w and b start at zero. On standardized data, zero is a perfectly reasonable starting point.
  • learning_rate and iterations are hyperparameters: knobs you set before training, not values the model learns.
  • We append the current MSE to a history list every step purely so we can plot convergence afterward.
def gradient_descent(x, y, learning_rate=0.1, iterations=60):
    w, b = 0.0, 0.0          # initial weight and bias
    history = []             # MSE at each iteration, for plotting

    for i in range(iterations):
        y_predicted = w * x + b
        cost = mean_squared_error(y, y_predicted)
        history.append(cost)

        # how far downhill, and in which direction
        dw, db = gradients(x, y, w, b)

        # take one step against the gradient
        w = w - learning_rate * dw
        b = b - learning_rate * db

    return w, b, history

Let’s run it on the standardized car data with a learning rate of 0.1 for 60 iterations.

w, b, history = gradient_descent(x, y, learning_rate=0.1, iterations=60)

print(f"Learned weight (w): {w:.3f}")
print(f"Learned bias   (b): {b:.3f}")
print(f"Final MSE:          {history[-1]:.4f}")
# Output:
# Learned weight (w): 0.841
# Learned bias   (b): 0.000
# Final MSE:          0.2919

The loop converged to a weight of about 0.841, a bias of essentially 0, and a final MSE near 0.2919. Recall the check from Step 2: on standardized data the best-fit weight equals the correlation between engine size and price. A weight of 0.841 says the two are strongly, positively related, exactly what we expected, and exactly what an analytical solution would give. Your from-scratch loop landed on the right answer.

Connecting back to the real coefficients

These numbers are in standardized units. If you reverse the standardization (multiply through by the price standard deviation and divide by the engine-size standard deviation), the slope maps back to roughly 162 dollars of price per unit of engine size, the same coefficient the simple LinearRegression model produced in the earlier lesson. Gradient descent and the closed-form solution agree, because they are minimizing the very same loss.


Step 5: Watching the Loss Fall

A single final number hides the story. The real value of recording history is that we can plot the MSE at every iteration and watch the algorithm learn. A healthy run shows the loss dropping fast at first, when the gradients are steep, then leveling off as it approaches the minimum and the steps naturally shrink.

import matplotlib.pyplot as plt

plt.figure(figsize=(8, 5))
plt.plot(range(len(history)), history, color="#2563eb")
plt.xlabel("Iteration")
plt.ylabel("MSE (standardized units)")
plt.title("Gradient descent loss curve")
plt.show()

The figure below shows that loss curve. The first handful of iterations do most of the work, slashing the MSE from 1.0 toward roughly 0.3, after which the curve flattens into a gentle approach to the minimum.

Line chart of MSE decreasing across gradient descent iterations
The loss falls steadily as gradient descent updates the parameters.

This shape is the signature of a well-behaved training run. When you train more complex models later, this same curve is the first thing you will look at: a smooth, decreasing loss means learning is working, while a curve that flatlines too early, bounces around, or shoots upward signals trouble.


Step 6: The Learning Rate Changes Everything

Of all the hyperparameters, the learning rate α \alpha is the one that most often makes or breaks a training run. It sets the step size. Too small and the algorithm crawls, needing thousands of iterations to arrive. Too large and it overshoots the minimum, bouncing across the valley and sometimes diverging entirely. The 0.1 we used was a good middle choice; let’s see what happens on either side of it by running the same loop with three different rates.

for rate in [0.01, 0.1, 0.6]:
    _, _, hist = gradient_descent(x, y, learning_rate=rate, iterations=60)
    print(f"lr={rate:<5}  MSE after 60 iters: {hist[-1]:.4f}")
# Output:
# lr=0.01   MSE after 60 iters: 0.5430
# lr=0.1    MSE after 60 iters: 0.2919
# lr=0.6    MSE after 60 iters: 0.2919

The numbers tell part of the story, and the next figure tells the rest. With lr=0.01 the steps are tiny, so after 60 iterations the loss is still well above the minimum: the run is simply not finished. With lr=0.1 the loss settles smoothly into the floor near 0.292. With lr=0.6 the steps are so large that the loss lurches up and down on its way down, overshooting the minimum each time before correcting; it happens to land in the same place here, but the path is jagged and, with a slightly larger rate, would diverge.

Loss curves for three different learning rates
The learning rate matters: too small crawls, too big overshoots.

There is no universal best learning rate; it depends on the data and the model. The practical workflow is to try a few values spaced by powers of ten (0.001, 0.01, 0.1) and pick the largest one whose loss curve still falls smoothly. That gives you the fastest convergence that remains stable.

When training blows up

If your loss curve climbs instead of falling, or prints nan, your learning rate is almost certainly too high: the steps overshoot so badly that each update lands somewhere worse. Lower the rate by a factor of ten and try again. This single habit will save you hours of confusion when you train larger models.


Putting It All Together

Here is the entire algorithm, from raw data to a trained line, condensed into one runnable script. This is the template behind every .fit() call you have made.

import numpy as np
import pandas as pd

# 1. Load the real data
df = pd.read_csv("automobiles.csv")  # download: https://datatweets.com/datasets/automobiles.csv

# 2. Standardize the single feature and the target
def standardize(s):
    return ((s - s.mean()) / s.std()).to_numpy()

x = standardize(df["engine_size"])
y = standardize(df["price"])

# 3. Loss and its gradients
def mean_squared_error(y_true, y_pred):
    return np.mean((y_true - y_pred) ** 2)

def gradients(x, y, w, b):
    n = len(x)
    error = y - (w * x + b)
    dw = -(2 / n) * np.sum(x * error)
    db = -(2 / n) * np.sum(error)
    return dw, db

# 4. The gradient descent loop
def gradient_descent(x, y, learning_rate=0.1, iterations=60):
    w, b, history = 0.0, 0.0, []
    for _ in range(iterations):
        history.append(mean_squared_error(y, w * x + b))
        dw, db = gradients(x, y, w, b)
        w -= learning_rate * dw
        b -= learning_rate * db
    return w, b, history

# 5. Train and report
w, b, history = gradient_descent(x, y, learning_rate=0.1, iterations=60)
print(f"w={w:.3f}  b={b:.3f}  final MSE={history[-1]:.4f}")
# Output: w=0.841  b=0.000  final MSE=0.2919

In about 30 lines, with no machine learning library at all, you implemented the optimization engine that scikit-learn, TensorFlow, and PyTorch all rely on. The details get more elaborate for deep networks (more parameters, mini-batches, fancier update rules), but the core idea never changes: define a loss, compute its gradient, and step downhill.


Scaling Beyond One Feature

We used a single feature so we could see and plot everything, but nothing about gradient descent is limited to one input. With several features, the line becomes a plane (or a hyperplane), each feature gets its own weight, and the prediction is

y^=b+w1x1+w2x2++wmxm \hat{y} = b + w_1 x_1 + w_2 x_2 + \cdots + w_m x_m

The loss is still MSE, and the update rule is identical: each weight wj w_j steps against its own partial derivative,

wjwjαMSEwjMSEwj=2ni=1nxij(yiy^i) w_j \leftarrow w_j - \alpha \frac{\partial \text{MSE}}{\partial w_j} \qquad \frac{\partial \text{MSE}}{\partial w_j} = -\frac{2}{n}\sum_{i=1}^{n} x_{ij}\bigl(y_i - \hat{y}_i\bigr)

In code, the only real change is replacing the scalar weight with a vector and using a matrix multiply for the predictions. The same loop, the same downhill step, just more bookkeeping.

def gradient_descent_multi(X, y, learning_rate=0.1, iterations=200):
    n, m = X.shape
    w = np.zeros(m)          # one weight per feature
    b = 0.0
    history = []
    for _ in range(iterations):
        y_pred = X @ w + b               # predictions for every row at once
        error = y - y_pred
        history.append(np.mean(error ** 2))
        w -= learning_rate * (-(2 / n) * (X.T @ error))   # vector of weight gradients
        b -= learning_rate * (-(2 / n) * np.sum(error))
    return w, b, history

This is exactly how scikit-learn’s gradient-based regressor scales to the five-feature price model you built earlier (engine_size, horsepower, curb_weight, width, highway_mpg). You do not need to run it here; the point is that the algorithm you wrote by hand is the same one, just widened. In the next lesson you will let scikit-learn handle that widening for you and confirm it reaches the same answer.


Practice Exercises

Try these before peeking at the hints. They build directly on the code above.

Exercise 1: Add a Convergence Check

Our loop always runs the full 60 iterations, even after the loss stops improving. Modify gradient_descent to stop early once the improvement between two consecutive iterations is smaller than a tolerance (say 1e-6). Print how many iterations it actually ran.

# Start from the gradient_descent function above and add early stopping.

Hint

Keep a previous_cost variable, initialized to None. Inside the loop, after computing current_cost, check if previous_cost is not None and abs(previous_cost - current_cost) < tolerance: break. Update previous_cost = current_cost each iteration, and return the loop index so you can see how early it stopped.

Exercise 2: Train on a Different Feature

Repeat the whole process using horsepower instead of engine_size to predict price. Standardize it, run gradient descent, and compare the learned weight. Which feature has the stronger standardized relationship with price?

x = standardize(df["horsepower"])
y = standardize(df["price"])
# Your code here: run gradient_descent and print w

Hint

Reuse the exact same functions; only the input column changes. The learned weight on standardized data is the correlation between the feature and price, so the larger weight (closer to 1) marks the stronger linear relationship. Both engine_size and horsepower are strong predictors of price.

Exercise 3: Find the Breaking Point

The learning rates 0.01, 0.1, and 0.6 all eventually behaved. Push further: run the loop with learning_rate=1.05 and print the loss history. What happens to the MSE, and why?

_, _, hist = gradient_descent(x, y, learning_rate=1.05, iterations=30)
print([round(c, 2) for c in hist])

Hint

With too large a learning rate, each step overshoots the minimum by more than it started, so the loss grows instead of shrinking. You should see the MSE values climbing toward infinity (and eventually nan). This is divergence, the failure mode the earlier caution box warned about. Lowering the rate fixes it.


Summary

You just built the optimization algorithm that sits underneath nearly all of machine learning. Let’s review what you learned.

Key Concepts

Loss Functions

  • A loss function turns “how good is this model” into a single number to minimize
  • Mean squared error averages the squared errors, punishing big mistakes hardest and giving a smooth, bowl-shaped surface

Gradients

  • The gradient is the slope of the loss surface; its sign tells you which way is uphill
  • The partial derivatives of MSE have intuitive readings: the bias gradient tracks the average error, and the weight gradient weights each error by its feature value

Gradient Descent

  • The update rule steps each parameter against its gradient: wwαMSE/w w \leftarrow w - \alpha\,\partial\text{MSE}/\partial w
  • Repeated for many iterations, this walks the parameters downhill to the loss minimum
  • On the standardized car data, the loop found w=0.841, b=0.000, with a final MSE of 0.2919, matching the closed-form solution

The Learning Rate

  • The learning rate α \alpha controls step size and is the most important hyperparameter to get right
  • Too small crawls, a good value converges smoothly, and too large overshoots or diverges
  • A rising or nan loss is the classic symptom of a learning rate that is too high

Why This Matters

Every framework you will ever use, from scikit-learn to the libraries behind large neural networks, trains models with some flavor of gradient descent. Understanding it from scratch demystifies that .fit() call: you now know that fitting means minimizing a loss by following its gradient downhill. When a real model trains too slowly, plateaus, or blows up, you will recognize the cause in the loss curve and know which knob to turn. That diagnostic intuition is worth far more than any single result, and it carries you straight into deep learning.


Next Steps

You implemented gradient descent by hand and watched it converge. Next, you will hand the work back to scikit-learn and use its SGDRegressor, which runs this same idea at scale, and confirm it lands on the same answer your own loop did.

Continue to Lesson 7 - Gradient Descent with Scikit-Learn

Use scikit-learn’s SGDRegressor instead of hand-written loops.

Back to Module Overview

Return to the Regression module overview.