Lesson 7 - Gradient Descent with Scikit-Learn

From Hand-Rolled Loops to Production Tools

In the previous lessons you built gradient descent from scratch. You watched a loss curve fall, tuned a learning rate by hand, and saw a weight converge on the same slope ordinary least squares would have found. That hand-coded version was perfect for understanding the idea. It is not what you reach for on real projects.

This lesson bridges the gap. You will learn why the plain gradient descent loop you wrote starts to creak as datasets grow, meet the practical fix called stochastic gradient descent, and then hand the whole job to scikit-learn’s SGDRegressor. Along the way you will predict car prices from real engine and body measurements, and you will prove to yourself that this optimizer lands on essentially the same answer as the exact least-squares solution.

By the end of this lesson, you will be able to:

  • Explain the two ways plain (batch) gradient descent slows down as data and features grow
  • Describe how stochastic gradient descent fixes those problems by sampling the data
  • Connect the from-scratch update rule you wrote to the knobs SGDRegressor exposes
  • Train SGDRegressor on real, scaled features and read its coefficients
  • Compare an SGD fit against ordinary least squares and confirm they agree

You should already be comfortable with the gradient descent update rule, train_test_split, and StandardScaler from earlier lessons. Let’s begin.


Why Plain Gradient Descent Struggles

Recall the loop you wrote. On every single iteration, it looked at all the data to compute one update. That is fine for 159 cars and one feature. It becomes a problem in two directions, and both matter the moment you leave a textbook.

Problem 1: More features means more ground to cover

Each feature you add is another dimension the optimizer must navigate. With one predictor, the loss surface is a simple 2D bowl, and rolling downhill is easy. Add a fifth feature, a fiftieth, a ten-thousandth, and that bowl becomes a high-dimensional landscape. There is vastly more ground to search, so reaching the bottom takes more iterations, and each iteration is more expensive.

Problem 2: Every step rereads the whole dataset

Look again at the cost of one update in the from-scratch version. To compute the gradient for the weight, you summed an error term across every row:

MSEw=2ni=1nxi(yiy^i) \frac{\partial \text{MSE}}{\partial w} = -\frac{2}{n} \sum_{i=1}^{n} x_i \left( y_i - \hat{y}_i \right)

That sum runs over all n n rows for a single step. With 159 rows it is instant. With ten million rows, one step touches ten million records, and you might need hundreds of steps. This full-dataset-per-step approach is called batch gradient descent, and on large data it is painfully slow.

A second, subtler issue

More features also tend to create more local minima and flat regions in the loss surface. A simple linear-regression loss is a smooth bowl with one minimum, so this is not a concern here. But for the more complex models you will meet later, an optimizer that only ever takes perfectly smooth, averaged steps can get stuck. A little randomness, which is exactly what stochastic gradient descent adds, helps escape those traps.


The Fix: Stochastic Gradient Descent

The idea behind stochastic gradient descent (SGD) is almost embarrassingly simple. Instead of summing the gradient over the entire dataset before each update, you estimate the gradient from just one randomly chosen sample (or a small batch) and update immediately.

Each individual step is now noisier, because one sample is a rough estimate of the true gradient. But each step is also enormously cheaper, and you get to take many more of them in the same amount of time. The noise even helps: it nudges the optimizer out of flat spots. Averaged over many tiny, cheap, slightly-random steps, SGD marches downhill just like batch gradient descent, only far faster on large data.

Compare the two update strategies side by side:

BATCH gradient descent          STOCHASTIC gradient descent
------------------------        ------------------------------
Read ALL n rows                 Read ONE random row
Compute exact gradient          Compute a noisy gradient estimate
Update weights once             Update weights once
Repeat                          Repeat (many more times, cheaply)

Slow per step, smooth path      Fast per step, jittery path

The update rule itself is unchanged. You still nudge each weight against the gradient, scaled by the learning rate η \eta :

wwηMSEw w \leftarrow w - \eta \, \frac{\partial \text{MSE}}{\partial w}

The only difference is how many rows go into estimating that gradient. That single change is what lets gradient descent scale to data that would never fit in a single batched computation.


The Dataset: Predicting Car Prices

You will put SGD to work on the same dataset you have used throughout this module: the classic automobiles dataset of imported cars, where each row is one car model described by its engine, body, and performance specs. Your job is to predict price in US dollars.

Load it with pandas.

import pandas as pd

df = pd.read_csv("automobiles.csv")  # download: https://datatweets.com/datasets/automobiles.csv

print("Shape:", df.shape)
print(df["price"].describe()[["mean", "min", "max"]].round(0))
# Output:
# Shape: (159, 26)
# mean    11446.0
# min      5118.0
# max     35056.0
# Name: price, dtype: float64

There are 159 cars and 26 columns, with no missing values to clean up. Prices run from about $5,118 for the cheapest economy car to $35,056 for the priciest model, averaging around $11,446.

A Data Dictionary

You will not need every column. Here are the numeric ones that carry the most signal about price, which you will feed to the model:

ColumnMeaning
engine_sizeEngine displacement in cubic centimeters
horsepowerEngine power output
curb_weightWeight of the car in pounds, ready to drive
widthBody width in inches
lengthBody length in inches
wheel_baseDistance between front and rear axles
city_mpg, highway_mpgFuel economy in miles per gallon
priceTarget: manufacturer’s price in US dollars

Bigger engines, more horsepower, and heavier, wider bodies all tend to push the price up; better fuel economy tends to come with cheaper, smaller cars. Those are exactly the relationships a linear model can capture.

Preparing the Features

You will predict price from five strong numeric features. Two preparation steps carry over from earlier lessons and matter even more for gradient descent.

First, split the data so you can judge the model on cars it never trained on. Second, scale the features. Gradient descent is especially sensitive to feature scale: when curb_weight runs into the thousands and width sits around 65, the raw gradients are wildly different in magnitude, and the optimizer zig-zags. Standardizing every feature to mean 0 and standard deviation 1 puts them on equal footing.

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

features = ["engine_size", "horsepower", "curb_weight", "width", "highway_mpg"]
X = df[features]
y = df["price"]

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.25, random_state=42
)

scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)   # learn mean/std on TRAIN only
X_test = scaler.transform(X_test)         # apply the SAME transform to test

print("Train cars:", X_train.shape[0], "| Test cars:", X_test.shape[0])
# Output: Train cars: 119 | Test cars: 40

The golden rule still applies

Fit the scaler on the training data only, then apply it to both sets. If you fit on the full dataset, information from the test cars leaks into training and your reported error becomes too optimistic. This rule never changes, no matter which optimizer trains the model.


A Baseline: Ordinary Least Squares

Before you reach for SGD, fit the model you already trust. LinearRegression solves for the best-fit coefficients exactly, using linear algebra rather than iteration. It is the gold standard to measure SGD against, because it finds the provably optimal least-squares line in one shot.

from sklearn.linear_model import LinearRegression
from sklearn.metrics import root_mean_squared_error, mean_absolute_error

ols = LinearRegression()
ols.fit(X_train, y_train)

for name, coef in zip(features, ols.coef_):
    print(f"  {name:<12} coef = {coef:8.1f}")
print(f"  intercept = {ols.intercept_:.1f}")
# Output:
#   engine_size  coef =   1808.4
#   horsepower   coef =    336.5
#   curb_weight  coef =   1935.4
#   width        coef =   1892.0
#   highway_mpg  coef =     82.6
#   intercept = 11442.5

Because the features are standardized, these coefficients are directly comparable: each tells you how many dollars the predicted price moves for a one-standard-deviation change in that feature, holding the others fixed. curb_weight (1935.4) and engine_size (1808.4) are the heaviest hitters, with width close behind. highway_mpg barely moves the needle once size and power are accounted for. The intercept, 11442.5, is the predicted price for an average car, which lines up nicely with the dataset’s $11,446 mean.

The chart below ranks those standardized coefficients so the pecking order is obvious at a glance.

Bar chart of standardized regression coefficients for car price
Curb weight, engine size, and width dominate the price prediction; fuel economy contributes little once they are accounted for.

Now measure how well this baseline does on the held-out test cars. You will use three standard regression metrics: the coefficient of determination R2 R^2 , the root mean squared error (RMSE), and the mean absolute error (MAE).

R2=1i(yiy^i)2i(yiyˉ)2 R^2 = 1 - \frac{\sum_i (y_i - \hat{y}_i)^2}{\sum_i (y_i - \bar{y})^2}

R2 R^2 is the fraction of price variation the model explains; 1.0 is perfect and 0 is no better than always guessing the mean. RMSE and MAE report the typical error back in dollars.

ols_pred = ols.predict(X_test)

print(f"  R^2  = {ols.score(X_test, y_test):.3f}")
print(f"  RMSE = ${root_mean_squared_error(y_test, ols_pred):,.0f}")
print(f"  MAE  = ${mean_absolute_error(y_test, ols_pred):,.0f}")
# Output:
#   R^2  = 0.793
#   RMSE = $2,327
#   MAE  = $1,863

An R2 R^2 of 0.793 means the five features explain about 79 percent of the variation in car prices. The typical prediction lands within roughly $1,863 (MAE) of the true price, and RMSE, which punishes the occasional large miss more harshly, is $2,327. For a five-column linear model on raw specs, that is a solid fit, and it is the bar SGD now has to clear.

You can see the quality of the fit directly by plotting each test car’s predicted price against its actual price. A perfect model would put every point on the diagonal.

Scatter plot of predicted versus actual car prices clustered along the diagonal
Predicted prices track actual prices closely along the diagonal, with the expected scatter for a linear model.

Connecting the From-Scratch Loop to SGDRegressor

Before switching optimizers, it helps to remember what your hand-coded gradient descent actually produced, because it is the same machinery SGDRegressor runs internally.

In the previous lesson you standardized a single feature, engine_size, and the target price, then ran gradient descent with a learning rate of 0.1 for 60 iterations. The loss fell steadily and the weight converged.

# Recap from the from-scratch lesson (one standardized feature)
# learning rate = 0.1, 60 iterations
# Output:
#   final weight w = 0.841
#   final bias   b = 0.000
#   final MSE      = 0.292

On standardized data the slope settles at 0.841, which is exactly the correlation between engine_size and price, and the bias settles at 0 because standardizing centers both variables. The loss curve you plotted told the whole convergence story.

Gradient descent loss curve falling and flattening over iterations
The mean squared error drops fast in the first iterations, then flattens as the weight nears its optimum.

You also saw that the learning rate makes or breaks the whole process. Too small and the descent crawls; too large and it overshoots the minimum and may never settle.

Loss curves for three learning rates showing slow, good, and overshooting behavior
A rate of 0.01 converges too slowly, 0.1 is well tuned, and 0.6 overshoots and bounces.

Keep that picture in mind, because SGDRegressor exposes the very same knobs under different names. Here is the translation:

From-scratch conceptSGDRegressor parameter
Cost function (MSE)loss="squared_error" (the default)
Learning rate η \eta eta0, plus a learning_rate schedule
Number of passes over the datamax_iter
When to stop earlytol

The one genuinely new idea is learning_rate. In your loop the rate was a fixed constant. scikit-learn, by default, shrinks the learning rate as training proceeds, taking big steps early and small careful steps near the minimum. That is learning_rate="invscaling". You can pin it to a constant with learning_rate="constant" if you want to mirror your hand-coded behavior exactly.


Training SGDRegressor

Now train the stochastic optimizer on the full five-feature problem, using the exact same scaled train and test sets you prepared for the OLS baseline. The interface is identical to every other scikit-learn model: instantiate, .fit(), .predict().

from sklearn.linear_model import SGDRegressor

sgd = SGDRegressor(max_iter=2000, random_state=42, eta0=0.01)
sgd.fit(X_train, y_train)

sgd_r2 = sgd.score(X_test, y_test)
print(f"  SGD test R^2 = {sgd_r2:.3f}")
# Output:
#   SGD test R^2 = 0.795

The three hyperparameters do exactly what their from-scratch counterparts did:

  • max_iter=2000 lets the optimizer take up to 2,000 passes over the data, plenty for it to settle.
  • eta0=0.01 sets the initial learning rate, the η \eta from your update rule.
  • random_state=42 fixes the randomness, both the data shuffling and the starting weights, so your run is reproducible.

That is the entire model. No loop, no manual gradient, no convergence check that you have to write yourself, just a single .fit() call standing in for the dozens of lines you wrote by hand.


SGD Versus OLS: Do They Agree?

Here is the moment of truth. LinearRegression solved for the exact optimum with linear algebra. SGDRegressor crept toward the minimum with thousands of small, noisy steps. If gradient descent works the way the theory promises, the two should land in almost the same place.

print(f"  OLS R^2 = {ols.score(X_test, y_test):.3f}")
print(f"  SGD R^2 = {sgd.score(X_test, y_test):.3f}")
# Output:
#   OLS R^2 = 0.793
#   SGD R^2 = 0.795

They match. OLS scores 0.793 on the test cars and SGD scores 0.795, a difference well within the noise of a 40-car test set. The iterative optimizer found essentially the same line as the exact algebraic solver. The bar chart makes the agreement unmistakable.

Bar chart comparing OLS and SGD test R-squared
SGDRegressor reaches essentially the same fit as ordinary least squares.

This is the payoff of the whole module. For a small dataset like this one, you would simply use LinearRegression, because the exact solution is fast and free. But that exact solution relies on inverting a matrix, which becomes expensive, and eventually impossible, as the number of rows and features explodes. SGD never inverts anything. It only ever does cheap, incremental updates, so it keeps working on data far too large for the exact method, and it is the same optimizer that trains neural networks on millions of examples. Seeing it reproduce the least-squares answer on a problem you can verify is exactly how you build trust in it for problems you cannot.

When to choose which

Reach for LinearRegression when the data fits comfortably in memory and you want the exact answer with zero tuning. Reach for SGDRegressor when the dataset is huge, when it arrives as a stream you cannot hold all at once, or when you want the same training style you will later use for larger models. On a problem this size, either one is correct, and they agree.


Putting It All Together

Here is the complete pipeline, from raw CSV to a head-to-head comparison, condensed into one runnable script.

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression, SGDRegressor

# 1. Load
df = pd.read_csv("automobiles.csv")  # download: https://datatweets.com/datasets/automobiles.csv

# 2. Prepare: choose features, split, scale (fit on train only)
features = ["engine_size", "horsepower", "curb_weight", "width", "highway_mpg"]
X, y = df[features], df["price"]
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.25, random_state=42
)
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

# 3. Exact solution
ols = LinearRegression().fit(X_train, y_train)

# 4. Stochastic gradient descent
sgd = SGDRegressor(max_iter=2000, random_state=42, eta0=0.01).fit(X_train, y_train)

# 5. Compare
print(f"OLS R^2 = {ols.score(X_test, y_test):.3f}")
print(f"SGD R^2 = {sgd.score(X_test, y_test):.3f}")
# Output:
# OLS R^2 = 0.793
# SGD R^2 = 0.795

In about twenty lines you loaded real car data, scaled it, trained both the exact and the iterative optimizer, and confirmed they agree. That is gradient descent, scaled up and made practical.


Practice Exercises

Try these before checking the hints. Reuse the scaled X_train, X_test, y_train, y_test from the lesson.

Exercise 1: Pin the Learning Rate to a Constant

The lesson used scikit-learn’s default shrinking learning-rate schedule. Train a new SGDRegressor with learning_rate="constant" and eta0=0.01 so the rate never changes, mirroring your from-scratch loop. Compare its test R2 R^2 to the 0.795 from the lesson.

from sklearn.linear_model import SGDRegressor

# Your code here

Hint

Instantiate SGDRegressor(learning_rate="constant", eta0=0.01, max_iter=2000, random_state=42), then .fit(X_train, y_train) and .score(X_test, y_test). You should land very close to the lesson’s result; a constant rate works fine here because the features are standardized.

Exercise 2: Read the SGD Coefficients

The OLS coefficients ranked curb_weight and engine_size as the biggest drivers of price. Print sgd.coef_ alongside the feature names and check whether SGD agrees on the ranking.

# Your code here (sgd is the trained model from the lesson)

Hint

Loop with for name, coef in zip(features, sgd.coef_): print(name, round(coef, 1)). Because SGD found essentially the same line as OLS, its coefficients should be close to the OLS values (curb_weight and engine_size largest, highway_mpg smallest).

Exercise 3: Watch a Bad Learning Rate

From the from-scratch lesson you know a learning rate that is too large makes gradient descent overshoot. Train an SGDRegressor with learning_rate="constant" and a deliberately large eta0=1.0, then print its test R2 R^2 . What happens to the fit?

# Your code here

Hint

Use SGDRegressor(learning_rate="constant", eta0=1.0, max_iter=2000, random_state=42). A large constant rate makes the updates overshoot the minimum, so the R2 R^2 drops well below 0.79 (and may even go negative). This is the same instability you saw with the 0.6 learning-rate curve earlier in the module.


Summary

You connected the gradient descent you built by hand to the optimizer professionals actually deploy, and you proved on real data that it works.

Key Concepts

Why plain gradient descent struggles

  • Batch gradient descent sums the gradient over every row before each update, which is slow on large data
  • Adding features expands the search space and can introduce flat regions and local minima

Stochastic gradient descent

  • SGD estimates the gradient from one random sample (or small batch) and updates immediately
  • Each step is noisier but far cheaper, so SGD takes many more steps and scales to huge data
  • The update rule wwηMSE/w w \leftarrow w - \eta \, \partial \text{MSE} / \partial w is unchanged; only the gradient estimate differs

Using SGDRegressor

  • loss="squared_error" is the MSE cost; eta0 is the initial learning rate; max_iter is the number of passes; tol controls early stopping
  • learning_rate="invscaling" shrinks the rate over time by default; "constant" mirrors a fixed-rate loop
  • The interface is the standard scikit-learn pattern: instantiate, .fit(), .predict(), .score()

SGD versus OLS

  • LinearRegression finds the exact least-squares solution; SGDRegressor approximates it iteratively
  • On the car-price data they agree: OLS R2=0.793 R^2 = 0.793 and SGD R2=0.795 R^2 = 0.795
  • Standardizing features is essential for gradient descent so no feature’s scale dominates the updates

Why This Matters

On 159 cars you would just use LinearRegression, and the comparison here is exactly how you build confidence in SGD for the cases where you cannot. The exact least-squares solution relies on a matrix inversion that grows too expensive, and eventually infeasible, as data scales. Stochastic gradient descent does only cheap incremental updates, so it keeps working on data that would never fit a one-shot solver, and it is the very optimizer that trains the large models you will meet later. Watching it reproduce the answer you can verify is what earns it your trust on the answers you cannot.


Next Steps

You have a full regression toolkit now: linear regression, its interpretation and diagnostics, the gradient descent that trains it, and the scikit-learn tools that scale it. Time to put all of it to work on a brand-new, messier dataset.

Continue to Lesson 8 - Guided Project: Predicting Insurance Costs

Put it all together on a real medical-cost dataset.

Back to Module Overview

Return to the Regression module overview.


Keep Building Your Skills

You just closed the loop between theory and practice: the optimizer you wrote line by line is the same one that, dressed in scikit-learn’s interface, trains models on data far beyond what you could fit in a single batch. Whenever you see SGDRegressor, SGDClassifier, or the training of a neural network, picture that loss curve falling step by step. The machinery is identical, and now you understand it from the inside out.