Lesson 7 - Gradient Descent with Scikit-Learn
On this page
- From Hand-Rolled Loops to Production Tools
- Why Plain Gradient Descent Struggles
- The Fix: Stochastic Gradient Descent
- The Dataset: Predicting Car Prices
- A Baseline: Ordinary Least Squares
- Connecting the From-Scratch Loop to SGDRegressor
- Training SGDRegressor
- SGD Versus OLS: Do They Agree?
- Putting It All Together
- Practice Exercises
- Summary
- Next Steps
- Keep Building Your Skills
From Hand-Rolled Loops to Production Tools
In the previous lessons you built gradient descent from scratch. You watched a loss curve fall, tuned a learning rate by hand, and saw a weight converge on the same slope ordinary least squares would have found. That hand-coded version was perfect for understanding the idea. It is not what you reach for on real projects.
This lesson bridges the gap. You will learn why the plain gradient descent loop you wrote starts to creak as datasets grow, meet the practical fix called stochastic gradient descent, and then hand the whole job to scikit-learn’s SGDRegressor. Along the way you will predict car prices from real engine and body measurements, and you will prove to yourself that this optimizer lands on essentially the same answer as the exact least-squares solution.
By the end of this lesson, you will be able to:
- Explain the two ways plain (batch) gradient descent slows down as data and features grow
- Describe how stochastic gradient descent fixes those problems by sampling the data
- Connect the from-scratch update rule you wrote to the knobs
SGDRegressorexposes - Train
SGDRegressoron real, scaled features and read its coefficients - Compare an SGD fit against ordinary least squares and confirm they agree
You should already be comfortable with the gradient descent update rule, train_test_split, and StandardScaler from earlier lessons. Let’s begin.
Why Plain Gradient Descent Struggles
Recall the loop you wrote. On every single iteration, it looked at all the data to compute one update. That is fine for 159 cars and one feature. It becomes a problem in two directions, and both matter the moment you leave a textbook.
Problem 1: More features means more ground to cover
Each feature you add is another dimension the optimizer must navigate. With one predictor, the loss surface is a simple 2D bowl, and rolling downhill is easy. Add a fifth feature, a fiftieth, a ten-thousandth, and that bowl becomes a high-dimensional landscape. There is vastly more ground to search, so reaching the bottom takes more iterations, and each iteration is more expensive.
Problem 2: Every step rereads the whole dataset
Look again at the cost of one update in the from-scratch version. To compute the gradient for the weight, you summed an error term across every row:
That sum runs over all rows for a single step. With 159 rows it is instant. With ten million rows, one step touches ten million records, and you might need hundreds of steps. This full-dataset-per-step approach is called batch gradient descent, and on large data it is painfully slow.
A second, subtler issue
More features also tend to create more local minima and flat regions in the loss surface. A simple linear-regression loss is a smooth bowl with one minimum, so this is not a concern here. But for the more complex models you will meet later, an optimizer that only ever takes perfectly smooth, averaged steps can get stuck. A little randomness, which is exactly what stochastic gradient descent adds, helps escape those traps.
The Fix: Stochastic Gradient Descent
The idea behind stochastic gradient descent (SGD) is almost embarrassingly simple. Instead of summing the gradient over the entire dataset before each update, you estimate the gradient from just one randomly chosen sample (or a small batch) and update immediately.
Each individual step is now noisier, because one sample is a rough estimate of the true gradient. But each step is also enormously cheaper, and you get to take many more of them in the same amount of time. The noise even helps: it nudges the optimizer out of flat spots. Averaged over many tiny, cheap, slightly-random steps, SGD marches downhill just like batch gradient descent, only far faster on large data.
Compare the two update strategies side by side:
BATCH gradient descent STOCHASTIC gradient descent
------------------------ ------------------------------
Read ALL n rows Read ONE random row
Compute exact gradient Compute a noisy gradient estimate
Update weights once Update weights once
Repeat Repeat (many more times, cheaply)
Slow per step, smooth path Fast per step, jittery pathThe update rule itself is unchanged. You still nudge each weight against the gradient, scaled by the learning rate :
The only difference is how many rows go into estimating that gradient. That single change is what lets gradient descent scale to data that would never fit in a single batched computation.
The Dataset: Predicting Car Prices
You will put SGD to work on the same dataset you have used throughout this module: the classic automobiles dataset of imported cars, where each row is one car model described by its engine, body, and performance specs. Your job is to predict price in US dollars.
Load it with pandas.
import pandas as pd
df = pd.read_csv("automobiles.csv") # download: https://datatweets.com/datasets/automobiles.csv
print("Shape:", df.shape)
print(df["price"].describe()[["mean", "min", "max"]].round(0))
# Output:
# Shape: (159, 26)
# mean 11446.0
# min 5118.0
# max 35056.0
# Name: price, dtype: float64There are 159 cars and 26 columns, with no missing values to clean up. Prices run from about $5,118 for the cheapest economy car to $35,056 for the priciest model, averaging around $11,446.
A Data Dictionary
You will not need every column. Here are the numeric ones that carry the most signal about price, which you will feed to the model:
| Column | Meaning |
|---|---|
engine_size | Engine displacement in cubic centimeters |
horsepower | Engine power output |
curb_weight | Weight of the car in pounds, ready to drive |
width | Body width in inches |
length | Body length in inches |
wheel_base | Distance between front and rear axles |
city_mpg, highway_mpg | Fuel economy in miles per gallon |
price | Target: manufacturer’s price in US dollars |
Bigger engines, more horsepower, and heavier, wider bodies all tend to push the price up; better fuel economy tends to come with cheaper, smaller cars. Those are exactly the relationships a linear model can capture.
Preparing the Features
You will predict price from five strong numeric features. Two preparation steps carry over from earlier lessons and matter even more for gradient descent.
First, split the data so you can judge the model on cars it never trained on. Second, scale the features. Gradient descent is especially sensitive to feature scale: when curb_weight runs into the thousands and width sits around 65, the raw gradients are wildly different in magnitude, and the optimizer zig-zags. Standardizing every feature to mean 0 and standard deviation 1 puts them on equal footing.
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
features = ["engine_size", "horsepower", "curb_weight", "width", "highway_mpg"]
X = df[features]
y = df["price"]
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.25, random_state=42
)
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train) # learn mean/std on TRAIN only
X_test = scaler.transform(X_test) # apply the SAME transform to test
print("Train cars:", X_train.shape[0], "| Test cars:", X_test.shape[0])
# Output: Train cars: 119 | Test cars: 40The golden rule still applies
Fit the scaler on the training data only, then apply it to both sets. If you fit on the full dataset, information from the test cars leaks into training and your reported error becomes too optimistic. This rule never changes, no matter which optimizer trains the model.
A Baseline: Ordinary Least Squares
Before you reach for SGD, fit the model you already trust. LinearRegression solves for the best-fit coefficients exactly, using linear algebra rather than iteration. It is the gold standard to measure SGD against, because it finds the provably optimal least-squares line in one shot.
from sklearn.linear_model import LinearRegression
from sklearn.metrics import root_mean_squared_error, mean_absolute_error
ols = LinearRegression()
ols.fit(X_train, y_train)
for name, coef in zip(features, ols.coef_):
print(f" {name:<12} coef = {coef:8.1f}")
print(f" intercept = {ols.intercept_:.1f}")
# Output:
# engine_size coef = 1808.4
# horsepower coef = 336.5
# curb_weight coef = 1935.4
# width coef = 1892.0
# highway_mpg coef = 82.6
# intercept = 11442.5Because the features are standardized, these coefficients are directly comparable: each tells you how many dollars the predicted price moves for a one-standard-deviation change in that feature, holding the others fixed. curb_weight (1935.4) and engine_size (1808.4) are the heaviest hitters, with width close behind. highway_mpg barely moves the needle once size and power are accounted for. The intercept, 11442.5, is the predicted price for an average car, which lines up nicely with the dataset’s $11,446 mean.
The chart below ranks those standardized coefficients so the pecking order is obvious at a glance.
Now measure how well this baseline does on the held-out test cars. You will use three standard regression metrics: the coefficient of determination , the root mean squared error (RMSE), and the mean absolute error (MAE).
is the fraction of price variation the model explains; 1.0 is perfect and 0 is no better than always guessing the mean. RMSE and MAE report the typical error back in dollars.
ols_pred = ols.predict(X_test)
print(f" R^2 = {ols.score(X_test, y_test):.3f}")
print(f" RMSE = ${root_mean_squared_error(y_test, ols_pred):,.0f}")
print(f" MAE = ${mean_absolute_error(y_test, ols_pred):,.0f}")
# Output:
# R^2 = 0.793
# RMSE = $2,327
# MAE = $1,863An of 0.793 means the five features explain about 79 percent of the variation in car prices. The typical prediction lands within roughly $1,863 (MAE) of the true price, and RMSE, which punishes the occasional large miss more harshly, is $2,327. For a five-column linear model on raw specs, that is a solid fit, and it is the bar SGD now has to clear.
You can see the quality of the fit directly by plotting each test car’s predicted price against its actual price. A perfect model would put every point on the diagonal.
Connecting the From-Scratch Loop to SGDRegressor
Before switching optimizers, it helps to remember what your hand-coded gradient descent actually produced, because it is the same machinery SGDRegressor runs internally.
In the previous lesson you standardized a single feature, engine_size, and the target price, then ran gradient descent with a learning rate of 0.1 for 60 iterations. The loss fell steadily and the weight converged.
# Recap from the from-scratch lesson (one standardized feature)
# learning rate = 0.1, 60 iterations
# Output:
# final weight w = 0.841
# final bias b = 0.000
# final MSE = 0.292On standardized data the slope settles at 0.841, which is exactly the correlation between engine_size and price, and the bias settles at 0 because standardizing centers both variables. The loss curve you plotted told the whole convergence story.
You also saw that the learning rate makes or breaks the whole process. Too small and the descent crawls; too large and it overshoots the minimum and may never settle.
Keep that picture in mind, because SGDRegressor exposes the very same knobs under different names. Here is the translation:
| From-scratch concept | SGDRegressor parameter |
|---|---|
| Cost function (MSE) | loss="squared_error" (the default) |
| Learning rate | eta0, plus a learning_rate schedule |
| Number of passes over the data | max_iter |
| When to stop early | tol |
The one genuinely new idea is learning_rate. In your loop the rate was a fixed constant. scikit-learn, by default, shrinks the learning rate as training proceeds, taking big steps early and small careful steps near the minimum. That is learning_rate="invscaling". You can pin it to a constant with learning_rate="constant" if you want to mirror your hand-coded behavior exactly.
Training SGDRegressor
Now train the stochastic optimizer on the full five-feature problem, using the exact same scaled train and test sets you prepared for the OLS baseline. The interface is identical to every other scikit-learn model: instantiate, .fit(), .predict().
from sklearn.linear_model import SGDRegressor
sgd = SGDRegressor(max_iter=2000, random_state=42, eta0=0.01)
sgd.fit(X_train, y_train)
sgd_r2 = sgd.score(X_test, y_test)
print(f" SGD test R^2 = {sgd_r2:.3f}")
# Output:
# SGD test R^2 = 0.795The three hyperparameters do exactly what their from-scratch counterparts did:
max_iter=2000lets the optimizer take up to 2,000 passes over the data, plenty for it to settle.eta0=0.01sets the initial learning rate, the from your update rule.random_state=42fixes the randomness, both the data shuffling and the starting weights, so your run is reproducible.
That is the entire model. No loop, no manual gradient, no convergence check that you have to write yourself, just a single .fit() call standing in for the dozens of lines you wrote by hand.
SGD Versus OLS: Do They Agree?
Here is the moment of truth. LinearRegression solved for the exact optimum with linear algebra. SGDRegressor crept toward the minimum with thousands of small, noisy steps. If gradient descent works the way the theory promises, the two should land in almost the same place.
print(f" OLS R^2 = {ols.score(X_test, y_test):.3f}")
print(f" SGD R^2 = {sgd.score(X_test, y_test):.3f}")
# Output:
# OLS R^2 = 0.793
# SGD R^2 = 0.795They match. OLS scores 0.793 on the test cars and SGD scores 0.795, a difference well within the noise of a 40-car test set. The iterative optimizer found essentially the same line as the exact algebraic solver. The bar chart makes the agreement unmistakable.
This is the payoff of the whole module. For a small dataset like this one, you would simply use LinearRegression, because the exact solution is fast and free. But that exact solution relies on inverting a matrix, which becomes expensive, and eventually impossible, as the number of rows and features explodes. SGD never inverts anything. It only ever does cheap, incremental updates, so it keeps working on data far too large for the exact method, and it is the same optimizer that trains neural networks on millions of examples. Seeing it reproduce the least-squares answer on a problem you can verify is exactly how you build trust in it for problems you cannot.
When to choose which
Reach for LinearRegression when the data fits comfortably in memory and you want the exact answer with zero tuning. Reach for SGDRegressor when the dataset is huge, when it arrives as a stream you cannot hold all at once, or when you want the same training style you will later use for larger models. On a problem this size, either one is correct, and they agree.
Putting It All Together
Here is the complete pipeline, from raw CSV to a head-to-head comparison, condensed into one runnable script.
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression, SGDRegressor
# 1. Load
df = pd.read_csv("automobiles.csv") # download: https://datatweets.com/datasets/automobiles.csv
# 2. Prepare: choose features, split, scale (fit on train only)
features = ["engine_size", "horsepower", "curb_weight", "width", "highway_mpg"]
X, y = df[features], df["price"]
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.25, random_state=42
)
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
# 3. Exact solution
ols = LinearRegression().fit(X_train, y_train)
# 4. Stochastic gradient descent
sgd = SGDRegressor(max_iter=2000, random_state=42, eta0=0.01).fit(X_train, y_train)
# 5. Compare
print(f"OLS R^2 = {ols.score(X_test, y_test):.3f}")
print(f"SGD R^2 = {sgd.score(X_test, y_test):.3f}")
# Output:
# OLS R^2 = 0.793
# SGD R^2 = 0.795In about twenty lines you loaded real car data, scaled it, trained both the exact and the iterative optimizer, and confirmed they agree. That is gradient descent, scaled up and made practical.
Practice Exercises
Try these before checking the hints. Reuse the scaled X_train, X_test, y_train, y_test from the lesson.
Exercise 1: Pin the Learning Rate to a Constant
The lesson used scikit-learn’s default shrinking learning-rate schedule. Train a new SGDRegressor with learning_rate="constant" and eta0=0.01 so the rate never changes, mirroring your from-scratch loop. Compare its test to the 0.795 from the lesson.
from sklearn.linear_model import SGDRegressor
# Your code hereHint
Instantiate SGDRegressor(learning_rate="constant", eta0=0.01, max_iter=2000, random_state=42), then .fit(X_train, y_train) and .score(X_test, y_test). You should land very close to the lesson’s result; a constant rate works fine here because the features are standardized.
Exercise 2: Read the SGD Coefficients
The OLS coefficients ranked curb_weight and engine_size as the biggest drivers of price. Print sgd.coef_ alongside the feature names and check whether SGD agrees on the ranking.
# Your code here (sgd is the trained model from the lesson)Hint
Loop with for name, coef in zip(features, sgd.coef_): print(name, round(coef, 1)). Because SGD found essentially the same line as OLS, its coefficients should be close to the OLS values (curb_weight and engine_size largest, highway_mpg smallest).
Exercise 3: Watch a Bad Learning Rate
From the from-scratch lesson you know a learning rate that is too large makes gradient descent overshoot. Train an SGDRegressor with learning_rate="constant" and a deliberately large eta0=1.0, then print its test . What happens to the fit?
# Your code hereHint
Use SGDRegressor(learning_rate="constant", eta0=1.0, max_iter=2000, random_state=42). A large constant rate makes the updates overshoot the minimum, so the drops well below 0.79 (and may even go negative). This is the same instability you saw with the 0.6 learning-rate curve earlier in the module.
Summary
You connected the gradient descent you built by hand to the optimizer professionals actually deploy, and you proved on real data that it works.
Key Concepts
Why plain gradient descent struggles
- Batch gradient descent sums the gradient over every row before each update, which is slow on large data
- Adding features expands the search space and can introduce flat regions and local minima
Stochastic gradient descent
- SGD estimates the gradient from one random sample (or small batch) and updates immediately
- Each step is noisier but far cheaper, so SGD takes many more steps and scales to huge data
- The update rule is unchanged; only the gradient estimate differs
Using SGDRegressor
loss="squared_error"is the MSE cost;eta0is the initial learning rate;max_iteris the number of passes;tolcontrols early stoppinglearning_rate="invscaling"shrinks the rate over time by default;"constant"mirrors a fixed-rate loop- The interface is the standard scikit-learn pattern: instantiate,
.fit(),.predict(),.score()
SGD versus OLS
LinearRegressionfinds the exact least-squares solution;SGDRegressorapproximates it iteratively- On the car-price data they agree: OLS and SGD
- Standardizing features is essential for gradient descent so no feature’s scale dominates the updates
Why This Matters
On 159 cars you would just use LinearRegression, and the comparison here is exactly how you build confidence in SGD for the cases where you cannot. The exact least-squares solution relies on a matrix inversion that grows too expensive, and eventually infeasible, as data scales. Stochastic gradient descent does only cheap incremental updates, so it keeps working on data that would never fit a one-shot solver, and it is the very optimizer that trains the large models you will meet later. Watching it reproduce the answer you can verify is what earns it your trust on the answers you cannot.
Next Steps
You have a full regression toolkit now: linear regression, its interpretation and diagnostics, the gradient descent that trains it, and the scikit-learn tools that scale it. Time to put all of it to work on a brand-new, messier dataset.
Continue to Lesson 8 - Guided Project: Predicting Insurance Costs
Put it all together on a real medical-cost dataset.
Back to Module Overview
Return to the Regression module overview.
Keep Building Your Skills
You just closed the loop between theory and practice: the optimizer you wrote line by line is the same one that, dressed in scikit-learn’s interface, trains models on data far beyond what you could fit in a single batch. Whenever you see SGDRegressor, SGDClassifier, or the training of a neural network, picture that loss curve falling step by step. The machinery is identical, and now you understand it from the inside out.