Lesson 2 - How Gradient Boosting Works
Welcome to How Gradient Boosting Works
In Lesson 1, From Trees to Boosting, you saw the big idea: instead of training one large model, boosting trains a sequence of small models where each one focuses on the mistakes of the ones before it. This lesson makes that idea concrete. The analytics team at Northwind Analytics wants to predict home values, and you are going to build the engine that does it, one line of code at a time, using the real California Housing dataset from the 1990 census. The target is MedHouseVal, the median house value (in units of $100,000) for each block group.
Rather than call a library and trust the result, you will construct a gradient booster by hand: start from a single constant prediction, then repeatedly fit a shallow DecisionTreeRegressor to the errors your model is still making, and add a small fraction of each tree’s output to a running prediction. By the end you will have watched the training error fall from 1.156 down to 0.556, and you will have proven your loop is doing the same thing as scikit-learn’s GradientBoostingRegressor. XGBoost comes later, in Module 2; here the goal is to understand the mechanics so nothing about boosting feels like magic.
By the end of this lesson, you will be able to:
- Explain the additive model and why it starts from the mean of the target
- Compute the residuals a model leaves behind and describe why fitting them reduces the error
- Describe how the learning rate trades step size against the number of trees, and why small steps generalize better
- Build a working gradient booster from scratch with a loop of
DecisionTreeRegressormodels - Cross-check a hand-built booster against scikit-learn’s
GradientBoostingRegressorand explain any small differences
You should be comfortable with basic Python, NumPy, and the scikit-learn workflow (train/test split, fitting a model, scoring it). If you have finished the Machine Learning Foundations module and Lesson 1 of this course, you are ready. Let’s begin.
The Additive Model
Gradient boosting builds its prediction as a sum of models. It does not try to get the answer right in one shot. Instead it starts with a rough guess and then adds a series of small corrections, each one a shallow tree. Written out, the model after rounds is:
Here is the prediction you already have, is the new tree you are about to add, and (the Greek letter nu) is the learning rate, a small number like 0.1 that shrinks each tree’s contribution. Reading the equation aloud: the new prediction is the old prediction plus a small step in the direction the new tree suggests.
Every additive model needs a starting point, . For squared-error regression the best possible constant prediction, the single number that minimizes the mean squared error before you have added any trees, is simply the mean of the target. So the booster begins by predicting the average house value for every block group, and every tree after that only has to explain how each block differs from the average.
Let’s load the data, split it, and find that starting constant.
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
data = fetch_california_housing()
X, y = data.data, data.target
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
print("Train rows:", X_train.shape[0], "| Features:", X_train.shape[1])
print("Target (MedHouseVal) train mean:", round(float(y_train.mean()), 4))
# Output:
# Train rows: 16512 | Features: 8
# Target (MedHouseVal) train mean: 2.0719The training set has 16,512 block groups and 8 numeric features. The mean target is 2.0719, meaning the average block group has a median home value of about $207,000. That number, 2.0719, is our : the prediction the booster makes for every block before it has learned anything specific.
Why the mean is the right starting point
For squared-error loss, the constant that minimizes is exactly the mean of . You can prove it by taking the derivative with respect to , setting it to zero, and solving. That is why gradient boosting for regression always initializes with the target mean: it is the best you can do with zero trees, so every tree afterward starts from the strongest possible baseline.
How Good Is the Starting Point Alone?
Before adding any trees, measure how far off the constant model is. We will use RMSE (root mean squared error), which is in the same units as the target, so an RMSE of 1.0 means being off by about $100,000 on a typical block.
import numpy as np
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
data = fetch_california_housing()
X, y = data.data, data.target
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
F0 = y_train.mean() # the constant starting model
constant_pred = np.full_like(y_train, F0)
rmse0 = np.sqrt(mean_squared_error(y_train, constant_pred))
print("F0 (constant):", round(float(F0), 4))
print("Baseline train RMSE:", round(float(rmse0), 4))
# Output:
# F0 (constant): 2.0719
# Baseline train RMSE: 1.1562Predicting the mean for every block gives a training RMSE of 1.1562, off by roughly $116,000 per block. That is our floor. Every tree we add should push this number down.
Fitting Residuals
Here is the heart of gradient boosting. After the current model makes its predictions, each block group has a leftover error called the residual:
The residual is what the model still gets wrong. A block whose true value is 3.5 while the model predicts 2.07 has a residual of : the model is under-predicting by 1.43 and needs to be nudged upward. A block the model over-predicts has a negative residual and needs to be nudged down.
The key insight is that for squared-error loss, the residual is exactly the direction that reduces the loss fastest. (That is where the gradient in gradient boosting comes from: the negative gradient of the squared-error loss with respect to the prediction is precisely .) So instead of guessing how to improve, the booster trains its next tree to predict the residuals. Wherever the tree says “this block’s residual is large and positive,” adding that tree’s output pushes the prediction toward the truth.
That is the whole loop:
- Compute the residuals of the current model: .
- Fit a shallow tree to those residuals.
- Update the model: .
- Repeat.
Each pass leaves a smaller residual for the next tree to chip away at. The figure below shows this staged process: the residuals start large (red), shrink as trees are added (orange), and end tiny (green).
The Learning Rate
You may have noticed the sitting in front of every tree. Why not add the full tree and take the biggest possible step toward the truth each round? Because a booster that lunges at the training data learns its noise as eagerly as its signal, and overfits. The learning rate, also called shrinkage, deliberately takes a small step, typically 0.1, so that no single tree can dominate the model. Progress is slower per tree, but the final model generalizes better because it is built from many small, cautious corrections rather than a few reckless ones.
There is a direct trade-off: a smaller learning rate needs more trees to reach the same training fit, but those extra trees buy you a smoother, better-generalizing model. The best way to see this is to compare a large step against a small step, both given plenty of trees (500 here) so the small-step model has time to catch up.
import numpy as np
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error
data = fetch_california_housing()
X, y = data.data, data.target
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
def boost(lr, n_estimators, max_depth=3):
init = y_train.mean()
F_tr = np.full_like(y_train, init)
F_te = np.full_like(y_test, init)
for _ in range(n_estimators):
residual = y_train - F_tr
tree = DecisionTreeRegressor(max_depth=max_depth, random_state=0)
tree.fit(X_train, residual)
F_tr = F_tr + lr * tree.predict(X_train)
F_te = F_te + lr * tree.predict(X_test)
train_rmse = np.sqrt(mean_squared_error(y_train, F_tr))
test_rmse = np.sqrt(mean_squared_error(y_test, F_te))
return train_rmse, test_rmse
for lr in (1.0, 0.1):
tr, te = boost(lr, 500)
print(f"lr={lr:<4} n=500 train RMSE={tr:.4f} test RMSE={te:.4f}")
# Output:
# lr=1.0 n=500 train RMSE=0.2527 test RMSE=0.5477
# lr=0.1 n=500 train RMSE=0.4137 test RMSE=0.4856Look carefully at the two rows. The large step () drives the training RMSE way down to 0.2527, but its test RMSE is a worse 0.5477. That gap between a great training score and a mediocre test score is the signature of overfitting: the model memorized the training blocks. The small step () fits the training data less aggressively (0.4137) yet scores a better 0.4856 on held-out data, with a much smaller train-to-test gap. Small steps plus more trees win.
Learning rate and number of trees move together
The learning rate and the number of trees are two dials for the same thing: how much total correction the model applies. Halving the learning rate roughly doubles the trees you need to reach the same fit. The practitioner’s rule of thumb is to fix a small learning rate (0.05 to 0.1), add trees until validation error stops improving, and let early stopping decide when to quit. You will automate exactly that in a later module.
Building a Booster by Hand
Now assemble the pieces into a real, working gradient booster. The loop is short because the idea is simple: initialize the prediction to the training mean, then for each of n_estimators rounds, fit a depth-3 tree to the current residuals and add learning_rate * tree.predict(X) to the running prediction. We print the training RMSE after each of the first ten iterations so you can watch it fall.
import numpy as np
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error
data = fetch_california_housing()
X, y = data.data, data.target
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
learning_rate = 0.1
n_estimators = 50
max_depth = 3
F_train = np.full_like(y_train, y_train.mean()) # start from the mean
trees = []
for m in range(1, n_estimators + 1):
residual = y_train - F_train # what the model still gets wrong
tree = DecisionTreeRegressor(max_depth=max_depth, random_state=0)
tree.fit(X_train, residual) # learn the residuals
F_train = F_train + learning_rate * tree.predict(X_train)
trees.append(tree)
rmse = np.sqrt(mean_squared_error(y_train, F_train))
if m <= 10:
print(f"iter {m:>2} train RMSE = {rmse:.4f}")
# Output:
# iter 1 train RMSE = 1.0955
# iter 2 train RMSE = 1.0432
# iter 3 train RMSE = 0.9968
# iter 4 train RMSE = 0.9556
# iter 5 train RMSE = 0.9199
# iter 6 train RMSE = 0.8885
# iter 7 train RMSE = 0.8619
# iter 8 train RMSE = 0.8390
# iter 9 train RMSE = 0.8167
# iter 10 train RMSE = 0.7986That is gradient boosting, in about fifteen lines. The RMSE starts at 1.0955 after the first tree, already below the 1.1562 baseline, and drops every single round: 1.0432, 0.9968, 0.9556, and onward. Each tree is fitted not to the house values themselves but to the shrinking residuals the previous trees left behind, so every round the model has less error to explain and the curve bends downward smoothly.
Notice we kept every tree in a trees list. That list is the trained model: to predict on new data you start from the same constant and add up learning_rate * tree.predict(x) for every tree. Let’s run the full 50 rounds and score both the training set and the held-out test set that way.
import numpy as np
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error
data = fetch_california_housing()
X, y = data.data, data.target
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
learning_rate, n_estimators, max_depth = 0.1, 50, 3
init = y_train.mean()
F_train = np.full_like(y_train, init)
trees = []
for m in range(n_estimators):
residual = y_train - F_train
tree = DecisionTreeRegressor(max_depth=max_depth, random_state=0)
tree.fit(X_train, residual)
F_train = F_train + learning_rate * tree.predict(X_train)
trees.append(tree)
# apply the SAME trees to the test set, starting from the same constant
F_test = np.full_like(y_test, init)
for tree in trees:
F_test = F_test + learning_rate * tree.predict(X_test)
train_rmse = np.sqrt(mean_squared_error(y_train, F_train))
test_rmse = np.sqrt(mean_squared_error(y_test, F_test))
print("After 50 trees:")
print(" train RMSE:", round(float(train_rmse), 4))
print(" test RMSE:", round(float(test_rmse), 4))
# Output:
# After 50 trees:
# train RMSE: 0.5559
# test RMSE: 0.5799Fifty shallow trees cut the training RMSE from 1.1562 all the way to 0.5559, and the test RMSE of 0.5799 is nearly as good, telling us the model generalizes rather than memorizes. Fifty stumps, none deeper than three levels, together beat what any single tree of that depth could do, and they did it just by chasing residuals.
Always start the test prediction from the same constant
A common bug when hand-rolling a booster is scoring the test set without initializing it to the training mean, or re-fitting trees on the test data. The trees were trained on the training residuals; to predict on new data you reuse those exact trees and start from the same . The model is the constant plus the fixed list of trees, nothing more.
Cross-Checking with Scikit-Learn
A from-scratch model is only convincing if it matches a trusted implementation. Scikit-learn’s GradientBoostingRegressor does the same thing our loop does: it initializes with the mean and adds shallow trees fitted to residuals. Fit it with matching settings, n_estimators=50, learning_rate=0.1, max_depth=3, and compare.
import numpy as np
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.metrics import mean_squared_error
data = fetch_california_housing()
X, y = data.data, data.target
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
gbr = GradientBoostingRegressor(
n_estimators=50, learning_rate=0.1, max_depth=3, random_state=0
)
gbr.fit(X_train, y_train)
train_rmse = np.sqrt(mean_squared_error(y_train, gbr.predict(X_train)))
test_rmse = np.sqrt(mean_squared_error(y_test, gbr.predict(X_test)))
print("sklearn GradientBoostingRegressor (50 trees, lr=0.1, depth=3):")
print(" train RMSE:", round(float(train_rmse), 4))
print(" test RMSE:", round(float(test_rmse), 4))
# Output:
# sklearn GradientBoostingRegressor (50 trees, lr=0.1, depth=3):
# train RMSE: 0.5559
# test RMSE: 0.5798The library reports a training RMSE of 0.5559, identical to our hand-built 0.5559, and a test RMSE of 0.5798 against our 0.5799, a difference of one ten-thousandth. That near-perfect agreement is the payoff: the loop you wrote is genuinely what a production gradient booster does under the hood.
Why the numbers can differ slightly (and why they barely do here)
For plain squared-error loss, fitting a tree to the residuals is the exact gradient step, so our loop and scikit-learn’s regressor agree to within rounding. The tiny test-set gap comes from small internal differences, such as how each implementation handles floating-point accumulation and tie-breaking when a tree searches for splits. On other loss functions (for example the robust Huber loss), or once you turn on features like row subsampling, scikit-learn adjusts each leaf’s value with an extra optimization step that a bare residual-fitting loop skips, and the two would diverge more noticeably. For the squared-error case here, expect them to land right on top of each other.
Practice Exercises
Try these before checking the hints. They reinforce the additive model, residual fitting, and the learning-rate trade-off.
Exercise 1: Track the Residuals Shrinking
Modify the hand-built loop so that, in addition to the RMSE, it also prints the mean absolute residual (np.abs(residual).mean()) at each of the first five iterations. Confirm that the typical leftover error gets smaller every round.
import numpy as np
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
data = fetch_california_housing()
X, y = data.data, data.target
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
# Your code hereHint
Inside the loop, right after residual = y_train - F_train, compute mad = np.abs(residual).mean() and print it alongside the iteration number. Because each tree fits the residuals, the mean absolute residual on the training set falls monotonically over the first several rounds.
Exercise 2: Change the Tree Depth
Rerun the 50-tree booster with max_depth=1 (decision stumps) and then with max_depth=4, keeping learning_rate=0.1. Print the test RMSE for each. Which depth generalizes best on this dataset, and why might deeper trees start to hurt?
import numpy as np
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error
# Your code here (reuse a boost() function that takes max_depth)Hint
Wrap the loop in a function boost(max_depth) that returns the test RMSE, then call it for depths 1 and 4. Deeper trees fit more of each residual per round, so they can overfit faster; with only 50 trees, a moderate depth like 2 or 3 usually strikes the best balance. Compare against the depth-3 test RMSE of 0.5799 from the lesson.
Exercise 3: Confirm the Additive Prediction
Take the list of 50 trained trees from the by-hand booster and predict the value of the very first test block manually: start with y_train.mean(), then add 0.1 * tree.predict(X_test[:1]) for every tree in the list. Confirm your total equals F_test[0] from the lesson’s full-run block.
import numpy as np
# assume `trees`, `X_test`, and `y_train` exist from the by-hand booster
# Your code hereHint
Initialize pred = y_train.mean(), then loop for tree in trees: pred += 0.1 * tree.predict(X_test[:1])[0]. This reconstructs for a single block, and it should match the first entry of the F_test array the lesson computed.
Summary
You built a gradient booster from nothing but a constant and a loop of shallow trees, and you proved it matches a production library. Let’s review the key ideas.
Key Concepts
The Additive Model
- Gradient boosting predicts with a sum of models:
- It starts from a constant , the mean of the target, which is the best zero-tree prediction for squared-error loss
- On California Housing the starting mean was 2.0719, giving a baseline training RMSE of 1.1562
Fitting Residuals
- The residual is what the current model still gets wrong
- For squared-error loss the residual is the negative gradient of the loss, so fitting it reduces error fastest
- Each new tree is trained on the residuals, and adding it shrinks the leftover error for the next tree
The Learning Rate
- The learning rate (shrinkage) scales down each tree’s contribution, usually to 0.1
- Large steps drive training error low but overfit; small steps generalize better with more trees
- With 500 trees, reached a test RMSE of 0.4856 versus 0.5477 for
Building and Checking a Booster
- The loop: start from the mean, then repeatedly fit a depth-3 tree to the residuals and add
learning_rate * tree.predict(X) - Fifty trees cut training RMSE from 1.1562 to 0.5559 and reached a test RMSE of 0.5799
- Scikit-learn’s
GradientBoostingRegressorwith matching settings gave 0.5559 / 0.5798, confirming the loop is correct
Why This Matters
Every boosting library you will ever use, scikit-learn’s GradientBoostingRegressor, XGBoost, LightGBM, CatBoost, runs the exact loop you just wrote: initialize with a constant, fit a weak learner to the residuals (the negative gradient), and take a shrunken step. The differences between those libraries are refinements on top of this core, faster split-finding, regularization, clever handling of the leaf values, but the skeleton is identical. Because you built that skeleton yourself and watched every number move, the advanced tools in the coming modules will read as sensible upgrades rather than opaque black boxes.
Just as important, you met the central tension of boosting head-on: it is a method powerful enough to overfit, and the learning rate is your main defense. Knowing why a small step with many trees beats a big step with few will make you far more effective when you start tuning real models, because you will understand what the knobs actually do instead of turning them at random.
Next Steps
You can now build and reason about a gradient booster for regression. In the next lesson, you will adapt the same additive, residual-chasing idea to classification, where the target is a category and the loss is no longer plain squared error.
Lesson 3: Gradient Boosting for Classification
Extend the additive model to categorical targets using log-loss and predicted probabilities.
Back to Module Overview
Return to the Boosting Foundations module overview
Continue Building Your Skills
You have done something most people who use boosting never do: you built one by hand and confirmed it against a trusted library, number for number. The loop you wrote, start from the mean, fit the residuals, take a small step, repeat, is the beating heart of every gradient boosting tool in existence. Run the by-hand booster again and try nudging the learning rate or the number of trees, and watch the training and test RMSE respond. Once that feels natural, you are ready to see how the very same idea handles classification, and then how XGBoost makes it fast and robust enough for production.