Lesson 5 - Guided Project: Build a Gradient Booster from Scratch

Welcome to the Guided Project

Over the last four lessons you took apart gradient boosting piece by piece: how it grew out of decision trees, how it builds an additive model that fits residuals one tree at a time, how the same idea handles classification through log-odds and pseudo-residuals, and how those pseudo-residuals are really just negative gradients of a loss function. In this guided project you put all of it back together into something that actually runs. You will build a complete gradient boosting regressor from scratch, as a small reusable Python class, and then hold it up against scikit-learn’s battle-tested implementation to prove it works.

The running example for this course is Northwind Analytics, a fictional consultancy whose data team is standardizing on boosted trees. Before they trust a library, they want to understand one, so your job is to hand them a booster whose every line you can explain. You will work on the real California Housing dataset, predicting the median house value of a neighborhood, and you will measure every step in real numbers so there is nowhere to hide. By the end, a from-scratch loop of about a dozen lines will match a mature library to within a rounding error.

By the end of this project, you will be able to:

  • Establish a defensible baseline so that every later improvement is measurable
  • Implement a ScratchGradientBoostingRegressor class with fit and predict methods that boost shallow trees on residuals
  • Explain why accumulating learning_rate * tree.predict(X) reconstructs the additive model from the earlier lessons
  • Validate a from-scratch model against GradientBoostingRegressor and reason about why the two agree but are not bit-for-bit identical
  • Trade off learning_rate against n_estimators and read a learning curve to judge where the gains flatten out

This is a capstone for Module 1, so you should already be comfortable with the additive-model view of boosting, residuals as the thing each tree fits, and the basic scikit-learn workflow (train/test split, fit, predict, score). Let’s build.


Stage 1: Set Up the Problem and a Baseline

You cannot tell whether a model is good without something to compare it to. So before writing a single line of boosting, you will load the data, split it honestly, and measure the dumbest possible model: one that ignores every feature and always predicts the average house value. That number is your floor. Any real model has to beat it, and how much it beats it tells you how much the features and the algorithm are actually worth.

The California Housing dataset ships inside scikit-learn, drawn from the 1990 California census. Each row is a block group (a small neighborhood), and the target, MedHouseVal, is the median house value there in units of $100,000. The first call to fetch_california_housing downloads and caches the data locally, so later runs are instant.

import numpy as np
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

data = fetch_california_housing()   # cached after the first download
X, y = data.data, data.target

print("Feature names:", data.feature_names)
print("X shape:", X.shape)
# Output:
# Feature names: ['MedInc', 'HouseAge', 'AveRooms', 'AveBedrms', 'Population', 'AveOccup', 'Latitude', 'Longitude']
# X shape: (20640, 8)

There are 20,640 neighborhoods and 8 numeric features, from median income (MedInc) to average occupancy (AveOccup) to latitude and longitude. Now split off a test set you will never touch during training, using random_state=42 so the split is reproducible.

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

print("Train rows:", X_train.shape[0])
print("Test rows: ", X_test.shape[0])
# Output:
# Train rows: 16512
# Test rows:  4128

With the data split, build the baseline. The simplest honest model predicts a single constant for every neighborhood: the mean target of the training set. This is exactly the value gradient boosting itself starts from before it adds any trees, which makes it the perfect reference point. You measure error with root mean squared error (RMSE), in the same $100,000 units as the target:

RMSE=1ni=1n(yiy^i)2 \text{RMSE} = \sqrt{\frac{1}{n}\sum_{i=1}^{n}\left(y_i - \hat{y}_i\right)^2}
baseline_pred = y_train.mean()
print("Baseline prediction (mean of y_train):", round(float(baseline_pred), 4))

baseline_rmse = np.sqrt(
    mean_squared_error(y_test, np.full_like(y_test, baseline_pred))
)
print(f"Baseline test RMSE: {baseline_rmse:.4f}")
# Output:
# Baseline prediction (mean of y_train): 2.0719
# Baseline test RMSE: 1.1449

The baseline always guesses about 2.07 (that is, $207,000) and lands a test RMSE of 1.1449, meaning it is off by roughly $114,000 per neighborhood on average. That is your floor. Everything you build from here has to beat 1.1449, and the whole point of gradient boosting is to beat it by a lot.

Why start from the mean

Predicting the mean is not an arbitrary baseline. For squared-error loss, the constant that minimizes error over the training set is the mean, which is exactly why your booster will initialize there in Stage 2. Starting from the best possible constant means the very first tree only has to explain what the mean cannot, and every tree after it chips away at what remains.


Stage 2: Implement the Scratch Gradient Booster

Now the heart of the project. You will write a class that captures the additive model from Lesson 2 directly in code. The recipe is short and you already know every piece of it:

  1. Initialize the prediction for every training row to the mean of y (your baseline).
  2. Repeat n_estimators times: compute the current residuals (ycurrent prediction y - \text{current prediction} ), fit a shallow DecisionTreeRegressor to those residuals, and add a shrunken slice of its predictions back into the running total.
  3. Store each tree so predict can replay the same additive sum on new data.

The shrinkage is the learning_rate. Instead of trusting each tree fully, you add only learning_rate * tree.predict(X), so no single tree can overcorrect. This is the additive model

FM(x)=F0(x)+ηm=1Mhm(x) F_M(x) = F_0(x) + \eta \sum_{m=1}^{M} h_m(x)

where F0 F_0 is the initial mean, η \eta is the learning rate, and each hm h_m is a tree fit to the residuals left by the trees before it. Here is the whole thing as a reusable class.

import numpy as np
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error, r2_score

class ScratchGradientBoostingRegressor:
    def __init__(self, n_estimators=200, learning_rate=0.1, max_depth=3):
        self.n_estimators = n_estimators
        self.learning_rate = learning_rate
        self.max_depth = max_depth
        self.trees = []
        self.init_ = None

    def fit(self, X, y):
        # Step 1: start every prediction at the training mean
        self.init_ = float(np.mean(y))
        current = np.full(len(y), self.init_, dtype=float)
        self.trees = []

        # Step 2: fit each new tree to the residuals left by the last
        for _ in range(self.n_estimators):
            residuals = y - current
            tree = DecisionTreeRegressor(
                max_depth=self.max_depth, random_state=42
            )
            tree.fit(X, residuals)
            current += self.learning_rate * tree.predict(X)
            self.trees.append(tree)
        return self

    def predict(self, X):
        # Replay the additive sum: mean + learning_rate * every tree
        pred = np.full(X.shape[0], self.init_, dtype=float)
        for tree in self.trees:
            pred += self.learning_rate * tree.predict(X)
        return pred

Notice how faithfully the code mirrors the math. init_ is F0 F_0 . Each pass through the loop builds one hm h_m by fitting the residuals, and current += self.learning_rate * tree.predict(X) is the running additive update. The predict method does nothing clever: it just starts from the same mean and re-adds every stored tree’s shrunken prediction. Train it and score it on the held-out test set.

model = ScratchGradientBoostingRegressor(
    n_estimators=200, learning_rate=0.1, max_depth=3
)
model.fit(X_train, y_train)

scratch_pred = model.predict(X_test)
scratch_rmse = np.sqrt(mean_squared_error(y_test, scratch_pred))
scratch_r2 = r2_score(y_test, scratch_pred)

print(f"Scratch test RMSE: {scratch_rmse:.4f}")
print(f"Scratch test R2:   {scratch_r2:.4f}")
# Output:
# Scratch test RMSE: 0.5112
# Scratch test R2:   0.8006

Look at what a dozen lines bought you. The baseline was 1.1449; your from-scratch booster lands a test RMSE of 0.5112, less than half the error, and an R2 R^2 of 0.8006, meaning it explains about 80 percent of the variation in house values. Two hundred shallow trees, each individually weak (depth 3 can barely tell neighborhoods apart), combine into a genuinely strong model. That is boosting: a crowd of humble learners, each fixing the last one’s mistakes.

The figure below shows how that error came down as trees were added, one boosting round at a time.

Line chart titled Learning Curve: Test RMSE versus Number of Trees. A blue curve starts near a test RMSE of 1.09 with a single tree, drops steeply through 0.92 at 5 trees, 0.81 at 10 trees, and 0.66 at 25 trees, then flattens through 0.58 at 50 trees, 0.54 at 100, 0.52 at 150, and 0.511 at 200 trees. A red dashed horizontal line marks the predict-the-mean baseline RMSE of 1.145 near the top.
The scratch booster's test RMSE falls fast at first, then flattens: early trees capture the big patterns, later trees make smaller and smaller corrections as the residuals shrink.

The shape of that curve is the story of boosting in one picture. The first handful of trees slash the error because there is a lot of obvious structure to explain. As the residuals get smaller, each new tree has less left to fix, so the curve bends and flattens toward 0.511. This is also a preview of the diminishing returns you will quantify in Stage 4.


Stage 3: Validate Against scikit-learn

A from-scratch model is only convincing if it agrees with a trusted reference. scikit-learn’s GradientBoostingRegressor implements the same algorithm you just wrote, so if you give it matching hyperparameters (n_estimators=200, learning_rate=0.1, max_depth=3), the two should land in nearly the same place. This is the single best way to check that your understanding, and your code, are correct.

from sklearn.ensemble import GradientBoostingRegressor

sk = GradientBoostingRegressor(
    n_estimators=200, learning_rate=0.1, max_depth=3, random_state=42
)
sk.fit(X_train, y_train)

sk_pred = sk.predict(X_test)
sk_rmse = np.sqrt(mean_squared_error(y_test, sk_pred))
sk_r2 = r2_score(y_test, sk_pred)

print(f"sklearn test RMSE: {sk_rmse:.4f}")
print(f"sklearn test R2:   {sk_r2:.4f}")
print(f"Abs RMSE difference: {abs(scratch_rmse - sk_rmse):.4f}")
# Output:
# sklearn test RMSE: 0.5114
# sklearn test R2:   0.8004
# Abs RMSE difference: 0.0002

Your scratch booster scored 0.5112; scikit-learn scores 0.5114. The two differ by 0.0002 in RMSE, well under a tenth of a percent. For all practical purposes they are the same model, and you built one of them by hand. That agreement is the payoff of this whole project: gradient boosting is not a black box, it is exactly the residual-fitting loop you wrote.

So why aren’t they identical? A few small implementation details differ, none of which change the algorithm:

  • The initial prediction. You initialize with the plain mean of y. scikit-learn also starts from the optimal constant for squared error, which is the mean, so this part matches, but its internal init estimator and bookkeeping differ slightly in floating-point detail.
  • Leaf values. A standard regression tree fit to residuals already puts the mean residual in each leaf, which is the correct squared-error step size. scikit-learn folds this into its own tree builder rather than calling DecisionTreeRegressor the way you do, so tie-breaking and split thresholds can differ marginally.
  • Numerical bookkeeping. Different summation orders and internal dtypes produce tiny floating-point differences that accumulate over 200 rounds into a 0.0002 gap.

None of these is a bug. They are the difference between a teaching implementation and a production one, and a 0.0002 RMSE gap is exactly the kind of agreement that says “you got the algorithm right.”

Cross-checking is how you earn trust in your own code

Whenever you implement an algorithm from scratch, find a reference implementation and make them agree on the same inputs. If your version and a mature library land within rounding distance, you have strong evidence your code is correct. If they diverge, the gap is a precise clue about which step you got wrong. This habit will save you far more time than it costs.


Stage 4: The Learning-Rate vs. n_estimators Trade-off

You have one working booster, but you chose its learning_rate and n_estimators more or less by convention. These two knobs are deeply linked, and understanding their trade-off is what separates tuning by feel from tuning with intent.

The learning_rate (η \eta ) controls how much of each tree you trust. A large learning rate takes big, confident steps and reaches a low error in only a few trees, but those big steps can overshoot and lock in mistakes. A small learning rate takes cautious steps, so each tree corrects only a sliver of the residual, which usually generalizes better, but it needs many more trees to get there. The rule of thumb is: lower the learning rate, and raise n_estimators to compensate. Let’s watch that play out with four settings.

settings = [
    (0.5, 50),
    (0.1, 200),
    (0.05, 400),
    (0.01, 400),
]

for lr, n in settings:
    m = ScratchGradientBoostingRegressor(
        n_estimators=n, learning_rate=lr, max_depth=3
    )
    m.fit(X_train, y_train)
    p = m.predict(X_test)
    rmse = float(round(np.sqrt(mean_squared_error(y_test, p)), 4))
    print(f"lr={lr}, n_estimators={n}  ->  test RMSE {rmse}")
# Output:
# lr=0.5, n_estimators=50  ->  test RMSE 0.5168
# lr=0.1, n_estimators=200  ->  test RMSE 0.5112
# lr=0.05, n_estimators=400  ->  test RMSE 0.5097
# lr=0.01, n_estimators=400  ->  test RMSE 0.603

Collect those into a table to read the pattern:

learning_raten_estimatorsTest RMSERead
0.50500.5168Fast but coarse: big steps, few trees, slightly worse
0.102000.5112The balanced default from Stage 2
0.054000.5097Small steps + more trees: best of the four
0.014000.6030Steps too small: 400 trees is not enough to arrive

The middle two rows tell the intended story. Dropping the learning rate from 0.5 to 0.05 while raising n_estimators improves the test RMSE from 0.5168 to 0.5097, the best result here. Smaller, more careful steps generalize better, as long as you give the model enough trees to finish the descent.

The last row is the crucial caveat, “up to a point.” At learning_rate=0.01 the steps are so tiny that 400 trees only get you to 0.6030, worse than everything else, not because the setting is bad but because the model has not finished learning: it is underfit, still crawling toward its minimum. Give it a few thousand trees and it would likely match or beat the others. That is the real trade-off: a smaller learning rate is a better destination but a slower journey, and you must budget enough trees to actually arrive.


Practice Exercises

Now it is your turn. Treat these as real extensions, run each one, and see the numbers move before you check the hint.

Exercise 1: Track the Training Error Every Few Rounds

Add a small change to fit that records the training RMSE every 25 trees, so you can watch the model learn. Print the checkpoints and confirm the error falls monotonically on the training set.

# Modify fit to append np.sqrt(mean_squared_error(y, current))
# to a list every 25 iterations, then print it after training.

Hint

Inside the loop, after current += self.learning_rate * tree.predict(X), add if (i + 1) % 25 == 0: (where i is a loop counter from enumerate(range(self.n_estimators))) and append round(float(np.sqrt(mean_squared_error(y, current))), 4) to a self.train_rmse_ list. Training RMSE should decrease at every checkpoint, unlike test RMSE, which eventually flattens.

Exercise 2: Vary the Tree Depth

You used max_depth=3 throughout. Loop over depths [1, 2, 3, 4, 6], train a fresh ScratchGradientBoostingRegressor(n_estimators=200, learning_rate=0.1, max_depth=depth) for each, and print the test RMSE. Which depth wins, and where do deeper trees start to hurt?

for depth in [1, 2, 3, 4, 6]:
    # train and print test RMSE
    pass

Hint

Depth-1 trees (stumps) can only ask one question each, so they underfit; deeper trees are stronger individual learners but boosting many strong learners can overfit. Build the model inside the loop, call .fit(X_train, y_train), and print round(float(np.sqrt(mean_squared_error(y_test, m.predict(X_test)))), 4). Expect the sweet spot to sit around depth 3 to 4 for this dataset.

Exercise 3: Add an early_stopping Check

Give fit an optional validation set and stop adding trees once the validation RMSE has not improved for 10 rounds. This is exactly how libraries avoid wasting trees. Report how many trees your model actually used.

def fit(self, X, y, X_val=None, y_val=None, patience=10):
    # after each tree, score X_val and break if no improvement
    ...

Hint

Track best_rmse and a rounds_without_improvement counter. After each tree, compute the validation RMSE with the trees fit so far; if it beats best_rmse, update it and reset the counter, otherwise increment the counter. When the counter reaches patience, break out of the loop. Store len(self.trees) so you can report the stopping point, it will usually be well under n_estimators.


Summary

Congratulations! You built a complete gradient boosting regressor from scratch, validated it against a production library, and used it to explore the algorithm’s central trade-off, all on a real dataset. Let’s review what you did.

Key Concepts

Baselines Make Improvement Measurable

  • The predict-the-mean baseline scored a test RMSE of 1.1449, the floor every real model must beat
  • Gradient boosting starts from exactly this constant, because the mean is the optimal constant for squared-error loss

Boosting Is an Additive Model of Residuals

  • ScratchGradientBoostingRegressor initializes at the mean, then fits each shallow tree to the current residuals and adds learning_rate * tree.predict(X) to a running total
  • predict simply replays that additive sum: init_ plus every stored tree’s shrunken prediction
  • The result, a test RMSE of 0.5112 and R2 R^2 of 0.8006, more than halved the baseline error

Cross-Validation Against a Reference

  • scikit-learn’s GradientBoostingRegressor with matching hyperparameters scored 0.5114, a difference of just 0.0002
  • The tiny gap comes from implementation details (init handling, leaf-value computation, floating-point order), not from a different algorithm

Learning Rate vs. n_estimators

  • A smaller learning_rate with more trees generalizes better (0.05 / 400 trees gave the best RMSE, 0.5097)
  • But too small a rate underfits without enough trees: 0.01 / 400 trees stalled at 0.6030, still descending

Why This Matters

There is a specific kind of confidence that comes from having built a thing yourself. Gradient boosting powers a large share of the winning models in tabular machine learning, and it is easy to treat the library call as magic. You no longer have to. You know that GradientBoostingRegressor is, at its core, the residual-fitting loop you wrote in a dozen lines, and you proved it by matching the library to within 0.0002 RMSE. When a boosted model behaves strangely in production, that mental model is what lets you reason about it instead of guessing.

Just as important, you practiced the discipline that makes the rest of this course pay off: establish a baseline first, measure every change in real held-out numbers, and cross-check your work against a trusted reference. Those habits are what turn “I ran a model” into “I understand this model,” and they are exactly what the Northwind Analytics team, and any team you join, needs from a practitioner.


Next Steps

You have a working booster and a rock-solid mental model of how it learns. From here, the course moves from your hand-built version to the industrial-strength one: XGBoost, which takes the same core idea and adds regularization, clever second-order optimization, and the engineering that made it the default choice for tabular data.

Module 2: XGBoost in Depth

Move from a from-scratch booster to the real thing: XGBoost, its objective, and its hyperparameters.

Back to Course Overview

Review the full Gradient Boosting & XGBoost course.


Continue Building Your Skills

You just closed Module 1 the best way possible, not by reading about gradient boosting, but by building one whose every line you can defend and then proving it right against a trusted library. That arc, baseline first, build carefully, validate honestly, tune with intent, is the same one you will follow for XGBoost, LightGBM, and every boosted model you meet after this course. The library will get more sophisticated, but the intuition you earned here, a crowd of humble trees each correcting the last one’s mistakes, is the idea underneath all of it, and now it is genuinely yours.

Sponsor

Keep DATATWEETS free. Help fund practical data, AI, and engineering lessons for learners worldwide.

Buy Me a Coffee at ko-fi.com