Lesson 1 - From Trees to Boosting

Welcome to From Trees to Boosting

Welcome to the first lesson of Gradient Boosting and XGBoost. This course picks up exactly where decision trees and random forests leave off. You already know how a single tree learns yes/no rules, and how a random forest grows hundreds of trees and lets them vote. In this lesson you will meet a different and remarkably powerful idea, boosting, which uses the very same decision trees but assembles them in a completely different way.

To keep things concrete, you will work alongside Northwind Analytics, a small data team asked to predict median house values across California districts. Their first instinct is the tree-based toolkit they already trust. So we will start there, fit a single tree and a random forest on the real California Housing dataset, see exactly where each one lands, and then introduce the model that this whole course is about: gradient boosting. Every number you see below was produced by running the code for real, not typed in by hand.

By the end of this lesson, you will be able to:

  • Explain why a single decision tree has high variance and tends to overfit
  • Describe how random forests use bagging to reduce variance by averaging many independent trees
  • Contrast bagging (parallel, independent trees) with boosting (sequential, dependent trees)
  • Explain why bagging mainly reduces variance while boosting mainly reduces bias
  • Fit and compare a decision tree, a random forest, and a gradient boosting model on real data using scikit-learn

You should be comfortable with basic Python, scikit-learn’s fit/predict pattern, and the idea of a train/test split. No boosting knowledge is assumed. Let’s begin.


Recap: One Tree Overfits

A single decision tree is easy to read, but left unchecked it is a fragile predictor. Grown to full depth, a tree keeps splitting until its leaves are almost pure, carving the training data into tiny regions that memorize noise as if it were signal. The result is a model with high variance: it fits the training set beautifully, yet a small change in the data can produce a very different tree, and its accuracy on unseen data suffers.

Let’s watch this happen on the California Housing data. Each row is a California district, the features describe income, house age, occupancy, and location, and the target MedHouseVal is the median house value in that district (in units of 100,000 dollars). Northwind fits one unconstrained regression tree and measures it on a held-out test set.

from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error, r2_score
import numpy as np

data = fetch_california_housing()
X_train, X_test, y_train, y_test = train_test_split(
    data.data, data.target, test_size=0.2, random_state=42
)

tree = DecisionTreeRegressor(random_state=42)
tree.fit(X_train, y_train)

train_rmse = np.sqrt(mean_squared_error(y_train, tree.predict(X_train)))
test_rmse = np.sqrt(mean_squared_error(y_test, tree.predict(X_test)))
test_r2 = r2_score(y_test, tree.predict(X_test))

print("train RMSE:", float(round(train_rmse, 4)))
print("test  RMSE:", float(round(test_rmse, 4)))
print("test  R2  :", float(round(test_r2, 4)))
train RMSE: 0.0
test  RMSE: 0.7069
test  R2  : 0.6187

Look at the gap. The tree’s training RMSE is essentially 0.0, meaning it predicts the training districts almost perfectly, yet its test RMSE jumps to 0.7069. That gap between a flawless training score and a mediocre test score is the signature of overfitting. The single tree memorized the training set and generalized poorly, explaining only about 62 percent of the variance in unseen districts (R2=0.6187 R^2 = 0.6187 ).

What RMSE and R2 mean here

RMSE (root mean squared error) is the typical size of a prediction’s miss, in the same units as the target. Since MedHouseVal is in units of 100,000 dollars, an RMSE of 0.7069 means the tree is off by roughly 70,000 dollars on a typical district. Lower is better. R2 R^2 is the fraction of the target’s variance the model explains, from 1.0 (perfect) down toward 0.0 (no better than always guessing the mean). Higher is better. We will use this same pair of numbers to compare every model in this lesson.


Recap: Random Forests Reduce Variance with Bagging

How do you tame a high-variance model? The random forest’s answer is disarmingly simple: build many trees and average them. The averaging is what helps. Individual full-depth trees each overfit in their own noisy way, but their errors are largely independent, so when you average their predictions the random wobble cancels out while the real signal survives. The averaged prediction is far more stable than any single tree.

The technique that makes those trees different enough to be worth averaging is called bagging, short for bootstrap aggregating. It has two ingredients:

  • Bootstrap sampling. Each tree is trained on its own random sample of the rows, drawn with replacement from the training set. Every tree therefore sees a slightly different dataset and grows a slightly different shape.
  • Aggregating. To predict, every tree makes its own estimate and the forest averages them (for regression) or takes a majority vote (for classification).

Random forests add one extra twist: at each split, a tree may only consider a random subset of the features. This de-correlates the trees even further, so their errors overlap less and the averaging works even better. Crucially, the trees are independent of one another: they are trained in parallel and none of them knows or cares what the others are doing. Order does not matter.

Let’s see what averaging buys Northwind on the same data.

from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score
import numpy as np

data = fetch_california_housing()
X_train, X_test, y_train, y_test = train_test_split(
    data.data, data.target, test_size=0.2, random_state=42
)

forest = RandomForestRegressor(random_state=42)
forest.fit(X_train, y_train)

test_rmse = np.sqrt(mean_squared_error(y_test, forest.predict(X_test)))
test_r2 = r2_score(y_test, forest.predict(X_test))

print("forest test RMSE:", float(round(test_rmse, 4)))
print("forest test R2  :", float(round(test_r2, 4)))
forest test RMSE: 0.5057
forest test R2  : 0.8049

The forest’s test RMSE drops from the single tree’s 0.7069 to 0.5057, and R2 R^2 climbs from 0.6187 to 0.8049. Nothing about the individual trees changed, they still overfit on their own, but averaging a hundred of them washed out most of the variance. This is the core promise of bagging: many independent high-variance models, averaged, become one low-variance model.


The New Idea: Boosting

Bagging attacks the problem sideways: it accepts that each tree overfits and cancels the noise by averaging. Boosting attacks it head-on with a completely different strategy. Instead of training many trees independently and averaging them, boosting trains trees sequentially, where each new tree is built to correct the errors the previous trees made.

Here is the intuition. Start with a first tree that makes a rough prediction for every district. It will get some districts badly wrong; those misses are the residuals (how far off each prediction was). Now train a second tree, but not on the original target. Train it to predict those leftover residuals, so it learns precisely where the first tree fell short. Add the second tree’s corrections on top of the first tree’s predictions, and the combined model is a little better. Compute the new, smaller residuals, and train a third tree on those. Repeat this many times, and each tree chips away at the errors that remain, gradually pulling the combined prediction closer to the truth.

This is why boosting is described as reducing bias. A single shallow tree is a weak, biased model that underfits, but by stacking many of them, each one patching the mistakes of the ensemble so far, boosting builds a strong model out of weak parts. Where bagging says “average away the variance,” boosting says “sequentially correct the bias.”

There is one more knob that makes boosting safe: the learning rate. Rather than adding each new tree’s full correction, boosting adds only a small fraction of it (say 10 percent). Taking many small, cautious steps instead of a few large ones keeps the model from overreacting to any single tree and is a big part of why boosting generalizes so well. You will study this trade-off closely in the next module; for now, just know that boosting deliberately learns slowly.

Weak learners, strong ensemble

The trees inside a boosting model are usually kept deliberately shallow, often just a few levels deep. A shallow tree on its own is a “weak learner,” barely better than guessing. Boosting’s insight is that you do not need each tree to be good; you need each tree to fix a specific slice of what is still wrong. Hundreds of weak, shallow trees, each nudging the prediction in the right direction, add up to a single strong model. This is the exact idea gradient boosting and XGBoost are built on.


Bagging vs. Boosting at a Glance

The two methods use the same building block, a decision tree, but assemble it in opposite ways. The figure below contrasts them side by side.

Side-by-side diagram. On the left, bagging: one training dataset fans out to three independent trees trained in parallel, whose votes are averaged, lowering variance and with order not mattering. On the right, boosting: Tree 1 makes a rough prediction leaving residual errors, Tree 2 is trained to fix those errors leaving smaller residuals, Tree 3 fixes what remains, and all trees are added together, lowering bias with order mattering.
Bagging trains trees in parallel and averages them to cut variance; boosting trains trees in sequence, each fixing the last one's errors, to cut bias.

It helps to line the two approaches up point by point:

  • Training order. Bagging trains all trees in parallel and independently; boosting trains them sequentially, each depending on the ones before it.
  • What each tree sees. In bagging, each tree learns the original target from a bootstrap sample of the rows. In boosting, each tree learns the residual errors still left by the trees already built.
  • How predictions combine. Bagging averages equal votes. Boosting adds trees together, each scaled by a small learning rate.
  • What it fixes. Bagging mainly reduces variance (it stabilizes an overfitting model). Boosting mainly reduces bias (it strengthens an underfitting model).
  • Does order matter? For bagging, no, the trees are interchangeable. For boosting, yes, shuffle the order and the model falls apart, because tree 5 only makes sense given trees 1 through 4.

Neither approach is universally better, but boosting, when tuned carefully, frequently produces the most accurate models on structured, tabular data like Northwind’s, which is exactly why it dominates in practice.


The Experiment: Tree vs. Forest vs. Gradient Boosting

Talk is cheap, so let’s settle it with a real head-to-head on the California Housing data. Northwind fits all three models with the same train/test split and the same random_state=42, then compares test RMSE and R2 R^2 . Gradient boosting is available in scikit-learn as GradientBoostingRegressor, no extra library required.

from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.metrics import mean_squared_error, r2_score
import numpy as np

data = fetch_california_housing()
X_train, X_test, y_train, y_test = train_test_split(
    data.data, data.target, test_size=0.2, random_state=42
)

models = {
    "Decision Tree":     DecisionTreeRegressor(random_state=42),
    "Random Forest":     RandomForestRegressor(random_state=42),
    "Gradient Boosting": GradientBoostingRegressor(random_state=42),
}

for name, model in models.items():
    model.fit(X_train, y_train)
    pred = model.predict(X_test)
    rmse = np.sqrt(mean_squared_error(y_test, pred))
    r2 = r2_score(y_test, pred)
    print(f"{name:18s} RMSE={float(round(rmse, 4)):<8} R2={float(round(r2, 4))}")
Decision Tree      RMSE=0.7069   R2=0.6187
Random Forest      RMSE=0.5057   R2=0.8049
Gradient Boosting  RMSE=0.5422   R2=0.7756

Read the table from top to bottom. The single decision tree is the weakest, at RMSE 0.7069. Both ensembles crush it: the random forest reaches 0.5057 and gradient boosting reaches 0.5422. Straight out of the box, with no tuning at all, gradient boosting is already competitive with the forest and explains about 78 percent of the variance, a night-and-day improvement over the lone tree. And this is boosting’s starting point, not its ceiling.

To make that last claim concrete, remember that boosting adds trees one at a time. We can freeze the model after every tree and check the test RMSE using staged_predict, watching the sequential corrections take effect.

from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.metrics import mean_squared_error
import numpy as np

data = fetch_california_housing()
X_train, X_test, y_train, y_test = train_test_split(
    data.data, data.target, test_size=0.2, random_state=42
)

gbr = GradientBoostingRegressor(n_estimators=300, random_state=42)
gbr.fit(X_train, y_train)

checkpoints = {1, 5, 25, 100, 300}
for i, pred in enumerate(gbr.staged_predict(X_test), start=1):
    if i in checkpoints:
        rmse = np.sqrt(mean_squared_error(y_test, pred))
        print(f"after {i:3d} trees: test RMSE = {float(round(rmse, 4))}")
after   1 trees: test RMSE = 1.0872
after   5 trees: test RMSE = 0.9205
after  25 trees: test RMSE = 0.6605
after 100 trees: test RMSE = 0.5422
after 300 trees: test RMSE = 0.4984

This is boosting laid bare. With just one tree the model is terrible (RMSE 1.0872, worse than the lone deep tree, because a single boosting tree is deliberately shallow). But each additional tree corrects the errors left by the ones before it, and the test RMSE falls step after step: 0.9205, then 0.6605, then 0.5422, and by 300 trees it reaches 0.4984, now beating the random forest’s 0.5057. The sequential, error-correcting process you read about is not a metaphor; it is literally the curve of these numbers going down.

Boosting rewards tuning

Notice the forest was strong immediately and barely needs tuning, while boosting started merely competitive but pulled ahead once we gave it more trees. That is the personality of these two methods: bagging is robust and forgiving, boosting is a precision instrument that gets better the more thoughtfully you set its knobs (number of trees, learning rate, tree depth). This course is largely about turning those knobs well, first with scikit-learn’s gradient boosting and then with XGBoost.


Practice Exercises

Try these before checking the hints. Each one reinforces a key contrast from this lesson.

Exercise 1: Confirm the Single Tree Overfits

Fit a DecisionTreeRegressor(random_state=42) on the California Housing training set and print both its training RMSE and its test RMSE. Explain in one sentence what the size of the gap tells you.

from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error
import numpy as np

data = fetch_california_housing()
X_train, X_test, y_train, y_test = train_test_split(
    data.data, data.target, test_size=0.2, random_state=42
)

# Your code here

Hint

Fit the tree, then compute RMSE separately on X_train/y_train and on X_test/y_test using np.sqrt(mean_squared_error(...)). You should see a training RMSE of about 0.0 and a test RMSE of about 0.7069. A near-zero training error beside a much larger test error is the textbook fingerprint of overfitting: the tree memorized the training rows.

Exercise 2: Does the Number of Trees Help Bagging or Boosting More?

Fit a RandomForestRegressor and a GradientBoostingRegressor (both random_state=42) with n_estimators=10 and again with n_estimators=300. Print each model’s test RMSE and note which method improved more when you added trees.

from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.metrics import mean_squared_error
import numpy as np

data = fetch_california_housing()
X_train, X_test, y_train, y_test = train_test_split(
    data.data, data.target, test_size=0.2, random_state=42
)

# Your code here

Hint

Build each model with n_estimators=10, fit, predict, and record RMSE; then repeat with n_estimators=300. The forest is already strong with few trees and improves only modestly, while boosting starts weaker but gains much more from the extra trees (dropping to roughly 0.4984). More trees mainly sharpens boosting, because each added tree corrects remaining errors, whereas the forest is just averaging in more near-identical votes.

Exercise 3: Watch the Residuals Shrink

Use staged_predict on a fitted GradientBoostingRegressor(n_estimators=100, random_state=42) to print the test RMSE after 1, 10, and 100 trees. Confirm the error decreases as trees are added, and explain why in terms of what each new tree is trained to do.

from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.metrics import mean_squared_error
import numpy as np

data = fetch_california_housing()
X_train, X_test, y_train, y_test = train_test_split(
    data.data, data.target, test_size=0.2, random_state=42
)

# Your code here

Hint

Loop over enumerate(gbr.staged_predict(X_test), start=1) and print the RMSE only when the tree count is 1, 10, or 100. The RMSE should fall steadily (around 1.0872 after 1 tree and 0.5422 after 100). It shrinks because each new tree is trained on the residuals, the errors still left over, so every tree specifically targets and reduces what the ensemble is still getting wrong.


Summary

You have connected everything you knew about trees and forests to the central idea of this course. Let’s review.

Key Concepts

A single tree overfits

  • A full-depth decision tree has high variance: near-perfect training error but weak test error
  • On California Housing, one tree scored train RMSE ~0.0 versus test RMSE 0.7069 (R2=0.6187 R^2 = 0.6187 ), the classic overfitting gap

Bagging (random forests) reduces variance

  • Trains many trees in parallel and independently, each on a bootstrap sample, then averages their predictions
  • Averaging cancels the independent noise of many overfitting trees, cutting variance
  • The forest reached RMSE 0.5057 and R2=0.8049 R^2 = 0.8049 , far better than the lone tree; order of the trees does not matter

Boosting reduces bias

  • Trains trees sequentially, where each new (usually shallow) tree is fit to the residual errors left by the trees so far
  • Predictions are added together, each scaled by a small learning rate, so the model learns in cautious steps
  • It turns many weak learners into one strong model; order matters and cannot be shuffled

The head-to-head result

  • Out of the box: Decision Tree RMSE 0.7069, Random Forest RMSE 0.5057, Gradient Boosting RMSE 0.5422
  • With 300 trees, boosting’s test RMSE fell 1.0872 → 0.9205 → 0.6605 → 0.5422 → 0.4984, overtaking the forest
  • Bagging is robust with little tuning; boosting is a precision tool that improves the more carefully it is tuned

Why This Matters

Boosting is not a niche trick; it is the engine behind the models that repeatedly win on tabular data, from the GradientBoostingRegressor you just used to the XGBoost you will master later in this course. Understanding why it works, that it corrects errors sequentially to attack bias, rather than averaging independent models to attack variance, is what will let you reason about its behavior instead of guessing. When a later lesson asks you to raise the learning rate, cap the tree depth, or add more estimators, you will know which lever you are pulling and what it does to the bias-variance balance.

Just as important, you saw the honest picture: boosting’s very first tree was worse than a plain tree, and only the sequential accumulation made it excellent. That is the mindset this course builds. You will not treat these models as magic boxes, but as a disciplined, step-by-step correction process whose every knob you can explain and tune with intent.


Next Steps

You now understand the difference between bagging and boosting, and you have seen gradient boosting beat both a single tree and a random forest on real data. Next, you will open up that sequential process and see, step by step, exactly how gradient boosting fits each tree to the residuals.

Lesson 2: How Gradient Boosting Works

Go inside the sequential loop and see how each tree is fit to the residuals of the ensemble so far.

Back to Module Overview

Return to the Boosting Foundations module overview


Continue Building Your Skills

You have laid the conceptual foundation for the entire course. The single distinction at the heart of this lesson, that bagging averages independent trees to fight variance while boosting stacks dependent trees to fight bias, will echo through every remaining module, all the way to XGBoost. Before moving on, rerun the comparison experiment yourself and change one thing at a time: raise the forest’s tree count, then raise the boosting model’s, and watch how differently the two respond. Building that intuition by hand now will make everything that follows, from tuning learning rates to reading XGBoost’s output, feel like a natural next step rather than a leap.

Sponsor

Keep DATATWEETS free. Help fund practical data, AI, and engineering lessons for learners worldwide.

Buy Me a Coffee at ko-fi.com