Lesson 3 - Gradient Boosting for Classification

Welcome to Gradient Boosting for Classification

In Lesson 2 you saw exactly how gradient boosting works for regression: it starts from a simple baseline prediction, then adds one small tree after another, each tree correcting the errors of the ensemble so far. Every tree fit the leftover residual, and the residuals were plain numbers, the gap between the true value and the current prediction. That story is clean because in regression the thing you predict and the thing you correct live in the same space: dollars, degrees, or counts.

This lesson keeps the same engine but changes the target. The analytics team at Northwind Analytics now wants to flag which customers and prospects are likely high earners, so marketing can tailor its outreach. Concretely, we will predict whether a person’s income is above 50,000 dollars a year using the real Adult Income dataset drawn from census records. The prediction is no longer a number on an open-ended scale; it is a probability between 0 and 1. That single change ripples through the whole method, and understanding how it ripples is what this lesson is about.

By the end of this lesson, you will be able to:

  • Explain why a boosting classifier works in log-odds space instead of directly on probabilities
  • Use the sigmoid function to turn an accumulated log-odds into a valid probability
  • Describe log loss and why it is the natural loss for a probability prediction
  • Interpret the classifier’s pseudo-residual as the gap yp y - p between the true label and the predicted probability
  • Train a GradientBoostingClassifier on the Adult Income data and read its accuracy, ROC AUC, and individual predicted probabilities

You should be comfortable with the regression version of gradient boosting from Lesson 2 and the basic scikit-learn workflow of a train/test split, fit, and scoring. Let’s begin.


Why Classification Needs a Different Setup

Suppose we tried to reuse the regression recipe unchanged. The baseline would guess some starting value, each tree would add a correction, and we would sum everything to get a prediction. The problem shows up immediately: a probability must stay between 0 and 1, but a sum of tree outputs can be any real number at all. Add a few confident trees and your “probability” sails past 1; subtract a few and it drops below 0. Nothing in the additive machinery keeps the total inside the legal range.

So we do not add up probabilities directly. Instead, gradient boosting for classification does its adding in a different space, called log-odds (also known as the logit), which has no boundaries. The log-odds of a probability p p is

log-odds(p)=log ⁣(p1p) \text{log-odds}(p) = \log\!\left(\frac{p}{1 - p}\right)

The quantity p/(1p) p / (1 - p) is the odds: how many times more likely the event is than its complement. Taking the log stretches those odds onto the entire number line. A probability of 0.5 maps to log-odds 0. Probabilities above 0.5 map to positive log-odds, probabilities below 0.5 map to negative log-odds, and as p p approaches 0 or 1 the log-odds runs off to minus or plus infinity. Because log-odds is unbounded, we can safely add tree after tree to it exactly the way we did in regression, with no risk of leaving a valid range.

This is the key mental shift for the whole lesson: the ensemble accumulates in log-odds space, and we convert to a probability only at the very end. The model builds a running total

F(x)=F0+ηh1(x)+ηh2(x)++ηhM(x) F(x) = F_0 + \eta\, h_1(x) + \eta\, h_2(x) + \cdots + \eta\, h_M(x)

where F0 F_0 is the baseline log-odds, each hm h_m is a small tree, and η \eta is the learning rate. That looks identical to the regression sum from Lesson 2, and that is the point: the additive structure is unchanged. Only the meaning of the number, and one final conversion step, are new.


The Sigmoid: From Log-Odds Back to a Probability

Accumulating in log-odds space is only useful if we can get back to a probability when it is time to make a prediction. The function that does this is the sigmoid, also called the logistic function. It is the exact inverse of the log-odds transform, and it takes any real number F(x) F(x) and squashes it into the interval (0,1) (0, 1) :

p=11+eF(x) p = \frac{1}{1 + e^{-F(x)}}

Feed it a large positive log-odds and it returns a probability near 1; feed it a large negative one and it returns a probability near 0; feed it 0 and it returns exactly 0.5. No matter how many trees the ensemble adds, and no matter how large the accumulated F(x) F(x) grows, the sigmoid always hands back a legitimate probability. That is why the boundlessness of log-odds space is a feature, not a bug: the sigmoid is the safety valve that guarantees a valid output.

The figure below traces the full flow for a single person: the model sums a base log-odds and the contributions of its trees into an accumulated F(x) F(x) , then the sigmoid maps that total into a probability.

A three-stage flow diagram. On the left, a blue box accumulates a base log-odds of minus 1.16 plus tree contributions of plus 0.71 and plus 0.58 into a total F of x equal to 0.85. An arrow leads to a purple box in the middle showing an S-shaped sigmoid curve with the formula p equals one over one plus e to the minus F, mapping the input F of x up to a point on the curve. A final arrow leads to a green box on the right showing the resulting probability 0.70 labeled probability that income exceeds 50K.
Trees add up in unbounded log-odds space; the sigmoid squashes the accumulated total into a valid 0-to-1 probability at the end.

We can see the two transforms undo each other with a tiny experiment. The Adult data has about 23.9 percent high earners, so the baseline probability is 0.2393. Convert it to log-odds and back, and we should land exactly where we started.

import math

p = 0.2393                       # fraction of people who earn >50K
log_odds = math.log(p / (1 - p))
print("Base log-odds:", round(log_odds, 4))

# Turn it back into a probability with the sigmoid
prob = 1 / (1 + math.exp(-log_odds))
print("Sigmoid of base log-odds:", round(prob, 4))
# Output:
# Base log-odds: -1.1565
# Sigmoid of base log-odds: 0.2393

The base log-odds is a negative number, which makes sense: high earners are the minority, so the odds of being one are below even, and the log of odds below 1 is negative. Push that value back through the sigmoid and we recover 0.2393, the class balance we started from. This round trip is happening constantly inside the model.

The initial guess is just the class balance

Before a single tree is added, a boosting classifier has to guess something. Just as the regression version started from the mean of the target, the classification version starts from the base log-odds implied by the class balance. For the Adult data that is log(0.2393/0.7607)1.16 \log(0.2393 / 0.7607) \approx -1.16 , which the sigmoid turns back into a 0.2393 probability of high income for everyone. Every tree the model adds nudges this baseline up or down for people whose features suggest they differ from the average.


Log Loss: Scoring a Probability

In regression, Lesson 2 measured error with squared error, the squared gap between prediction and truth. That does not fit here, because our prediction is a probability and our truth is a 0 or a 1. We need a loss that rewards confident correct probabilities and punishes confident wrong ones. That loss is log loss, also called binary cross-entropy.

For a single observation with true label y{0,1} y \in \{0, 1\} and predicted probability p p , log loss is

L(y,p)=[ylog(p)+(1y)log(1p)] L(y, p) = -\big[\, y \log(p) + (1 - y)\log(1 - p) \,\big]

Only one of the two terms is ever active. When the true label is y=1 y = 1 , the expression collapses to log(p) -\log(p) : the loss is small when p p is near 1 and grows without bound as p p approaches 0. When the true label is y=0 y = 0 , it collapses to log(1p) -\log(1 - p) : small when p p is near 0, exploding as p p approaches 1. In both cases the message is the same. Being confidently right costs almost nothing, being unsure costs a moderate amount, and being confidently wrong is penalized very heavily. That steep penalty for confident mistakes is exactly what pushes the model to produce well-calibrated probabilities rather than reckless ones.

Log loss is the objective the whole ensemble is trying to minimize, the same role squared error played in the regression lesson. And just as squared error made the residual come out to yy^ y - \hat{y} , log loss makes the classifier’s residual come out to a strikingly simple form, which is the next piece of the puzzle.


The Pseudo-Residual: The Gap Between Label and Probability

In regression, each new tree fit the residual, the leftover error the ensemble had not yet explained. Classification keeps that idea but gives it a more precise name: the pseudo-residual. It is called pseudo because it is not a raw error in the target’s units; it is the direction that reduces the loss fastest, which is why the general treatment in Lesson 4 frames it as a gradient. For now, the intuition is all you need, and for log loss the pseudo-residual works out to something you can read at a glance:

r=yp r = y - p

That is it: the true label minus the predicted probability. If a person truly earns more than 50K (y=1 y = 1 ) but the model currently gives them only p=0.3 p = 0.3 , the pseudo-residual is 10.3=0.7 1 - 0.3 = 0.7 , a strong positive signal telling the next tree to push this person’s log-odds up. If a person earns 50K or less (y=0 y = 0 ) but the model gives them p=0.8 p = 0.8 , the pseudo-residual is 00.8=0.8 0 - 0.8 = -0.8 , a strong negative signal to push their log-odds down. When the probability already matches the label, the pseudo-residual is near zero and the model leaves that person alone.

So the loop is the same shape as regression, with probabilities standing in for raw predictions: compute everyone’s current probability, take the pseudo-residual yp y - p , fit a small tree to those pseudo-residuals, add its (scaled) output to the accumulated log-odds, and repeat. Here is what those pseudo-residuals look like at the very start, when the model still predicts the base probability of 0.2393 for everyone.

# Pseudo-residuals for a few people, using the base probability 0.2393
# as the starting prediction (before any tree is added).
p0 = 0.2393
people = [
    ("high earner, model still guessing base", 1, p0),
    ("low earner, model still guessing base",  0, p0),
]
for label, y_true, p in people:
    residual = y_true - p
    print(f"{label:42s} y={y_true}  p={p:.4f}  y-p={residual:+.4f}")
# Output:
# high earner, model still guessing base     y=1  p=0.2393  y-p=+0.7607
# low earner, model still guessing base      y=0  p=0.2393  y-p=-0.2393

Notice the asymmetry. The true high earner carries a large positive pseudo-residual of +0.76, because the base guess badly underrates them, so the first tree has a strong incentive to raise their log-odds. The true low earner carries a smaller negative pseudo-residual of -0.24, because the base guess of 0.24 was already close to their label of 0. The trees spend most of their effort on the people the current model gets most wrong, which is exactly the behavior that makes boosting so effective.

Why y minus p is so convenient

It is not a coincidence that log loss produces such a clean pseudo-residual. When you pair log loss with the log-odds representation and work out the gradient, the messy-looking derivatives cancel and you are left with the plain difference yp y - p . This elegance is one reason the sigmoid-plus-log-loss pairing is the standard setup for binary classification across logistic regression, neural networks, and gradient boosting alike. You will derive this result properly in Lesson 4; here, just trust that the gap between the label and the probability is the signal each tree chases.


A Real Experiment on Adult Income

Enough theory. Let’s put a GradientBoostingClassifier to work on the real Adult Income dataset and see whether Northwind can actually predict high earners. The data loads straight from scikit-learn’s OpenML mirror and comes cached, so it is ready to use.

import pandas as pd
from sklearn.datasets import fetch_openml

adult = fetch_openml("adult", version=2, as_frame=True)
X_all = adult.data
y_raw = adult.target

print("Rows, columns:", X_all.shape)
print("Target labels:", y_raw.unique().tolist())
# Output:
# Rows, columns: (48842, 14)
# Target labels: ['<=50K', '>50K']

There are 48,842 people and 14 features, and the target is the string ">50K" or "<=50K". Several of the columns are categorical text (occupation, marital status, and so on). Handling categorical features properly is a topic in its own right and gets full treatment in Module 3, so to keep this lesson focused on the classification mechanics we will train on the numeric columns only: age, years of education, hours worked per week, and capital gains and losses. We also turn the target into a clean 0/1 label, where 1 means high income.

import pandas as pd
from sklearn.datasets import fetch_openml

adult = fetch_openml("adult", version=2, as_frame=True)
X_all = adult.data
y_raw = adult.target

features = ["age", "education-num", "hours-per-week",
            "capital-gain", "capital-loss"]
X = X_all[features].copy()
y = (y_raw == ">50K").astype(int)

print("Feature columns:", list(X.columns))
print("Positive rate (>50K):", round(float(y.mean()), 4))
print(y.value_counts().to_string())
# Output:
# Feature columns: ['age', 'education-num', 'hours-per-week', 'capital-gain', 'capital-loss']
# Positive rate (>50K): 0.2393
# class
# 0    37155
# 1    11687

Just under 24 percent of people are high earners, confirming the imbalanced base rate we used for the log-odds calculation earlier. Now we split the data, train the classifier with random_state=42 for reproducibility, and evaluate it on the held-out test set. We report two numbers: accuracy, the fraction of correct labels, and ROC AUC, which measures how well the predicted probabilities rank high earners above low earners regardless of any single threshold.

import pandas as pd
from sklearn.datasets import fetch_openml
from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import accuracy_score, roc_auc_score

adult = fetch_openml("adult", version=2, as_frame=True)
features = ["age", "education-num", "hours-per-week",
            "capital-gain", "capital-loss"]
X = adult.data[features].copy()
y = (adult.target == ">50K").astype(int)

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

model = GradientBoostingClassifier(random_state=42)
model.fit(X_train, y_train)

pred = model.predict(X_test)
proba = model.predict_proba(X_test)[:, 1]

print("Test accuracy:", round(accuracy_score(y_test, pred), 4))
print("Test ROC AUC: ", round(roc_auc_score(y_test, proba), 4))
# Output:
# Test accuracy: 0.8466
# Test ROC AUC:  0.8713

The classifier reaches 84.7 percent test accuracy and a ROC AUC of 0.871, using only five numeric features and every default setting. Put the accuracy in context: because only 23.9 percent of people are high earners, a lazy model that always guessed “not high income” would already be right about 76 percent of the time. The 84.7 percent here is a genuine, meaningful lift over that baseline. And the AUC of 0.871 tells us the probabilities are well ordered: pick a random high earner and a random low earner, and the model gives the high earner a larger probability about 87 percent of the time. Adding the categorical features in a later module pushes these numbers higher still.

Reading Individual Probabilities

Accuracy is a summary over thousands of people, but the real payoff of a probability model is what it says about one person. Because the model works through the sigmoid, predict_proba hands us a calibrated probability for each individual, not just a yes/no label. Here are the first five people in the test set alongside the probability the model assigns them and their actual outcome.

import pandas as pd
from sklearn.datasets import fetch_openml
from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingClassifier

adult = fetch_openml("adult", version=2, as_frame=True)
features = ["age", "education-num", "hours-per-week",
            "capital-gain", "capital-loss"]
X = adult.data[features].copy()
y = (adult.target == ">50K").astype(int)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)
model = GradientBoostingClassifier(random_state=42)
model.fit(X_train, y_train)

proba = model.predict_proba(X_test)[:, 1]

# Look at the first five people in the test set
sample = X_test.head(5).copy()
sample["prob_>50K"] = [float(round(p, 4)) for p in proba[:5]]
sample["actual"] = y_test.head(5).values
print(sample.to_string())
# Output:
#        age  education-num  hours-per-week  capital-gain  capital-loss  prob_>50K  actual
# 40342   54              9              40             0             0     0.2183       0
# 47680   28              9              20             0             0     0.0373       0
# 524     53             10              50             0             0     0.4035       0
# 8508    58              9              16          3137             0     0.0491       0
# 31692   47             12              40             0             0     0.3344       1

Read a few of these. The 28-year-old working 20 hours a week with 9 years of education gets a very low 0.037 probability of high income, and indeed earns 50K or less; the model is confidently and correctly cautious. The 53-year-old with more education working 50-hour weeks gets a much higher 0.404, reflecting features that lean toward high income even though this particular person did not clear the bar. These are not just labels; they are graded degrees of belief, and that is precisely what makes a probability model useful for a team like Northwind that wants to rank and prioritize prospects rather than merely bucket them.


Practice Exercises

Try these before checking the hints. They reinforce the log-odds, sigmoid, and pseudo-residual ideas, plus the training workflow.

Exercise 1: Convert Probabilities to Log-Odds and Back

Write two small functions, to_log_odds(p) and to_prob(f), that implement the log-odds transform and the sigmoid. Convert the probabilities 0.1, 0.5, and 0.9 to log-odds, then convert each result back, and confirm you recover the original probability.

import math

# Your code here

Hint

Use to_log_odds(p) = math.log(p / (1 - p)) and to_prob(f) = 1 / (1 + math.exp(-f)). A probability of 0.5 should map to log-odds 0, 0.9 to about +2.20, and 0.1 to about -2.20. Feeding each log-odds back through to_prob should return the original probability to within rounding.

Exercise 2: Compute Log Loss for Confident Predictions

Using L(y,p)=[ylog(p)+(1y)log(1p)] L(y, p) = -[\,y\log(p) + (1-y)\log(1-p)\,] , compute the log loss for four cases: a true positive predicted at p=0.9 p = 0.9 , a true positive predicted at p=0.1 p = 0.1 , a true negative at p=0.1 p = 0.1 , and a true negative at p=0.9 p = 0.9 . Print all four and note which are penalized most.

import math

def log_loss_one(y, p):
    # Your code here
    pass

Hint

Return -(y * math.log(p) + (1 - y) * math.log(1 - p)). The two confidently correct cases (true positive at 0.9, true negative at 0.1) give a small loss of about 0.105 each. The two confidently wrong cases (true positive at 0.1, true negative at 0.9) give a large loss of about 2.303 each, which is why the model works so hard to avoid confident mistakes.

Exercise 3: Retrain with Fewer Trees

Copy the training block from the lesson, but set n_estimators=20 on the GradientBoostingClassifier so it adds only 20 trees instead of the default 100. Print the test accuracy and ROC AUC and compare them to the full model’s 0.8466 and 0.8713.

from sklearn.datasets import fetch_openml
from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import accuracy_score, roc_auc_score

# Your code here

Hint

Build the classifier with GradientBoostingClassifier(n_estimators=20, random_state=42) and reuse the same features, split, and metrics. With only 20 trees the scores dip slightly below the 100-tree model, illustrating that each added tree contributes another small correction in log-odds space, and that more trees generally means a better fit up to a point.


Summary

You have taken the gradient boosting engine from Lesson 2 and pointed it at a classification problem, predicting who earns more than 50K on the real Adult Income data. The mechanics of adding small trees stayed the same; what changed was the space the model adds in and the way it scores itself.

Key Concepts

Working in Log-Odds Space

  • A probability must stay in [0,1] [0, 1] , but a sum of tree outputs can be any real number, so the model does not add probabilities directly
  • Instead it accumulates in log-odds space, log(p/(1p)) \log(p / (1-p)) , which is unbounded and safe to add to
  • The ensemble builds a running total F(x)=F0+ηh1(x)++ηhM(x) F(x) = F_0 + \eta h_1(x) + \cdots + \eta h_M(x) , exactly the additive form from regression

The Sigmoid

  • The sigmoid p=1/(1+eF(x)) p = 1 / (1 + e^{-F(x)}) maps any accumulated log-odds back into a valid probability
  • It is the inverse of the log-odds transform, so converting a probability to log-odds and back recovers the original value
  • The model’s initial guess is the base log-odds implied by the class balance (1.16 \approx -1.16 for Adult, which is a 0.2393 probability)

Log Loss and the Pseudo-Residual

  • The classifier minimizes log loss (binary cross-entropy), which heavily penalizes confident wrong predictions
  • Each tree fits the pseudo-residual yp y - p : the gap between the true label and the current predicted probability
  • People the model gets most wrong carry the largest pseudo-residuals, so the trees focus their effort where it matters most

The Real Result

  • A default GradientBoostingClassifier on five numeric features reached 0.8466 test accuracy and 0.8713 ROC AUC
  • That beats the ~76 percent always-guess-low baseline, and the calibrated probabilities let Northwind rank individuals, not just label them

Why This Matters

Classification is where most real business decisions live: will this customer churn, is this transaction fraud, does this prospect earn more than 50K? Being able to run gradient boosting on those problems, and to trust the probabilities it produces, is one of the most directly useful skills in applied machine learning. Just as importantly, you now understand why the classifier is built the way it is. The log-odds detour is not an arbitrary complication; it is the elegant fix that lets an additive model produce bounded probabilities. And the pseudo-residual yp y - p is not a new invention but the same “fit the leftover error” idea you already knew, reappearing in a form tailored to probabilities. Seeing that continuity means the leap from regression to classification is far smaller than it first looks, and it sets you up perfectly for the general theory of loss functions and gradients in the next lesson.


Next Steps

You can now train and interpret a gradient boosting classifier and explain every piece of its setup. In the next lesson, you will step back and see the unifying theory: how any differentiable loss function defines a pseudo-residual, so that regression and classification are just two instances of one general algorithm.

Lesson 4: Loss Functions and Pseudo-Residuals

See how any differentiable loss defines a pseudo-residual, unifying regression and classification under one gradient boosting framework.

Back to Module Overview

Return to the Boosting Foundations module overview


Continue Building Your Skills

You have crossed the bridge from regression to classification without changing the engine underneath, only the space it drives in and the yardstick it measures against. Take a moment to trace the flow one more time: the model sums small trees into an accumulated log-odds, the sigmoid squashes that total into a probability, log loss scores it, and the pseudo-residual yp y - p tells the next tree where to push. Once that loop feels as natural as the regression version, you are ready to see the single theory that contains them both, and to understand why gradient boosting works for almost any loss you can write down.

Sponsor

Keep DATATWEETS free. Help fund practical data, AI, and engineering lessons for learners worldwide.

Buy Me a Coffee at ko-fi.com