Lesson 4 - Loss Functions and Pseudo-Residuals
On this page
- Welcome to Loss Functions and Pseudo-Residuals
- The Key Generalization: Fit the Negative Gradient
- Loss 1: Squared Error Recovers the Ordinary Residual
- Loss 2: Absolute Error and the Sign Trick
- Loss 3: Log Loss Recovers Label Minus Probability
- Why This Framing Matters
- Practice Exercises
- Summary
- Next Steps
- Continue Building Your Skills
Welcome to Loss Functions and Pseudo-Residuals
In the last two lessons you watched gradient boosting improve step by step: fit a tree to the errors, add it to the model, look at the new errors, fit another tree, and repeat. For regression you fit each tree to the leftover residual , and for classification you fit each tree to the difference between the label and the predicted probability. Those looked like two separate tricks. They are not. They are the same trick, seen through two different loss functions.
This lesson reveals the general principle behind everything Lessons 2 and 3 showed. At each step, gradient boosting does not fit the next tree to a hand-picked notion of “error.” It fits the tree to the pseudo-residual: the negative gradient of your chosen loss function with respect to the current prediction. Choose squared error and the pseudo-residual turns out to be the ordinary residual. Choose log loss and it turns out to be label-minus-probability. This single idea is what puts the word gradient in gradient boosting, and it is the exact mechanism that XGBoost and every other boosting library, coming in the next module, are built on.
By the end of this lesson, you will be able to:
- State the pseudo-residual as the negative gradient of the loss and explain why boosting fits each tree to it
- Derive and compute the squared-error pseudo-residual and see that it equals the ordinary residual
- Derive and compute the absolute-error pseudo-residual and explain why makes boosting robust to outliers
- Derive and compute the binary log-loss pseudo-residual with , connecting it back to Lesson 3
- Explain how choosing a loss lets you tune what the model cares about, and why every boosting library rests on this
You should be comfortable with the boosting loop from Lessons 2 and 3, basic numpy, and the idea of a derivative as a slope. No heavy calculus is required; we will compute every gradient numerically and print the result. Let’s begin.
The Key Generalization: Fit the Negative Gradient
Here is the central definition of the whole lesson. Suppose the model built so far is , and we measure its mistakes with a loss function that is small when the prediction is close to the true value and large when it is far off. The next tree is trained to predict the pseudo-residual for each observation :
Read that slowly, because it is the heart of gradient boosting. The pseudo-residual is the negative gradient of the loss with respect to the current prediction, evaluated at where the model stands right now. It answers a precise question: if I could nudge this one prediction, which direction and how hard should I push to reduce the loss fastest? That direction of steepest descent, computed separately for every observation, becomes the target the next tree tries to match.
This reframes boosting as gradient descent, but performed in an unusual space. Ordinary gradient descent adjusts a fixed set of model parameters. Gradient boosting instead treats each observation’s prediction as the thing to adjust, computes the negative gradient of the loss at each of those predictions, and then fits a tree to approximate that whole vector of gradients. Adding the tree (scaled by a learning rate) takes one downhill step for the entire dataset at once. The residuals you fit in earlier lessons were never special; they were simply the negative gradient of one particular loss.
Why the word gradient earns its place
“Boosting” is the strategy of building a strong model from many weak ones, each correcting the last. “Gradient” is the recipe for what each weak learner should correct: the negative gradient of a differentiable loss. Because the recipe only needs a gradient, you can swap in any differentiable loss and the machinery is unchanged. That interchangeability is the whole reason a single library such as XGBoost can handle regression, classification, ranking, and custom objectives with the same core algorithm.
To make this concrete, Northwind Analytics predicts many different things: revenue per order, days until a shipment arrives, whether a customer will churn. Each of these wants a different notion of “good.” Revenue has a few enormous orders that should not dominate; delivery time is fine with squared error; churn needs calibrated probabilities. In the rest of this lesson we work through three losses and compute their pseudo-residuals numerically, so you can see exactly how the target the next tree chases changes with the loss you pick.
Loss 1: Squared Error Recovers the Ordinary Residual
Start with the loss you have implicitly been using all along. Squared error, written with a convenient factor of one half, is
Differentiate with respect to the prediction . The one half cancels the two that comes down from the square, and the inner derivative of with respect to is :
The negative gradient is exactly , the plain residual. This is why Lesson 2 worked: fitting each tree to “the amount we were off by” was secretly fitting the negative gradient of squared-error loss. The factor of one half is a convention that makes this come out clean; drop it and you would carry a constant 2 that the learning rate absorbs anyway.
Let’s confirm it numerically on a small, real slice of data. We take the first six block groups from the California Housing dataset (the target is median house value in units of $100,000) and imagine a model whose current prediction is just the mean of those targets.
import numpy as np
from sklearn.datasets import fetch_california_housing
housing = fetch_california_housing()
y = housing.target[:6] # median house value, in $100,000s
F = np.full_like(y, y.mean()) # current model: just predict the mean
print("y (actual): ", np.round(y, 4))
print("F (current): ", np.round(F, 4))
# Output:
# y (actual): [4.526 3.585 3.521 3.413 3.422 2.697]
# F (current): [3.5273 3.5273 3.5273 3.5273 3.5273 3.5273]Now compute the squared-error pseudo-residual, which the derivation says should just be y - F.
import numpy as np
from sklearn.datasets import fetch_california_housing
y = fetch_california_housing().target[:6]
F = np.full_like(y, y.mean())
# L = 1/2 (y - F)^2 -> pseudo-residual = -dL/dF = y - F
pseudo_residual = y - F
print("squared-error pseudo-residuals:", np.round(pseudo_residual, 4))
# Output:
# squared-error pseudo-residuals: [ 0.9987 0.0577 -0.0063 -0.1143 -0.1053 -0.8303]The first block group sits about 0.9987 (roughly $99,870) above the current mean prediction, so its pseudo-residual is a large positive number, telling the next tree to push that prediction up. The last block group sits 0.8303 below, so its pseudo-residual is a large negative number, pulling that prediction down. Every value is simply how far off we were, signed. Squared error and the ordinary residual are one and the same.
Loss 2: Absolute Error and the Sign Trick
Now change the loss. Absolute error measures the size of the miss without squaring it:
Its derivative with respect to is not a smooth line; it is a constant slope whose sign flips at zero. Wherever the prediction is too low (), the slope of the loss is ; wherever it is too high (), the slope is . Negating that gradient gives a pseudo-residual that is the sign of the residual:
This is a strikingly different signal from squared error. Squared error’s pseudo-residual grows without bound as the miss grows; a point that is off by 40 produces a pseudo-residual of 40 and screams for attention. Absolute error’s pseudo-residual is always just or : every point says only “I am too high” or “I am too low,” never “by a lot.” Compute it on the same six block groups.
import numpy as np
from sklearn.datasets import fetch_california_housing
y = fetch_california_housing().target[:6]
F = np.full_like(y, y.mean())
# L = |y - F| -> pseudo-residual = -dL/dF = sign(y - F)
pseudo_residual = np.sign(y - F)
print("absolute-error pseudo-residuals:", np.round(pseudo_residual, 4))
# Output:
# absolute-error pseudo-residuals: [ 1. 1. -1. -1. -1. -1.]Compare this vector to the squared-error one. The first two block groups are above the mean, so they get ; the last four are below, so they get . The magnitude of each miss has been thrown away entirely. That looks like a loss of information, and it is, but it buys something valuable: robustness to outliers. A single wildly wrong point cannot dominate the next tree, because its vote is capped at 1.
To see why that matters, imagine a data-entry slip: someone typed 45.26 instead of 4.526 for the first block group. Under squared error, that one corrupted point now produces a gigantic pseudo-residual and hijacks the next tree, which will spend its whole capacity chasing the bad value. Under absolute error, the same point contributes a pseudo-residual of exactly 1.0, no louder than any honest point.
import numpy as np
# same six blocks, but the first value is a data-entry error (45.26, not 4.526)
y = np.array([45.26, 3.585, 3.521, 3.413, 3.422, 2.697])
F = np.full_like(y, 3.5273)
squared_res = y - F
absolute_res = np.sign(y - F)
print("outlier squared-error residual: ", round(float(squared_res[0]), 4))
print("outlier absolute-error residual:", round(float(absolute_res[0]), 4))
# Output:
# outlier squared-error residual: 41.7327
# outlier absolute-error residual: 1.0The corrupted point produces a squared-error pseudo-residual of 41.7327, dozens of times larger than any real signal in the data, versus a tame 1.0 under absolute error. This is the practical reason to reach for absolute error (or its smooth cousin, the Huber loss) when Northwind’s revenue data contains a handful of enormous orders you do not want a single big client steering the entire model.
The pseudo-residual controls what dominates learning
Because the next tree is fit to the pseudo-residuals, whichever points have the largest pseudo-residuals get the most attention. Squared error makes far-off points shout, which is great when big misses genuinely matter and dangerous when they are just noise or bad data. Absolute error mutes that shouting to a whisper. Choosing the loss is therefore not a cosmetic decision; it directly decides which observations steer your model.
Loss 3: Log Loss Recovers Label Minus Probability
The third loss is the one behind Lesson 3’s classification boosting. For a binary label , the model outputs a raw score (a log-odds), which we squash into a probability with the sigmoid function . The log loss (also called binary cross-entropy) is
Working out the derivative with respect to the raw score takes a little chain rule, but the result is famously clean. The messy pieces cancel because the sigmoid’s own derivative is , and the negative gradient collapses to
The pseudo-residual is simply the label minus the predicted probability. This is exactly the quantity Lesson 3 fit each tree to, and now you can see it was never an ad hoc choice: it is the negative gradient of log loss with respect to the raw score. Notice the target here is the score , not the probability directly, which is why boosting works in the unbounded log-odds space and only maps to a probability at the end.
Let’s compute it on a tiny, hand-made set of six labels with some current raw scores, the kind of intermediate state a booster would be in partway through training.
import numpy as np
labels = np.array([0, 0, 1, 1, 1, 0])
F = np.array([-1.0, 0.5, 0.2, 2.0, -0.5, 1.0]) # current raw scores (log-odds)
p = 1 / (1 + np.exp(-F)) # sigmoid -> probabilities
pseudo_residual = labels - p # -dL/dF for log loss
print("p = sigmoid(F): ", np.round(p, 4))
print("log-loss pseudo-residuals:", np.round(pseudo_residual, 4))
# Output:
# p = sigmoid(F): [0.2689 0.6225 0.5498 0.8808 0.3775 0.7311]
# log-loss pseudo-residuals: [-0.2689 -0.6225 0.4502 0.1192 0.6225 -0.7311]Read a few of these. The second observation has label 0 but the model currently gives it probability 0.6225, a confident wrong guess, so its pseudo-residual is a strongly negative -0.6225, pushing the score down hard. The fourth observation has label 1 and the model already predicts 0.8808, nearly right, so its pseudo-residual is a small 0.1192; there is little left to correct. Just like a residual, the pseudo-residual is large where the model is badly wrong and near zero where it is nearly right, but here it is automatically bounded between and , which keeps classification boosting stable.
The figure below draws all three losses on the top row and their pseudo-residuals (negative gradients) on the bottom row, so you can see at a glance how the same downhill idea produces three very different targets.
Why This Framing Matters
Seeing residuals as negative gradients is not just an elegant reframing; it is the reason gradient boosting is so flexible in practice. The learning machinery, grow a tree to match a target vector, scale it by the learning rate, add it to the model, never changes. The only thing that changes when you swap losses is the target vector, the pseudo-residual, that you hand the tree. That clean separation is what lets a single library expose one algorithm and dozens of objectives.
That flexibility maps directly onto real modeling choices. When Northwind predicts delivery time and the misses are roughly symmetric with no crazy outliers, squared error is the natural default and its pseudo-residual is the familiar residual. When it predicts revenue per order, where a few gigantic orders would otherwise dominate, absolute error (or Huber) trims their influence by capping the pseudo-residual. When it predicts churn and needs trustworthy probabilities to prioritize retention offers, log loss delivers calibrated outputs and a pseudo-residual of label-minus-probability. You pick the loss to match what “good” means for the task, and the pseudo-residual formula does the rest.
Most importantly, this is precisely the abstraction that XGBoost, LightGBM, and CatBoost are built on. Every one of them asks only that your loss provide a gradient (and, for the second-order methods you will meet next module, a second derivative too). Once you understand that a tree is always fit to the negative gradient of the loss, those libraries stop being black boxes: their objective="reg:squarederror" or objective="binary:logistic" settings are nothing more than a choice of which pseudo-residual to compute. You already know what happens under the hood.
Any differentiable loss is fair game
The only real requirement is that the loss be differentiable in the prediction, so a gradient exists at every point. That opens the door to custom objectives: quantile loss to predict a delivery-time upper bound Northwind can safely promise, Poisson loss for count data like units sold, or a business-specific asymmetric loss that punishes under-forecasting revenue more than over-forecasting. Write down the loss, take its negative gradient, and gradient boosting can optimize it, no new algorithm required.
Practice Exercises
Try these before checking the hints. They reinforce the link between a loss, its gradient, and the pseudo-residual the next tree chases.
Exercise 1: Confirm the Squared-Error Gradient Numerically
Instead of trusting the calculus, verify that really is the negative gradient of . Pick a single point, say y = 3.0 and F = 2.0, and approximate the derivative with a tiny finite difference: for a small h like 1e-5. Negate it and check it lands on 1.0.
def loss(y, F):
return 0.5 * (y - F) ** 2
y, F, h = 3.0, 2.0, 1e-5
# Your code hereHint
Compute grad = (loss(y, F + h) - loss(y, F - h)) / (2 * h), then pseudo = -grad. You should get a number extremely close to 1.0, which equals y - F = 3.0 - 2.0. This finite-difference check is exactly how you would sanity-test the gradient of a custom loss before handing it to a boosting library.
Exercise 2: Robustness Head to Head
Build a target vector y = np.array([2.0, 2.1, 1.9, 2.0, 20.0]) where the last value is an outlier, with a constant current prediction F equal to the median of y. Compute both the squared-error pseudo-residual (y - F) and the absolute-error pseudo-residual (np.sign(y - F)), and print the pseudo-residual of the outlier under each. Which loss lets the outlier dominate?
import numpy as np
y = np.array([2.0, 2.1, 1.9, 2.0, 20.0])
F = np.full_like(y, np.median(y))
# Your code hereHint
Print (y - F)[-1] and np.sign(y - F)[-1]. The squared-error pseudo-residual for the outlier is about 18.0, dwarfing the roughly 0.1-sized residuals of the honest points, so it dominates the next tree. The absolute-error pseudo-residual is just 1.0, the same magnitude as everyone else, so the outlier cannot take over. This is the outlier-robustness argument in numbers.
Exercise 3: Log-Loss Pseudo-Residual and Confidence
Using labels = np.array([1, 1, 0, 0]) and raw scores F = np.array([3.0, -3.0, 3.0, -3.0]), compute p = 1 / (1 + np.exp(-F)) and the pseudo-residual labels - p. Identify which observation has the pseudo-residual closest to zero and which has the largest magnitude, and explain what that says about the model’s mistakes.
import numpy as np
labels = np.array([1, 1, 0, 0])
F = np.array([3.0, -3.0, 3.0, -3.0])
# Your code hereHint
The first point (label 1, score 3.0, so p around 0.95) is confidently correct, giving a small pseudo-residual near 0.05. The third point (label 0, score 3.0, so p around 0.95) is confidently wrong, giving a large negative pseudo-residual near -0.95. The pseudo-residual is smallest where the model is already right and largest where it is confidently wrong, which is precisely where the next tree should focus.
Summary
You have reached the conceptual core of gradient boosting. The “errors” you fit trees to in earlier lessons were never arbitrary: each was the negative gradient of a specific loss function with respect to the current prediction, a quantity called the pseudo-residual. Choosing the loss chooses the pseudo-residual, and the pseudo-residual is the only thing about boosting that changes from one problem to the next.
Key Concepts
The Pseudo-Residual
- At each step, gradient boosting fits the next tree to the pseudo-residual
- It is the negative gradient of the loss with respect to the current prediction, the direction of steepest descent for each observation
- This makes boosting a form of gradient descent performed on predictions rather than on parameters
Squared Error
- has pseudo-residual , the ordinary residual from Lesson 2
- Its pseudo-residual grows without bound, so large misses dominate; on the six housing blocks it ran from
0.9987down to-0.8303
Absolute Error
- has pseudo-residual , always or
- Capping the magnitude makes boosting robust to outliers; a corrupted point produced
41.7327under squared error but only1.0under absolute error
Log Loss (Binary)
- with has pseudo-residual , recovering Lesson 3
- Bounded between and , it is small where the model is right and large where it is confidently wrong
Why It Matters
- Swapping the loss only swaps the pseudo-residual; the tree-fitting machinery is unchanged
- Any differentiable loss works, which is exactly the abstraction XGBoost and other libraries are built on
Why This Matters
This lesson is the hinge between the hand-built boosters of Module 1 and the industrial libraries ahead. Once you internalize that a boosted tree is always fit to the negative gradient of a loss, the alphabet soup of objective settings in XGBoost, LightGBM, and CatBoost turns into a single, familiar decision: which loss matches what “good” means for this task? Regression with symmetric errors wants squared error; revenue with wild outliers wants absolute or Huber loss; probability-hungry classification wants log loss. You are no longer memorizing recipes but choosing a gradient.
It also demystifies the marketing around these tools. When a library boasts that it supports “custom objectives,” you now know that means nothing more than “bring your own differentiable loss and we will compute its negative gradient.” That understanding will let you read the next module’s XGBoost material not as a bag of tricks but as the same gradient-descent-on-predictions idea you just implemented by hand, made fast and regularized.
Next Steps
You now understand the general principle that unifies every gradient booster: fit each tree to the negative gradient of your chosen loss. In the guided project, you will put all of Module 1 together and build a working gradient booster from scratch, computing pseudo-residuals in a real training loop.
Guided Project: Build a Gradient Booster from Scratch
Assemble the full Module 1 loop into a working booster that computes pseudo-residuals and adds trees one at a time.
Back to Module Overview
Return to the Boosting Foundations module overview
Continue Building Your Skills
You have just learned the single idea that turns a pile of boosting tricks into one coherent algorithm. Fitting residuals, fitting label-minus-probability, chasing signs for robustness, these are not three methods but three faces of one rule: fit the negative gradient of the loss. Hold onto that sentence. When the next module hands you XGBoost with its second-order gradients, regularization, and blazing speed, you will recognize the beating heart underneath as exactly the pseudo-residual you computed here by hand. Master this framing now, and every boosting library you touch for the rest of your career will make sense from the inside out.