Lesson 2 - Inside the XGBoost Objective
Welcome to Inside the XGBoost Objective
In Module 1 you built a gradient booster by hand. Each tree was fit to the pseudo-residual, the negative gradient of the loss, and added to the model with a learning rate. That booster works, and it captures the true spirit of boosting. But if you fit it and a real XGBoost model to the same Northwind Analytics data, XGBoost will usually win: it needs fewer trees, resists overfitting better, and trains far faster. The reason is not a better tree-growing trick bolted onto the same idea. It is a genuinely different objective and a smarter way of approximating it.
This lesson opens that objective up. XGBoost does two things Module 1’s booster did not. First, it adds a regularization term directly to the objective, so the algorithm is not just chasing training loss but explicitly paying a price for complex trees. Second, it uses a second-order (Newton) approximation of the loss, meaning it uses not only the gradient but also the hessian, the second derivative, of the loss at each point. From those two ingredients falls out a beautiful closed form for the best value to put in each leaf, a score that measures how “good” a leaf is, and the exact gain XGBoost uses to decide where to split and whether a split is worth keeping. We will derive each piece, then verify the numbers against a real one-tree XGBoost fit so you can trust that the math and the library agree to the decimal.
By the end of this lesson, you will be able to:
- Write down XGBoost’s regularized objective and explain the roles of and
- Expand the loss to second order and identify the gradient and hessian of each observation
- Derive the optimal leaf weight by minimizing the per-leaf objective
- Compute the leaf similarity score and the split gain that XGBoost maximizes
- Explain how sets a minimum gain and prunes splits, and verify every quantity against a real
xgboostfit
You should be comfortable with the pseudo-residual idea from Module 1, basic numpy, and reading a derivative and a second derivative. Every formula in this lesson is checked numerically, and the final section confirms the hand math against the real library. Let’s begin.
The Regularized Objective
Module 1’s booster only ever tried to shrink the training loss. XGBoost minimizes a larger quantity that adds a penalty for model complexity. Written across all trees in the ensemble, the objective is
The first sum is the familiar training loss over all observations. The second sum is new: for every tree in the model, XGBoost adds a regularization term that grows with how complex that tree is. For a single tree with leaves and leaf weights , the penalty is
Read the two pieces separately. The term charges a flat cost for each leaf the tree has, so a bushier tree pays more; this is what discourages needless splits and, as we will see, powers pruning. The term is an L2 penalty on the leaf weights, exactly like ridge regression, pulling every leaf value toward zero so no single leaf can make an extreme prediction. The is a convention that makes the derivative clean, and (called reg_lambda in the library) controls how hard the shrinkage pulls. Neither penalty exists in a plain gradient booster, and together they are the first reason XGBoost generalizes so well.
Regularization lives in the objective, not as an afterthought
In many models regularization is a separate step you tune around the edges. In XGBoost, and sit inside the objective the tree-builder optimizes. That means every leaf weight and every split decision is already made with the penalty in view, rather than fitting an unpenalized tree and shrinking it later. This is why you will see appear directly in the leaf-weight formula below, not as a post-processing knob.
Second-Order Approximation: Gradients and Hessians
When XGBoost adds the next tree , the new prediction for observation becomes . Plugging that into the loss and expanding with a second-order Taylor series around the current prediction gives, for each observation,
Two new quantities appear. The gradient is the first derivative of the loss with respect to the current prediction, and the hessian is the second derivative:
Module 1 used only the gradient, which is why it is a first-order method, the boosting analogue of plain gradient descent. XGBoost keeps the second-order term too, which makes it the boosting analogue of Newton’s method. The hessian tells XGBoost how curved the loss is at each point, so it can take a better-sized step instead of relying on a hand-chosen learning rate alone. For squared-error loss , the two derivatives are especially simple:
The gradient is the (signed) prediction error, and the hessian is exactly for every point. Let’s build a tiny Northwind dataset and compute these directly. Imagine six orders with a units-sold target, and a starting prediction equal to XGBoost’s default base_score of 0.5.
import numpy as np
# Six Northwind orders; target is units sold.
y = np.array([10.0, 12.0, 14.0, 30.0, 32.0, 34.0])
F = np.full_like(y, 0.5) # current prediction = base_score default (0.5)
# Squared-error loss L = 1/2 (y - yhat)^2 -> g = yhat - y, h = 1
g = F - y
h = np.ones_like(y)
print("g_i (gradient):", [round(float(v), 4) for v in g])
print("h_i (hessian): ", [round(float(v), 4) for v in h])
# Output:
# g_i (gradient): [-9.5, -11.5, -13.5, -29.5, -31.5, -33.5]
# h_i (hessian): [1.0, 1.0, 1.0, 1.0, 1.0, 1.0]Each gradient is 0.5 - y, a large negative number because the base prediction of 0.5 sits far below every real units figure, telling the tree those predictions must rise. Every hessian is 1.0, as the derivation promised. These two vectors, and , are the only things about the loss that the rest of the algorithm ever sees. Everything below is built purely from sums of and .
The Optimal Leaf Weight
Now put the tree structure into the picture. A tree assigns each observation to exactly one leaf, and every observation in leaf receives the same predicted value, the leaf weight . Substituting the second-order expansion into the regularized objective and dropping the constant terms (they do not depend on the new tree), the part of the objective that leaf contributes is
Notice the that appears next to the summed hessians: it comes straight from the L2 penalty in . To keep the notation clean, define the leaf’s summed gradient and summed hessian as
The per-leaf objective is now a simple upward parabola in : . Minimize it by setting the derivative with respect to to zero, , which gives the optimal leaf weight
This is a closed form. XGBoost does not search for the best leaf value; it computes it in one division. The in the denominator is the shrinkage: raise it and every leaf weight moves toward zero. Let’s compute the optimal weight for the case where all six orders land in a single leaf.
import numpy as np
y = np.array([10.0, 12.0, 14.0, 30.0, 32.0, 34.0])
F = np.full_like(y, 0.5)
g = F - y
h = np.ones_like(y)
reg_lambda = 1.0
G = g.sum()
H = h.sum()
w_star = -G / (H + reg_lambda)
print("G =", round(float(G), 4), " H =", round(float(H), 4))
print("optimal leaf weight w* = -G/(H+lambda) =", round(float(w_star), 6))
# Output:
# G = -129.0 H = 6.0
# optimal leaf weight w* = -G/(H+lambda) = 18.428571With every order in one leaf, the best single prediction to add on top of the 0.5 base is 18.4286, so the leaf would push predictions up to roughly 18.93, near the middle of the units values. Try raising reg_lambda and you will watch this weight shrink toward zero: at it drops to about 8.06, and at to about 1.22. That is L2 regularization acting directly through the denominator, exactly as the formula predicts.
The Similarity Score: How Good Is a Leaf?
We have the best weight for a leaf. How good is that leaf, in objective terms? Substitute back into the per-leaf objective . After the algebra, the minimized value collapses to a single clean expression:
This is the smallest (most negative, hence best) contribution leaf can make to the objective. Because a more negative objective is better, XGBoost thinks of the bracketed quantity, stripped of the , as a positive-is-better similarity score (also called the leaf quality or structure score):
The name “similarity” is apt: the numerator is large when the residuals in the leaf all point the same way and reinforce each other, and small when they cancel out. A leaf full of observations that agree on which direction the prediction should move earns a high similarity score; a leaf mixing positive and negative residuals earns a low one, because the summed gradient nearly cancels. Compute the similarity of our single all-six-rows leaf.
import numpy as np
y = np.array([10.0, 12.0, 14.0, 30.0, 32.0, 34.0])
F = np.full_like(y, 0.5)
g = F - y
h = np.ones_like(y)
reg_lambda = 1.0
G = g.sum()
H = h.sum()
similarity = G**2 / (H + reg_lambda)
print("root similarity = G^2/(H+lambda) =", round(float(similarity), 6))
# Output:
# root similarity = G^2/(H+lambda) = 2377.285714The root leaf, holding all six orders, scores 2377.2857. On its own that number means little; its whole purpose is to be compared against the similarity of the two leaves we would get by splitting it. That comparison is the gain.
The Gain: Choosing and Pruning Splits
Every candidate split takes a parent leaf and divides its observations into a left child and a right child. XGBoost decides whether the split is worthwhile by asking: does splitting raise the total similarity, and by enough to justify the cost of adding a leaf? Writing for the left child’s sums, for the right child’s, and noting the parent’s sums are and , the gain of the split is
In words: gain is one half of (left similarity plus right similarity minus parent similarity), minus the per-leaf penalty . The bracket measures how much cleaner the two children are than the single parent; the is the toll for the extra leaf. XGBoost evaluates this for every candidate split point and picks the one with the largest gain. Crucially, it keeps a split only if its gain is positive. If the best available split has gain , that branch is not grown; the two children would not improve the penalized objective enough to pay for themselves. This is exactly how (the gamma hyperparameter) prunes: raising raises the bar every split must clear, so fewer splits survive and trees stay smaller.
Let’s evaluate a real split on our six orders. The natural cut separates the three low-units orders from the three high-units orders.
import numpy as np
y = np.array([10.0, 12.0, 14.0, 30.0, 32.0, 34.0])
F = np.full_like(y, 0.5)
g = F - y
h = np.ones_like(y)
reg_lambda = 1.0
gamma = 0.0
# Left = three low-units orders, Right = three high-units orders
GL, HL = g[0:3].sum(), h[0:3].sum()
GR, HR = g[3:6].sum(), h[3:6].sum()
sim_L = GL**2 / (HL + reg_lambda)
sim_R = GR**2 / (HR + reg_lambda)
sim_parent = (GL + GR)**2 / (HL + HR + reg_lambda)
gain = 0.5 * (sim_L + sim_R - sim_parent) - gamma
wL = -GL / (HL + reg_lambda)
wR = -GR / (HR + reg_lambda)
print("GL =", round(float(GL), 4), " GR =", round(float(GR), 4))
print("left weight wL* =", round(float(wL), 4))
print("right weight wR* =", round(float(wR), 4))
print("sim_L =", round(float(sim_L), 4), " sim_R =", round(float(sim_R), 4), " sim_parent =", round(float(sim_parent), 4))
print("Gain =", round(float(gain), 6))
# Output:
# GL = -34.5 GR = -94.5
# left weight wL* = 8.625
# right weight wR* = 23.625
# sim_L = 297.5625 sim_R = 2232.5625 sim_parent = 2377.2857
# Gain = 76.419643The split earns a gain of 76.42, comfortably positive, so XGBoost keeps it. The two children have leaf weights 8.625 (the low-units orders) and 23.625 (the high-units orders), and their combined similarity (297.56 + 2232.56 = 2530.12) exceeds the parent’s 2377.29, which is what makes the gain positive. For contrast, an “interleaved” split that scatters low and high orders on both sides mixes opposite-signed residuals in each child, so its similarity barely rises above the parent and its gain comes out negative (about -123.58), a split XGBoost would refuse to make. The figure below traces the whole computation for the good split.
Verifying against a real XGBoost fit
Hand math is only convincing if the library agrees. Let’s fit a genuine one-tree XGBoost model on the same six orders, with the same reg_lambda, gamma, and base_score, and read off its leaf values with get_dump. To make the leaf weights appear directly (rather than shrunk by a learning rate), we set eta=1.0.
import numpy as np
import xgboost as xgb
y = np.array([10.0, 12.0, 14.0, 30.0, 32.0, 34.0])
X = np.arange(6).reshape(-1, 1).astype(float) # one feature, ordered low to high units
dtrain = xgb.DMatrix(X, label=y)
params = {
"max_depth": 1, # a single split: root -> two leaves
"eta": 1.0, # learning rate 1, so the leaf weight is applied in full
"reg_lambda": 1.0, # lambda in our formulas
"gamma": 0.0, # gamma in our formulas
"base_score": 0.5, # starting prediction F
"min_child_weight": 0,
"objective": "reg:squarederror",
"tree_method": "exact",
}
bst = xgb.train(params, dtrain, num_boost_round=1)
print(bst.get_dump(with_stats=True)[0])
pred = bst.predict(dtrain)
print("predictions:", [round(float(v), 4) for v in pred])
print("base_score + wL =", round(float(0.5 + 8.625), 4),
" base_score + wR =", round(float(0.5 + 23.625), 4))
# Output:
# 0:[f0<2.5] yes=1,no=2,missing=1,gain=152.839355,cover=6
# 1:leaf=8.625,cover=3
# 2:leaf=23.625,cover=3
#
# predictions: [9.125, 9.125, 9.125, 24.125, 24.125, 24.125]
# base_score + wL = 9.125 base_score + wR = 24.125Everything lines up. XGBoost split at f0 < 2.5, exactly separating the three low-units orders from the three high-units orders, and its leaf values are 8.625 and 23.625, matching our hand-computed and to the digit. The predictions are those leaf weights added to the base_score of 0.5, giving 9.125 and 24.125, which is why understanding base_score as the starting prediction matters. One honest caveat about the reported gain=152.84: XGBoost prints the gain without the outer factor, and indeed . The value the library compares against gamma internally is this un-halved bracket; the is a constant that does not change which split wins, so both conventions rank splits identically.
Why the hessian carries the row count here
Look at cover=6 on the root and cover=3 on each leaf in the dump. For squared-error loss every hessian is 1, so is simply the number of rows in the leaf, which is exactly what “cover” reports. For other losses the hessian is not 1 (for log loss it is ), and cover becomes the summed hessian rather than a plain count. The hyperparameter min_child_weight is a floor on this cover, which is why on non-squared losses it behaves like “minimum total confidence in a leaf” rather than “minimum rows.”
Practice Exercises
Try these before checking the hints. They exercise the leaf-weight, similarity, and gain formulas you just derived.
Exercise 1: Watch Regularization Shrink the Leaf Weight
Using the six orders’ summed gradient G = -129.0 and summed hessian H = 6.0, compute the optimal root leaf weight for reg_lambda values of 0.0, 1.0, 10.0, and 100.0. What happens to the weight as grows, and why?
G, H = -129.0, 6.0
for lam in [0.0, 1.0, 10.0, 100.0]:
# Your code here
passHint
Compute w = -G / (H + lam) inside the loop. You should get about 21.5, 18.43, 8.06, and 1.22. As grows, the denominator grows, so the weight shrinks steadily toward zero. That is the L2 penalty from acting directly through the leaf-weight formula, damping how extreme any single leaf’s prediction can be.
Exercise 2: Find the Gamma That Prunes a Split
The good split on the six orders has an (un-penalized) similarity rise of 2530.125 - 2377.2857, giving a halved gain of 76.42 when gamma = 0. Treating gamma as the only thing you change, find the smallest that makes the split’s gain non-positive so XGBoost would prune it. Print the gain for gamma values of 0, 50, and 100.
half_rise = 76.419643 # 1/2 (sim_L + sim_R - sim_parent)
for gamma in [0.0, 50.0, 100.0]:
gain = half_rise - gamma
# Your code here: print gamma and gain, note when it goes non-positiveHint
Gain is half_rise - gamma, so it turns non-positive once gamma exceeds 76.42. At gamma = 50 the gain is still positive (26.42, kept); at gamma = 100 it is negative (-23.58, pruned). This is precisely how the gamma hyperparameter controls tree size: it is the minimum gain a split must produce to be allowed, so larger gamma means more aggressive pruning and simpler trees.
Exercise 3: Gradients and Hessians for Log Loss
Squared-error loss has the simplest possible hessian, . Log loss does not. For a binary label and predicted probability , the log-loss gradient and hessian with respect to the raw score are and . For three points that all have label y = 1 and predicted probabilities p = [0.5, 0.9, 0.99], compute g and h and note how the hessian changes as the model grows confident.
import numpy as np
y = 1.0
p = np.array([0.5, 0.9, 0.99])
# Your code here: g = p - y ; h = p * (1 - p)Hint
You get gradients [-0.5, -0.1, -0.01] and hessians [0.25, 0.09, 0.0099]. As the predicted probability approaches the correct label, both the gradient and the hessian shrink toward zero. The vanishing hessian is why confident, correct points contribute almost nothing to a leaf’s cover under log loss, and it is exactly the curvature information that a first-order booster (Module 1) throws away but XGBoost’s Newton step uses.
Summary
You have seen precisely what makes XGBoost different from the hand-built booster of Module 1. It optimizes a regularized objective, and it approximates that objective to second order, using both the gradient and the hessian of the loss. From those two choices, a closed-form leaf weight, a leaf similarity score, and a split gain all fall out, and a real XGBoost fit matches them exactly.
Key Concepts
Regularized Objective
- XGBoost minimizes with
- charges a flat cost per leaf (controls pruning); is an L2 penalty on leaf weights (controls shrinkage)
Gradients and Hessians
- A second-order Taylor expansion introduces and
- For squared error, and ; the algorithm sees the loss only through sums of and
Optimal Leaf Weight
- , a closed form, not a search; on the six orders the two leaves gave
8.625and23.625 - Raising shrinks every weight toward zero through the denominator
Similarity and Gain
- Leaf quality is the similarity score ; the root leaf scored
2377.29 - ; the good split scored
76.42 - A split is kept only if its gain is positive, so sets the minimum gain and prunes
Verified against the library
- A one-tree
xgboostfit withreg_lambda=1,gamma=0,base_score=0.5produced leaf values8.625and23.625, matching the hand math; predictions are those weights plus the base score
Why This Matters
Every hyperparameter you will meet in the next lesson connects back to a symbol in this objective. reg_lambda is the in the leaf-weight denominator; gamma is the that a split’s gain must beat; max_depth and min_child_weight bound how the gain-maximizing search is allowed to grow trees. Because you derived the formulas and watched the library reproduce them to the decimal, those knobs are no longer mysterious dials. When Northwind’s model overfits, you will know that raising reg_lambda shrinks leaf weights and raising gamma refuses low-gain splits, because you have seen both effects in the algebra and in the numbers.
This objective is also the shared foundation of the whole gradient-boosting family. LightGBM and CatBoost use the same Newton step with gradients and hessians and the same similarity-based split scoring; they differ mainly in how they search for splits, not in what a split is worth. Master this one derivation and you understand the scoring engine at the heart of every modern boosting library.
Next Steps
You now understand the objective XGBoost optimizes and the exact formulas it uses to weight leaves and score splits. In the next lesson you will turn each symbol in that objective into a hyperparameter and learn how to set the learning rate, tree depth, min_child_weight, gamma, and the L1 and L2 penalties to control a real model.
Lesson 3: Core Hyperparameters
Map each symbol in the objective to a tunable knob: learning rate, max_depth, min_child_weight, gamma, and the L1 and L2 penalties.
Back to Module Overview
Return to the XGBoost in Depth module overview
Continue Building Your Skills
You have just derived the mathematical core of XGBoost and watched a real fit confirm it to the decimal. The regularized objective, the gradient and hessian, the closed-form leaf weight , the similarity score , and the gain that chooses and prunes splits are not five separate tricks; they are one objective seen from five angles. Hold onto the picture in the figure: sum the gradients and hessians, turn them into a similarity, and let the rise in similarity decide the split. Every hyperparameter, every “why did it overfit,” and every rival library in the rest of this course maps back to that one idea. Learn it once here, from the math and the numbers together, and XGBoost will never again feel like a black box.