Lesson 2 - Inside the XGBoost Objective

Welcome to Inside the XGBoost Objective

In Module 1 you built a gradient booster by hand. Each tree was fit to the pseudo-residual, the negative gradient of the loss, and added to the model with a learning rate. That booster works, and it captures the true spirit of boosting. But if you fit it and a real XGBoost model to the same Northwind Analytics data, XGBoost will usually win: it needs fewer trees, resists overfitting better, and trains far faster. The reason is not a better tree-growing trick bolted onto the same idea. It is a genuinely different objective and a smarter way of approximating it.

This lesson opens that objective up. XGBoost does two things Module 1’s booster did not. First, it adds a regularization term directly to the objective, so the algorithm is not just chasing training loss but explicitly paying a price for complex trees. Second, it uses a second-order (Newton) approximation of the loss, meaning it uses not only the gradient but also the hessian, the second derivative, of the loss at each point. From those two ingredients falls out a beautiful closed form for the best value to put in each leaf, a score that measures how “good” a leaf is, and the exact gain XGBoost uses to decide where to split and whether a split is worth keeping. We will derive each piece, then verify the numbers against a real one-tree XGBoost fit so you can trust that the math and the library agree to the decimal.

By the end of this lesson, you will be able to:

Write down XGBoost’s regularized objective and explain the roles of $\gamma$ and $\lambda$
Expand the loss to second order and identify the gradient $g_i$ and hessian $h_i$ of each observation
Derive the optimal leaf weight $w_j^* = -G_j / (H_j + \lambda)$ by minimizing the per-leaf objective
Compute the leaf similarity score $G_j^2 / (H_j + \lambda)$ and the split gain that XGBoost maximizes
Explain how $\gamma$ sets a minimum gain and prunes splits, and verify every quantity against a real xgboost fit

You should be comfortable with the pseudo-residual idea from Module 1, basic numpy, and reading a derivative and a second derivative. Every formula in this lesson is checked numerically, and the final section confirms the hand math against the real library. Let’s begin.

The Regularized Objective

Module 1’s booster only ever tried to shrink the training loss. XGBoost minimizes a larger quantity that adds a penalty for model complexity. Written across all $K$ trees in the ensemble, the objective is

\text{obj} = \sum_{i} L(y_i, \hat{y}_i) + \sum_{k=1}^{K} \Omega(f_k)

The first sum is the familiar training loss over all observations. The second sum is new: for every tree $f_k$ in the model, XGBoost adds a regularization term $\Omega$ that grows with how complex that tree is. For a single tree with $T$ leaves and leaf weights $w_1, \dots, w_T$ , the penalty is

\Omega(f) = \gamma\, T + \tfrac{1}{2}\lambda \sum_{j=1}^{T} w_j^2

Read the two pieces separately. The term $\gamma\, T$ charges a flat cost $\gamma$ for each leaf the tree has, so a bushier tree pays more; this is what discourages needless splits and, as we will see, powers pruning. The term $\tfrac{1}{2}\lambda \sum_j w_j^2$ is an L2 penalty on the leaf weights, exactly like ridge regression, pulling every leaf value toward zero so no single leaf can make an extreme prediction. The $\tfrac{1}{2}$ is a convention that makes the derivative clean, and $\lambda$ (called reg_lambda in the library) controls how hard the shrinkage pulls. Neither penalty exists in a plain gradient booster, and together they are the first reason XGBoost generalizes so well.

Regularization lives in the objective, not as an afterthought

In many models regularization is a separate step you tune around the edges. In XGBoost, $\gamma$ and $\lambda$ sit inside the objective the tree-builder optimizes. That means every leaf weight and every split decision is already made with the penalty in view, rather than fitting an unpenalized tree and shrinking it later. This is why you will see $\lambda$ appear directly in the leaf-weight formula below, not as a post-processing knob.

Second-Order Approximation: Gradients and Hessians

When XGBoost adds the next tree $f_t$ , the new prediction for observation $i$ becomes $\hat{y}_i^{(t-1)} + f_t(x_i)$ . Plugging that into the loss and expanding with a second-order Taylor series around the current prediction $\hat{y}_i^{(t-1)}$ gives, for each observation,

L\!\left(y_i,\, \hat{y}_i^{(t-1)} + f_t(x_i)\right) \approx L\!\left(y_i,\, \hat{y}_i^{(t-1)}\right) + g_i\, f_t(x_i) + \tfrac{1}{2} h_i\, f_t(x_i)^2

Two new quantities appear. The gradient is the first derivative of the loss with respect to the current prediction, and the hessian is the second derivative:

g_i = \frac{\partial L(y_i, \hat{y})}{\partial \hat{y}} \Bigg|_{\hat{y} = \hat{y}_i^{(t-1)}}, \qquad h_i = \frac{\partial^2 L(y_i, \hat{y})}{\partial \hat{y}^2} \Bigg|_{\hat{y} = \hat{y}_i^{(t-1)}}

Module 1 used only the gradient, which is why it is a first-order method, the boosting analogue of plain gradient descent. XGBoost keeps the second-order term too, which makes it the boosting analogue of Newton’s method. The hessian tells XGBoost how curved the loss is at each point, so it can take a better-sized step instead of relying on a hand-chosen learning rate alone. For squared-error loss $L = \tfrac{1}{2}(y - \hat{y})^2$ , the two derivatives are especially simple:

g_i = \frac{\partial}{\partial \hat{y}} \tfrac{1}{2}(y_i - \hat{y})^2 = \hat{y}_i - y_i, \qquad h_i = \frac{\partial}{\partial \hat{y}} (\hat{y}_i - y_i) = 1

The gradient is the (signed) prediction error, and the hessian is exactly $1$ for every point. Let’s build a tiny Northwind dataset and compute these directly. Imagine six orders with a units-sold target, and a starting prediction equal to XGBoost’s default base_score of 0.5.

import numpy as np

# Six Northwind orders; target is units sold.
y = np.array([10.0, 12.0, 14.0, 30.0, 32.0, 34.0])
F = np.full_like(y, 0.5)   # current prediction = base_score default (0.5)

# Squared-error loss L = 1/2 (y - yhat)^2  ->  g = yhat - y,  h = 1
g = F - y
h = np.ones_like(y)

print("g_i (gradient):", [round(float(v), 4) for v in g])
print("h_i (hessian): ", [round(float(v), 4) for v in h])
# Output:
# g_i (gradient): [-9.5, -11.5, -13.5, -29.5, -31.5, -33.5]
# h_i (hessian):  [1.0, 1.0, 1.0, 1.0, 1.0, 1.0]

Each gradient is 0.5 - y, a large negative number because the base prediction of 0.5 sits far below every real units figure, telling the tree those predictions must rise. Every hessian is 1.0, as the derivation promised. These two vectors, $g$ and $h$ , are the only things about the loss that the rest of the algorithm ever sees. Everything below is built purely from sums of $g$ and $h$ .

The Optimal Leaf Weight

Now put the tree structure into the picture. A tree assigns each observation to exactly one leaf, and every observation in leaf $j$ receives the same predicted value, the leaf weight $w_j$ . Substituting the second-order expansion into the regularized objective and dropping the constant $L(y_i, \hat{y}^{(t-1)})$ terms (they do not depend on the new tree), the part of the objective that leaf $j$ contributes is

\text{obj}_j = \left(\sum_{i \in j} g_i\right) w_j + \tfrac{1}{2}\left(\sum_{i \in j} h_i + \lambda\right) w_j^2

Notice the $\lambda$ that appears next to the summed hessians: it comes straight from the L2 penalty $\tfrac{1}{2}\lambda w_j^2$ in $\Omega$ . To keep the notation clean, define the leaf’s summed gradient and summed hessian as

G_j = \sum_{i \in j} g_i, \qquad H_j = \sum_{i \in j} h_i

The per-leaf objective is now a simple upward parabola in $w_j$ : $\text{obj}_j = G_j\, w_j + \tfrac{1}{2}(H_j + \lambda)\, w_j^2$ . Minimize it by setting the derivative with respect to $w_j$ to zero, $G_j + (H_j + \lambda)\, w_j = 0$ , which gives the optimal leaf weight

w_j^* = -\frac{\sum_{i \in j} g_i}{\sum_{i \in j} h_i + \lambda} = -\frac{G_j}{H_j + \lambda}

This is a closed form. XGBoost does not search for the best leaf value; it computes it in one division. The $\lambda$ in the denominator is the shrinkage: raise it and every leaf weight moves toward zero. Let’s compute the optimal weight for the case where all six orders land in a single leaf.

import numpy as np

y = np.array([10.0, 12.0, 14.0, 30.0, 32.0, 34.0])
F = np.full_like(y, 0.5)
g = F - y
h = np.ones_like(y)

reg_lambda = 1.0

G = g.sum()
H = h.sum()
w_star = -G / (H + reg_lambda)

print("G =", round(float(G), 4), " H =", round(float(H), 4))
print("optimal leaf weight w* = -G/(H+lambda) =", round(float(w_star), 6))
# Output:
# G = -129.0  H = 6.0
# optimal leaf weight w* = -G/(H+lambda) = 18.428571

With every order in one leaf, the best single prediction to add on top of the 0.5 base is 18.4286, so the leaf would push predictions up to roughly 18.93, near the middle of the units values. Try raising reg_lambda and you will watch this weight shrink toward zero: at $\lambda = 10$ it drops to about 8.06, and at $\lambda = 100$ to about 1.22. That is L2 regularization acting directly through the denominator, exactly as the formula predicts.

The Similarity Score: How Good Is a Leaf?

We have the best weight for a leaf. How good is that leaf, in objective terms? Substitute $w_j^*$ back into the per-leaf objective $\text{obj}_j = G_j w_j + \tfrac{1}{2}(H_j + \lambda) w_j^2$ . After the algebra, the minimized value collapses to a single clean expression:

\text{obj}_j^* = -\tfrac{1}{2}\,\frac{G_j^2}{H_j + \lambda}

This is the smallest (most negative, hence best) contribution leaf $j$ can make to the objective. Because a more negative objective is better, XGBoost thinks of the bracketed quantity, stripped of the $-\tfrac{1}{2}$ , as a positive-is-better similarity score (also called the leaf quality or structure score):

\text{similarity}_j = \frac{G_j^2}{H_j + \lambda}

The name “similarity” is apt: the numerator $G_j^2 = \left(\sum_{i \in j} g_i\right)^2$ is large when the residuals in the leaf all point the same way and reinforce each other, and small when they cancel out. A leaf full of observations that agree on which direction the prediction should move earns a high similarity score; a leaf mixing positive and negative residuals earns a low one, because the summed gradient nearly cancels. Compute the similarity of our single all-six-rows leaf.

import numpy as np

y = np.array([10.0, 12.0, 14.0, 30.0, 32.0, 34.0])
F = np.full_like(y, 0.5)
g = F - y
h = np.ones_like(y)
reg_lambda = 1.0

G = g.sum()
H = h.sum()
similarity = G**2 / (H + reg_lambda)

print("root similarity = G^2/(H+lambda) =", round(float(similarity), 6))
# Output:
# root similarity = G^2/(H+lambda) = 2377.285714

The root leaf, holding all six orders, scores 2377.2857. On its own that number means little; its whole purpose is to be compared against the similarity of the two leaves we would get by splitting it. That comparison is the gain.

The Gain: Choosing and Pruning Splits

Every candidate split takes a parent leaf and divides its observations into a left child and a right child. XGBoost decides whether the split is worthwhile by asking: does splitting raise the total similarity, and by enough to justify the $\gamma$ cost of adding a leaf? Writing $G_L, H_L$ for the left child’s sums, $G_R, H_R$ for the right child’s, and noting the parent’s sums are $G_L + G_R$ and $H_L + H_R$ , the gain of the split is

\text{Gain} = \tfrac{1}{2}\left[\frac{G_L^2}{H_L + \lambda} + \frac{G_R^2}{H_R + \lambda} - \frac{(G_L + G_R)^2}{H_L + H_R + \lambda}\right] - \gamma

In words: gain is one half of (left similarity plus right similarity minus parent similarity), minus the per-leaf penalty $\gamma$ . The bracket measures how much cleaner the two children are than the single parent; the $\gamma$ is the toll for the extra leaf. XGBoost evaluates this for every candidate split point and picks the one with the largest gain. Crucially, it keeps a split only if its gain is positive. If the best available split has gain $\le 0$ , that branch is not grown; the two children would not improve the penalized objective enough to pay for themselves. This is exactly how $\gamma$ (the gamma hyperparameter) prunes: raising $\gamma$ raises the bar every split must clear, so fewer splits survive and trees stay smaller.

Let’s evaluate a real split on our six orders. The natural cut separates the three low-units orders from the three high-units orders.

import numpy as np

y = np.array([10.0, 12.0, 14.0, 30.0, 32.0, 34.0])
F = np.full_like(y, 0.5)
g = F - y
h = np.ones_like(y)
reg_lambda = 1.0
gamma = 0.0

# Left = three low-units orders, Right = three high-units orders
GL, HL = g[0:3].sum(), h[0:3].sum()
GR, HR = g[3:6].sum(), h[3:6].sum()

sim_L = GL**2 / (HL + reg_lambda)
sim_R = GR**2 / (HR + reg_lambda)
sim_parent = (GL + GR)**2 / (HL + HR + reg_lambda)

gain = 0.5 * (sim_L + sim_R - sim_parent) - gamma

wL = -GL / (HL + reg_lambda)
wR = -GR / (HR + reg_lambda)

print("GL =", round(float(GL), 4), " GR =", round(float(GR), 4))
print("left weight  wL* =", round(float(wL), 4))
print("right weight wR* =", round(float(wR), 4))
print("sim_L =", round(float(sim_L), 4), " sim_R =", round(float(sim_R), 4), " sim_parent =", round(float(sim_parent), 4))
print("Gain =", round(float(gain), 6))
# Output:
# GL = -34.5  GR = -94.5
# left weight  wL* = 8.625
# right weight wR* = 23.625
# sim_L = 297.5625  sim_R = 2232.5625  sim_parent = 2377.2857
# Gain = 76.419643

The split earns a gain of 76.42, comfortably positive, so XGBoost keeps it. The two children have leaf weights 8.625 (the low-units orders) and 23.625 (the high-units orders), and their combined similarity (297.56 + 2232.56 = 2530.12) exceeds the parent’s 2377.29, which is what makes the gain positive. For contrast, an “interleaved” split that scatters low and high orders on both sides mixes opposite-signed residuals in each child, so its similarity barely rises above the parent and its gain comes out negative (about -123.58), a split XGBoost would refuse to make. The figure below traces the whole computation for the good split.

A tree-split diagram. A purple parent node holding all six orders shows summed gradient G equals minus 129 and summed hessian H equals 6, with similarity G squared over H plus lambda equal to 2377.29. Two arrows lead down, one labeled units less than 20 to a blue left leaf and one labeled units at least 20 to a green right leaf. The left leaf shows G L equals minus 34.5, H L equals 3, optimal weight 8.625, and similarity 297.56. The right leaf shows G R equals minus 94.5, H R equals 3, optimal weight 23.625, and similarity 2232.56. An orange box at the bottom writes Gain equals one half times the bracket of 297.56 plus 2232.56 minus 2377.29 minus gamma, equal to 76.42, positive so the split is kept. — The split scored end to end: each node's gradients and hessians sum to G and H, the similarity score G squared over H plus lambda measures how aligned a node's residuals are, and the gain (half the rise in similarity, minus gamma) decides whether the split survives.

Verifying against a real XGBoost fit

Hand math is only convincing if the library agrees. Let’s fit a genuine one-tree XGBoost model on the same six orders, with the same reg_lambda, gamma, and base_score, and read off its leaf values with get_dump. To make the leaf weights appear directly (rather than shrunk by a learning rate), we set eta=1.0.

import numpy as np
import xgboost as xgb

y = np.array([10.0, 12.0, 14.0, 30.0, 32.0, 34.0])
X = np.arange(6).reshape(-1, 1).astype(float)   # one feature, ordered low to high units

dtrain = xgb.DMatrix(X, label=y)
params = {
    "max_depth": 1,          # a single split: root -> two leaves
    "eta": 1.0,              # learning rate 1, so the leaf weight is applied in full
    "reg_lambda": 1.0,       # lambda in our formulas
    "gamma": 0.0,            # gamma in our formulas
    "base_score": 0.5,       # starting prediction F
    "min_child_weight": 0,
    "objective": "reg:squarederror",
    "tree_method": "exact",
}
bst = xgb.train(params, dtrain, num_boost_round=1)

print(bst.get_dump(with_stats=True)[0])
pred = bst.predict(dtrain)
print("predictions:", [round(float(v), 4) for v in pred])
print("base_score + wL =", round(float(0.5 + 8.625), 4),
      "  base_score + wR =", round(float(0.5 + 23.625), 4))
# Output:
# 0:[f0<2.5] yes=1,no=2,missing=1,gain=152.839355,cover=6
# 	1:leaf=8.625,cover=3
# 	2:leaf=23.625,cover=3
#
# predictions: [9.125, 9.125, 9.125, 24.125, 24.125, 24.125]
# base_score + wL = 9.125   base_score + wR = 24.125

Everything lines up. XGBoost split at f0 < 2.5, exactly separating the three low-units orders from the three high-units orders, and its leaf values are 8.625 and 23.625, matching our hand-computed $w_L^*$ and $w_R^*$ to the digit. The predictions are those leaf weights added to the base_score of 0.5, giving 9.125 and 24.125, which is why understanding base_score as the starting prediction $F$ matters. One honest caveat about the reported gain=152.84: XGBoost prints the gain without the outer $\tfrac{1}{2}$ factor, and indeed $2 \times 76.42 = 152.84$ . The value the library compares against gamma internally is this un-halved bracket; the $\tfrac{1}{2}$ is a constant that does not change which split wins, so both conventions rank splits identically.

Why the hessian carries the row count here

Look at cover=6 on the root and cover=3 on each leaf in the dump. For squared-error loss every hessian is 1, so $H_j$ is simply the number of rows in the leaf, which is exactly what “cover” reports. For other losses the hessian is not 1 (for log loss it is $p(1-p)$ ), and cover becomes the summed hessian rather than a plain count. The hyperparameter min_child_weight is a floor on this cover, which is why on non-squared losses it behaves like “minimum total confidence in a leaf” rather than “minimum rows.”

Practice Exercises

Try these before checking the hints. They exercise the leaf-weight, similarity, and gain formulas you just derived.

Exercise 1: Watch Regularization Shrink the Leaf Weight

Using the six orders’ summed gradient G = -129.0 and summed hessian H = 6.0, compute the optimal root leaf weight $w^* = -G/(H + \lambda)$ for reg_lambda values of 0.0, 1.0, 10.0, and 100.0. What happens to the weight as $\lambda$ grows, and why?

G, H = -129.0, 6.0
for lam in [0.0, 1.0, 10.0, 100.0]:
    # Your code here
    pass

Hint

Compute w = -G / (H + lam) inside the loop. You should get about 21.5, 18.43, 8.06, and 1.22. As $\lambda$ grows, the denominator grows, so the weight shrinks steadily toward zero. That is the L2 penalty from $\Omega$ acting directly through the leaf-weight formula, damping how extreme any single leaf’s prediction can be.

Exercise 2: Find the Gamma That Prunes a Split

The good split on the six orders has an (un-penalized) similarity rise of 2530.125 - 2377.2857, giving a halved gain of 76.42 when gamma = 0. Treating gamma as the only thing you change, find the smallest $\gamma$ that makes the split’s gain non-positive so XGBoost would prune it. Print the gain for gamma values of 0, 50, and 100.

half_rise = 76.419643   # 1/2 (sim_L + sim_R - sim_parent)
for gamma in [0.0, 50.0, 100.0]:
    gain = half_rise - gamma
    # Your code here: print gamma and gain, note when it goes non-positive

Hint

Gain is half_rise - gamma, so it turns non-positive once gamma exceeds 76.42. At gamma = 50 the gain is still positive (26.42, kept); at gamma = 100 it is negative (-23.58, pruned). This is precisely how the gamma hyperparameter controls tree size: it is the minimum gain a split must produce to be allowed, so larger gamma means more aggressive pruning and simpler trees.

Exercise 3: Gradients and Hessians for Log Loss

Squared-error loss has the simplest possible hessian, $h_i = 1$ . Log loss does not. For a binary label $y$ and predicted probability $p = \sigma(F)$ , the log-loss gradient and hessian with respect to the raw score are $g = p - y$ and $h = p(1 - p)$ . For three points that all have label y = 1 and predicted probabilities p = [0.5, 0.9, 0.99], compute g and h and note how the hessian changes as the model grows confident.

import numpy as np

y = 1.0
p = np.array([0.5, 0.9, 0.99])
# Your code here: g = p - y ; h = p * (1 - p)

Hint

You get gradients [-0.5, -0.1, -0.01] and hessians [0.25, 0.09, 0.0099]. As the predicted probability approaches the correct label, both the gradient and the hessian shrink toward zero. The vanishing hessian is why confident, correct points contribute almost nothing to a leaf’s cover under log loss, and it is exactly the curvature information that a first-order booster (Module 1) throws away but XGBoost’s Newton step uses.

Summary

You have seen precisely what makes XGBoost different from the hand-built booster of Module 1. It optimizes a regularized objective, and it approximates that objective to second order, using both the gradient and the hessian of the loss. From those two choices, a closed-form leaf weight, a leaf similarity score, and a split gain all fall out, and a real XGBoost fit matches them exactly.

Key Concepts

Regularized Objective

XGBoost minimizes $\sum_i L(y_i, \hat{y}_i) + \sum_k \Omega(f_k)$ with $\Omega(f) = \gamma T + \tfrac{1}{2}\lambda \sum_j w_j^2$
$\gamma$ charges a flat cost per leaf (controls pruning); $\lambda$ is an L2 penalty on leaf weights (controls shrinkage)

Gradients and Hessians

A second-order Taylor expansion introduces $g_i = \partial_{\hat{y}} L$ and $h_i = \partial^2_{\hat{y}} L$
For squared error, $g_i = \hat{y}_i - y_i$ and $h_i = 1$ ; the algorithm sees the loss only through sums of $g$ and $h$

Optimal Leaf Weight

$w_j^* = -G_j / (H_j + \lambda)$ , a closed form, not a search; on the six orders the two leaves gave 8.625 and 23.625
Raising $\lambda$ shrinks every weight toward zero through the denominator

Similarity and Gain

Leaf quality is the similarity score $G_j^2 / (H_j + \lambda)$ ; the root leaf scored 2377.29
$\text{Gain} = \tfrac{1}{2}[\, G_L^2/(H_L+\lambda) + G_R^2/(H_R+\lambda) - (G_L+G_R)^2/(H_L+H_R+\lambda)\,] - \gamma$ ; the good split scored 76.42
A split is kept only if its gain is positive, so $\gamma$ sets the minimum gain and prunes

Verified against the library

A one-tree xgboost fit with reg_lambda=1, gamma=0, base_score=0.5 produced leaf values 8.625 and 23.625, matching the hand math; predictions are those weights plus the base score

Why This Matters

Every hyperparameter you will meet in the next lesson connects back to a symbol in this objective. reg_lambda is the $\lambda$ in the leaf-weight denominator; gamma is the $\gamma$ that a split’s gain must beat; max_depth and min_child_weight bound how the gain-maximizing search is allowed to grow trees. Because you derived the formulas and watched the library reproduce them to the decimal, those knobs are no longer mysterious dials. When Northwind’s model overfits, you will know that raising reg_lambda shrinks leaf weights and raising gamma refuses low-gain splits, because you have seen both effects in the algebra and in the numbers.

This objective is also the shared foundation of the whole gradient-boosting family. LightGBM and CatBoost use the same Newton step with gradients and hessians and the same similarity-based split scoring; they differ mainly in how they search for splits, not in what a split is worth. Master this one derivation and you understand the scoring engine at the heart of every modern boosting library.

Next Steps

You now understand the objective XGBoost optimizes and the exact formulas it uses to weight leaves and score splits. In the next lesson you will turn each symbol in that objective into a hyperparameter and learn how to set the learning rate, tree depth, min_child_weight, gamma, and the L1 and L2 penalties to control a real model.

Lesson 3: Core Hyperparameters

Map each symbol in the objective to a tunable knob: learning rate, max_depth, min_child_weight, gamma, and the L1 and L2 penalties.

Back to Module Overview

Return to the XGBoost in Depth module overview

Continue Building Your Skills

You have just derived the mathematical core of XGBoost and watched a real fit confirm it to the decimal. The regularized objective, the gradient and hessian, the closed-form leaf weight $-G/(H+\lambda)$ , the similarity score $G^2/(H+\lambda)$ , and the gain that chooses and prunes splits are not five separate tricks; they are one objective seen from five angles. Hold onto the picture in the figure: sum the gradients and hessians, turn them into a similarity, and let the rise in similarity decide the split. Every hyperparameter, every “why did it overfit,” and every rival library in the rest of this course maps back to that one idea. Learn it once here, from the math and the numbers together, and XGBoost will never again feel like a black box.

Previous lesson

Lesson 1 - Introducing XGBoost

Next lesson

Lesson 3 - Core Hyperparameters

Courses

DATATWEETS

Title here

Lesson 2 - Inside the XGBoost Objective

Welcome to Inside the XGBoost Objective

The Regularized Objective

Second-Order Approximation: Gradients and Hessians

The Optimal Leaf Weight

The Similarity Score: How Good Is a Leaf?

The Gain: Choosing and Pruning Splits

Verifying against a real XGBoost fit

Practice Exercises

Exercise 1: Watch Regularization Shrink the Leaf Weight

Exercise 2: Find the Gamma That Prunes a Split

Exercise 3: Gradients and Hessians for Log Loss

Summary

Key Concepts

Why This Matters

Next Steps

Lesson 3: Core Hyperparameters

Back to Module Overview

Continue Building Your Skills

Lesson 2 - Inside the XGBoost Objective

Welcome to Inside the XGBoost Objective#

The Regularized Objective#

Second-Order Approximation: Gradients and Hessians#

The Optimal Leaf Weight#

The Similarity Score: How Good Is a Leaf?#

The Gain: Choosing and Pruning Splits#

Verifying against a real XGBoost fit#

Practice Exercises#

Exercise 1: Watch Regularization Shrink the Leaf Weight#

Exercise 2: Find the Gamma That Prunes a Split#

Exercise 3: Gradients and Hessians for Log Loss#

Summary#

Key Concepts#

Why This Matters#

Next Steps#

Lesson 3: Core Hyperparameters

Back to Module Overview

Continue Building Your Skills#

Welcome to Inside the XGBoost Objective

The Regularized Objective

Second-Order Approximation: Gradients and Hessians

The Optimal Leaf Weight

The Similarity Score: How Good Is a Leaf?

The Gain: Choosing and Pruning Splits

Verifying against a real XGBoost fit

Practice Exercises

Exercise 1: Watch Regularization Shrink the Leaf Weight

Exercise 2: Find the Gamma That Prunes a Split

Exercise 3: Gradients and Hessians for Log Loss

Summary

Key Concepts

Why This Matters

Next Steps

Continue Building Your Skills