Lesson 3 - Core Hyperparameters

Welcome to Core Hyperparameters

In the last lesson you looked under XGBoost’s hood and saw exactly how it fits each tree: it uses the gradient and hessian of the loss to score candidate splits, keeps a split only when its gain beats a threshold, and adds each new tree scaled by a shrinkage factor. That machinery does not run on autopilot. A handful of numbers you pass in decide how many trees get built, how big each step is, how deep each tree grows, and how eager XGBoost is to carve out a new split. Those numbers are the model’s hyperparameters, and setting them well is most of the work of getting XGBoost to perform.

Northwind Analytics is back on the California Housing problem, predicting MedHouseVal, the median house value (in units of 100,000 dollars) for each California district. In Lesson 1 they saw an untuned boosting model land around a test RMSE of 0.5, competitive but not spectacular. Their next job is to tune it, and to tune anything you first have to understand what each knob does. In this lesson you will turn five knobs one at a time and measure the effect on real train and test RMSE, so that by the end you can look at a set of hyperparameters and predict, roughly, whether the model will underfit, overfit, or land in the sweet spot. Every number below came from running the code for real.

By the end of this lesson, you will be able to:

Explain what n_estimators, learning_rate, max_depth, min_child_weight, and gamma each control
Read the gap between train and test RMSE as a direct measure of overfitting
Trade learning_rate against n_estimators to learn slowly but well
Use max_depth to set model complexity and recognize when depth starts to overfit
Apply min_child_weight and gamma as complexity brakes, connecting each to the hessian and gain ideas from Lesson 2

You should be comfortable with the scikit-learn fit/predict pattern and the boosting intuition from Module 1. We will use XGBoost’s scikit-learn wrapper, xgb.XGBRegressor, throughout. Let’s begin.

The One Number to Watch: The Train-Test Gap

Before touching a single knob, fix the habit that makes all of this readable. For every model you will print two RMSE numbers: one on the training data the model learned from, and one on a held-out test set it never saw. The training RMSE tells you how well the model fit; the test RMSE tells you how well it will generalize. The distance between them, the gap, is your overfitting gauge.

A small gap with a high test RMSE means the model is too simple: it underfits, missing signal even on data it trained on.
A small gap with a low test RMSE is the goal: the model learned real patterns that carry over to new data.
A large gap (tiny train RMSE, much bigger test RMSE) means the model memorized the training set and overfit.

Every sweep in this lesson prints both numbers so you can watch the gap open and close as you turn each knob. Here is the shared setup all the experiments reuse.

import warnings
warnings.filterwarnings("ignore")
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
import xgboost as xgb
import numpy as np

data = fetch_california_housing()
X_train, X_test, y_train, y_test = train_test_split(
    data.data, data.target, test_size=0.2, random_state=42
)

def rmse(model, X, y):
    return float(np.sqrt(mean_squared_error(y, model.predict(X))))

print("train rows:", X_train.shape[0], " test rows:", X_test.shape[0])

train rows: 16512  test rows: 4128

With 16,512 training districts and 4,128 test districts locked in behind random_state=42, every model below is judged on exactly the same split.

n_estimators: How Many Trees to Build

n_estimators is the number of boosting rounds, one tree per round. Because each tree corrects the errors left by the ones before it, adding trees can only lower the training error, or leave it unchanged. The interesting question is what happens on the test set: at first more trees keep helping, but past a point the extra trees start fitting noise and generalization stops improving (and can slowly worsen). Let’s sweep it with everything else held fixed.

import warnings
warnings.filterwarnings("ignore")
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
import xgboost as xgb
import numpy as np

data = fetch_california_housing()
X_train, X_test, y_train, y_test = train_test_split(
    data.data, data.target, test_size=0.2, random_state=42
)

def rmse(model, X, y):
    return float(np.sqrt(mean_squared_error(y, model.predict(X))))

for n in [10, 50, 100, 300, 1000]:
    m = xgb.XGBRegressor(n_estimators=n, learning_rate=0.3,
                         max_depth=6, random_state=42)
    m.fit(X_train, y_train)
    tr = round(rmse(m, X_train, y_train), 4)
    te = round(rmse(m, X_test, y_test), 4)
    print(f"n_estimators={n:5d}  train={tr:<8} test={te}")

n_estimators=   10  train=0.491    test=0.5453
n_estimators=   50  train=0.342    test=0.4728
n_estimators=  100  train=0.2757   test=0.4626
n_estimators=  300  train=0.1464   test=0.4599
n_estimators= 1000  train=0.0251   test=0.4601

Read the two columns separately. The train RMSE falls relentlessly, from 0.491 all the way down to 0.0251 at 1,000 trees, because every added tree fits the training residuals a little more. The test RMSE, though, tells the real story: it drops sharply through the first hundred trees (0.5453 to 0.4626), inches down to a floor around 0.46 by 300 trees, and then stops improving, even ticking up from 0.4599 to 0.4601 as you go from 300 to 1,000 trees. Those last 700 trees drove the training error to near zero while doing nothing for generalization. That is the widening gap of overfitting in slow motion. More trees are not free: past the point where test error flattens, you are only buying complexity and compute.

n_estimators is not a set-and-forget knob

There is no universal “right” number of trees, because the best count depends on the other knobs, above all learning_rate. A model that learns in small steps needs more trees to reach the same fit than one that takes big steps. That is exactly why the next section pairs these two parameters together: you almost never tune n_estimators in isolation.

learning_rate: How Big Each Step Is

learning_rate (also written as eta) is the shrinkage factor from Lesson 2. Before adding a new tree’s predictions to the ensemble, XGBoost multiplies them by this fraction. A small learning_rate means each tree nudges the prediction only a little, so the model learns cautiously; a large one means each tree makes a bold correction. Hold the tree count fixed at 100 and watch what different step sizes do.

import warnings
warnings.filterwarnings("ignore")
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
import xgboost as xgb
import numpy as np

data = fetch_california_housing()
X_train, X_test, y_train, y_test = train_test_split(
    data.data, data.target, test_size=0.2, random_state=42
)

def rmse(model, X, y):
    return float(np.sqrt(mean_squared_error(y, model.predict(X))))

for lr in [0.01, 0.05, 0.1, 0.3, 1.0]:
    m = xgb.XGBRegressor(n_estimators=100, learning_rate=lr,
                         max_depth=6, random_state=42)
    m.fit(X_train, y_train)
    tr = round(rmse(m, X_train, y_train), 4)
    te = round(rmse(m, X_test, y_test), 4)
    print(f"learning_rate={lr:<5}  train={tr:<8} test={te}")

learning_rate=0.01   train=0.6994   test=0.7209
learning_rate=0.05   train=0.4311   test=0.5033
learning_rate=0.1    train=0.3689   test=0.4739
learning_rate=0.3    train=0.2757   test=0.4626
learning_rate=1.0    train=0.1853   test=0.5801

This sweep has a clear U-shape in the test column. At learning_rate=0.01 the steps are so timid that 100 trees are nowhere near enough to fit the data; both train and test RMSE are high (0.6994 and 0.7209). This is underfitting from learning too slowly with too few trees. At the other extreme, learning_rate=1.0 takes reckless full-size steps: the training RMSE drops to 0.1853, but the test RMSE balloons to 0.5801, the worst generalization in the table. That is overfitting from steps so large the model overreacts to each tree. The sweet spot here sits in the middle, around 0.3, at test RMSE 0.4626.

The catch is that the “best” learning_rate is entangled with n_estimators. A smaller step size is not worse; it just needs more trees to travel the same distance. Let’s prove it by letting the low-rate models compensate with more trees.

import warnings
warnings.filterwarnings("ignore")
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
import xgboost as xgb
import numpy as np

data = fetch_california_housing()
X_train, X_test, y_train, y_test = train_test_split(
    data.data, data.target, test_size=0.2, random_state=42
)

def rmse(model, X, y):
    return float(np.sqrt(mean_squared_error(y, model.predict(X))))

for lr, n in [(0.3, 100), (0.1, 100), (0.1, 300), (0.05, 300), (0.05, 600)]:
    m = xgb.XGBRegressor(n_estimators=n, learning_rate=lr,
                         max_depth=6, random_state=42)
    m.fit(X_train, y_train)
    tr = round(rmse(m, X_train, y_train), 4)
    te = round(rmse(m, X_test, y_test), 4)
    print(f"learning_rate={lr:<5} n_estimators={n:<5} train={tr:<8} test={te}")

learning_rate=0.3   n_estimators=100   train=0.2757   test=0.4626
learning_rate=0.1   n_estimators=100   train=0.3689   test=0.4739
learning_rate=0.1   n_estimators=300   train=0.263    test=0.4498
learning_rate=0.05  n_estimators=300   train=0.3322   test=0.4649
learning_rate=0.05  n_estimators=600   train=0.263    test=0.4516

Look at rows two and three. At learning_rate=0.1 with only 100 trees the model is a touch underfit (test 0.4739), but give it 300 trees and it reaches 0.4498, the best test RMSE in this table, beating the faster learning_rate=0.3 model. The same pattern repeats at 0.05: 300 trees is not enough (0.4649), but 600 trees recovers most of the ground (0.4516). The lesson is a rule of thumb you will use constantly: lower the learning rate and raise the tree count together. Small, patient steps generalize better than big greedy ones, as long as you give the model enough rounds to get where it is going.

The standard tuning move

In practice, practitioners often fix a smallish learning_rate (0.05 to 0.1 is a common starting range), then use early stopping or a validation curve to find how many trees that rate needs. Speed of training pushes you toward a larger rate; quality of the final model pushes you toward a smaller rate with more trees. You are always trading one against the other, never setting either alone.

max_depth: How Complex Each Tree Is

max_depth caps how many levels deep each tree can grow. A shallow tree (depth 2) can only combine a couple of features before making a prediction, so it captures broad trends but misses fine interactions. A deep tree (depth 10) can carve the feature space into thousands of tiny regions, capturing intricate interactions, and memorizing noise right along with them. This is the single most direct lever on model complexity, and its sweep is the clearest picture of overfitting in the whole lesson. Hold 200 trees and learning_rate=0.1, and vary only the depth.

import warnings
warnings.filterwarnings("ignore")
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
import xgboost as xgb
import numpy as np

data = fetch_california_housing()
X_train, X_test, y_train, y_test = train_test_split(
    data.data, data.target, test_size=0.2, random_state=42
)

def rmse(model, X, y):
    return float(np.sqrt(mean_squared_error(y, model.predict(X))))

for d in [2, 4, 6, 8, 10]:
    m = xgb.XGBRegressor(n_estimators=200, learning_rate=0.1,
                         max_depth=d, random_state=42)
    m.fit(X_train, y_train)
    tr = round(rmse(m, X_train, y_train), 4)
    te = round(rmse(m, X_test, y_test), 4)
    gap = round(te - tr, 4)
    print(f"max_depth={d:2d}  train={tr:<8} test={te:<8} gap={gap}")

max_depth= 2  train=0.5227   test=0.5478   gap=0.0252
max_depth= 4  train=0.4203   test=0.4831   gap=0.0628
max_depth= 6  train=0.3045   test=0.4564   gap=0.1519
max_depth= 8  train=0.1778   test=0.4541   gap=0.2763
max_depth=10  train=0.0818   test=0.4664   gap=0.3846

This is the table to memorize. Trace the gap column as depth grows: 0.0252, 0.0628, 0.1519, 0.2763, 0.3846. It more than doubles at nearly every step. At max_depth=2 the model is almost balanced (train and test within 0.025 of each other) but a bit underfit, with the worst test RMSE in the table. As depth increases, the training RMSE plunges toward 0.08, because deep trees can fit almost anything, but the test RMSE stops cooperating: it improves only until depth 8 (0.4541) and then gets worse at depth 10 (0.4664) even as training error keeps dropping. That divergence, train still falling while test turns back up, is the textbook signature of overfitting. The best generalization sits at depth 6 to 8; beyond that you are paying in variance for a model that only looks better on paper. The figure makes the widening gap impossible to miss.

A line chart with max_depth (2, 4, 6, 8, 10) on the horizontal axis and RMSE on the vertical axis, showing two curves. The blue train RMSE curve falls steeply from 0.523 at depth 2 down to 0.082 at depth 10. The orange test RMSE curve stays nearly flat, dipping from 0.548 to a minimum around 0.454 at depth 8 and then rising slightly to 0.466 at depth 10. The vertical distance between the two curves, labeled as the gap, grows from about 0.03 at depth 2 to about 0.38 at depth 10, illustrating worsening overfitting as trees get deeper. — As max_depth grows, training RMSE (blue) keeps falling but test RMSE (orange) flattens and then rises. The widening vertical gap between the curves is overfitting made visible.

For most tabular problems, XGBoost’s trees are kept shallow on purpose, typically max_depth between 3 and 8. Depth is your first-choice complexity dial: raise it when the model underfits (small gap, high test error), lower it when the model overfits (large gap). But depth is a blunt instrument, all-or-nothing per level. The next two knobs let you restrain complexity more surgically, from inside each tree.

min_child_weight: How Much Evidence a Leaf Needs

min_child_weight sets the minimum sum of hessians required in a child node for a split to be allowed. Recall from Lesson 2 that the hessian is the second derivative of the loss for each training row; for the squared-error objective used here, every row contributes a hessian of 1, so the sum of hessians in a node is essentially just the number of training rows that land there. Setting min_child_weight higher therefore says: “do not create a leaf unless enough training examples support it.” That blocks the tiny, hyper-specific leaves that memorize individual districts, making splits more conservative. Let’s take a deliberately overfit model (max_depth=8, gap 0.2763 from above) and dial this knob up.

import warnings
warnings.filterwarnings("ignore")
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
import xgboost as xgb
import numpy as np

data = fetch_california_housing()
X_train, X_test, y_train, y_test = train_test_split(
    data.data, data.target, test_size=0.2, random_state=42
)

def rmse(model, X, y):
    return float(np.sqrt(mean_squared_error(y, model.predict(X))))

for mcw in [1, 5, 10, 50, 100]:
    m = xgb.XGBRegressor(n_estimators=200, learning_rate=0.1, max_depth=8,
                         min_child_weight=mcw, random_state=42)
    m.fit(X_train, y_train)
    tr = round(rmse(m, X_train, y_train), 4)
    te = round(rmse(m, X_test, y_test), 4)
    gap = round(te - tr, 4)
    print(f"min_child_weight={mcw:4d}  train={tr:<8} test={te:<8} gap={gap}")

min_child_weight=   1  train=0.1778   test=0.4541   gap=0.2763
min_child_weight=   5  train=0.2007   test=0.4567   gap=0.2561
min_child_weight=  10  train=0.2289   test=0.4557   gap=0.2268
min_child_weight=  50  train=0.3236   test=0.4492   gap=0.1256
min_child_weight= 100  train=0.3669   test=0.4541   gap=0.0871

Watch the gap column shrink steadily: 0.2763, 0.2561, 0.2268, 0.1256, 0.0871. As min_child_weight rises, the training RMSE climbs (from 0.1778 to 0.3669) because the model is no longer allowed to build the tiny leaves that let it memorize; it is being forced to be more conservative. Crucially, the test RMSE barely moves and even improves slightly, reaching its best value of 0.4492 at min_child_weight=50 before the constraint starts to bite too hard at 100. This is regularization working exactly as intended: it gave up a big chunk of training-set accuracy (which was fake, just memorization) in exchange for a much smaller, healthier gap and slightly better generalization. When a model overfits and you do not want to sacrifice depth, raising min_child_weight is one of the cleanest ways to rein it in.

Why the hessian, not a raw count?

For plain squared-error regression the sum of hessians equals the row count, so min_child_weight reads like a simple “minimum samples per leaf.” But XGBoost defines it in hessian terms so the same knob works for any objective. In classification, for instance, confident predictions have small hessians and uncertain ones have large hessians, so min_child_weight ends up requiring a leaf to hold enough genuinely informative examples, not just enough rows. Defining it through the hessian is what makes this one parameter behave sensibly across every loss function XGBoost supports.

gamma: How Much Gain a Split Must Earn

gamma (also called min_split_loss) sets the minimum gain a split must produce to be kept. In Lesson 2 you saw that XGBoost scores every candidate split with a gain formula and only makes the split if the gain is positive, that is, if splitting reduces the loss more than leaving the node whole. gamma raises that bar from zero to a value you choose: a split now has to improve the objective by at least gamma to survive, otherwise it is pruned away. Higher gamma means fewer splits, shallower effective trees, and a simpler model. Same overfit starting point (max_depth=8), now sweeping gamma.

import warnings
warnings.filterwarnings("ignore")
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
import xgboost as xgb
import numpy as np

data = fetch_california_housing()
X_train, X_test, y_train, y_test = train_test_split(
    data.data, data.target, test_size=0.2, random_state=42
)

def rmse(model, X, y):
    return float(np.sqrt(mean_squared_error(y, model.predict(X))))

for g in [0, 0.5, 1, 5, 20]:
    m = xgb.XGBRegressor(n_estimators=200, learning_rate=0.1, max_depth=8,
                         gamma=g, random_state=42)
    m.fit(X_train, y_train)
    tr = round(rmse(m, X_train, y_train), 4)
    te = round(rmse(m, X_test, y_test), 4)
    gap = round(te - tr, 4)
    print(f"gamma={g:<5}  train={tr:<8} test={te:<8} gap={gap}")

gamma=0      train=0.1778   test=0.4541   gap=0.2763
gamma=0.5    train=0.3238   test=0.4768   gap=0.153
gamma=1      train=0.3549   test=0.4823   gap=0.1274
gamma=5      train=0.4391   test=0.5067   gap=0.0676
gamma=20     train=0.52     test=0.5506   gap=0.0305

Once again the gap column collapses as the knob rises: 0.2763, 0.153, 0.1274, 0.0676, 0.0305. By demanding that every split earn at least gamma in gain, higher values prune away the marginal splits, and the training RMSE climbs from 0.1778 toward 0.52 as the trees are forced to stay coarse. Notice the contrast with min_child_weight, though: here the test RMSE gets steadily worse, from 0.4541 up to 0.5506, rather than holding steady. On this particular dataset gamma is a heavier hammer, it simplified the model past the point that helped, so at these settings the plain max_depth=8 model already generalized better than any pruned version. That is a useful reminder: a regularizer can absolutely over-correct. The value of gamma is that it gives you a gain-based brake, one grounded directly in Lesson 2’s split score, and on noisier datasets where many splits are spurious it can be exactly the tool that closes an overfitting gap without hurting test error. As with every knob here, you confirm it by watching the two RMSE columns, not by assuming.

Practice Exercises

Try each before opening the hint. Reuse the shared setup (the fetch_california_housing load, the random_state=42 split, and the rmse helper) in every one.

Exercise 1: Find Where More Trees Stop Helping

Using learning_rate=0.1 and max_depth=4, fit XGBRegressor models with n_estimators in [50, 100, 200, 400, 800]. Print train and test RMSE for each and identify the tree count beyond which test RMSE stops meaningfully improving.

import warnings
warnings.filterwarnings("ignore")
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
import xgboost as xgb
import numpy as np

data = fetch_california_housing()
X_train, X_test, y_train, y_test = train_test_split(
    data.data, data.target, test_size=0.2, random_state=42
)

def rmse(model, X, y):
    return float(np.sqrt(mean_squared_error(y, model.predict(X))))

# Your code here

Hint

Loop over the tree counts, build xgb.XGBRegressor(n_estimators=n, learning_rate=0.1, max_depth=4, random_state=42), fit, and print round(rmse(m, X_train, y_train), 4) and round(rmse(m, X_test, y_test), 4). You will see training RMSE keep falling while test RMSE flattens out; the point where the test column stops dropping by more than a hair is your practical n_estimators. Adding trees past there only grows the train-test gap.

Exercise 2: Rescue an Underfit Model by Lowering the Learning Rate

Build a baseline with learning_rate=0.3, n_estimators=50, max_depth=4 and record its test RMSE. Then build a second model with learning_rate=0.05 but n_estimators=300 (same depth) and compare. Which generalizes better, and what did trading a smaller step for more trees buy you?

# Your code here (reuse the setup and rmse helper from above)

Hint

Fit both models and print their test RMSE side by side. The smaller learning_rate with far more trees should reach a lower or comparable test RMSE, because patient, shrunken steps generalize better as long as you supply enough rounds to converge. This is the “lower the rate, raise the tree count together” rule in action.

Exercise 3: Close an Overfitting Gap Two Different Ways

Start from an overfit model: max_depth=10, n_estimators=200, learning_rate=0.1. Print its train-test gap. Then try to shrink the gap two ways, once by setting min_child_weight=50 and once by setting gamma=1, printing the gap for each. Which lever preserved test RMSE better on this dataset?

# Your code here (reuse the setup and rmse helper from above)

Hint

Compute gap = round(test_rmse - train_rmse, 4) for the baseline (about 0.3846) and for each regularized model. Both min_child_weight and gamma will raise training RMSE and shrink the gap, but on California Housing min_child_weight=50 actually improves test RMSE (down to about 0.4482 from the baseline 0.4664) while gamma=1 shrinks the gap yet pushes test RMSE the wrong way (up to about 0.4843). The takeaway: closing the gap is only good if test RMSE does not get worse in the process, so always judge a regularizer by the test column, not the gap alone.

Summary

You turned five of XGBoost’s most important knobs one at a time and measured, on real California Housing data, exactly what each one does to train and test RMSE. Let’s consolidate.

Key Concepts

The train-test gap is your overfitting gauge

Always print RMSE on both the training set and a held-out test set
A large gap (tiny train error, big test error) means overfitting; a small gap with high error means underfitting

n_estimators sets the number of trees

More trees always lower training error but test error bottoms out and then flattens
Test RMSE fell to a floor near 0.46 by 300 trees, and 1,000 trees added nothing but overfitting

learning_rate (eta) sets the step size

Too small underfits with a fixed budget (0.01 gave test RMSE 0.7209); too large overfits (1.0 gave 0.5801)
Lower the rate and raise n_estimators together: learning_rate=0.1 with 300 trees reached the best 0.4498

max_depth sets tree complexity

The train-test gap widened from 0.0252 at depth 2 to 0.3846 at depth 10
Test RMSE improved only through depth 8 (0.4541) and then worsened; keep trees shallow (often 3 to 8)

min_child_weight requires enough evidence per leaf

It is the minimum sum of hessians in a child; for squared error that is essentially a minimum row count
Raising it made splits conservative, shrinking the gap from 0.2763 to 0.0871 while holding test RMSE steady (best 0.4492 at 50)

gamma requires enough gain per split

It is the minimum gain (from Lesson 2’s split score) a split must earn to survive; higher values prune more splits
It shrank the gap from 0.2763 to 0.0305 but here degraded test RMSE, a reminder that a regularizer can over-correct

Why This Matters

These five knobs are the vocabulary of every XGBoost tuning session you will ever run. Once you can read a train-test gap and know which lever changes it and in which direction, tuning stops being trial-and-error and becomes diagnosis: a large gap says “add a complexity brake” (lower max_depth, raise min_child_weight or gamma), while a small gap with high error says “let the model learn more” (deeper trees, more estimators, a rate-and-trees rebalance). You saw honestly that not every brake helps equally, min_child_weight protected test RMSE on this data while gamma cost some, which is exactly why you measure rather than assume.

You have also now connected the abstractions of Lesson 2 to concrete controls: the hessian you studied is the currency min_child_weight counts, and the gain formula you derived is the score gamma thresholds. Everything in XGBoost’s tuning surface traces back to the objective. In the next lesson you will add the remaining, and equally powerful, family of controls: regularization penalties and randomized sampling.

Next Steps

You can now set the five core knobs that decide how many trees XGBoost builds, how big each step is, and how complex each tree is allowed to get. Next you will meet the second half of the tuning toolkit: L1/L2 regularization penalties and the row/column sampling that gives XGBoost its stochastic edge.

Lesson 4: Regularization and Sampling

Add reg_lambda, reg_alpha, subsample, and colsample_bytree to fight overfitting with penalties and randomness.

Back to Module Overview

Return to the XGBoost in Depth module overview

Continue Building Your Skills

The habit worth taking from this lesson is not a lookup table of “good” hyperparameter values, those depend entirely on your data. It is the diagnostic loop: fit a model, print train and test RMSE, read the gap, and turn the one knob that moves it the direction you need. Rerun these sweeps yourself and change the fixed settings, try the max_depth sweep at a lower learning_rate, or push min_child_weight even higher, and watch how the gap responds. That reflex, reasoning from the train-test gap to a specific knob, is what separates confident tuning from guessing, and it will carry straight into the regularization and sampling controls you tackle next.

Previous lesson

Lesson 2 - Inside the XGBoost Objective

Next lesson

Lesson 4 - Regularization and Sampling

Courses

DATATWEETS

Title here

Lesson 3 - Core Hyperparameters

Welcome to Core Hyperparameters

The One Number to Watch: The Train-Test Gap

n_estimators: How Many Trees to Build

learning_rate: How Big Each Step Is

max_depth: How Complex Each Tree Is

min_child_weight: How Much Evidence a Leaf Needs

gamma: How Much Gain a Split Must Earn

Practice Exercises

Exercise 1: Find Where More Trees Stop Helping

Exercise 2: Rescue an Underfit Model by Lowering the Learning Rate

Exercise 3: Close an Overfitting Gap Two Different Ways

Summary

Key Concepts

Why This Matters

Next Steps

Lesson 4: Regularization and Sampling

Back to Module Overview

Continue Building Your Skills

Lesson 3 - Core Hyperparameters

Welcome to Core Hyperparameters#

The One Number to Watch: The Train-Test Gap#

n_estimators: How Many Trees to Build#

learning_rate: How Big Each Step Is#

max_depth: How Complex Each Tree Is#

min_child_weight: How Much Evidence a Leaf Needs#

gamma: How Much Gain a Split Must Earn#

Practice Exercises#

Exercise 1: Find Where More Trees Stop Helping#

Exercise 2: Rescue an Underfit Model by Lowering the Learning Rate#

Exercise 3: Close an Overfitting Gap Two Different Ways#

Summary#

Key Concepts#

Why This Matters#

Next Steps#

Lesson 4: Regularization and Sampling

Back to Module Overview

Continue Building Your Skills#

Welcome to Core Hyperparameters

The One Number to Watch: The Train-Test Gap

n_estimators: How Many Trees to Build

learning_rate: How Big Each Step Is

max_depth: How Complex Each Tree Is

min_child_weight: How Much Evidence a Leaf Needs

gamma: How Much Gain a Split Must Earn

Practice Exercises

Exercise 1: Find Where More Trees Stop Helping

Exercise 2: Rescue an Underfit Model by Lowering the Learning Rate

Exercise 3: Close an Overfitting Gap Two Different Ways

Summary

Key Concepts

Why This Matters

Next Steps

Continue Building Your Skills