Lesson 1 - Early Stopping and Evaluation Sets
Welcome to Early Stopping and Evaluation Sets
Welcome to Module 3. In Module 2 you learned what XGBoost is and turned its main knobs by hand: n_estimators, learning_rate, max_depth, and the regularization terms. You saw how sensitive the model is to how many trees you grow. Too few and it underfits; too many and it slowly memorizes the training set. But you were still doing the one thing this module exists to stop: guessing a good value for n_estimators and hoping.
Our running team, Northwind Analytics, felt this directly. Their California Housing model kept improving as they added trees, so they kept raising n_estimators, retraining from scratch each time to check whether the extra trees still helped. That is slow, wasteful, and easy to get wrong. This module is about training models that are robust without that manual fiddling, and the first tool is the most useful one you will meet all module: early stopping. Instead of picking a tree count, you set a large cap, show XGBoost a validation set, and let it stop on its own the moment more trees stop helping. Every number in this lesson came from running the code for real on the actual California Housing data.
By the end of this lesson, you will be able to:
- Explain why a fixed
n_estimatorsis a guess, and how early stopping replaces the guess with a data-driven decision - Create a proper three-way split (train / validation / test) and explain the distinct job of each set
- Configure early stopping in the scikit-learn API by passing
early_stopping_roundsandeval_metricto the constructor and aneval_settofit - Read back
best_iterationandbest_score, and understand that prediction uses the best model, not the last one - Do the same with the native API (
xgb.trainwithevalsandearly_stopping_rounds) - Avoid the classic mistake of early-stopping on the test set, and reason about the
early_stopping_roundspatience value
You should be comfortable with fitting an XGBRegressor, the train/test split, and the hyperparameters from Module 2. Let’s begin.
The Problem: Guessing n_estimators Is Wasteful
Every boosting model faces the same tension. Each new tree corrects the ensemble’s remaining errors, so on the training data the error only ever goes down as you add trees. But the training error is not what you care about. You care about error on data the model has never seen, and that curve behaves differently: it drops fast at first, then flattens, and eventually can creep back up as the extra trees start fitting noise that does not generalize. This is overfitting.
So the “right” number of trees is a real quantity, not a matter of taste, and it depends on the data, the learning rate, and the tree depth. Set n_estimators too low and you stop before the model has learned the signal (underfitting). Set it too high and you waste training time and start memorizing (overfitting). The old workaround, retraining at 100 trees, then 300, then 1000, and eyeballing which was best, is exactly the manual guessing we want to eliminate. It is slow, and because each run is separate, it is easy to miss the real sweet spot.
Early stopping turns this into a single automatic decision. The idea is simple: set n_estimators deliberately high (a ceiling you are sure is more than enough), and while XGBoost adds trees one by one, have it watch the error on a held-out validation set. As long as that validation error keeps improving, keep going. Once it fails to improve for a set number of rounds in a row, early_stopping_rounds, stop and keep the best model seen so far. You get the right tree count without ever picking it yourself.
Training error is not the referee
You can never use training error to decide when to stop, because adding trees almost always lowers training error, right up to the point of memorizing the data. Early stopping needs a separate set the model does not train on, so that a lack of improvement genuinely means “more trees are no longer helping on unseen data.” That separate set is your validation set, and it is a different set from the test set you report your final number on.
How Early Stopping Works
There are four pieces to configure, and once you see them together the mechanics are easy:
- A large
n_estimators. This is now just a ceiling. Set it high enough that you are confident the true best tree count is below it (we use2000). Early stopping will almost always stop before reaching it. - An
eval_set. This is the held-out validation data XGBoost checks after every tree. In the scikit-learn API you pass it tofitas a list of(X, y)tuples. - An
eval_metric. The score to watch,"rmse"for our regression. XGBoost computes this metric on theeval_setafter each round. early_stopping_rounds. The patience. If the validation metric does not improve for this many rounds in a row, training halts. We use50: “if 50 straight trees fail to beat the best validation score, quit.”
When training stops, XGBoost remembers the best iteration, the tree count at which the validation metric was best, not the last iteration it happened to reach. You read it back with model.best_iteration and the score there with model.best_score. Crucially, when you later call predict, XGBoost uses the model up to that best iteration automatically, so you get the best model, not the slightly-overfit final one.
xgboost 3.x: early stopping is set on the constructor
In older XGBoost you passed early_stopping_rounds to fit. As of xgboost 3.x (this course uses 3.3.0), you configure early_stopping_rounds and eval_metric on the constructor (XGBRegressor(...)) and pass only the eval_set to fit. Passing early_stopping_rounds to fit will not work in this version. The native xgb.train, by contrast, still takes early_stopping_rounds as a call argument, as you will see below.
A Real Run on California Housing
Before we can early-stop, we need a three-way split. So far the course has used train and test. Early stopping forces a third set, because the validation data steers training (it decides when to stop), which means it is no longer neutral, and you must not report your final accuracy on it. The clean setup is:
- Train (60%): the trees are fit on this.
- Validation (20%): early stopping watches this to decide the tree count.
- Test (20%): untouched until the very end, used only for the one honest final number.
We build it with two calls to train_test_split, both with random_state=42: first peel off 40% as a temporary block, then halve that block into validation and test.
import warnings
warnings.filterwarnings("ignore")
import numpy as np
import xgboost as xgb
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
data = fetch_california_housing()
# First split off 40% for a temp block, then halve it into val and test
X_train, X_temp, y_train, y_temp = train_test_split(
data.data, data.target, test_size=0.4, random_state=42
)
X_val, X_test, y_val, y_test = train_test_split(
X_temp, y_temp, test_size=0.5, random_state=42
)
print("train size:", X_train.shape[0])
print("val size :", X_val.shape[0])
print("test size :", X_test.shape[0])train size: 12384
val size : 4128
test size : 4128Now the model. We set n_estimators=2000 as a ceiling, a modest learning_rate=0.05 (a smaller step means more trees are useful, which makes early stopping’s job more interesting), max_depth=4, and turn on early stopping with early_stopping_rounds=50 and eval_metric="rmse" on the constructor. Then we call fit, handing it the validation set as the eval_set.
import warnings
warnings.filterwarnings("ignore")
import numpy as np
import xgboost as xgb
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
data = fetch_california_housing()
X_train, X_temp, y_train, y_temp = train_test_split(
data.data, data.target, test_size=0.4, random_state=42
)
X_val, X_test, y_val, y_test = train_test_split(
X_temp, y_temp, test_size=0.5, random_state=42
)
model = xgb.XGBRegressor(
n_estimators=2000, # a high ceiling, not a target
learning_rate=0.05,
max_depth=4,
early_stopping_rounds=50, # patience: stop after 50 rounds with no gain
eval_metric="rmse", # the validation metric to watch
random_state=42,
)
# The validation set drives early stopping; verbose=False keeps the log quiet
model.fit(X_train, y_train, eval_set=[(X_val, y_val)], verbose=False)
print("best_iteration :", model.best_iteration)
print("best_score (val RMSE):", float(round(model.best_score, 4)))
# predict() uses the model up to best_iteration automatically
pred = model.predict(X_test)
test_rmse = np.sqrt(mean_squared_error(y_test, pred))
print("final TEST RMSE :", float(round(test_rmse, 4)))best_iteration : 1614
best_score (val RMSE): 0.4632
final TEST RMSE : 0.4601Read those numbers carefully, because they are the whole point of the lesson. We allowed up to 2000 trees, but XGBoost stopped at iteration 1614: it found that after 1,614 trees, 50 more rounds went by without beating the best validation RMSE of 0.4632, so it quit and kept the tree count where validation error bottomed out. We never chose 1614; the validation set did. And on the untouched test set, the model reaches an honest RMSE of 0.4601, our one final number.
The figure below shows what happened during training. The gray training curve keeps sliding down as trees pile up, exactly as warned, while the blue validation curve drops steeply, then flattens out. The green marker is where early stopping locked in the best model. Every trained tree past that point was still lowering training error while doing essentially nothing for validation error, which is the visual signature of the overfitting early stopping saves you from.
The Same Thing in the Native API
The native xgb.train interface supports early stopping too, and it is worth seeing because so much competition code is written this way. Here early_stopping_rounds is an argument to xgb.train (not the constructor), the validation data goes in the evals list as (DMatrix, name) tuples, and eval_metric lives in the params dict. After training, bst.best_iteration holds the answer, and bst.predict uses the best iteration automatically in xgboost 3.x.
import warnings
warnings.filterwarnings("ignore")
import numpy as np
import xgboost as xgb
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
data = fetch_california_housing()
X_train, X_temp, y_train, y_temp = train_test_split(
data.data, data.target, test_size=0.4, random_state=42
)
X_val, X_test, y_val, y_test = train_test_split(
X_temp, y_temp, test_size=0.5, random_state=42
)
dtrain = xgb.DMatrix(X_train, label=y_train)
dvalid = xgb.DMatrix(X_val, label=y_val)
dtest = xgb.DMatrix(X_test, label=y_test)
params = {
"objective": "reg:squarederror",
"learning_rate": 0.05,
"max_depth": 4,
"eval_metric": "rmse",
"seed": 42,
}
bst = xgb.train(
params,
dtrain,
num_boost_round=2000, # the ceiling
evals=[(dvalid, "valid")], # what early stopping watches
early_stopping_rounds=50,
verbose_eval=False,
)
print("native best_iteration:", bst.best_iteration)
print("native best_score :", float(round(bst.best_score, 4)))
pred = bst.predict(dtest) # uses best_iteration automatically
test_rmse = np.sqrt(mean_squared_error(y_test, pred))
print("native TEST RMSE :", float(round(test_rmse, 4)))native best_iteration: 1614
native best_score : 0.4632
native TEST RMSE : 0.4598Same story from the other door: best iteration 1614, best validation RMSE 0.4632, and a test RMSE of 0.4598, matching the scikit-learn run to within rounding. The two APIs configure early stopping in slightly different places (constructor vs. call argument), but they run the identical logic and land on the identical tree count.
Caveats: Where Early Stopping Bites Back
Early stopping is one of the highest-value tricks in the whole toolkit, but two things trip people up.
Never early-stop on the test set. It is tempting to skip the third split and just pass your test set as the eval_set. Do not. The moment your test set influences when training stops, it has shaped the model, and your final test RMSE is no longer an honest estimate of performance on new data: it is optimistic, because you tuned the tree count to that exact set. The validation set exists precisely to absorb that influence and keep the test set pristine. Report your final number on data that touched nothing in the training decision.
The patience (early_stopping_rounds) is itself a choice. Set it too small and you may stop on a temporary plateau, quitting during a flat stretch that would have resumed improving a few dozen trees later. Set it too large and you waste time training trees you will ultimately discard. A patience of 50 at a learning rate of 0.05 is a sensible starting point: small learning rates improve in tiny steps, so they need more patience before you conclude the improvement has truly stalled. There is no universal best value, but the principle is steady, the smaller the learning rate, the more patience you generally want.
One more practical note: early stopping pairs beautifully with a small learning rate. A tiny learning_rate needs many trees to converge, which is exactly the situation where guessing n_estimators is hardest, and exactly where letting the validation set decide pays off most. That is why the standard recipe is “small learning rate, large n_estimators ceiling, early stopping on,” rather than hand-tuning the tree count.
Practice Exercises
Try each before opening its hint. They reinforce the three-way split, the API, and the patience knob.
Exercise 1: Build the Split and Fit With Early Stopping
Create the train/validation/test split exactly as in the lesson (two train_test_split calls, random_state=42), fit an XGBRegressor(n_estimators=2000, learning_rate=0.05, max_depth=4, early_stopping_rounds=50, eval_metric="rmse", random_state=42) with the validation set as the eval_set, and print best_iteration, best_score, and the final test RMSE.
import warnings
warnings.filterwarnings("ignore")
import numpy as np
import xgboost as xgb
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
data = fetch_california_housing()
# Your code hereHint
Split in two steps: X_train, X_temp, y_train, y_temp = train_test_split(data.data, data.target, test_size=0.4, random_state=42), then X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5, random_state=42). Put early_stopping_rounds=50 and eval_metric="rmse" on the constructor, and call model.fit(X_train, y_train, eval_set=[(X_val, y_val)], verbose=False). You should see best_iteration 1614, best_score 0.4632, and test RMSE 0.4601.
Exercise 2: Prove the Test Set Was Never the Referee
Using the model from Exercise 1, compute the validation RMSE and the test RMSE separately with mean_squared_error. Confirm the validation RMSE matches best_score (0.4632), which shows early stopping was judged on the validation set, and note that the test RMSE (0.4601) is a genuinely separate number.
import warnings
warnings.filterwarnings("ignore")
import numpy as np
import xgboost as xgb
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
data = fetch_california_housing()
# Your code here (reuse the Exercise 1 split and model)Hint
After fitting, do val_rmse = np.sqrt(mean_squared_error(y_val, model.predict(X_val))) and test_rmse = np.sqrt(mean_squared_error(y_test, model.predict(X_test))). The validation RMSE prints as 0.4632, the same value as model.best_score, confirming that is the set early stopping optimized. The test RMSE prints as 0.4601: close, but a distinct, untouched measurement. If you had instead early-stopped on the test set, that 0.4601 would be an optimistic number you could not trust.
Exercise 3: Feel the Patience Knob
Keeping everything else fixed, fit three models with early_stopping_rounds set to 10, 50, and 200. Print each model’s best_iteration and test RMSE, and notice how a very small patience can stop earlier while a larger patience explores further.
import warnings
warnings.filterwarnings("ignore")
import numpy as np
import xgboost as xgb
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
data = fetch_california_housing()
X_train, X_temp, y_train, y_temp = train_test_split(
data.data, data.target, test_size=0.4, random_state=42
)
X_val, X_test, y_val, y_test = train_test_split(
X_temp, y_temp, test_size=0.5, random_state=42
)
# Your code hereHint
Loop over [10, 50, 200], building a fresh XGBRegressor(n_estimators=2000, learning_rate=0.05, max_depth=4, early_stopping_rounds=patience, eval_metric="rmse", random_state=42) each time and fitting with the same eval_set. Print model.best_iteration and the test RMSE for each. A tiny patience like 10 risks quitting on a flat stretch and stopping earlier; a larger patience like 200 lets training run further before concluding the validation metric has stalled. The final test RMSE stays close across all three because prediction always uses the best iteration, but the stopping point moves.
Summary
You replaced a guess with a decision. Instead of choosing n_estimators by hand and retraining to check, you set a ceiling and let a held-out validation set tell XGBoost exactly when to stop. Let’s review.
Key Concepts
Why early stopping exists
- Training error falls almost forever as you add trees, so it cannot tell you when to stop
- Validation error drops, flattens, and can rise again; its low point is the right tree count
- Early stopping finds that point automatically instead of you guessing
n_estimators
The four pieces
- A large
n_estimators(a ceiling,2000here), aneval_set(the validation data), aneval_metric("rmse"), andearly_stopping_rounds(the patience,50) - XGBoost keeps the best iteration, and
predictuses that best model, not the last one
The real run (California Housing, three-way split of 12384 / 4128 / 4128)
- With a 2000-tree cap, the model stopped at iteration 1614 on its own
- Best validation RMSE 0.4632; final, honest test RMSE 0.4601 (scikit-learn API)
- The native
xgb.trainpath landed identically: best iteration 1614, test RMSE 0.4598
Two rules that keep it honest
- Early-stop on a validation set, never the test set, or your final number is optimistic
- The
early_stopping_roundspatience is a real choice; smaller learning rates generally want more patience
Why This Matters
Early stopping is the single most useful habit in applied gradient boosting, and now you own it end to end. It removes the most error-prone hyperparameter, the tree count, from your manual to-do list and hands it to the data, which is exactly where that decision belongs. Just as important, you learned the discipline that makes it trustworthy: the three-way split. Keeping a validation set to steer training and a test set that never influences a single decision is what separates a real performance estimate from a flattering one, and that discipline carries into everything you do next, not just boosting.
From here, the module builds on this foundation. You will fold the validation idea into full cross-validation so your tree-count decision does not hinge on one lucky split, then use these robust models to handle missing data and read what the model actually learned.
Next Steps
You can now let XGBoost pick its own tree count. Next you will make that decision even more reliable by cross-validating it, using XGBoost’s built-in xgb.cv so the best iteration is chosen across several folds instead of a single validation split.
Lesson 2: Cross-Validation with xgb.cv
Choose the tree count across multiple folds with XGBoost's built-in cross-validation instead of relying on one validation split.
Back to Module Overview
Return to the Training Robust Models module overview
Continue Building Your Skills
Before moving on, rerun the lesson’s model yourself and change one thing at a time. Drop the learning_rate to 0.02 and watch best_iteration climb (a smaller step needs more trees). Raise it to 0.2 and watch the model stop much earlier. Try max_depth=6 and see how the best tree count shifts. Each experiment builds the same instinct: early stopping is not one setting but a partnership between the learning rate, the tree ceiling, and the validation set, and getting a feel for that partnership now will make cross-validation in Lesson 2 feel like a natural extension rather than a new idea.