Lesson 2 - Cross-Validation with xgb.cv
Welcome to Cross-Validation with xgb.cv
In Lesson 1 you learned to hand XGBoost a validation set and let early stopping halt training the moment that set stopped improving. That solved one problem: you no longer have to guess how many trees to grow. But it quietly introduced another. Your entire verdict about the model, both the tree count and the RMSE you reported, now rests on a single split of the data. Change which rows happen to fall into that validation set and the numbers move. Sometimes by more than the improvement you were trying to measure.
Northwind Analytics ran into exactly this. Two analysts each held out 20 percent of the California Housing districts to check the same XGBoost settings, got test RMSEs a couple hundredths apart, and spent an afternoon arguing about which configuration was “better” when the difference was pure luck of the draw. The fix is cross-validation: instead of trusting one split, you rotate through several, train on each, and average the results. XGBoost ships its own tool for this, xgb.cv, and it does something scikit-learn’s cross-validation can’t do out of the box, it gives you the full per-round history across folds, so cross-validation and early stopping cooperate to pick the tree count and the honest error in one call.
Every number in this lesson was produced by running the code on the real California Housing dataset with random_state=42.
By the end of this lesson, you will be able to:
- Explain why a single train/validation split gives a noisy estimate, and how k-fold cross-validation averages over folds to steady it
- Run 5-fold cross-validation with
xgb.cvand read its returned DataFrame oftrain-rmse-mean,train-rmse-std,test-rmse-mean, andtest-rmse-std - Combine
xgb.cvwithearly_stopping_roundsso the number of boosting rounds is chosen automatically - Retrain a final model on all the training data at that round and evaluate it once on a held-out test set
- Contrast
xgb.cvwith scikit-learn’scross_val_score, and report a result as mean and standard deviation rather than a single lucky number
You should be comfortable with early stopping from Lesson 1 and remember k-fold cross-validation from the Machine Learning course. Let’s begin.
Why One Split Isn’t Enough
A train/test split hands you exactly one number, and it is tempting to treat that number as the accuracy of your model. It is not. It is the accuracy on one particular slice of held-out rows. Some slices happen to contain more of the easy, typical districts; others catch more of the unusual, hard-to-predict ones. The model didn’t change, but the scoreboard did.
Let’s make the noise visible. Here is the same XGBoost model, with identical hyperparameters, evaluated five times. The only thing that changes between runs is random_state in train_test_split, which reshuffles which rows land in the test set.
import warnings
warnings.filterwarnings("ignore")
import numpy as np
import xgboost as xgb
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
data = fetch_california_housing()
for rs in [1, 2, 3, 4, 5]:
X_tr, X_te, y_tr, y_te = train_test_split(
data.data, data.target, test_size=0.2, random_state=rs
)
model = xgb.XGBRegressor(
n_estimators=300, learning_rate=0.1, max_depth=4, random_state=42
)
model.fit(X_tr, y_tr)
rmse = np.sqrt(mean_squared_error(y_te, model.predict(X_te)))
print(f"split random_state={rs}: test RMSE = {float(round(rmse, 4))}")split random_state=1: test RMSE = 0.4704
split random_state=2: test RMSE = 0.4715
split random_state=3: test RMSE = 0.455
split random_state=4: test RMSE = 0.4615
split random_state=5: test RMSE = 0.4673The RMSE ranges from 0.4550 to 0.4715, a spread of about 0.0165, for a model that never changed. In MedHouseVal units that swing is worth roughly 1,650 dollars of apparent error per district, entirely manufactured by the split. If a tuning experiment claims to shave 0.01 off your RMSE, but a different split can move it 0.0165, you cannot tell the real improvement from the noise. That is the whole problem, and cross-validation is the standard answer to it.
k-Fold Cross-Validation, Briefly Recalled
You met k-fold cross-validation in the Machine Learning course; here is the one-paragraph refresher. You split the training data into equal parts, called folds. Then you train times. Each time, one fold is held out as the validation set and the other folds are used for training. Every row gets to be in the validation set exactly once, and every row gets used for training times. You end up with validation scores, and you summarize them by their mean and standard deviation:
The mean is your best single estimate of how the model will do on unseen data, and it is far steadier than any one split because it averages five verdicts instead of trusting one. The standard deviation is just as important: it tells you how much that estimate wobbles from fold to fold. A low means the model performs consistently no matter how the data is carved up; a high is a warning that your headline number is fragile. With , the common default, each fold holds 20 percent of the data, mirroring a standard train/test split, but now you use all five.
Cross-Validating XGBoost with xgb.cv
XGBoost’s native API includes xgb.cv, which runs the whole k-fold procedure for you and, crucially, reports the score at every boosting round. You pack the training data into a DMatrix, describe the model with a params dictionary, and call xgb.cv with nfold=5. Because early stopping needs somewhere to stop, we also pass num_boost_round=2000 as a generous ceiling and early_stopping_rounds=50, exactly the mechanism from Lesson 1, now running inside every fold at once.
import warnings
warnings.filterwarnings("ignore")
import numpy as np
import xgboost as xgb
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
data = fetch_california_housing()
X_train, X_test, y_train, y_test = train_test_split(
data.data, data.target, test_size=0.2, random_state=42
)
# xgb.cv cross-validates on the TRAINING data only; the test set stays untouched
dtrain = xgb.DMatrix(X_train, label=y_train)
params = {
"objective": "reg:squarederror",
"max_depth": 4,
"eta": 0.1, # eta is the native name for learning_rate
"seed": 42,
}
cv_results = xgb.cv(
params,
dtrain,
num_boost_round=2000, # a generous ceiling; early stopping trims it
nfold=5,
metrics="rmse",
early_stopping_rounds=50,
seed=42,
)
print("type :", type(cv_results).__name__)
print("columns :", list(cv_results.columns))
print("shape :", cv_results.shape)
print()
print("first 3 rounds:")
print(cv_results.head(3).round(4).to_string())type : DataFrame
columns : ['train-rmse-mean', 'train-rmse-std', 'test-rmse-mean', 'test-rmse-std']
shape : (1055, 4)
first 3 rounds:
train-rmse-mean train-rmse-std test-rmse-mean test-rmse-std
0 1.0897 0.0031 1.0906 0.0123
1 1.0313 0.0033 1.0335 0.0126
2 0.9807 0.0035 0.9839 0.0127xgb.cv returns a pandas DataFrame with one row per boosting round. Each row holds four numbers: the mean and standard deviation of the training RMSE across the five folds, and the mean and standard deviation of the validation (“test”) RMSE across the five folds. Round 0 is the model with a single tree; each row below it is the ensemble one tree larger. Watch test-rmse-mean fall down the rows as boosting corrects its own errors, exactly the residual-fixing loop from Module 1, now measured five ways at once.
Now look at the shape: 1055 rows, not 2000. That is early stopping at work. The full history would have been 2000 rounds, but the cross-validated test-rmse-mean stopped improving for 50 rounds in a row around round 1055, so xgb.cv cut the run short and returned only the rounds up to the best one. In other words, the number of rows tells you the best round. Let’s read off the final row, which is that best round.
import warnings
warnings.filterwarnings("ignore")
import numpy as np
import xgboost as xgb
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
data = fetch_california_housing()
X_train, X_test, y_train, y_test = train_test_split(
data.data, data.target, test_size=0.2, random_state=42
)
dtrain = xgb.DMatrix(X_train, label=y_train)
params = {"objective": "reg:squarederror", "max_depth": 4, "eta": 0.1, "seed": 42}
cv_results = xgb.cv(
params, dtrain, num_boost_round=2000, nfold=5,
metrics="rmse", early_stopping_rounds=50, seed=42,
)
best_round = len(cv_results) # rows == best number of boosting rounds
best = cv_results.iloc[-1] # the last (best) round, as a clean pandas row
print("best round (num_boost_round):", best_round)
print("last two rounds:")
print(cv_results.tail(2).round(4).to_string())
print()
print("CV test-rmse-mean:", float(round(best["test-rmse-mean"], 4)))
print("CV test-rmse-std :", float(round(best["test-rmse-std"], 4)))
print("CV train-rmse-mean:", float(round(best["train-rmse-mean"], 4)))best round (num_boost_round): 1055
last two rounds:
train-rmse-mean train-rmse-std test-rmse-mean test-rmse-std
1053 0.2726 0.0014 0.4591 0.0112
1054 0.2725 0.0014 0.4591 0.0112
CV test-rmse-mean: 0.4591
CV test-rmse-std : 0.0112
CV train-rmse-mean: 0.2725There is your honest, cross-validated verdict in one line: at 1055 rounds, XGBoost’s cross-validated test RMSE is 0.4591, plus or minus 0.0112 across the five folds. Two things earn your attention. First, that number came from averaging five separate held-out evaluations, so it is far more trustworthy than any single split from the section above. Second, the standard deviation of 0.0112 is small relative to the mean, telling you the model performs consistently no matter how the training rows are carved up. Notice too the gap between train-rmse-mean (0.2725) and test-rmse-mean (0.4591): the model fits training data much closer than validation data, the ordinary signature of a flexible model, and cross-validation is what lets you see and trust that gap.
One call did two jobs
xgb.cv handled model selection and evaluation together. Early stopping inside the folds chose the tree count (1055) so you didn’t have to guess it, and averaging across folds produced the error estimate (0.4591 plus or minus 0.0112) that tells you how good that choice is. This is the payoff of the native tool: because it keeps the full per-round history for every fold, the round that is best on average and the cross-validated score at that round fall out of the same run. A plain scikit-learn cross-validation, which only returns one score per fold, cannot pick the tree count for you this way.
From Cross-Validation to a Final Model
Cross-validation gave you two things: a trustworthy estimate (RMSE about 0.4591) and a good setting (num_boost_round = 1055). It did not give you a model to deploy, because each of the five CV models saw only four folds of the training data. The standard final step is to retrain once on all the training data at the chosen round, then evaluate that model a single time on the held-out test set you have kept untouched since the split. That test set is the honesty check: it confirms the cross-validated estimate holds up on data the CV procedure never touched.
import warnings
warnings.filterwarnings("ignore")
import numpy as np
import xgboost as xgb
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score
data = fetch_california_housing()
X_train, X_test, y_train, y_test = train_test_split(
data.data, data.target, test_size=0.2, random_state=42
)
# best_round came from xgb.cv above; we hard-code the value it found
best_round = 1055
# Retrain on ALL the training data at the cross-validated best round
final_model = xgb.XGBRegressor(
n_estimators=best_round,
learning_rate=0.1,
max_depth=4,
random_state=42,
)
final_model.fit(X_train, y_train)
pred = final_model.predict(X_test)
test_rmse = np.sqrt(mean_squared_error(y_test, pred))
test_r2 = r2_score(y_test, pred)
print("held-out test RMSE:", float(round(test_rmse, 4)))
print("held-out test R2 :", float(round(test_r2, 4)))held-out test RMSE: 0.4469
held-out test R2 : 0.8476On the untouched test set the final model reaches an RMSE of 0.4469 and an of 0.8476, explaining about 85 percent of the variance in unseen districts. That test RMSE is even a touch better than the cross-validated mean of 0.4591, which makes sense: the final model trained on the full training set (all five folds’ worth of rows), whereas each CV model trained on only four-fifths of it, so the final model had a little more data to learn from. The important point is that the held-out number lands within the neighborhood the cross-validation predicted, comfortably inside a standard deviation or two. Cross-validation set the expectation; the held-out test confirmed it. That agreement is what lets Northwind report the result with a straight face.
Which round do you train the final model on?
Because xgb.cv trims its output to the best round, len(cv_results) is the tree count you want, and you pass it straight into the final model as n_estimators (native API: num_boost_round). You are deliberately turning off early stopping for this final fit, there is no validation set to stop against and you already know the round you want, so you simply grow exactly that many trees on all the training data.
xgb.cv vs. scikit-learn’s cross_val_score
If you have used scikit-learn, you know cross_val_score, and it works perfectly well on an XGBRegressor because the scikit-learn API plugs straight into it. It is worth seeing side by side so you know when to reach for which. The catch is that cross_val_score needs a fixed model, including a fixed tree count, so you must decide n_estimators yourself up front rather than letting cross-validation discover it.
import warnings
warnings.filterwarnings("ignore")
import numpy as np
import xgboost as xgb
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split, cross_val_score
data = fetch_california_housing()
X_train, X_test, y_train, y_test = train_test_split(
data.data, data.target, test_size=0.2, random_state=42
)
# A FIXED model: we must pick n_estimators ourselves (300 here)
model = xgb.XGBRegressor(
n_estimators=300, learning_rate=0.1, max_depth=4, random_state=42
)
# neg_root_mean_squared_error is negated (higher is better), so flip the sign
scores = -cross_val_score(
model, X_train, y_train, cv=5, scoring="neg_root_mean_squared_error"
)
print("fold RMSEs:", [float(round(s, 4)) for s in scores])
print("mean RMSE :", float(round(scores.mean(), 4)))
print("std RMSE :", float(round(scores.std(), 4)))fold RMSEs: [0.4654, 0.4738, 0.4723, 0.4691, 0.4803]
mean RMSE : 0.4722
std RMSE : 0.005cross_val_score returns exactly what its name says: one score per fold, here five RMSEs that you summarize as 0.4722 plus or minus 0.0050. That is a clean, correct cross-validated estimate, and if you already know your tree count it is the shortest path to one. But notice what it cannot tell you: it has no idea whether 300 trees was a good choice, because it only ever saw the final score of each fold, not the round-by-round history. To use it for choosing n_estimators you would have to loop it over many candidate counts yourself, refitting from scratch each time.
xgb.cv is the better fit when the tree count is one of the things you are trying to decide, which, with early stopping, it almost always is. Because it retains test-rmse-mean for every round, a single call both selects the round and scores it. Use cross_val_score for a quick cross-validated check of a fully-specified model, or when you want XGBoost to sit inside a larger scikit-learn pipeline; reach for xgb.cv when you want cross-validation and early stopping to choose the tree count together, as you did above.
Report the spread, always
Whichever tool you use, report the mean and the standard deviation, never the mean alone. “RMSE 0.4591 plus or minus 0.0112” and “RMSE 0.4722 plus or minus 0.0050” are honest claims; a bare “RMSE 0.4591” hides how much that estimate could have moved. The two tools here even used different tree counts (1055 vs 300) and so land on slightly different means, precisely the kind of difference the standard deviation helps you weigh instead of over-reading.
Practice Exercises
Try each one before opening its hint. They walk you through running xgb.cv, reading its output, and building the final model.
Exercise 1: Run xgb.cv and Read the Best Round
Build a DMatrix from the California Housing training split (random_state=42), then run xgb.cv with params = {"objective": "reg:squarederror", "max_depth": 4, "eta": 0.1, "seed": 42}, num_boost_round=2000, nfold=5, metrics="rmse", early_stopping_rounds=50, and seed=42. Print the number of rows (the best round) and the test-rmse-mean and test-rmse-std of the final row.
import warnings
warnings.filterwarnings("ignore")
import numpy as np
import xgboost as xgb
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
data = fetch_california_housing()
X_train, X_test, y_train, y_test = train_test_split(
data.data, data.target, test_size=0.2, random_state=42
)
# Your code hereHint
Make dtrain = xgb.DMatrix(X_train, label=y_train) and call cv_results = xgb.cv(params, dtrain, num_boost_round=2000, nfold=5, metrics="rmse", early_stopping_rounds=50, seed=42). Then best_round = len(cv_results) and best = cv_results.iloc[-1]. Printing best["test-rmse-mean"] and best["test-rmse-std"] directly from the pandas row gives clean floats. You should see best_round equal to 1055 and a CV test RMSE of 0.4591 plus or minus 0.0112.
Exercise 2: Train the Final Model and Evaluate on the Held-Out Test Set
Using the best round from Exercise 1 (1055), fit an XGBRegressor(n_estimators=1055, learning_rate=0.1, max_depth=4, random_state=42) on all of X_train, predict on X_test, and print the test RMSE and . Compare the test RMSE to the cross-validated mean from Exercise 1.
import warnings
warnings.filterwarnings("ignore")
import numpy as np
import xgboost as xgb
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score
data = fetch_california_housing()
X_train, X_test, y_train, y_test = train_test_split(
data.data, data.target, test_size=0.2, random_state=42
)
# Your code hereHint
Construct the model with n_estimators=1055, call .fit(X_train, y_train), then pred = model.predict(X_test). Compute RMSE with np.sqrt(mean_squared_error(y_test, pred)) and with r2_score(y_test, pred). You should get a test RMSE of 0.4469 and of 0.8476. It lands a little below the CV mean of 0.4591 because the final model trained on all five folds’ rows, not just four-fifths of them, and it sits comfortably within a standard deviation or two of the CV estimate.
Exercise 3: Compare Two Learning Rates with xgb.cv
Run xgb.cv twice with the same settings as Exercise 1 but with eta set to 0.05 and then 0.1. For each, print the best round (number of rows) and the final test-rmse-mean. Note how the slower learning rate changes the number of rounds early stopping settles on.
import warnings
warnings.filterwarnings("ignore")
import numpy as np
import xgboost as xgb
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
data = fetch_california_housing()
X_train, X_test, y_train, y_test = train_test_split(
data.data, data.target, test_size=0.2, random_state=42
)
dtrain = xgb.DMatrix(X_train, label=y_train)
# Your code hereHint
Loop over [0.05, 0.1], rebuilding params with each eta and calling xgb.cv(params, dtrain, num_boost_round=2000, nfold=5, metrics="rmse", early_stopping_rounds=50, seed=42) each time. Print len(cv_results) and cv_results["test-rmse-mean"].iloc[-1]. At eta=0.05 the run reaches the 2000-round ceiling without early stopping triggering (its test-rmse-mean is still creeping down at 0.4573), a signal that 2000 trees was not enough headroom for such a slow rate; at eta=0.1 early stopping trims to 1055 rounds with a test-rmse-mean of 0.4591. The lesson: the best round depends heavily on the learning rate, which is exactly why letting cross-validation choose it beats hard-coding a tree count.
Summary
You replaced a single noisy split with cross-validation and let xgb.cv pick the tree count and the honest error in one call. Let’s review.
Key Concepts
Why one split is not enough
- A train/test split reports accuracy on one slice of held-out rows, and that number moves when the slice changes
- On California Housing the same XGBoost model scored anywhere from 0.4550 to 0.4715 RMSE across five different splits, a spread that can swamp a real tuning gain
k-fold cross-validation
- Split the training data into folds, train times with each fold held out once, and summarize the scores by their mean and standard deviation
- The mean is a steadier estimate than any single split; the standard deviation tells you how fragile that estimate is
Cross-validating with xgb.cv
xgb.cvreturns a pandas DataFrame with one row per boosting round and columnstrain-rmse-mean,train-rmse-std,test-rmse-mean,test-rmse-std- With
early_stopping_roundsit trims the output to the best round, solen(cv_results)is the tree count and the last row is the cross-validated score there - On California Housing it settled on 1055 rounds with a CV test RMSE of 0.4591 plus or minus 0.0112
From CV to a deployable model
- Retrain once on all the training data at the chosen round, then evaluate a single time on the held-out test set
- The final model reached 0.4469 RMSE and 0.8476 on the untouched test set, confirming the cross-validated estimate
xgb.cv vs. cross_val_score
cross_val_scoregives one score per fold for a fixed model (here 0.4722 plus or minus 0.0050 at 300 trees) and slots into scikit-learn pipelinesxgb.cvkeeps the per-round history, so it can select the tree count and score it in a single call, ideal when you want early stopping and cross-validation to cooperate
Why This Matters
Cross-validation is the difference between a number you hope is right and a number you can defend. When you tell a stakeholder your model’s error is “0.4591 plus or minus 0.0112,” you are making a claim you have checked five ways and quantified the uncertainty of, not reporting whatever the first split happened to hand you. That habit, reporting the spread alongside the mean, is what separates honest evaluation from lucky evaluation, and it is exactly the discipline you will lean on when you start comparing many hyperparameter settings in the tuning module: without cross-validation you cannot tell a real improvement from the noise of the split.
Just as important, you saw xgb.cv do two jobs at once. It is not merely a scoring tool; because it retains the full per-round history across folds, it lets cross-validation and early stopping cooperate to choose the tree count for you, then reports the honest error at that choice in the same breath. That is a capability the plain scikit-learn workflow cannot match without extra loops, and it is why the native tool earns its place in your toolkit.
Next Steps
You can now cross-validate an XGBoost model and report an honest, spread-aware error. Next you will confront a problem cross-validation alone cannot fix: imbalanced data, where one class vastly outnumbers the other and plain accuracy becomes misleading. You will see how XGBoost’s scale_pos_weight and the right evaluation metrics keep training honest when the classes are lopsided.
Lesson 3: Handling Imbalanced Data
Train XGBoost when one class dwarfs the other, using scale_pos_weight and metrics that survive class imbalance.
Back to Module Overview
Return to the Training Robust Models module overview
Continue Building Your Skills
Before moving on, rerun xgb.cv yourself and change one thing at a time. Bump max_depth from 4 to 6 and watch both the best round and the test-rmse-std shift; deeper trees fit faster but often widen the fold-to-fold spread. Try nfold=10 instead of 5 and notice how the standard deviation usually shrinks as each fold’s estimate averages over more configurations. Each experiment reinforces the same instinct: never trust a single number when a handful of folds can tell you how much that number really wobbles. Carry that instinct into the tuning module, where you will run cross-validation dozens of times to separate settings that genuinely help from settings that only looked good on one lucky split.