Lesson 5 - Guided Project: Tuning XGBoost on Real Data
Welcome to the Guided Project
Across Module 2 you took XGBoost apart. You met its two APIs and its regularized objective, you saw how gradients and hessians drive each split’s gain, and you learned what every important knob does: the structural ones that shape each tree (max_depth, min_child_weight, gamma) and the pair that governs how fast the ensemble learns (learning_rate with n_estimators), then the regularization and sampling ones that fight overfitting (reg_lambda, reg_alpha, subsample, colsample_bytree). Knowing what each knob does is one thing. Turning them, in the right order, on real data, to make a model that is measurably better, is the skill this project builds.
The running example for this course is Northwind Analytics, a fictional consultancy whose data team has adopted XGBoost as its default for tabular problems. A junior analyst has trained a model with library defaults and asked, reasonably, “is this as good as it gets?” Your job is to answer that with evidence: start from an honest baseline, tune with intent on data the model has never scored, and report the final number on a test set you keep sealed until the very end. You will work on the real California Housing dataset, predicting the median house value of a neighborhood, and every number you report will be real and reproducible. The honest headline, which you will earn step by step, is that careful tuning takes the test RMSE from 0.4626 down to 0.4419 and, just as importantly, narrows the gap between training and test error.
By the end of this project, you will be able to:
- Establish a defensible XGBoost baseline and keep a test set sealed so every later gain is measured honestly
- Tune the structural hyperparameters from Lesson 3 (
max_depth,min_child_weight, and alearning_rate/n_estimatorspair) against a held-out validation set - Add the regularization and sampling knobs from Lesson 4 (
reg_lambda,subsample,colsample_bytree) to improve generalization, not just fit - Lock in a tuned configuration, retrain on the full training set, and quantify the gain against the baseline on the sealed test set
- Read a train-test gap as evidence of over- or under-fitting and explain why tuning gains on strong defaults are often modest but real
This is the capstone for Module 2, so you should already be comfortable with the two XGBoost APIs, what each hyperparameter controls, and the basic scikit-learn workflow (train/test split, fit, predict, score). Let’s tune.
Stage 1: Fit a Baseline and Seal the Test Set
You cannot claim tuning helped unless you know exactly what you improved on. So the first move is not to tune anything. It is to fit XGBoost with sensible default-ish settings, measure it on a test set, and write that number down. That number is the thing to beat, and the discipline of setting it before you touch a single knob is what keeps a tuning project honest.
The California Housing dataset ships inside scikit-learn, drawn from the 1990 California census. Each row is a block group (a small neighborhood), and the target, MedHouseVal, is the median house value there in units of $100,000. The first call to fetch_california_housing downloads and caches the data locally, so later runs are instant.
import warnings
warnings.filterwarnings("ignore")
import numpy as np
import xgboost as xgb
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score
data = fetch_california_housing() # cached after the first download
X, y = data.data, data.target
print("Feature names:", data.feature_names)
print("X shape:", X.shape)
# Output:
# Feature names: ['MedInc', 'HouseAge', 'AveRooms', 'AveBedrms', 'Population', 'AveOccup', 'Latitude', 'Longitude']
# X shape: (20640, 8)There are 20,640 neighborhoods and 8 numeric features, from median income (MedInc) to average occupancy (AveOccup) to latitude and longitude. Now for the split that makes the whole project trustworthy. You will carve the data into three pieces: a test set you seal now and do not score until Stage 4, and, inside the training data, a sub-train and a validation set that you will use to compare tuning candidates. The test set never influences a single tuning decision, which is exactly why the final number on it means something.
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42)
# Carve a tuning validation set out of the TRAINING data only.
X_tr, X_val, y_tr, y_val = train_test_split(
X_train, y_train, test_size=0.2, random_state=42)
print("Full training rows: ", X_train.shape[0])
print(" Tuning sub-train: ", X_tr.shape[0])
print(" Tuning validation: ", X_val.shape[0])
print("Test rows (untouched):", X_test.shape[0])
# Output:
# Full training rows: 16512
# Tuning sub-train: 13209
# Tuning validation: 3303
# Test rows (untouched): 4128With the split in place, fit the baseline. You will use XGBRegressor with a modest, explicit n_estimators=100 and otherwise library defaults, trained on the full training set, and score it on the sealed test set. RMSE is measured in the same $100,000 units as the target:
def rmse(y_true, y_pred):
return float(np.sqrt(mean_squared_error(y_true, y_pred)))
baseline = xgb.XGBRegressor(n_estimators=100, random_state=42)
baseline.fit(X_train, y_train)
baseline_rmse = rmse(y_test, baseline.predict(X_test))
baseline_r2 = float(r2_score(y_test, baseline.predict(X_test)))
print(f"Baseline test RMSE: {baseline_rmse:.4f}")
print(f"Baseline test R2: {baseline_r2:.4f}")
# Output:
# Baseline test RMSE: 0.4626
# Baseline test R2: 0.8367Out of the box, XGBoost lands a test RMSE of 0.4626 and an of 0.8367, already explaining about 84 percent of the variation in house values. That is a strong starting point, which is exactly what makes this a realistic tuning exercise: the interesting question is not whether you can beat a broken baseline, but whether careful tuning can squeeze a real, measurable gain out of an already-good model. 0.4626 is the number every later stage has to beat.
Why hold out a separate validation set
If you tune by repeatedly checking the test set, the test set quietly stops being a fair judge: you start, however unconsciously, fitting your choices to it. A held-out validation set gives you a surface to compare candidates on, while the test set stays sealed as a single honest verdict at the end. This is the same reason libraries offer early stopping on a validation set rather than the test set, a pattern you will meet again in Module 3.
Stage 2: Tune the Structural Hyperparameters
Now you start turning knobs, beginning with the ones from Lesson 3 that shape the trees themselves. Three matter most here. max_depth sets how deep each tree can grow, and therefore how complex an interaction a single tree can capture. min_child_weight sets how much evidence a split needs before it is allowed, acting as a brake on splits that fit noise. And learning_rate and n_estimators move together: a smaller learning rate takes more cautious steps, so it needs more trees to arrive, which is why you tune them as a pair rather than one at a time.
You will search a small grid of these settings, training each candidate on the sub-train and scoring it on the validation set. Keeping the grid modest (a few dozen fits) is deliberate: tuning is about learning the shape of the response, not about brute-forcing every combination.
import itertools
max_depths = [3, 4, 6]
min_child_weights = [1, 3, 5]
lr_n_pairs = [(0.1, 300), (0.05, 600)]
structural_results = []
for md, mcw, (lr, n) in itertools.product(
max_depths, min_child_weights, lr_n_pairs):
model = xgb.XGBRegressor(
n_estimators=n, learning_rate=lr, max_depth=md,
min_child_weight=mcw, random_state=42)
model.fit(X_tr, y_tr) # fit on sub-train only
val_rmse = rmse(y_val, model.predict(X_val)) # score on validation
structural_results.append((round(val_rmse, 4), md, mcw, lr, n))
structural_results.sort()
print(f"Fits evaluated: {len(structural_results)}")
print("Top 3 (val RMSE, max_depth, min_child_weight, learning_rate, n_estimators):")
for r in structural_results[:3]:
print(" ", r)
# Output:
# Fits evaluated: 18
# Top 3 (val RMSE, max_depth, min_child_weight, learning_rate, n_estimators):
# (0.4702, 6, 5, 0.05, 600)
# (0.4716, 6, 5, 0.1, 300)
# (0.4742, 6, 1, 0.1, 300)Eighteen fits, sorted best-first. The pattern is clear before you even pick a winner: every one of the top three uses max_depth=6, so on this dataset the trees want to be deeper than the default. The best combination pairs that depth with a cautious learning_rate=0.05 over n_estimators=600 and a min_child_weight=5 that keeps splits honest. Pull those settings out into named variables so the later stages can build on them.
best_val, best_md, best_mcw, best_lr, best_n = structural_results[0]
print("Best structural settings:")
print(f" max_depth={best_md}, min_child_weight={best_mcw}, "
f"learning_rate={best_lr}, n_estimators={best_n}")
print(f" validation RMSE: {best_val:.4f}")
# Output:
# Best structural settings:
# max_depth=6, min_child_weight=5, learning_rate=0.05, n_estimators=600
# validation RMSE: 0.4702A quick note on reading these numbers: the best validation RMSE here is 0.4702, which looks worse than the 0.4626 baseline, but that is not an apples-to-apples comparison. The baseline was scored on the test set after training on all 16,512 training rows; these candidates are scored on the validation set after training on only the 13,209 sub-train rows. The validation number is only meaningful relative to other candidates on the same validation set, which is exactly how you are using it. The true comparison against the baseline waits for Stage 4, when the winning configuration is retrained on the full training set and scored on the sealed test set.
Stage 3: Add Regularization and Sampling
You now have a good tree shape, but a good fit is not the whole story. A model can score well and still be overfit, leaning too hard on the training data and generalizing worse than it could. The Lesson 4 knobs exist precisely for this: reg_lambda penalizes large leaf weights in the objective, while subsample and colsample_bytree show each tree only a random slice of the rows and columns, which injects the kind of randomness that curbs overfitting. Layer a small grid of these on top of the structural winner from Stage 2 and see whether they help.
reg_lambdas = [1, 5, 10]
subsamples = [0.8, 1.0]
colsample_bytrees = [0.8, 1.0]
reg_results = []
for rl, ss, cs in itertools.product(
reg_lambdas, subsamples, colsample_bytrees):
model = xgb.XGBRegressor(
n_estimators=best_n, learning_rate=best_lr, max_depth=best_md,
min_child_weight=best_mcw, reg_lambda=rl,
subsample=ss, colsample_bytree=cs, random_state=42)
model.fit(X_tr, y_tr)
val_rmse = rmse(y_val, model.predict(X_val))
reg_results.append((round(val_rmse, 4), rl, ss, cs))
reg_results.sort()
print(f"Fits evaluated: {len(reg_results)}")
print("Top 3 (val RMSE, reg_lambda, subsample, colsample_bytree):")
for r in reg_results[:3]:
print(" ", r)
best_reg_val, best_rl, best_ss, best_cs = reg_results[0]
print(f"Best regularized validation RMSE: {best_reg_val:.4f}")
# Output:
# Fits evaluated: 12
# Top 3 (val RMSE, reg_lambda, subsample, colsample_bytree):
# (0.4689, 10, 0.8, 0.8)
# (0.4702, 1, 1.0, 1.0)
# (0.4703, 5, 0.8, 0.8)
# Best regularized validation RMSE: 0.4689Two things jump out. First, the second-place row, reg_lambda=1, subsample=1.0, colsample_bytree=1.0, is exactly the un-regularized structural model from Stage 2 (those are the defaults), and it scores the same 0.4702. That is your control. Second, the winner, reg_lambda=10 with 80 percent row and column sampling, nudges the validation RMSE down to 0.4689. A small improvement, but the more important effect is what regularization does to the gap between training and validation error. Measure it directly.
def train_val_gap(**extra):
m = xgb.XGBRegressor(
n_estimators=best_n, learning_rate=best_lr, max_depth=best_md,
min_child_weight=best_mcw, random_state=42, **extra)
m.fit(X_tr, y_tr)
tr = rmse(y_tr, m.predict(X_tr))
va = rmse(y_val, m.predict(X_val))
return tr, va
s_tr, s_va = train_val_gap() # structural only
r_tr, r_va = train_val_gap(
reg_lambda=best_rl, subsample=best_ss, colsample_bytree=best_cs)
print(f"Structural only : train {s_tr:.4f} val {s_va:.4f} gap {s_va - s_tr:.4f}")
print(f"+ regularization: train {r_tr:.4f} val {r_va:.4f} gap {r_va - r_tr:.4f}")
# Output:
# Structural only : train 0.2536 val 0.4702 gap 0.2166
# + regularization: train 0.2880 val 0.4689 gap 0.1809This is the real payoff of Stage 3. The structural-only model fits the training data hard (train RMSE 0.2536) but leaves a wide 0.2166 gap to validation, a classic sign that some of that tight fit is memorization. Adding regularization deliberately loosens the training fit (train RMSE rises to 0.2880) while validation error edges down, shrinking the gap to 0.1809. The model got a little worse at reciting the training set and a little better at the job that matters. That trade, worse on train and better on validation, is the signature of regularization doing exactly what it is designed to do.
Stage 4: Lock In the Tuned Model and Measure the Gain
Every tuning decision is made. The structural winner from Stage 2 and the regularization winner from Stage 3 combine into one configuration. Now you do the thing you have been saving: retrain that configuration on the full training set (all 16,512 rows, sub-train plus validation) and score it, once, on the test set you sealed back in Stage 1. First, retrain the Stage 2 structural-only config the same way, so you can see how much each phase contributed on the test set, not just on validation.
# Structural-only config (Stage 2 winner), retrained on the full training set.
structural_model = xgb.XGBRegressor(
n_estimators=best_n, learning_rate=best_lr, max_depth=best_md,
min_child_weight=best_mcw, random_state=42)
structural_model.fit(X_train, y_train)
struct_test = rmse(y_test, structural_model.predict(X_test))
struct_train = rmse(y_train, structural_model.predict(X_train))
print(f"Stage 2 structural-only test RMSE {struct_test:.4f} "
f"train {struct_train:.4f} gap {struct_test - struct_train:.4f}")
# Output:
# Stage 2 structural-only test RMSE 0.4522 train 0.2738 gap 0.1784final_model = xgb.XGBRegressor(
n_estimators=best_n, learning_rate=best_lr, max_depth=best_md,
min_child_weight=best_mcw, reg_lambda=best_rl,
subsample=best_ss, colsample_bytree=best_cs, random_state=42)
final_model.fit(X_train, y_train) # retrain on the FULL training set
final_rmse = rmse(y_test, final_model.predict(X_test))
final_r2 = float(r2_score(y_test, final_model.predict(X_test)))
final_train = rmse(y_train, final_model.predict(X_train))
base_train = rmse(y_train, baseline.predict(X_train))
print(f"Baseline train {base_train:.4f} test {baseline_rmse:.4f} "
f"gap {baseline_rmse - base_train:.4f}")
print(f"Final tuned model train {final_train:.4f} test {final_rmse:.4f} "
f"gap {final_rmse - final_train:.4f}")
print()
print("Final tuned configuration:")
print(f" n_estimators={best_n}, learning_rate={best_lr}, max_depth={best_md},")
print(f" min_child_weight={best_mcw}, reg_lambda={best_rl}, "
f"subsample={best_ss}, colsample_bytree={best_cs}")
print()
print(f"Baseline test RMSE: {baseline_rmse:.4f} -> Tuned test RMSE: {final_rmse:.4f}")
print(f"Baseline test R2: {baseline_r2:.4f} -> Tuned test R2: {final_r2:.4f}")
gain = baseline_rmse - final_rmse
print(f"RMSE improvement: {gain:.4f} ({gain / baseline_rmse * 100:.1f}% lower error)")
# Output:
# Baseline train 0.2757 test 0.4626 gap 0.1869
# Final tuned model train 0.3025 test 0.4419 gap 0.1394
#
# Final tuned configuration:
# n_estimators=600, learning_rate=0.05, max_depth=6,
# min_child_weight=5, reg_lambda=10, subsample=0.8, colsample_bytree=0.8
#
# Baseline test RMSE: 0.4626 -> Tuned test RMSE: 0.4419
# Baseline test R2: 0.8367 -> Tuned test R2: 0.8510
# RMSE improvement: 0.0208 (4.5% lower error)There is the answer to the junior analyst’s question. Tuning took the test RMSE from 0.4626 to 0.4419, a real 0.0208 improvement, or 4.5 percent lower error, and pushed from 0.8367 to 0.8510. Read the three stages as a progression on the sealed test set: the baseline sat at 0.4626, structural tuning alone brought it to 0.4522, and adding regularization finished the job at 0.4419. The figure below shows that march down, bar by bar.
Now be honest about the size of that win, because honesty is the point of this project. A 4.5 percent error reduction is not the tenfold leap you sometimes see in tutorials, and that is expected: XGBoost’s defaults are genuinely good, so a strong baseline leaves less headroom than a weak one. But look past the RMSE at the gap column. The baseline overfit with a 0.1869 train-test gap; the tuned model narrowed that to 0.1394 while also scoring lower on test. You did not just shave error, you built a model that leans less on its training data, which is the kind of improvement that holds up when the data shifts. On a real project, a reliable 4.5 percent that generalizes better is a result worth shipping.
Modest, real gains are the norm on strong baselines
It is tempting to judge tuning by how dramatic the improvement looks. Resist that. On a well-defaulted library like XGBoost, single-digit-percent gains that also improve generalization are a normal, good outcome, and chasing bigger numbers often means overfitting your hyperparameters to one particular split. The professional move is to report the honest gain, note the narrower train-test gap, and stop, rather than torture the validation set for another 0.001.
Practice Exercises
Now it is your turn. Treat these as real extensions, run each one, and read the numbers before you check the hint.
Exercise 1: Add gamma to the Structural Search
Lesson 3 introduced gamma, the minimum loss reduction required to make a split. Extend the Stage 2 grid with gamma values [0, 0.1, 0.3] (keep the rest of the grid the same) and rerun the validation search. Does a nonzero gamma reach the top of the leaderboard, and does it change the winning max_depth?
# Add a gammas = [0, 0.1, 0.3] loop level to the Stage 2 itertools.product,
# pass gamma=g into XGBRegressor, and re-sort structural_results.Hint
Add gamma=g to the constructor and include gammas as a fourth axis in itertools.product. Because gamma is another brake on splitting, much like min_child_weight, expect small effects here: the strongest settings will still favor max_depth=6, and a modest gamma may tie or barely beat gamma=0. Watch the number of fits grow to len(structural_results) and keep the grid small so it still runs in a reasonable time.
Exercise 2: Use RandomizedSearchCV Instead of a Manual Grid
Replace the Stage 2 manual loop with scikit-learn’s RandomizedSearchCV, sampling n_iter=15 combinations from distributions over max_depth, min_child_weight, learning_rate, and reg_lambda, with cv=3, scoring="neg_root_mean_squared_error", and random_state=42. Compare its best cross-validated score and chosen settings to your hand-built winner.
from sklearn.model_selection import RandomizedSearchCV
param_dist = {
"max_depth": [3, 4, 6, 8],
"min_child_weight": [1, 3, 5],
"learning_rate": [0.03, 0.05, 0.1],
"reg_lambda": [1, 5, 10],
}
# search = RandomizedSearchCV(xgb.XGBRegressor(n_estimators=400, random_state=42), ...)Hint
Fit the search on X_train, y_train (it does its own internal cross-validation, so no manual validation split is needed). Read search.best_params_ and -search.best_score_ for the best RMSE. Keep n_iter small (15) and cv=3 so it finishes quickly, since each iteration trains three models. Expect settings in the same neighborhood as your manual winner, favoring deeper trees and a smaller learning rate; the exact combination may differ slightly because the search space and validation scheme are different.
Exercise 3: Retune n_estimators with Early Stopping
Fixing n_estimators by hand is wasteful. Refit the final configuration with a large n_estimators=2000 but pass early_stopping_rounds=50 and an eval_set of (X_val, y_val), so XGBoost stops adding trees once validation error stops improving. Report final_model.best_iteration and check whether the early-stopped model matches or beats your fixed 600-tree test RMSE.
model = xgb.XGBRegressor(
n_estimators=2000, learning_rate=0.05, max_depth=6,
min_child_weight=5, reg_lambda=10, subsample=0.8,
colsample_bytree=0.8, early_stopping_rounds=50, random_state=42)
# model.fit(X_tr, y_tr, eval_set=[(X_val, y_val)], verbose=False)Hint
In xgboost 3.x, early_stopping_rounds is a constructor argument and eval_set is passed to fit. After fitting, model.best_iteration tells you how many trees it actually kept, which will often be well under 2000. Then score on the test set as before. Early stopping is the principled way to set n_estimators, and you will meet it properly in Module 3, so this is a preview of a habit worth building.
Summary
Congratulations! You tuned XGBoost end to end on a real dataset, moving from a strong baseline to a measurably better, better-generalizing model, without ever letting the test set influence a decision. Let’s review what you did.
Key Concepts
Seal the Test Set, Tune on Validation
- You split the data into a sealed test set and, inside the training data, a sub-train and validation set, so every tuning choice was judged on data the final verdict never saw
- The baseline
XGBRegressorscored a test RMSE of 0.4626 ( 0.8367), the honest number every later stage had to beat
Structural Tuning Does Most of the Work
- Searching
max_depth,min_child_weight, andlearning_rate/n_estimatorspairs across 18 validation fits, the winner wasmax_depth=6, min_child_weight=5, learning_rate=0.05, n_estimators=600 - Retrained on the full training set, this structural-only model reached a test RMSE of 0.4522, most of the total gain
Regularization Improves Generalization, Not Just Fit
- Adding
reg_lambda=10,subsample=0.8, andcolsample_bytree=0.8narrowed the train-to-validation gap from 0.2166 to 0.1809 by loosening the training fit while lowering validation error - This is the signature of regularization: a little worse on train, a little better on the data that matters
Quantify the Gain Honestly
- The final tuned model scored a test RMSE of 0.4419 ( 0.8510), a real 4.5 percent error reduction over the baseline
- Just as important, the train-test gap shrank from 0.1869 to 0.1394: the tuned model leans less on its training data
Why This Matters
Most real machine-learning work does not look like the dramatic before-and-after of a tutorial. It looks like this: a good default model, a careful search for a real improvement, and the discipline to report the honest size of that improvement rather than a cherry-picked one. You just practiced the entire loop, baseline first, tune on held-out data, layer in regularization for generalization, and measure once on a sealed test set, on data with all the messiness of a real census. That workflow transfers directly to any tabular problem you will face.
The subtler lesson is what you learned to value. It would have been easy to celebrate the RMSE dropping and stop there. Instead you watched the train-test gap and understood that a model which memorizes less is often worth more than one that merely scores a hair lower, because it is the one that survives contact with new data. That instinct, reading generalization and not just error, is exactly what the Northwind Analytics team, and any team you join, needs from someone who tunes models for a living.
Next Steps
You can now take XGBoost from defaults to a tuned, well-generalizing model with an honest measurement to back it up. Next, the course moves from tuning knobs to training robustly: how to let the model decide how many trees it needs, how to validate reliably, and how to handle the messy realities of imbalance, missing values, and categorical features.
Module 3: Training Robust Models
Learn to train XGBoost reliably: early stopping, cross-validation, imbalance, and missing or categorical data.
Back to Course Overview
Review the full Gradient Boosting & XGBoost course.
Continue Building Your Skills
You just closed Module 2 the way real practitioners work: not by trusting defaults and not by chasing a flashy number, but by setting an honest baseline, tuning with intent on data the verdict never saw, and reporting a real 4.5 percent gain alongside a narrower train-test gap. Every knob you turned here, from max_depth to reg_lambda, you first met one lesson at a time in Module 2, and now you have felt what each one does to a live model on real data. That combination, knowing what a hyperparameter means and having watched it move a genuine metric, is what turns XGBoost from a black box into an instrument you can play. Carry that habit forward, because the models only get more capable from here.