Lesson 5 - Guided Project: A Robust Training Pipeline
On this page
Welcome to the Guided Project
Across Module 3 you learned to train XGBoost like a professional rather than a tutorial. You let a validation set decide how many trees to grow with early stopping (Lesson 1), you measured performance reliably with cross-validation through xgb.cv (Lesson 2), you stopped letting an imbalanced target fool your accuracy score by reaching for scale_pos_weight and the metrics that actually matter (Lesson 3), and you fed XGBoost real-world data with missing values and categorical features using its native enable_categorical support (Lesson 4). Four techniques, four lessons. In the real world they never arrive one at a time. A single messy dataset hands you all four at once, and your job is to build one pipeline that handles every one of them together.
The running example for this course is Northwind Analytics, a fictional consultancy whose data team ships tabular models for clients. A client has handed them a census-style dataset and a blunt question: can you predict which people earn more than $50,000 a year? The catch is the one every practitioner knows well: only about a quarter of people in the data are high earners, several columns have missing entries, and half the features are text categories like occupation and marital status. A naive model will score a flattering accuracy and quietly miss most of the high earners it was built to find. Your job is to build a model that does not.
You will work on the real Adult Income dataset (48,842 rows from the 1994 US census, fetched from OpenML), and every number you report will be real and reproducible. The honest headline, which you will earn stage by stage, is that a robust pipeline lifts positive-class recall from 0.6506 to 0.8512 while holding ROC AUC at 0.9290 (up from 0.9183), turning a model that finds two-thirds of high earners into one that finds six in seven.
By the end of this project, you will be able to:
- Load a real, messy dataset and diagnose its class balance, missing values, and categorical columns before training anything
- Feed categoricals and missing values straight into XGBoost with
enable_categorical=Trueandtree_method="hist", no manual encoding or imputation - Use early stopping and
xgb.cvtogether to size an ensemble instead of guessingn_estimators - Set
scale_pos_weightand choose the right evaluation metric so an imbalanced target does not hide poor minority-class recall - Combine all four Module 3 techniques into one configuration and quantify the honest precision/recall trade-off against a naive baseline
This is the capstone for Module 3, so you should already be comfortable with early stopping, xgb.cv, scale_pos_weight, and native categorical handling in isolation. Here you wire them into a single pipeline. Let’s build.
Stage 1: Load, Inspect, and Set a Baseline
You cannot fix problems you have not measured. So the first stage is pure diagnosis: load the data, see how imbalanced it is, find the missing values and categorical columns, and fit one honest baseline that ignores all of it. That baseline is the thing every later stage has to beat, and its weakness will tell you exactly what to fix.
The Adult Income dataset comes from OpenML through scikit-learn’s fetch_openml. The first call downloads and caches it locally, so later runs are instant. Passing as_frame=True gives you a pandas DataFrame with real dtypes, which is exactly what XGBoost’s native categorical support wants. The target is the string >50K or <=50K; you turn it into a 1/0 label where 1 means high earner.
import warnings
warnings.filterwarnings("ignore")
import numpy as np
import pandas as pd
import xgboost as xgb
from sklearn.datasets import fetch_openml
from sklearn.model_selection import train_test_split
from sklearn.metrics import (accuracy_score, roc_auc_score, recall_score,
precision_score, f1_score, average_precision_score)
adult = fetch_openml("adult", version=2, as_frame=True) # cached after first download
X = adult.data.copy()
y = (adult.target == ">50K").astype(int)
print("X shape:", X.shape)
print("Positive rate (>50K):", round(float(y.mean()), 4))
cat_cols = [c for c in X.columns if str(X[c].dtype) in ("object", "category")]
print("Categorical columns:", cat_cols)
miss = X.isna().sum()
print("Columns with missing values:")
print(miss[miss > 0])
# Output:
# X shape: (48842, 14)
# Positive rate (>50K): 0.2393
# Categorical columns: ['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race', 'sex', 'native-country']
# Columns with missing values:
# workclass 2799
# occupation 2809
# native-country 857
# dtype: int64There is the diagnosis in three numbers. The dataset has 48,842 rows and 14 features. Only 23.93 percent are high earners, so the classes are roughly 3-to-1 imbalanced, the exact situation from Lesson 3 where plain accuracy lies. Eight of the fourteen columns are categorical text, and three columns carry genuine missing values (workclass, occupation, and native-country). You are not going to impute those missing values or one-hot-encode those categories by hand. Instead, you convert the categorical columns to pandas category dtype and let XGBoost handle both problems natively, exactly as in Lesson 4.
for c in cat_cols:
X[c] = X[c].astype("category")
# Stratified 60/20/20 split into train / validation / test.
# stratify=y keeps the ~24% positive rate identical in every split.
X_trainfull, X_test, y_trainfull, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y)
X_train, X_valid, y_train, y_valid = train_test_split(
X_trainfull, y_trainfull, test_size=0.25, random_state=42, stratify=y_trainfull)
print("train / valid / test rows:", X_train.shape[0], X_valid.shape[0], X_test.shape[0])
print("test positive rate:", round(float(y_test.mean()), 4))
# Output:
# train / valid / test rows: 29304 9769 9769
# test positive rate: 0.2393Note that a test_size=0.25 on the 80 percent that remained after the first split carves out another 20 percent of the whole, giving a clean 60/20/20 split: 29,304 training rows, 9,769 for validation, and 9,769 sealed for the final test. Because you passed stratify=y both times, the test set’s positive rate is 0.2393, identical to the full data, so nothing about the imbalance was distorted by the split. Now fit the baseline. It uses enable_categorical=True and tree_method="hist" so it can read the categorical columns and the NaNs, but it does nothing about the imbalance and simply grows a fixed 300 trees. A small helper reports the four metrics that matter at a decision threshold of 0.5.
def report(model, X_te, y_te, label):
proba = model.predict_proba(X_te)[:, 1]
pred = (proba >= 0.5).astype(int)
acc = accuracy_score(y_te, pred)
auc = roc_auc_score(y_te, proba)
rec = recall_score(y_te, pred)
prec = precision_score(y_te, pred)
f1 = f1_score(y_te, pred)
print(f"{label}: acc={acc:.4f} auc={auc:.4f} "
f"recall={rec:.4f} prec={prec:.4f} f1={f1:.4f}")
return dict(acc=acc, auc=auc, rec=rec, prec=prec, f1=f1)
baseline = xgb.XGBClassifier(
n_estimators=300, tree_method="hist",
enable_categorical=True, random_state=42)
baseline.fit(X_train, y_train)
base = report(baseline, X_test, y_test, "Baseline")
# Output:
# Baseline: acc=0.8650 auc=0.9183 recall=0.6506 prec=0.7519 f1=0.6975Read those numbers carefully, because they are the whole reason this project exists. The baseline looks great if you only glance at accuracy (0.8650) and ROC AUC (0.9183). But look at the recall on the positive class: 0.6506. The model correctly identifies only 65 percent of the actual high earners, missing more than a third of them. That is the imbalance trap from Lesson 3 in the flesh: because high earners are the minority, a model can score 86 percent accuracy while quietly failing at the one job it was hired for. Recall is the weak spot, and lifting it, without wrecking everything else, is what the next three stages are for.
Why recall is the metric to watch here
In an imbalanced problem, accuracy is dominated by the majority class. With only 24 percent positives, a model that never predicts “high earner” would still score 76 percent accuracy while catching zero of them. ROC AUC is more honest because it ranks positives against negatives across all thresholds, which is why the baseline’s 0.9183 AUC looks fine. But AUC does not tell you what happens at your actual decision threshold. Positive-class recall does: it is the fraction of true high earners your deployed model actually flags. When the cost of missing a positive is real, recall is the number that keeps you honest.
Stage 2: Add Robustness with Early Stopping and Cross-Validation
The baseline grew exactly 300 trees because you told it to. That is a guess, and a guess is not robust: too few trees underfit, too many waste compute and can overfit. Module 3 gave you two principled ways to let the data decide how many trees to grow. Early stopping (Lesson 1) watches a validation metric and halts once it stops improving. Cross-validation with xgb.cv (Lesson 2) does the same across five folds, giving a more stable estimate of both the ideal round count and the metric you can expect. Use both and see whether they agree.
Start with early stopping on the validation set. You give it a large ceiling of 2,000 trees, a gentler learning_rate=0.05, and ask it to stop after 50 rounds with no ROC AUC improvement on X_valid.
es_model = xgb.XGBClassifier(
n_estimators=2000, learning_rate=0.05, tree_method="hist",
enable_categorical=True, eval_metric="auc",
early_stopping_rounds=50, random_state=42)
es_model.fit(X_train, y_train, eval_set=[(X_valid, y_valid)], verbose=False)
print("early stopping best_iteration:", es_model.best_iteration)
print("best validation AUC:", round(float(es_model.best_score), 4))
# Output:
# early stopping best_iteration: 314
# best validation AUC: 0.9314Left to its own judgment, the model settled on 314 trees, close to the 300 you guessed for the baseline but chosen for a reason rather than by hand, and it reached a validation AUC of 0.9314. Now cross-check that number with xgb.cv, which uses the low-level DMatrix API. Build the DMatrix on the combined train-plus-validation data (X_trainfull) so the folds see as much data as possible, and run five-fold CV with the same early-stopping logic.
dtrain = xgb.DMatrix(X_trainfull, label=y_trainfull, enable_categorical=True)
params = dict(max_depth=6, eta=0.05, objective="binary:logistic",
tree_method="hist", eval_metric="auc")
cvres = xgb.cv(params, dtrain, num_boost_round=2000, nfold=5,
early_stopping_rounds=50, seed=42)
print("xgb.cv best round:", len(cvres))
print("xgb.cv mean test AUC:", round(float(cvres["test-auc-mean"].iloc[-1]), 4))
# Output:
# xgb.cv best round: 258
# xgb.cv mean test AUC: 0.9286The two methods agree, which is exactly what you want to see. Early stopping on a single validation split picked 314 rounds; five-fold cross-validation picked 258 and estimated a mean held-out AUC of 0.9286, right in line with the 0.9314 from the single split. When your round-sizing methods converge like this, you can trust the ensemble size instead of guessing it. You now have a robustly-sized model, but notice what has not changed: nothing here touched the imbalance. AUC improved slightly, but you have not yet asked the model to care about finding the minority class. That is Stage 3.
Stage 3: Handle the Imbalance with scale_pos_weight
Stage 2 made the ensemble the right size; it did not make the model care about the 24 percent of rows that matter most. XGBoost, left alone, optimizes overall log-loss, and with a 3-to-1 class ratio the cheapest way to lower that loss is to lean toward the majority “low earner” class. That is precisely why the baseline’s recall was a weak 0.6506. Lesson 3’s fix is scale_pos_weight, which multiplies the gradient contribution of the positive class so each high earner counts more. The standard setting is the ratio of negatives to positives.
You also switch the early-stopping metric to aucpr (area under the precision-recall curve), which, unlike plain accuracy, focuses squarely on how well the model ranks the minority class.
neg = int((y_train == 0).sum())
pos = int((y_train == 1).sum())
spw = neg / pos
print("scale_pos_weight = neg / pos =", round(spw, 4))
imb_model = xgb.XGBClassifier(
n_estimators=2000, learning_rate=0.05, tree_method="hist",
enable_categorical=True, eval_metric="aucpr",
scale_pos_weight=spw, early_stopping_rounds=50, random_state=42)
imb_model.fit(X_train, y_train, eval_set=[(X_valid, y_valid)], verbose=False)
print("best_iteration:", imb_model.best_iteration)
imb = report(imb_model, X_test, y_test, "Imbalance-handled")
ap = average_precision_score(y_test, imb_model.predict_proba(X_test)[:, 1])
print("average precision:", round(float(ap), 4))
# Output:
# scale_pos_weight = neg / pos = 3.1791
# best_iteration: 269
# Imbalance-handled: acc=0.8357 auc=0.9297 recall=0.8546 prec=0.6123 f1=0.7134
# average precision: 0.8335This is the pivot of the whole project. With scale_pos_weight=3.1791, positive-class recall jumps from 0.6506 to 0.8546, a 20-point leap. The model now catches roughly 85 percent of true high earners instead of 65 percent. ROC AUC even edges up to 0.9297, and the F1 score improves from 0.6975 to 0.7134. But be honest about the cost, because there is always a cost: precision fell from 0.7519 to 0.6123, and overall accuracy dipped from 0.8650 to 0.8357. By telling the model to care more about catching positives, you made it flag more people as high earners, so more of those flags are wrong.
That is the precision/recall trade-off, and it is not a bug to be fixed but a dial to be set. Which way you turn it depends on the cost of each error. If a missed high earner (a false negative) is expensive, say you are screening for people who qualify for a benefit and must not be overlooked, the Stage 3 model is clearly better despite its lower precision. The right choice is a business decision, but the professional discipline is to see the trade rather than let a single accuracy number hide it.
Stage 4: Combine Everything into One Robust Model
You have proven each technique in isolation. Now assemble them into the single configuration you would actually ship. The final model combines all four Module 3 lessons at once: native categorical and missing-value handling (enable_categorical, tree_method="hist"), early stopping to size the ensemble (early_stopping_rounds on the validation set), imbalance handling (scale_pos_weight with the aucpr metric), and, for a little extra generalization, the row and column subsampling you met back in Module 2. Then you retrain and measure once on the sealed test set.
final = xgb.XGBClassifier(
n_estimators=2000, learning_rate=0.05, max_depth=6,
tree_method="hist", enable_categorical=True, # categoricals + missing values
eval_metric="aucpr", scale_pos_weight=spw, # imbalance handling
subsample=0.8, colsample_bytree=0.8, # a little regularization
early_stopping_rounds=50, random_state=42) # let the data size the ensemble
final.fit(X_train, y_train, eval_set=[(X_valid, y_valid)], verbose=False)
print("final best_iteration:", final.best_iteration)
fin = report(final, X_test, y_test, "Final robust")
ap_final = average_precision_score(y_test, final.predict_proba(X_test)[:, 1])
print("final average precision:", round(float(ap_final), 4))
# Output:
# final best_iteration: 224
# Final robust: acc=0.8363 auc=0.9290 recall=0.8512 prec=0.6140 f1=0.7134
# final average precision: 0.832Early stopping settled the final model at 224 trees, and it delivers what the whole pipeline was built for. Line it up against the Stage 1 baseline, metric by metric, to see exactly what the robustness bought.
print(f"Baseline : acc={base['acc']:.4f} auc={base['auc']:.4f} "
f"recall={base['rec']:.4f} prec={base['prec']:.4f} f1={base['f1']:.4f}")
print(f"Final robust: acc={fin['acc']:.4f} auc={fin['auc']:.4f} "
f"recall={fin['rec']:.4f} prec={fin['prec']:.4f} f1={fin['f1']:.4f}")
print()
print(f"Positive-class recall: {base['rec']:.4f} -> {fin['rec']:.4f} "
f"(+{fin['rec'] - base['rec']:.4f})")
print(f"ROC AUC: {base['auc']:.4f} -> {fin['auc']:.4f} "
f"(+{fin['auc'] - base['auc']:.4f})")
# Output:
# Baseline : acc=0.8650 auc=0.9183 recall=0.6506 prec=0.7519 f1=0.6975
# Final robust: acc=0.8363 auc=0.9290 recall=0.8512 prec=0.6140 f1=0.7134
#
# Positive-class recall: 0.6506 -> 0.8512 (+0.2006)
# ROC AUC: 0.9183 -> 0.9290 (+0.0107)There is the answer to the client’s question. The robust pipeline lifts positive-class recall from 0.6506 to 0.8512, a full +0.2006, so the model that once found two-thirds of high earners now finds roughly six in seven. ROC AUC rose from 0.9183 to 0.9290, and F1 improved from 0.6975 to 0.7134, confirming the gain is real ranking skill, not just a threshold nudge. The figure below traces recall and AUC across the stages.
Now be honest about the trade, because that honesty is the point. The final model is not better on every metric: accuracy fell from 0.8650 to 0.8363 and precision from 0.7519 to 0.6140. If your client only ever looked at accuracy, they would call this a worse model, and they would be missing the story entirely. What you built is a model that catches the high earners it was hired to catch, and you can state precisely what that cost: about 14 fewer correct rejections per 100 flags, in exchange for finding 20 more true positives per 100 actual high earners. On an imbalanced problem where the minority class is the whole point, that is the model worth shipping, and you can defend every number.
The pipeline, not any single knob, is the deliverable
Notice that no single technique did all the work. Native categorical handling let you use the data at all without brittle encoding. Early stopping and xgb.cv sized the ensemble so you were not guessing. scale_pos_weight and the aucpr metric drove the recall gain. Subsampling added a little generalization. On a real dataset you rarely get to apply these one at a time and pick a winner, you apply them together, and the robustness comes from the combination. That is why Module 3 built them separately and this project wires them into one configuration: the deliverable is the pipeline, not any individual knob.
Practice Exercises
Now it is your turn. Treat these as real extensions, run each one, and read the numbers before you check the hint.
Exercise 1: Tune the Decision Threshold
The report helper classifies at a fixed 0.5 probability threshold, but with scale_pos_weight inflating positive scores, 0.5 may not be optimal. Take the final model’s predicted probabilities on the validation set, sweep thresholds from 0.3 to 0.7, and find the one that maximizes F1. Then apply that threshold to the test set and compare recall, precision, and F1 to the 0.5 defaults.
val_proba = final.predict_proba(X_valid)[:, 1]
# for t in np.linspace(0.3, 0.7, 41):
# pred = (val_proba >= t).astype(int)
# compute f1_score(y_valid, pred) and track the best tHint
Pick the best threshold on the validation set, then apply it once to the test set, never tune the threshold on the test set itself, for the same reason you keep the test set sealed. Because scale_pos_weight pushes probabilities upward, the F1-optimal threshold will likely sit above 0.5, which raises precision back up while giving back some recall. This is the threshold-tuning counterpart to changing scale_pos_weight: two different dials that both move the precision/recall balance.
Exercise 2: Compare scale_pos_weight Values
You used the textbook scale_pos_weight = neg / pos = 3.1791, but it is a dial, not a law. Retrain the final configuration with scale_pos_weight set to 1 (off), 2, 3.1791, and 5, and report each model’s test recall and precision. Plot or tabulate how recall rises and precision falls as the weight grows.
for spw_try in [1, 2, 3.1791, 5]:
m = xgb.XGBClassifier(
n_estimators=2000, learning_rate=0.05, max_depth=6,
tree_method="hist", enable_categorical=True, eval_metric="aucpr",
scale_pos_weight=spw_try, subsample=0.8, colsample_bytree=0.8,
early_stopping_rounds=50, random_state=42)
# m.fit(...); report(...)Hint
Expect a clean monotone trade: as scale_pos_weight climbs from 1 to 5, recall climbs and precision falls, because a heavier positive weight makes the model flag ever more people as high earners. There is no single “correct” value, only the value that matches the relative cost of a false negative versus a false positive for your use case. Seeing the whole curve, rather than trusting neg / pos blindly, is what lets you set the dial deliberately.
Exercise 3: Prove the Categorical Handling Earns Its Place
Native categorical handling is doing quiet work in every stage. Quantify it: build a version of X where the eight categorical columns are label-encoded to plain integers without enable_categorical (so XGBoost treats them as ordered numbers), retrain the final configuration on that, and compare its test AUC and recall to your native-categorical model.
X_enc = X.copy()
for c in cat_cols:
X_enc[c] = X_enc[c].cat.codes # integer codes, -1 for NaN
# re-split X_enc the same way, then fit XGBClassifier WITHOUT enable_categoricalHint
Reuse random_state=42 and the same stratified split so the only thing that changes is how categoricals are represented. Treating an unordered category like occupation as an ordered integer forces XGBoost to split on an arbitrary numeric ordering, which usually costs a little AUC and recall. The gap is often modest on this dataset (XGBoost is robust), but measuring it turns “native handling is better” from a claim into a number you can defend.
Summary
Congratulations! You built a complete, robust XGBoost training pipeline on a real, messy, imbalanced dataset, combining every technique from Module 3 into one configuration you could actually ship. Let’s review what you did.
Key Concepts
Diagnose Before You Train
- The Adult Income data was 23.93 percent positive, had missing values in three columns, and eight categorical columns, all diagnosed in Stage 1 before a single model was tuned
- The baseline
XGBClassifierscored a flattering 0.8650 accuracy and 0.9183 AUC but a weak 0.6506 positive-class recall, exposing the imbalance trap
Size the Ensemble with Early Stopping and CV
- Early stopping on a validation set chose 314 trees at a 0.9314 validation AUC; five-fold
xgb.cvindependently chose 258 rounds at a 0.9286 mean AUC - When two round-sizing methods agree, you can trust the ensemble size instead of guessing
n_estimators
Fix Imbalance and Own the Trade-off
- Setting
scale_pos_weight = neg / pos = 3.1791and theaucprmetric lifted recall from 0.6506 to 0.8546, at the honest cost of precision falling from 0.7519 to 0.6123 - The precision/recall trade-off is a dial to set by business cost, not a bug to hide behind accuracy
Combine Everything into One Robust Model
- The final model wired together native categorical/missing handling, early stopping (224 trees),
scale_pos_weight, and subsampling - It lifted positive-class recall from 0.6506 to 0.8512 and ROC AUC from 0.9183 to 0.9290, while you stated plainly the precision and accuracy it gave back
Why This Matters
Real datasets do not hand you one problem at a time. The Adult Income data was imbalanced and missing values and full of categorical text, all at once, which is the normal condition of tabular data in the wild. Module 3 taught you each technique in isolation so you could understand it; this project taught you the harder and more valuable skill of composing them into a single pipeline where the robustness comes from the combination. That is what production model training actually looks like.
The subtler lesson is what you learned to report. A less careful analyst would have shipped the baseline, quoted its 86 percent accuracy, and never noticed it missed a third of the high earners. You learned to look past the headline metric to the one that matches the job, to lift it deliberately, and to state the exact price you paid in precision and accuracy. That honesty, being able to say “here is what improved, here is what it cost, and here is why the trade is worth it”, is what separates a model that demos well from a model a team can trust in production.
Next Steps
You can now build a robust XGBoost pipeline end to end on messy, imbalanced, real-world data. But a model that scores well is only half the job: you still need to understand why it predicts what it does, tune it efficiently, and get it into production. That is exactly where the course goes next.
Module 4: Interpretation, Tuning & Deployment
Open the black box with feature importance and SHAP, tune with Optuna, and ship a saved model.
Back to Course Overview
Review the full Gradient Boosting & XGBoost course.
Continue Building Your Skills
You just closed Module 3 the way robust models are actually built: not by trusting a default and quoting its accuracy, but by diagnosing the data first, sizing the ensemble with evidence, fixing the imbalance on purpose, and combining every technique into one configuration whose trade-offs you can defend line by line. Each tool you used here, early stopping, xgb.cv, scale_pos_weight, native categorical handling, you first met alone in this module, and now you have felt how they reinforce each other on a single real dataset. That is the leap from knowing what a technique does to knowing how to compose techniques under real-world pressure, and it is exactly the instinct the Northwind Analytics team, and any team you join, needs from someone who trains models for a living. The models only get more powerful from here, and now you know how to make them trustworthy too.