Lesson 1 - Guided Project: An Honest, Tuned, Explained Model
Welcome to the Guided Project
This is the last lesson of the course, and it is unlike every guided project that came before it. Each module closed with a project that exercised that module’s skills. This one exercises all four modules at once, on a single real dataset, in the order you would actually work through a problem in the field: understand the data, train something robust, tune it properly, explain it honestly, and report metrics you can defend.
The running example for this course is Northwind Analytics, a fictional consultancy whose data team ships tabular models for clients. A client has handed them a census dataset and one blunt question: can you predict which people earn more than $50,000 a year, and can you show your work? That second clause is the whole point of this capstone. Anyone can call .fit() and quote an accuracy number. The job you have been training for is to build a model whose every number, from the class balance to the final recall, traces back to a deliberate, verified decision.
You will work on the real Adult Income dataset (48,842 rows from the 1994 US census, fetched from OpenML). It is imbalanced (only about 24 percent are high earners), it has genuine categorical columns and real missing values, and a careless model will score a flattering accuracy while quietly missing a third of the people it was built to find. Over four stages you will fix that, and every number you report will be real and reproducible. The honest headline you will earn: a naive baseline that finds 65 percent of high earners becomes a tuned, explained model that finds 86 percent of them, with ROC AUC rising from 0.9183 to 0.9299 and no metric hidden along the way.
By the end of this project, you will be able to:
- Take one real, messy dataset from raw load to a documented, honestly evaluated model using skills from all four modules
- Diagnose class balance, categorical columns, and missing values, then feed them straight into XGBoost with
enable_categoricalandtree_method="hist" - Size and rebalance a model with early stopping,
xgb.cv, andscale_pos_weight, then tune it with a small Optuna study - Explain the finished model with SHAP, both a global ranking and one exact, additive local explanation in the classifier’s log-odds space
- Report final metrics honestly against the baseline, naming precisely what improved and what it cost
This is the capstone, so it assumes you are comfortable with everything the four modules taught. Here you wire it all into one build. Let’s go.
Stage 1: Explore and Prepare the Data
You cannot fix what you have not measured, so the first stage is pure diagnosis and setup. The Adult Income dataset comes from OpenML through scikit-learn’s fetch_openml. The first call downloads and caches it, so later runs are instant, and as_frame=True returns a pandas DataFrame with real dtypes, exactly what XGBoost’s native categorical support wants. The target is the string >50K or <=50K, which you turn into a 1/0 label where 1 means high earner.
import warnings
warnings.filterwarnings("ignore")
import numpy as np
import pandas as pd
import xgboost as xgb
from sklearn.datasets import fetch_openml
from sklearn.model_selection import train_test_split
from sklearn.metrics import (accuracy_score, roc_auc_score, recall_score,
precision_score, f1_score, average_precision_score)
adult = fetch_openml("adult", version=2, as_frame=True) # cached after first download
X = adult.data.copy()
y = (adult.target == ">50K").astype(int)
print("X shape:", X.shape)
print("Positive rate (>50K):", round(float(y.mean()), 4))
cat_cols = [c for c in X.columns if str(X[c].dtype) in ("object", "category")]
num_cols = [c for c in X.columns if c not in cat_cols]
print("Categorical columns:", cat_cols)
print("Numeric columns:", num_cols)
miss = X.isna().sum()
print("Columns with missing values:")
print(miss[miss > 0])
# Output:
# X shape: (48842, 14)
# Positive rate (>50K): 0.2393
# Categorical columns: ['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race', 'sex', 'native-country']
# Numeric columns: ['age', 'fnlwgt', 'education-num', 'capital-gain', 'capital-loss', 'hours-per-week']
# Columns with missing values:
# workclass 2799
# occupation 2809
# native-country 857
# dtype: int64There is the whole problem in a few numbers. The dataset has 48,842 rows and 14 features. Only 23.93 percent are high earners, so the classes are roughly 3-to-1 imbalanced, the exact situation where plain accuracy lies. Eight columns are categorical text (occupation, marital status, and so on), six are numeric, and three columns carry genuine missing values (workclass, occupation, native-country). You are not going to hand-impute those blanks or one-hot-encode those categories. Instead you convert the categorical columns to pandas category dtype and let XGBoost handle both natively, exactly as Module 3 taught. Then you carve out a stratified 60/20/20 split so the imbalance is preserved in every part.
for c in cat_cols:
X[c] = X[c].astype("category")
# Stratified 60/20/20 split: train (fit) / validation (tune & early-stop) / test (sealed)
X_trainfull, X_test, y_trainfull, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y)
X_train, X_valid, y_train, y_valid = train_test_split(
X_trainfull, y_trainfull, test_size=0.25, random_state=42, stratify=y_trainfull)
print("train / valid / test rows:", X_train.shape[0], X_valid.shape[0], X_test.shape[0])
print("test positive rate:", round(float(y_test.mean()), 4))
# Output:
# train / valid / test rows: 29304 9769 9769
# test positive rate: 0.2393A test_size=0.25 on the 80 percent left after the first split carves out another 20 percent of the whole, giving a clean 60/20/20: 29,304 training rows, 9,769 for validation, and 9,769 sealed away for the final test. Because you stratified both splits, the test set’s positive rate is 0.2393, identical to the full data. Now establish the baseline every later stage must beat. It uses enable_categorical=True and tree_method="hist" so it can read the categoricals and the NaNs, but it does nothing about the imbalance and simply grows a fixed 300 trees. A small report helper prints the six metrics that matter at a 0.5 threshold.
def report(model, X_te, y_te, label):
proba = model.predict_proba(X_te)[:, 1]
pred = (proba >= 0.5).astype(int)
acc = accuracy_score(y_te, pred)
auc = roc_auc_score(y_te, proba)
rec = recall_score(y_te, pred)
prec = precision_score(y_te, pred)
f1 = f1_score(y_te, pred)
ap = average_precision_score(y_te, proba)
print(f"{label}: acc={acc:.4f} auc={auc:.4f} recall={rec:.4f} "
f"prec={prec:.4f} f1={f1:.4f} ap={ap:.4f}")
return dict(acc=acc, auc=auc, rec=rec, prec=prec, f1=f1, ap=ap)
baseline = xgb.XGBClassifier(
n_estimators=300, tree_method="hist",
enable_categorical=True, random_state=42)
baseline.fit(X_train, y_train)
base = report(baseline, X_test, y_test, "Baseline")
# Output:
# Baseline: acc=0.8650 auc=0.9183 recall=0.6506 prec=0.7519 f1=0.6975 ap=0.8100Read those numbers slowly, because they are the reason the next three stages exist. The baseline looks strong if you only glance at accuracy (0.8650) and ROC AUC (0.9183). But look at the positive-class recall: 0.6506. The model correctly flags only 65 percent of the actual high earners and misses more than a third of them. That is the imbalance trap in the flesh: because high earners are the minority, a model can score 86 percent accuracy while failing at the one job it was hired for. Recall is the weak spot, and lifting it, without wrecking the rest, is the whole arc of this capstone.
Why recall is the number to watch here
On an imbalanced problem, accuracy is dominated by the majority class. With only 24 percent positives, a model that never predicts “high earner” would still score 76 percent accuracy while catching zero of them. ROC AUC is more honest because it ranks positives against negatives across all thresholds, which is why the baseline’s 0.9183 already looks fine. But AUC does not tell you what happens at your actual decision threshold; positive-class recall does. When missing a positive is costly, recall keeps you honest, and it is the metric this whole build is organized around lifting.
Stage 2: Train a Robust Model
The baseline grew exactly 300 trees because you told it to, and it ignored the imbalance entirely. Module 3 fixes both. Early stopping lets a validation metric decide how many trees to grow instead of you guessing. scale_pos_weight rebalances the objective so each of the scarce high earners counts more. The standard weight is the ratio of negatives to positives:
Set both at once: a large tree ceiling with a gentle learning_rate=0.05, early stopping on the validation set’s AUC, and the imbalance weight.
neg = int((y_train == 0).sum())
pos = int((y_train == 1).sum())
spw = neg / pos
print("negatives / positives:", neg, "/", pos)
print("scale_pos_weight = neg / pos =", round(spw, 4))
robust = xgb.XGBClassifier(
n_estimators=2000, learning_rate=0.05, tree_method="hist",
enable_categorical=True, eval_metric="auc",
scale_pos_weight=spw, early_stopping_rounds=50, random_state=42)
robust.fit(X_train, y_train, eval_set=[(X_valid, y_valid)], verbose=False)
print("early stopping best_iteration:", robust.best_iteration)
print("best validation AUC:", round(float(robust.best_score), 4))
rob = report(robust, X_test, y_test, "Robust (Stage 2)")
# Output:
# negatives / positives: 22292 / 7012
# scale_pos_weight = neg / pos = 3.1791
# early stopping best_iteration: 282
# best validation AUC: 0.9313
# Robust (Stage 2): acc=0.8357 auc=0.9297 recall=0.8520 prec=0.6127 f1=0.7128 ap=0.8336This is the pivot of the project. Left to size itself, the model stopped at 282 trees and reached a validation AUC of 0.9313. But the headline is the recall: with scale_pos_weight=3.1791, positive-class recall jumps from 0.6506 to 0.8520, a 20-point leap, so the model now catches roughly six in seven high earners instead of two in three. ROC AUC edges up to 0.9297 and F1 improves from 0.6975 to 0.7128. Be honest about the cost, because there always is one: precision fell from 0.7519 to 0.6127 and accuracy dipped from 0.8650 to 0.8357. By telling the model to care more about catching positives, you made it flag more people, so more of those flags are wrong. That is the precision/recall trade-off, a dial to set by business cost, not a bug to hide.
Before trusting that single validation split, cross-check the ensemble size with five-fold xgb.cv on the combined train-plus-validation data, using the low-level DMatrix API and the same imbalance weight.
dtrain = xgb.DMatrix(X_trainfull, label=y_trainfull, enable_categorical=True)
params_cv = dict(max_depth=6, eta=0.05, objective="binary:logistic",
tree_method="hist", eval_metric="auc", scale_pos_weight=spw)
cvres = xgb.cv(params_cv, dtrain, num_boost_round=2000, nfold=5,
early_stopping_rounds=50, seed=42)
print("xgb.cv best round:", len(cvres))
print("xgb.cv mean test AUC:", round(float(cvres["test-auc-mean"].iloc[-1]), 4))
# Output:
# xgb.cv best round: 252
# xgb.cv mean test AUC: 0.9288The two round-sizing methods agree, which is exactly what you want to see. Early stopping on the single validation split picked 282 rounds at a 0.9313 AUC; five-fold cross-validation independently picked 252 and estimated a mean held-out AUC of 0.9288. When your methods converge like this, you can trust the ensemble size instead of guessing n_estimators. You now have a robust, imbalance-aware model. What you have not done is optimize its other hyperparameters, they are still library defaults. That is Stage 3.
Stage 3: Tune with Optuna
Stage 2 gave you a solid, imbalance-aware model on default settings for max_depth, subsample, regularization, and the rest. Module 4 taught you to stop hand-guessing those and let Optuna search intelligently, using a Tree-structured Parzen Estimator (TPE) that learns from every trial to steer the next one. You will run a small 25-trial study that tunes seven hyperparameters, keeps scale_pos_weight fixed at the imbalance ratio, uses early stopping inside every trial, and optimizes validation ROC AUC. Seed the sampler for reproducibility and quiet its per-trial logging so the output stays clean.
import optuna
optuna.logging.set_verbosity(optuna.logging.WARNING)
def objective(trial):
params = {
"max_depth": trial.suggest_int("max_depth", 3, 10),
"learning_rate": trial.suggest_float("learning_rate", 0.01, 0.3, log=True),
"subsample": trial.suggest_float("subsample", 0.5, 1.0),
"colsample_bytree": trial.suggest_float("colsample_bytree", 0.5, 1.0),
"reg_lambda": trial.suggest_float("reg_lambda", 1e-3, 10.0, log=True),
"min_child_weight": trial.suggest_int("min_child_weight", 1, 10),
"gamma": trial.suggest_float("gamma", 1e-3, 5.0, log=True),
}
model = xgb.XGBClassifier(
n_estimators=2000, tree_method="hist", enable_categorical=True,
eval_metric="auc", scale_pos_weight=spw,
early_stopping_rounds=50, random_state=42, **params)
model.fit(X_train, y_train, eval_set=[(X_valid, y_valid)], verbose=False)
return roc_auc_score(y_valid, model.predict_proba(X_valid)[:, 1])
sampler = optuna.samplers.TPESampler(seed=42)
study = optuna.create_study(direction="maximize", sampler=sampler)
study.optimize(objective, n_trials=25)
print("Trials run:", len(study.trials))
print("Best validation AUC:", round(float(study.best_value), 4))
print("Best params:")
for k, v in study.best_params.items():
print(f" {k}: {round(float(v), 4) if isinstance(v, float) else v}")
# Output:
# Trials run: 25
# Best validation AUC: 0.9317
# Best params:
# max_depth: 6
# learning_rate: 0.0492
# subsample: 0.7474
# colsample_bytree: 0.6276
# reg_lambda: 0.2647
# min_child_weight: 5
# gamma: 0.4301Twenty-five trials in, study.best_value is 0.9317, a hair above Stage 2’s 0.9313 on the same validation set, and study.best_params names the configuration that got there: a max_depth of 6, a small learning_rate near 0.049, roughly three-quarters row subsampling with two-thirds column subsampling, and a light touch of reg_lambda and gamma regularization. You did not guess any of it; TPE discovered it by learning from its own history over just 25 fits, not the tens of thousands a full grid would demand. Now train the final model with those winning parameters, still with early stopping and the imbalance weight, and score it on the sealed test set.
final = xgb.XGBClassifier(
n_estimators=2000, tree_method="hist", enable_categorical=True,
eval_metric="auc", scale_pos_weight=spw,
early_stopping_rounds=50, random_state=42, **study.best_params)
final.fit(X_train, y_train, eval_set=[(X_valid, y_valid)], verbose=False)
print("final best_iteration:", final.best_iteration)
fin = report(final, X_test, y_test, "Final tuned")
# Output:
# final best_iteration: 191
# Final tuned: acc=0.8325 auc=0.9299 recall=0.8563 prec=0.6063 f1=0.7099 ap=0.8336Early stopping settled the final model at 191 trees. On the sealed test set it scores recall 0.8563, ROC AUC 0.9299, and average precision 0.8336, nudging past the Stage 2 robust model on ranking quality while holding the recall gain. Tuning did not work a miracle here, and it usually will not on top of an already-solid imbalance-aware model, but it produced a real, held-out-confirmed improvement with no manual grid to babysit. You now have the model you would ship. Before you ship it, you have to be able to explain it.
AUC is the right thing to optimize, but not the only thing to report
The study maximized validation ROC AUC because AUC is threshold-independent: it measures how well the model ranks high earners above low earners, which is the ranking quality you want tuning to chase. But AUC alone would let you forget the imbalance, which is why scale_pos_weight stays fixed outside the search and why the final report below still quotes recall, precision, and average precision at the actual decision threshold. Optimize the ranking; report the decision. Conflating the two is how a model that tunes well ends up disappointing in production.
Stage 4: Explain and Evaluate
A model you cannot explain is only half finished. Module 4’s SHAP gives honest, additive explanations for a tree model, both globally (which features drive the model overall) and locally (why one specific prediction came out the way it did). One subtlety matters for a classifier: TreeExplainer works in the model’s margin (log-odds) space, not probability space. The base value and the SHAP contributions add up to the raw margin, which the sigmoid then squashes into a probability:
Start with the global ranking: the mean absolute SHAP value per feature, computed across all 9,769 test rows.
import shap
explainer = shap.TreeExplainer(final)
shap_values = explainer.shap_values(X_test)
print("SHAP array shape:", shap_values.shape)
print("base value (margin):", round(float(explainer.expected_value), 4))
feat_names = list(X_test.columns)
mean_abs = np.abs(shap_values).mean(axis=0)
order = np.argsort(mean_abs)[::-1]
print("Global importance (mean |SHAP|, log-odds units):")
for j in order:
print(f" {feat_names[j]:16s} {round(float(mean_abs[j]), 4)}")
# Output:
# SHAP array shape: (9769, 14)
# base value (margin): 0.0252
# Global importance (mean |SHAP|, log-odds units):
# marital-status 0.8349
# age 0.6208
# relationship 0.4661
# capital-gain 0.4582
# occupation 0.4537
# hours-per-week 0.3041
# education 0.2496
# education-num 0.2092
# capital-loss 0.1463
# sex 0.1243
# fnlwgt 0.0687
# workclass 0.0676
# race 0.0476
# native-country 0.0435The ranking tells a coherent story about who earns more. marital-status (0.8349) moves predictions the most, followed by age (0.6208), relationship (0.4661), capital-gain (0.4582), and occupation (0.4537). That marriage, age, and occupation dominate a 1994 income model is exactly the kind of sensible, defensible pattern you want to see before trusting a model, and notice that several of the top drivers are the categorical columns you fed in natively, no encoding required. Now prove the explanations are exact, then read one.
base_margin = float(explainer.expected_value)
margins = final.predict(X_test, output_margin=True)
allmatch = np.allclose(base_margin + shap_values.sum(axis=1), margins, atol=1e-3)
print("additivity holds for all test rows (margin space):", bool(allmatch))
proba_test = final.predict_proba(X_test)[:, 1]
i = int(np.argmax(proba_test))
print(f"\nExample person: test row {i}")
print(f" predicted P(>50K): {round(float(proba_test[i]), 4)} actual label: {int(y_test.iloc[i])}")
print(f" base margin (log-odds): {round(base_margin, 4)}")
print(f" sum SHAP: {round(float(shap_values[i].sum()), 4)}")
print(f" base+sum: {round(float(base_margin + shap_values[i].sum()), 4)} model margin: {round(float(margins[i]), 4)}")
row_vals = X_test.iloc[i]
contribs = sorted(zip(feat_names, shap_values[i]), key=lambda t: abs(t[1]), reverse=True)
for name, s in contribs[:6]:
direction = "up " if s > 0 else "down"
print(f" {name:16s} = {str(row_vals[name]):>16} SHAP {round(float(s), 4):>+8} pushes {direction}")
# Output:
# additivity holds for all test rows (margin space): True
#
# Example person: test row 3930
# predicted P(>50K): 0.9991 actual label: 1
# base margin (log-odds): 0.0252
# sum SHAP: 6.9584
# base+sum: 6.9836 model margin: 6.9836
# capital-gain = 15024 SHAP +4.2082 pushes up
# occupation = Exec-managerial SHAP +0.5257 pushes up
# age = 53 SHAP +0.3865 pushes up
# education = Masters SHAP +0.3817 pushes up
# marital-status = Married-civ-spouse SHAP +0.3481 pushes up
# hours-per-week = 55 SHAP +0.3094 pushes up The np.allclose line confirms the additivity property holds for all 9,769 test rows at once: base value plus the SHAP contributions equals the model’s raw margin exactly. Then take the most confidently-predicted high earner, test row 3930, whom the model flags as >50K with probability 0.9991 (and who truly is one). Its explanation is airtight: start at the base margin 0.0252, add the fourteen signed contributions summing to 6.9584, and you land at 6.9836, precisely the model’s margin. Reading the top contributions in plain English: this person is predicted a near-certain high earner mainly because of a large capital-gain of 15,024 (by far the strongest force, +4.21 in log-odds), an executive-managerial occupation, being 53 years old, holding a Master’s degree, being married, and working 55 hours a week. No global bar chart could have told you capital gains dominated this particular prediction; that is the local resolution SHAP buys, and it is exactly what you hand a stakeholder who asks “why this person?”
Finally, line the finished model up against the Stage 1 baseline, metric by metric, on the sealed test set.
print(f"Baseline : acc={base['acc']:.4f} auc={base['auc']:.4f} recall={base['rec']:.4f} prec={base['prec']:.4f} f1={base['f1']:.4f} ap={base['ap']:.4f}")
print(f"Final tuned: acc={fin['acc']:.4f} auc={fin['auc']:.4f} recall={fin['rec']:.4f} prec={fin['prec']:.4f} f1={fin['f1']:.4f} ap={fin['ap']:.4f}")
print()
print(f"recall: {base['rec']:.4f} -> {fin['rec']:.4f} ({fin['rec']-base['rec']:+.4f})")
print(f"roc auc: {base['auc']:.4f} -> {fin['auc']:.4f} ({fin['auc']-base['auc']:+.4f})")
print(f"avg prec: {base['ap']:.4f} -> {fin['ap']:.4f} ({fin['ap']-base['ap']:+.4f})")
print(f"f1: {base['f1']:.4f} -> {fin['f1']:.4f} ({fin['f1']-base['f1']:+.4f})")
# Output:
# Baseline : acc=0.8650 auc=0.9183 recall=0.6506 prec=0.7519 f1=0.6975 ap=0.8100
# Final tuned: acc=0.8325 auc=0.9299 recall=0.8563 prec=0.6063 f1=0.7099 ap=0.8336
#
# recall: 0.6506 -> 0.8563 (+0.2057)
# roc auc: 0.9183 -> 0.9299 (+0.0117)
# avg prec: 0.8100 -> 0.8336 (+0.0236)
# f1: 0.6975 -> 0.7099 (+0.0124)There is the answer to the client’s question, stated honestly. The finished model lifts positive-class recall from 0.6506 to 0.8563 (a full +0.2057), ROC AUC from 0.9183 to 0.9299, average precision from 0.8100 to 0.8336, and F1 from 0.6975 to 0.7099. It is not better on every metric: accuracy fell from 0.8650 to 0.8325 and precision from 0.7519 to 0.6063. If the client looked only at accuracy, they would call this a worse model and miss the entire story. What you built finds 86 percent of the high earners instead of 65 percent, and you can state exactly what that cost in precision and defend why the trade is worth it on a problem where the minority class is the whole point.
Every number here traces to a decision you made
Nothing in that final scorecard is an accident. The recall gain traces to scale_pos_weight, chosen because Stage 1 measured a 24 percent positive rate. The ensemble sizes (282, 252, 191 trees) were decided by early stopping and xgb.cv, not guessed. The hyperparameters came from a seeded 25-trial Optuna study you can rerun. The explanation of row 3930 is exact to three decimals because SHAP’s additivity is a proven property, not a heuristic. That traceable chain, from a diagnosed problem to a defended number, is the real deliverable of this course, more than any single library call.
The Final Report
Here is the whole build written up the way Northwind Analytics would hand it to the client, tracing each stage back to the module that taught it.
Problem. Adult Income, 48,842 rows, 14 features, predict who earns >50K. The target is imbalanced at a 23.93 percent positive rate; three columns have missing values and eight are categorical text.
Boosting Foundations (Module 1). The reason any of this works is the additive, gradient-fitting idea from Module 1: every tree in the ensemble fits the negative gradient of the log-loss, and shrunken trees accumulate into a strong classifier. That principle underlies the baseline XGBClassifier, which scored 0.8650 accuracy and 0.9183 ROC AUC but a weak 0.6506 positive-class recall, exposing the imbalance trap before any tuning began.
XGBoost in Depth (Module 2). The library itself, its regularized objective and its hyperparameters, is what Module 2 opened up. Every knob later tuned (max_depth, subsample, colsample_bytree, reg_lambda, gamma, min_child_weight) and the hist tree method and native categorical support that let the model read the data at all come straight from that module. The final model used max_depth 6, subsample 0.7474, colsample_bytree 0.6276, reg_lambda 0.2647, and gamma 0.4301.
Training Robust Models (Module 3). Module 3 turned the baseline into a model you can trust on messy data. Early stopping sized the ensemble at 282 trees (validation AUC 0.9313), and xgb.cv independently agreed at 252 rounds (mean AUC 0.9288). scale_pos_weight = 3.1791 rebalanced the objective and delivered the headline: recall leapt from 0.6506 to 0.8520, at the honest cost of precision falling from 0.7519 to 0.6127. Native enable_categorical handling fed eight categorical columns and three columns of missing values in with no manual work.
Interpretation, Tuning & Deployment (Module 4). Module 4 finished and justified the model. A seeded 25-trial Optuna study raised validation AUC to 0.9317 and produced the final configuration, trained to 191 trees for a test ROC AUC of 0.9299 and recall of 0.8563. SHAP then ranked the global drivers (marital-status 0.8349, age 0.6208, relationship 0.4661, capital-gain 0.4582, occupation 0.4537) and explained one prediction exactly, decomposing row 3930’s near-certain high-earner call (P = 0.9991) into a dominant capital-gain contribution of +4.21 log-odds plus occupation, age, education, and marital status, with additivity verified across all 9,769 test rows.
Bottom line. Baseline to final, on data the model never saw: recall 0.6506 to 0.8563 (+0.2057), ROC AUC 0.9183 to 0.9299, average precision 0.8100 to 0.8336, at a stated cost of accuracy (0.8650 to 0.8325) and precision (0.7519 to 0.6063). A model that found two-thirds of high earners now finds six in seven, and every number above is reproducible from the code in this lesson.
Practice Exercises
These are the last exercises of the course, so they are reflective rather than mechanical. There is no single “correct” cell to run; the goal is to reason about the whole build the way you would defend it to a client or a reviewer. Think each one through, then open the hint.
Exercise 1: What would you flag as the biggest risk in this model?
You are about to hand this model to the Northwind client for a real decision. Looking back across all four stages, what is the single risk you would flag most loudly, and what evidence from the lesson supports it?
Hint
Strong candidates: the model was trained on 1994 census data, so its learned relationships (marriage and sex as top income drivers, the specific dollar meaning of capital-gain) may not transfer to any present-day population, an issue no metric in the lesson can catch. Another is the precision drop to 0.6063: at the 0.5 threshold nearly four in ten of the model’s “high earner” flags are wrong, which matters enormously if a false positive is expensive. A third is fairness: sex and race are features, and a model that keys income predictions off them may be inappropriate or unlawful for real decisions regardless of its accuracy. The best answer names a risk the metrics cannot see and explains why the honest scorecard alone is not a green light to deploy.
Exercise 2: Why report both precision and recall, and average precision too?
The final report quotes recall, precision, F1, ROC AUC, and average precision rather than settling on one headline number. Why insist on carrying several metrics through a report instead of picking the best-looking one?
Hint
Any single metric can be gamed or can hide the failure that matters. Accuracy (0.8325) looks worse than the baseline’s 0.8650 while the model is genuinely better at the real task, which is why quoting accuracy alone would mislead in the opposite direction. Recall and precision move in opposition (recall up to 0.8563, precision down to 0.6063), so reporting only one lets you flatter the model by hiding its cost, exactly the dishonesty this course has warned against. ROC AUC and average precision are threshold-independent summaries of ranking quality, and average precision is the more informative of the two on an imbalanced problem because it focuses on the minority class. Carrying all of them forces the trade-off into the open where a reader can weigh it against the actual cost of each error type, which is a decision only the client can make, not the modeler.
Exercise 3: What would you do differently for a real deployment?
This lesson produced one model and one report. If Northwind put this model into production and asked you to keep it healthy, what from across the whole course would you add beyond what you did here?
Hint
Several threads from the course extend naturally. From Module 3, you would monitor the class balance and the metrics over time and re-run xgb.cv-style checks as fresh data arrives, to catch drift before it hurts. From Module 4’s deployment lesson, you would save the model as a versioned artifact with its exact preprocessing (the category dtype mapping, the feature order) and wrap it in a documented prediction function, so the served model matches the one you evaluated. You would calibrate or tune the decision threshold to the client’s real cost of false positives versus false negatives rather than leaving it at 0.5, and you would keep SHAP in the loop so every production prediction can be explained on demand and audited for fairness. A real deployment treats the single report you wrote as the start of an ongoing process, not the finish line.
Summary
Congratulations, you have finished the course. In this capstone you took one real, messy, imbalanced dataset from a raw download all the way to a tuned, explained, honestly evaluated model, using skills from every one of the four modules at once, and you defended every number you reported.
Key Concepts
Explore and prepare (Stage 1)
- Adult Income is 23.93 percent positive, with eight categorical columns and three columns of missing values, all diagnosed before any modeling
- The naive baseline scored a flattering 0.8650 accuracy and 0.9183 AUC but a weak 0.6506 recall, exposing the imbalance trap
Train robustly (Stage 2)
- Early stopping sized the ensemble at 282 trees; five-fold
xgb.cvindependently agreed at 252 scale_pos_weight = 3.1791lifted recall from 0.6506 to 0.8520 at the honest cost of precision falling to 0.6127
Tune (Stage 3)
- A seeded 25-trial Optuna study raised validation AUC to 0.9317 and chose
max_depth6,learning_rate0.0492, and light regularization - The tuned final model (191 trees) scored 0.9299 test AUC and 0.8563 recall
Explain and evaluate (Stage 4)
- SHAP ranked
marital-status(0.8349),age,relationship,capital-gain, andoccupationas the top drivers, and its additivity held for all 9,769 test rows in log-odds space - Baseline to final: recall +0.2057, ROC AUC +0.0117, average precision +0.0236, with the precision and accuracy cost stated plainly
Why This Matters
Every module in this course taught its skills in isolation so you could understand them: how boosting fits gradients, how XGBoost scores a split, how to handle imbalance, how SHAP attributes a prediction. Real problems never arrive one skill at a time. This capstone was the harder and more valuable exercise of composing all of them on a single dataset, in sequence, where the quality of the result comes from the combination rather than any one technique. That is what building a model actually looks like once you leave the tutorial behind.
The deeper lesson is the one the whole course has been driving at: a model is only as trustworthy as the chain of verified decisions behind it. You did not quote an accuracy and move on. You diagnosed the imbalance, chose the weight that answered it, let the data size the ensemble, tuned with a reproducible search, explained the result feature by feature, and reported the honest trade-off with nothing hidden. Anyone can train a model that demos well. What separates a professional is the ability to say, for every number, here is why it is that value, here is what it cost, and here is why you can trust it. You can do that now.
Next Steps
Back to Course Overview
Review the full Gradient Boosting & XGBoost course, all four modules, from foundations through this capstone.
Back to Module Overview
Return to the Capstone module overview
Continue Building Your Skills
You began this course not knowing what a gradient boosting model really did, and you end it having taken one from a raw census file to a tuned, explained, honestly reported classifier whose every number you can defend. Along the way you built a booster from scratch to understand the mechanics, took XGBoost apart to see its regularized objective and its knobs, learned to train it reliably on imbalanced and messy data, and learned to explain, tune, and prepare it for production. This capstone wove all of it into a single build, and the discipline you practiced here, diagnose before you train, let the data decide what you would otherwise guess, and report the honest trade rather than the flattering headline, is exactly what the Northwind Analytics team, and any team you join, needs from someone who ships tabular models. The libraries will keep improving and the datasets will keep changing, but that discipline is yours now, and it is what will make the models you build next ones that people can actually trust.