Training Robust Models on DATATWEETS

Training Robust Models on DATATWEETShttps://datatweets.com/courses/gradient-boosting/training-robust-models/Recent content in Training Robust Models on DATATWEETSHugoenCopyright (c) 2025 DatatweetsSun, 05 Jul 2026 09:00:00 +0200Lesson 1 - Early Stopping and Evaluation Setshttps://datatweets.com/courses/gradient-boosting/training-robust-models/lesson-1-early-stopping-and-evaluation-sets/Sun, 05 Jul 2026 09:00:00 +0200https://datatweets.com/courses/gradient-boosting/training-robust-models/lesson-1-early-stopping-and-evaluation-sets/Open Module 3 by handing the n_estimators decision to your data. You will set a deliberately large tree cap (2000), hand XGBoost a validation set with an eval_metric, and let early_stopping_rounds halt training once validation RMSE stops improving. On the real California Housing data with a train/validation/test split, the model stops at best_iteration 1614 with a best validation RMSE of 0.4632 and a final test RMSE of 0.4601, instead of blindly running all 2000 trees. You will do this with both the scikit-learn API (early stopping on the constructor) and the native xgb.train API, and learn why you must never early-stop on your test set.Lesson 2 - Cross-Validation with xgb.cvhttps://datatweets.com/courses/gradient-boosting/training-robust-models/lesson-2-cross-validation-with-xgb-cv/Sun, 05 Jul 2026 09:00:00 +0200https://datatweets.com/courses/gradient-boosting/training-robust-models/lesson-2-cross-validation-with-xgb-cv/A single train/validation split is noisy: on California Housing the same XGBoost model swings from 0.4550 to 0.4715 test RMSE just by changing which rows land in the test set. This lesson replaces that one number with 5-fold cross-validation using XGBoost’s native xgb.cv, which returns a per-round history of train and test RMSE across folds. Combined with early_stopping_rounds it automatically settles on 1055 boosting rounds, where the cross-validated test RMSE is 0.4591 plus or minus 0.0112 across the five folds. Retraining on all the training data at that round reaches 0.4469 RMSE on the untouched held-out test set.Lesson 3 - Handling Imbalanced Datahttps://datatweets.com/courses/gradient-boosting/training-robust-models/lesson-3-handling-imbalanced-data/Sun, 05 Jul 2026 09:00:00 +0200https://datatweets.com/courses/gradient-boosting/training-robust-models/lesson-3-handling-imbalanced-data/When only about 24 percent of people earn more than 50K, a default XGBClassifier scores a comfortable 84.95 percent accuracy yet catches barely half of the actual high earners (recall 0.50). This lesson shows why accuracy hides that failure, why ROC AUC (0.878) and average precision (0.755) and per-class recall tell the real story, and how setting scale_pos_weight to the negative-to-positive ratio (about 3.18) raises positive-class recall from 0.5004 to 0.7772, at a precision cost from 0.79 to 0.53. You will fit every model for real and read the honest trade-off.Lesson 4 - Missing Values and Categorical Featureshttps://datatweets.com/courses/gradient-boosting/training-robust-models/lesson-4-missing-values-and-categorical-features/Sun, 05 Jul 2026 09:00:00 +0200https://datatweets.com/courses/gradient-boosting/training-robust-models/lesson-4-missing-values-and-categorical-features/Most models force you to fill in missing values and one-hot encode categories before training. XGBoost does neither. In this lesson you inject 20 percent missing values into a numeric feature and watch XGBoost train and predict without a single imputer (test AUC 0.8781 versus 0.8794 with no gaps), then train directly on Adult Income’s real category columns with enable_categorical=True. Native categorical splits reach test AUC 0.9300 from just 14 columns, matching one-hot encoding’s 0.9308 while keeping the feature count at 14 instead of exploding it to 105.Lesson 5 - Guided Project: A Robust Training Pipelinehttps://datatweets.com/courses/gradient-boosting/training-robust-models/lesson-5-guided-project-a-robust-training-pipeline/Sun, 05 Jul 2026 09:00:00 +0200https://datatweets.com/courses/gradient-boosting/training-robust-models/lesson-5-guided-project-a-robust-training-pipeline/Bring all of Module 3 together on the real Adult Income dataset, an imbalanced classification problem with a ~24 percent positive rate, genuine missing values, and eight categorical columns. You establish a default XGBClassifier baseline that scores 0.9183 test ROC AUC but only 0.6506 recall on the >50K minority class, then add early stopping and xgb.cv to size the ensemble, set scale_pos_weight to rebalance the objective, and combine everything into one final robust model. The result lifts positive-class recall from 0.6506 to 0.8512 and nudges ROC AUC from 0.9183 to 0.9290, an honest trade that costs some precision but delivers a model that actually finds the minority class.