Lesson 4 - Regularization and Sampling

Welcome to Regularization and Sampling

In the previous lesson, Northwind Analytics tuned the structural hyperparameters that shape each tree: n_estimators, learning_rate, max_depth, min_child_weight, and gamma. Those knobs decide how many trees you grow, how fast the model learns, and how deep and confident each tree is allowed to get. They are your first line of defense against overfitting.

This lesson covers the second line of defense: the knobs that fight overfitting directly, either by penalizing complexity inside the objective or by injecting randomness into training. There are two families. Regularization (reg_lambda and reg_alpha) adds a penalty on the size of the leaf weights, pulling them toward zero so no single leaf can make a wildly confident prediction. Sampling (subsample and colsample_bytree) trains each tree on a random subset of the rows and features, so the trees disagree a little and their errors cancel out.

Northwind’s model from earlier is accurate but greedy: grown deep and long, it nearly memorizes the training districts. That makes it the perfect patient for this lesson. You will start from a deliberately overfit XGBoost model and watch each of these four knobs pull the training and test errors back together. Every number below was produced by running the code for real on the California Housing dataset.

By the end of this lesson, you will be able to:

  • Explain how reg_lambda (L2) shrinks leaf weights through the term λ \lambda in XGBoost’s objective
  • Explain how reg_alpha (L1) can drive some leaf weights to exactly zero
  • Use subsample to train each boosting round on a random fraction of the rows (stochastic gradient boosting)
  • Use colsample_bytree to give each tree a random subset of the features, decorrelating the trees
  • Read a real train/test RMSE sweep and recognize each knob closing the overfitting gap

You should be comfortable with xgb.XGBRegressor, the train/test split, and the objective from Lesson 2 where the leaf weight was w=G/(H+λ) w^* = -G/(H+\lambda) . Let’s begin.


The Overfit Starting Point

To see regularization work, you first need something to regularize. A well-tuned model has little gap left to close, so instead Northwind builds a purposely over-eager model: many trees, grown deep, learning at a normal rate. This is the anti-pattern from Lesson 3 pushed on purpose, and it gives every knob in this lesson a visible gap to attack.

import warnings
warnings.filterwarnings("ignore")
import numpy as np
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
import xgboost as xgb

data = fetch_california_housing()
X_train, X_test, y_train, y_test = train_test_split(
    data.data, data.target, test_size=0.2, random_state=42
)

# A deliberately overfit model: many deep trees, no extra regularization
model = xgb.XGBRegressor(
    n_estimators=400, max_depth=8, learning_rate=0.1, random_state=42
)
model.fit(X_train, y_train)

train_rmse = np.sqrt(mean_squared_error(y_train, model.predict(X_train)))
test_rmse = np.sqrt(mean_squared_error(y_test, model.predict(X_test)))

print("train RMSE:", float(round(train_rmse, 4)))
print("test  RMSE:", float(round(test_rmse, 4)))
print("gap       :", float(round(test_rmse - train_rmse, 4)))
train RMSE: 0.1043
test  RMSE: 0.4523
gap       : 0.348

There is the overfitting signature again. A training RMSE of 0.1043 means the model predicts the training districts almost perfectly, while its test RMSE sits far higher at 0.4523. That gap of 0.348 is the room the model has to improve on unseen data, and it is exactly what the four knobs in this lesson are designed to close. Keep these three numbers in mind; every experiment below is measured against them.

Two ways to fight overfitting

Structural knobs (Lesson 3) fight overfitting by making each tree simpler: fewer, shallower, more cautious trees. The knobs in this lesson fight it two other ways. Regularization leaves the tree structure alone but shrinks the values it predicts, so even a deep tree can’t shout. Sampling leaves the values alone but hides part of the data from each tree, so the trees can’t all overfit the same noise. In practice you combine all three families; this lesson isolates each one so you can see exactly what it does.


reg_lambda: L2 Regularization on Leaf Weights

Recall the objective from Lesson 2. When XGBoost decides what value a leaf should predict, it does not just average the residuals. It computes the optimal leaf weight from the gradients G G and hessians H H that fall into that leaf, tempered by a regularization term λ \lambda :

w=GH+λ w^* = -\frac{G}{H + \lambda}

Look at what λ \lambda does in that denominator. With λ=0 \lambda = 0 , the leaf weight is G/H -G/H , the unregularized optimum. As you increase λ \lambda , the denominator grows, so the magnitude of w w^* shrinks toward zero. Every leaf’s prediction is pulled in, smaller and more conservative. The same λ \lambda also appears in the similarity score that XGBoost uses to rank splits, so a larger λ \lambda makes the model more reluctant to carve out splits that rely on a few points. This knob is exposed as reg_lambda (the alias lambda is a reserved word in Python, so the scikit-learn API spells it out). Its default is 1.

Let’s sweep it on the overfit model and watch the leaf weights get reined in.

import warnings
warnings.filterwarnings("ignore")
import numpy as np
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
import xgboost as xgb

data = fetch_california_housing()
X_train, X_test, y_train, y_test = train_test_split(
    data.data, data.target, test_size=0.2, random_state=42
)

for lam in [0, 1, 10, 100, 1000]:
    model = xgb.XGBRegressor(
        n_estimators=400, max_depth=8, learning_rate=0.1,
        reg_lambda=lam, random_state=42
    )
    model.fit(X_train, y_train)
    tr = np.sqrt(mean_squared_error(y_train, model.predict(X_train)))
    te = np.sqrt(mean_squared_error(y_test, model.predict(X_test)))
    print(f"reg_lambda={lam:<5} train={float(round(tr,4)):<8} "
          f"test={float(round(te,4)):<8} gap={float(round(te-tr,4))}")
reg_lambda=0     train=0.084    test=0.453    gap=0.369
reg_lambda=1     train=0.1043   test=0.4523   gap=0.348
reg_lambda=10    train=0.1641   test=0.4464   gap=0.2823
reg_lambda=100   train=0.2638   test=0.4374   gap=0.1736
reg_lambda=1000  train=0.3665   test=0.4494   gap=0.0829

Read this table left to right and the story is clear. As reg_lambda climbs, the training RMSE steadily rises (0.084 to 0.3665), because the penalty forbids the model from fitting the training data as tightly. That is the point: you are trading away a little training accuracy on purpose. Meanwhile the gap collapses from 0.369 all the way down to 0.0829. The overfitting is being squeezed out.

The test RMSE, the number that actually matters, traces a gentle U. It improves from 0.4523 at the default down to its best value of 0.4374 at reg_lambda=100, then starts to creep back up at reg_lambda=1000 because by then the penalty is so heavy the model is beginning to underfit. This is the universal shape of a regularization sweep: too little and you overfit, too much and you underfit, with a sweet spot in between that only tuning can find.


reg_alpha: L1 Regularization on Leaf Weights

reg_lambda is an L2 penalty: it penalizes the square of each leaf weight, which shrinks large weights hard but never quite forces them to zero. reg_alpha is the L1 counterpart: it penalizes the absolute value of each leaf weight. The practical difference is famous from linear models. L1 has a “corner” at zero that can push small weights exactly to zero, producing sparsity, some leaves are effectively switched off and contribute nothing. Where L2 gently deflates every weight, L1 zeroes out the weakest ones entirely. Its default is 0 (off).

Let’s sweep reg_alpha on the same overfit model.

import warnings
warnings.filterwarnings("ignore")
import numpy as np
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
import xgboost as xgb

data = fetch_california_housing()
X_train, X_test, y_train, y_test = train_test_split(
    data.data, data.target, test_size=0.2, random_state=42
)

for a in [0, 0.1, 1, 10, 100]:
    model = xgb.XGBRegressor(
        n_estimators=400, max_depth=8, learning_rate=0.1,
        reg_alpha=a, random_state=42
    )
    model.fit(X_train, y_train)
    tr = np.sqrt(mean_squared_error(y_train, model.predict(X_train)))
    te = np.sqrt(mean_squared_error(y_test, model.predict(X_test)))
    print(f"reg_alpha={a:<5} train={float(round(tr,4)):<8} "
          f"test={float(round(te,4)):<8} gap={float(round(te-tr,4))}")
reg_alpha=0     train=0.1043   test=0.4523   gap=0.348
reg_alpha=0.1   train=0.1064   test=0.4548   gap=0.3484
reg_alpha=1     train=0.1348   test=0.447    gap=0.3122
reg_alpha=10    train=0.3061   test=0.4448   gap=0.1387
reg_alpha=100   train=0.4948   test=0.5282   gap=0.0334

The same trade-off appears, and reg_alpha bites hard once it gets large. Small values (0.1, 1) barely move the model. By reg_alpha=10 the training RMSE has jumped to 0.3061, the gap has fallen to 0.1387, and the test RMSE has improved to 0.4448, its best in this sweep. Push to reg_alpha=100 and the model over-corrects: with so many leaf weights zeroed out, the training RMSE balloons to 0.4948 and the test RMSE gets worse at 0.5282. That is underfitting caused by too aggressive an L1 penalty. Notice how the tiny final gap of 0.0334 is not a good sign here, the train and test errors met because both got bad. A shrinking gap is only worth having when the test error shrinks with it.

reg_alpha and reg_lambda are complementary, not rivals

You do not have to choose one. XGBoost applies both penalties at once, and they do different jobs: reg_lambda (L2) smoothly shrinks all leaf weights, while reg_alpha (L1) can prune the weakest weights to exactly zero. A common tuning strategy is to lean on reg_lambda as your default workhorse and add a touch of reg_alpha when you suspect many leaves are contributing noise. The right pair of values is found by searching, which is exactly what Lesson 5’s guided project and the tuning tools in Module 4 are for.


subsample: Row Sampling per Boosting Round

The next two knobs fight overfitting a completely different way: instead of penalizing the model, they hide data from it. subsample sets the fraction of the training rows that each boosting round is allowed to use, drawn at random. With subsample=0.5, every tree is grown on a fresh random half of the data. Because each tree sees a different slice, the trees vary more, and averaging varied trees cancels out some of the variance, the same insight behind bagging, now bolted onto boosting. This variant even has a name: stochastic gradient boosting. The default is 1.0 (use every row).

Let’s sweep it.

import warnings
warnings.filterwarnings("ignore")
import numpy as np
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
import xgboost as xgb

data = fetch_california_housing()
X_train, X_test, y_train, y_test = train_test_split(
    data.data, data.target, test_size=0.2, random_state=42
)

for s in [1.0, 0.8, 0.5]:
    model = xgb.XGBRegressor(
        n_estimators=400, max_depth=8, learning_rate=0.1,
        subsample=s, random_state=42
    )
    model.fit(X_train, y_train)
    tr = np.sqrt(mean_squared_error(y_train, model.predict(X_train)))
    te = np.sqrt(mean_squared_error(y_test, model.predict(X_test)))
    print(f"subsample={s:<4} train={float(round(tr,4)):<8} "
          f"test={float(round(te,4)):<8} gap={float(round(te-tr,4))}")
subsample=1.0  train=0.1043   test=0.4523   gap=0.348
subsample=0.8  train=0.0922   test=0.4492   gap=0.357
subsample=0.5  train=0.1088   test=0.4552   gap=0.348

The effect here is real but subtle, and it is worth being honest about why. Moving to subsample=0.8 nudges the test RMSE from 0.4523 to 0.4492, a small genuine improvement, while subsample=0.5 slightly overshoots to 0.4552. On this clean, information-rich dataset the rows are largely redundant, so throwing half of them away per tree costs almost as much as the variance reduction it buys. On noisier or higher-variance datasets, where trees are more prone to chase spurious patterns, row subsampling typically pays off much more. The lesson is not “always subsample,” it is “subsample is a variance knob whose best value depends on the data, so tune it rather than trusting the default.”


colsample_bytree: Feature Sampling per Tree

colsample_bytree is the column analogue of subsample. Instead of hiding rows, it hides features: each tree is built using only a random subset of the columns. With colsample_bytree=0.5 and eight features, every tree gets to consider only four of them, chosen at random. This decorrelates the trees. If one feature is very strong, without column sampling every tree would split on it first and the trees would look alike; forcing some trees to ignore it makes them explore other features and disagree more, which again helps the ensemble average out its errors.

import warnings
warnings.filterwarnings("ignore")
import numpy as np
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
import xgboost as xgb

data = fetch_california_housing()
X_train, X_test, y_train, y_test = train_test_split(
    data.data, data.target, test_size=0.2, random_state=42
)

for c in [1.0, 0.7, 0.5]:
    model = xgb.XGBRegressor(
        n_estimators=400, max_depth=8, learning_rate=0.1,
        colsample_bytree=c, random_state=42
    )
    model.fit(X_train, y_train)
    tr = np.sqrt(mean_squared_error(y_train, model.predict(X_train)))
    te = np.sqrt(mean_squared_error(y_test, model.predict(X_test)))
    print(f"colsample_bytree={c:<4} train={float(round(tr,4)):<8} "
          f"test={float(round(te,4)):<8} gap={float(round(te-tr,4))}")
colsample_bytree=1.0  train=0.1043   test=0.4523   gap=0.348
colsample_bytree=0.7  train=0.1018   test=0.4512   gap=0.3494
colsample_bytree=0.5  train=0.103    test=0.4598   gap=0.3568

Again the effect is modest on this dataset, and for the same reason. California Housing has only eight features, several of which (median income above all) are strongly predictive, so hiding half of them from each tree removes real signal rather than just noise. colsample_bytree=0.7 gives a tiny improvement to 0.4512, while dropping to 0.5 hurts. Column sampling shines on wide datasets with dozens or hundreds of features, many of them weak or correlated, exactly the setting where decorrelating the trees prevents them from all leaning on the same handful of columns. As always, the right fraction is a tuning decision.

The figure below shows how subsample and colsample_bytree each carve a random slice out of the data for a tree.

Two side-by-side panels over a grid representing the training data. The left panel, subsample = 0.5, highlights four of eight horizontal rows in green and feeds them into a tree, with a note that the next boosting round redraws the rows. The right panel, colsample_bytree = 0.5, highlights three of six vertical columns in orange and feeds those features into a tree, with a note that the next tree redraws the columns.
subsample picks a random fraction of the rows for each boosting round; colsample_bytree picks a random subset of the features for each tree. Both make the trees differ so their errors cancel.

colsample_bylevel and colsample_bynode too

colsample_bytree is the coarsest of three column-sampling knobs. XGBoost also offers colsample_bylevel, which resamples the feature subset at each depth level of a tree, and colsample_bynode, which resamples at every single split. They multiply together: setting all three to 0.5 means each split considers roughly 0.5×0.5×0.5=0.125 0.5 \times 0.5 \times 0.5 = 0.125 of the features. Most of the time colsample_bytree alone is enough; reach for the finer-grained versions only on very wide datasets where you want even more decorrelation between splits.


Practice Exercises

Try these before checking the hints. Each one reruns a small piece of the experiments above so the effect is in your own hands.

Exercise 1: Find reg_lambda’s Sweet Spot

Using the overfit base model (n_estimators=400, max_depth=8, learning_rate=0.1, random_state=42), fit it with reg_lambda set to 10, 100, and 300. Print the test RMSE for each and report which value gives the lowest test error.

import warnings
warnings.filterwarnings("ignore")
import numpy as np
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
import xgboost as xgb

data = fetch_california_housing()
X_train, X_test, y_train, y_test = train_test_split(
    data.data, data.target, test_size=0.2, random_state=42
)

# Your code here

Hint

Loop over [10, 100, 300], build xgb.XGBRegressor(..., reg_lambda=lam), fit, and print np.sqrt(mean_squared_error(y_test, model.predict(X_test))) cast to a clean float. You should see reg_lambda=100 give the best test RMSE, around 0.4374, matching the U-shape from the lesson, with 300 slightly worse as the penalty starts to underfit.

Exercise 2: Watch the Train-Test Gap Close

Fit the overfit base model twice: once with no extra regularization (reg_lambda=1, the default) and once with reg_lambda=100. For each, print the training RMSE, the test RMSE, and the gap between them. Explain in one sentence why the training RMSE goes up when you regularize.

import warnings
warnings.filterwarnings("ignore")
import numpy as np
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
import xgboost as xgb

data = fetch_california_housing()
X_train, X_test, y_train, y_test = train_test_split(
    data.data, data.target, test_size=0.2, random_state=42
)

# Your code here

Hint

Compute both RMSEs for each model and subtract. The default should give train 0.1043, test 0.4523, gap 0.348; reg_lambda=100 should give train 0.2638, test 0.4374, gap 0.1736. The training RMSE rises because the λ \lambda term in w=G/(H+λ) w^* = -G/(H+\lambda) shrinks every leaf weight, so the model is no longer allowed to fit the training points as tightly, which is exactly what stops it from memorizing noise.

Exercise 3: Combine Regularization and Sampling

Build one model that uses several knobs at once: reg_lambda=10, subsample=0.8, and colsample_bytree=0.7, on top of the overfit base. Print its train and test RMSE and compare the gap to the unregularized baseline’s gap of 0.348.

import warnings
warnings.filterwarnings("ignore")
import numpy as np
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
import xgboost as xgb

data = fetch_california_housing()
X_train, X_test, y_train, y_test = train_test_split(
    data.data, data.target, test_size=0.2, random_state=42
)

# Your code here

Hint

Pass all three arguments into a single xgb.XGBRegressor(...). You should get roughly train 0.1574, test 0.4417, gap 0.2843, a clear improvement on the baseline’s 0.348 gap. The knobs stack: modest amounts of L2 regularization plus row and column sampling together close more of the gap than the mild sampling did alone, without pushing any single knob to an extreme.


Summary

You have met the second family of XGBoost hyperparameters, the ones that attack overfitting directly rather than by reshaping the trees. Let’s review.

Key Concepts

Regularization shrinks leaf weights

  • reg_lambda (L2) enters the objective through λ \lambda in w=G/(H+λ) w^* = -G/(H+\lambda) ; a larger λ \lambda inflates the denominator and pulls every leaf weight toward zero
  • On the overfit model, reg_lambda raised training RMSE from 0.084 to 0.3665 while shrinking the gap from 0.369 to 0.0829, with the best test RMSE of 0.4374 at reg_lambda=100
  • reg_alpha (L1) penalizes the absolute value of leaf weights and can drive some to exactly zero (sparsity); reg_alpha=10 gave test RMSE 0.4448, but reg_alpha=100 underfit at 0.5282

Sampling injects randomness

  • subsample trains each boosting round on a random fraction of the rows (stochastic gradient boosting); subsample=0.8 slightly improved test RMSE to 0.4492
  • colsample_bytree gives each tree a random subset of the features, decorrelating the trees; colsample_bytree=0.7 nudged test RMSE to 0.4512
  • Both effects were modest on the eight-feature California Housing data but matter far more on noisy or wide datasets; colsample_bylevel and colsample_bynode offer even finer-grained column sampling

Reading a regularization sweep

  • Increasing any of these knobs raises training error on purpose; the goal is a lower test error, not merely a smaller gap
  • Every knob traces a U-shape: too little overfits, too much underfits, and the sweet spot is found by tuning

Why This Matters

Regularization and sampling are what make XGBoost robust rather than just powerful. A boosting model with enough deep trees can memorize almost any training set; these four knobs are how you stop it, and understanding which mechanism each one uses lets you reach for the right tool. When your training error is tiny but your test error is stuck, you now know to shrink the leaf weights with reg_lambda or reg_alpha; when your trees look suspiciously alike and all lean on one feature, you know to decorrelate them with subsample and colsample_bytree.

Just as important, you saw the honest picture: a smaller train-test gap is only good news when the test error falls with it, and the right value for every knob depends on the dataset. That is why nobody sets these by hand in production. In the next lesson you will put all of Module 2’s hyperparameters together and search for a good combination systematically, turning a baseline XGBoost model into a measurably better one on real data.


Next Steps

You can now name every knob XGBoost gives you to fight overfitting and explain what each one does to the model. The natural next step is to stop tuning them one at a time and search for a strong combination on a real problem.

Guided Project: Tuning XGBoost on Real Data

Put every hyperparameter from this module together and tune XGBoost from a baseline to a measurably better model.

Back to Module Overview

Return to the XGBoost in Depth module overview


Continue Building Your Skills

You have completed the tour of XGBoost’s hyperparameters, from the structural knobs of Lesson 3 to the regularization and sampling knobs here. The single habit worth carrying forward is the one every sweep in this lesson rewarded: change one knob, measure train and test error, and judge the result by whether the test error improves, not by whether the gap looks small. Rerun the reg_lambda sweep yourself and try extending it, add a value between 100 and 1000, or combine a mid-range reg_lambda with subsample=0.8, and watch how the U-shaped curve shifts. Building that intuition by hand now is exactly what will let the systematic tuning in the next lesson feel like a natural extension rather than a black box.

Sponsor

Keep DATATWEETS free. Help fund practical data, AI, and engineering lessons for learners worldwide.

Buy Me a Coffee at ko-fi.com