Lesson 3 - Hyperparameter Tuning with Optuna
Welcome to Hyperparameter Tuning with Optuna
In Module 2’s guided project, Northwind Analytics tuned XGBoost the way most people first learn to: with grid search and random search. Those got the job done, but they left a bad taste. Grid search made the team enumerate every combination by hand and watched the number of fits explode the moment they added a fourth or fifth hyperparameter. Random search was faster but felt like throwing darts in the dark, since each new setting was drawn without any memory of how the earlier ones scored.
In this lesson you will replace that guesswork with a smarter approach: Optuna, a modern hyperparameter-optimization library whose default sampler uses Bayesian optimization (specifically a Tree-structured Parzen Estimator, or TPE) to learn from every trial it has already run and steer the next one toward promising settings. You will tune XGBoost on the real California Housing data using a proper three-way split, tune only on the validation set, and report the winner exactly once on a held-out test set. You will also compare that tuned model honestly against a sensible default-XGBoost baseline. Every number below was produced by running the code for real with a seeded sampler, so your own run should reproduce them.
By the end of this lesson, you will be able to:
- Explain why grid search scales badly (combinatorial explosion) and why random search, though better, is still blind to past results
- Describe what Optuna’s default TPE / Bayesian sampler does differently: it uses the history of trials to propose the next one
- Assemble the four building blocks of an Optuna study: a
study, anobjectivefunction,trial.suggest_*calls that define the search space, andstudy.optimize - Run a real 30-trial study over six XGBoost hyperparameters and read
study.best_paramsandstudy.best_value - Refit the winning configuration and report its test RMSE, comparing it fairly against a default-XGBoost baseline
You should be comfortable with the XGBoost scikit-learn API, RMSE, and the idea of a train/validation/test split. Let’s begin.
Why Grid and Random Search Run Out of Road
Grid search is the most literal way to tune a model: you list a handful of candidate values for each hyperparameter and try every combination. It is exhaustive and easy to reason about, but it has a fatal scaling problem. If you give it 6 values for max_depth, 6 for learning_rate, 5 for subsample, 5 for colsample_bytree, 5 for reg_lambda, and 5 for min_child_weight, you are asking for
model fits. That is the combinatorial explosion: the cost grows as the product of the per-parameter counts, so each hyperparameter you add multiplies the work. Northwind cannot afford 22,500 XGBoost fits just to feel around.
Random search fixes the cost by sampling a fixed budget of random combinations instead of the full grid, and it is genuinely better: it explores each hyperparameter at many distinct values rather than being trapped on a coarse lattice. But it has its own blind spot. Every draw is independent, so random search never learns. If trials 1 through 10 all quietly agree that a tiny learning_rate is hurting you, trial 11 is just as likely to pick a tiny learning_rate again. It has no memory.
Bayesian optimization is the idea that closes that gap. It keeps a running model of how the objective (here, validation RMSE) depends on the hyperparameters, updates that model after every trial, and uses it to propose the next configuration where improvement looks most likely. Optuna’s default sampler is a Tree-structured Parzen Estimator (TPE), one practical flavor of this idea: it separates the settings that produced good scores from those that produced bad ones and samples the next trial from the region that looks good. In short, grid search is exhaustive but explosive, random search is cheap but forgetful, and TPE is cheap and it remembers.
Fewer trials, not zero thinking
TPE does not read your mind. It still needs a well-chosen search space (sensible ranges, log scale where appropriate) and enough trials to build up a useful history, usually a warm-up phase of random exploration before it starts exploiting. What it buys you is efficiency: it typically reaches a good configuration in far fewer trials than random search would need, because it stops re-testing regions that have already looked bad. You are trading a smarter search strategy for a smaller compute bill.
The Four Building Blocks of an Optuna Study
An Optuna run has exactly four moving parts, and once you can name them the code reads itself:
- A
studyis the whole optimization session. You create it withoptuna.create_study(direction="minimize")because we are minimizing RMSE (use"maximize"for a score like accuracy or ). - An
objectivefunction is the thing you are optimizing. It takes a singletrialargument, trains a model with that trial’s proposed settings, and returns the number to minimize (here, validation RMSE). trial.suggest_*calls, made inside the objective, define the search space one hyperparameter at a time.trial.suggest_int("max_depth", 3, 10)says “pick an integer between 3 and 10”;trial.suggest_float("learning_rate", 0.01, 0.3, log=True)says “pick a float between 0.01 and 0.3, searched on a log scale.”study.optimize(objective, n_trials=30)runs the loop: it calls your objective 30 times, and on each call the sampler uses everything it has learned so far to pick the next set ofsuggest_*values.
Before we optimize anything, we need data and a fair way to score. We use a three-way split: a training set the models learn from, a validation set the study scores each trial on, and a test set we lock away and touch only at the very end. Tuning on the validation set and reporting on the test set is what keeps us honest: if we tuned and reported on the same data, we would be choosing the settings that best fit that data by luck and calling it real. We also fit a plain default-XGBoost baseline now, so we know the number to beat.
import warnings
warnings.filterwarnings("ignore")
import numpy as np
import xgboost as xgb
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
data = fetch_california_housing()
# First split off a held-out TEST set we never touch during tuning
X_temp, X_test, y_temp, y_test = train_test_split(
data.data, data.target, test_size=0.2, random_state=42
)
# Then split the rest into TRAIN (tune on this) and VALID (score trials on this)
X_train, X_valid, y_train, y_valid = train_test_split(
X_temp, y_temp, test_size=0.25, random_state=42
)
print("train:", X_train.shape[0], "valid:", X_valid.shape[0], "test:", X_test.shape[0])
# A sensible default-XGBoost baseline to beat
baseline = xgb.XGBRegressor(n_estimators=400, random_state=42)
baseline.fit(X_train, y_train)
base_valid_rmse = np.sqrt(mean_squared_error(y_valid, baseline.predict(X_valid)))
print("baseline validation RMSE:", float(round(base_valid_rmse, 4)))train: 12384 valid: 4128 test: 4128
baseline validation RMSE: 0.4798We now have 12,384 training rows to fit on, 4,128 validation rows to score each trial, and 4,128 test rows sealed off for the finale. The default XGBoost model scores a validation RMSE of 0.4798. That is our line in the sand: the tuning is only worth it if Optuna can beat it.
Running a Real Optuna Study
Here is the whole search. Read the objective first: it proposes six hyperparameters with trial.suggest_*, trains an XGBRegressor on the training set, and returns the validation RMSE. Notice the deliberate design choices in the ranges. learning_rate and reg_lambda use log=True because they matter most across orders of magnitude (0.01 versus 0.1 is a bigger deal than 0.2 versus 0.29), so a log scale spends trials where they count. max_depth and min_child_weight control tree complexity, while subsample and colsample_bytree add the randomness that fights overfitting. We seed the sampler with TPESampler(seed=42) so the run is reproducible, and quiet Optuna’s per-trial logging so the output stays clean.
import warnings
warnings.filterwarnings("ignore")
import numpy as np
import xgboost as xgb
import optuna
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
optuna.logging.set_verbosity(optuna.logging.WARNING)
data = fetch_california_housing()
X_temp, X_test, y_temp, y_test = train_test_split(
data.data, data.target, test_size=0.2, random_state=42
)
X_train, X_valid, y_train, y_valid = train_test_split(
X_temp, y_temp, test_size=0.25, random_state=42
)
def objective(trial):
params = {
"max_depth": trial.suggest_int("max_depth", 3, 10),
"learning_rate": trial.suggest_float("learning_rate", 0.01, 0.3, log=True),
"subsample": trial.suggest_float("subsample", 0.5, 1.0),
"colsample_bytree": trial.suggest_float("colsample_bytree", 0.5, 1.0),
"reg_lambda": trial.suggest_float("reg_lambda", 1e-3, 10.0, log=True),
"min_child_weight": trial.suggest_int("min_child_weight", 1, 10),
}
model = xgb.XGBRegressor(n_estimators=400, random_state=42, **params)
model.fit(X_train, y_train)
pred = model.predict(X_valid)
return float(np.sqrt(mean_squared_error(y_valid, pred)))
sampler = optuna.samplers.TPESampler(seed=42)
study = optuna.create_study(direction="minimize", sampler=sampler)
study.optimize(objective, n_trials=30)
print("Trials run:", len(study.trials))
print("Best validation RMSE:", float(round(study.best_value, 4)))
print("Best params:")
for k, v in study.best_params.items():
print(f" {k}: {round(float(v), 4) if isinstance(v, float) else v}")Trials run: 30
Best validation RMSE: 0.4535
Best params:
max_depth: 9
learning_rate: 0.0352
subsample: 0.9271
colsample_bytree: 0.6613
reg_lambda: 0.0878
min_child_weight: 4Thirty trials in, study.best_value is 0.4535, comfortably below the baseline’s 0.4798 on the same validation set, and study.best_params names the configuration that got there: a deep tree (max_depth 9) with a small learning_rate of about 0.035, a healthy amount of row and column subsampling, and a light dose of L2 regularization. You did not have to guess any of that. The sampler discovered it by learning from its own history, testing 30 configurations rather than the tens of thousands a full grid would have demanded.
The figure below traces the best validation RMSE so far across the 30 trials. It does not fall in a smooth line, and that is exactly what a real Bayesian search looks like: an early warm-up phase where it explores, then sudden drops when the accumulated history points it at a genuinely better region, then a plateau once it has converged.
Pruning: stopping hopeless trials early
Optuna has a second efficiency lever we have not used here: pruning. When a model is trained iteratively (like XGBoost’s boosting rounds), you can report the intermediate validation score after each round with trial.report(...) and call trial.should_prune(); a pruner such as optuna.pruners.MedianPruner then kills trials that are tracking well below the median of past trials at the same round. It does not change which settings get proposed; it just stops wasting compute finishing trials that are clearly going nowhere, so your budget buys more promising configurations. On a fast dataset like this one it is optional, but on large models it can cut total tuning time dramatically.
Crowning the Winner on the Test Set
A study’s best_value is a validation score, and validation scores are slightly optimistic by construction: we picked those settings because they looked best on that particular validation set. The honest question is how the winner does on data that played no role in choosing it. So we refit both the tuned configuration and the default baseline on the combined train-plus-validation data (now that tuning is done, there is no reason to hold the validation rows out of training) and judge each one exactly once on the sealed test set.
import warnings
warnings.filterwarnings("ignore")
import numpy as np
import xgboost as xgb
import optuna
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
optuna.logging.set_verbosity(optuna.logging.WARNING)
data = fetch_california_housing()
X_temp, X_test, y_temp, y_test = train_test_split(
data.data, data.target, test_size=0.2, random_state=42
)
X_train, X_valid, y_train, y_valid = train_test_split(
X_temp, y_temp, test_size=0.25, random_state=42
)
def objective(trial):
params = {
"max_depth": trial.suggest_int("max_depth", 3, 10),
"learning_rate": trial.suggest_float("learning_rate", 0.01, 0.3, log=True),
"subsample": trial.suggest_float("subsample", 0.5, 1.0),
"colsample_bytree": trial.suggest_float("colsample_bytree", 0.5, 1.0),
"reg_lambda": trial.suggest_float("reg_lambda", 1e-3, 10.0, log=True),
"min_child_weight": trial.suggest_int("min_child_weight", 1, 10),
}
model = xgb.XGBRegressor(n_estimators=400, random_state=42, **params)
model.fit(X_train, y_train)
return float(np.sqrt(mean_squared_error(y_valid, model.predict(X_valid))))
sampler = optuna.samplers.TPESampler(seed=42)
study = optuna.create_study(direction="minimize", sampler=sampler)
study.optimize(objective, n_trials=30)
# Refit both models on TRAIN + VALID, then judge once on the untouched TEST set
X_fit, y_fit = X_temp, y_temp
baseline = xgb.XGBRegressor(n_estimators=400, random_state=42)
baseline.fit(X_fit, y_fit)
base_test = np.sqrt(mean_squared_error(y_test, baseline.predict(X_test)))
tuned = xgb.XGBRegressor(n_estimators=400, random_state=42, **study.best_params)
tuned.fit(X_fit, y_fit)
tuned_test = np.sqrt(mean_squared_error(y_test, tuned.predict(X_test)))
print("Default baseline test RMSE:", float(round(base_test, 4)))
print("Optuna-tuned test RMSE:", float(round(tuned_test, 4)))
print("Improvement:", float(round(base_test - tuned_test, 4)))Default baseline test RMSE: 0.4584
Optuna-tuned test RMSE: 0.442
Improvement: 0.0164On the untouched test set, the tuned model scores 0.4420 against the default baseline’s 0.4584, an improvement of 0.0164 in RMSE. In MedHouseVal units (each unit is 100,000 dollars), that is roughly 1,640 dollars less error on a typical district. It is worth being honest about the size of that gain: California Housing is a clean, well-behaved dataset where sensible XGBoost defaults are already strong, so the tuning ceiling is low. The point is not that Optuna performed a miracle here; it is that it found a real, held-out-confirmed improvement over a good baseline in just 30 trials, with no manual grid to babysit. On messier, higher-variance data (more features, more noise, more class imbalance), the same 30-trial search routinely buys a far larger margin.
Why the test number is the only one that counts
Notice the tuned model’s validation RMSE (0.4535) and its test RMSE (0.4420) are close but not identical, and here the test score is actually a touch better. Neither is “the answer” on its own: the validation score is the one you optimized against, so it is mildly optimistic as an estimate of true performance, and the test score is your single unbiased read on unseen data. Always quote the test number when you report a tuned model to anyone, and never go back and re-tune against the test set once you have looked at it, or it stops being a fair test.
Practice Exercises
Try each one before opening its hint. They reinforce the study workflow, the search space, and the honest test-set report.
Exercise 1: Run the Study and Read the Winner
Build the three-way split exactly as in the lesson, define the objective over the six hyperparameters, run a 30-trial study with TPESampler(seed=42), and print study.best_value and study.best_params. Confirm you reproduce a best validation RMSE of 0.4535.
import warnings
warnings.filterwarnings("ignore")
import numpy as np
import xgboost as xgb
import optuna
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
optuna.logging.set_verbosity(optuna.logging.WARNING)
# Your code hereHint
The objective must take a single trial argument, call trial.suggest_int / trial.suggest_float to build the params dict, fit xgb.XGBRegressor(n_estimators=400, random_state=42, **params) on X_train, and return the validation RMSE as a float. Create the study with optuna.create_study(direction="minimize", sampler=optuna.samplers.TPESampler(seed=42)) and call study.optimize(objective, n_trials=30). With the seeded sampler you should see study.best_value round to 0.4535 and max_depth land at 9.
Exercise 2: Confirm the Winner on the Test Set
Take study.best_params from Exercise 1, refit an XGBRegressor on the combined train+validation data (X_temp, y_temp), and print its test RMSE alongside a default XGBRegressor(n_estimators=400, random_state=42) refit the same way. State in one sentence how much tuning bought.
# Continue from Exercise 1's `study`, X_temp, y_temp, X_test, y_test
# Your code hereHint
Fit xgb.XGBRegressor(n_estimators=400, random_state=42, **study.best_params) on X_temp, y_temp, predict on X_test, and take np.sqrt(mean_squared_error(...)). Do the same for a default XGBRegressor(n_estimators=400, random_state=42). You should see roughly 0.4420 (tuned) versus 0.4584 (baseline), an improvement of about 0.0164 RMSE, or around 1,640 dollars per district. Report the test number, not the validation one.
Exercise 3: Does the Budget Matter? Compare 10 vs 30 Trials
Run the same study twice, once with n_trials=10 and once with n_trials=30, both with a fresh TPESampler(seed=42), and print each study’s best_value. Note whether the extra 20 trials found a better configuration.
# Reuse the objective and split from Exercise 1
# Your code hereHint
Create two separate studies (each needs its own new TPESampler(seed=42) and its own create_study call) and call optimize with n_trials=10 on one and n_trials=30 on the other. Compare the two best_values. You will see the 10-trial study stop at a higher (worse) RMSE and the 30-trial study reach 0.4535, because TPE needs a warm-up of exploration before its history is rich enough to exploit. The lesson: budget enough trials for the sampler to actually learn, but watch for the point where extra trials stop helping.
Summary
You replaced blind search with a sampler that learns, and you confirmed its winner the honest way. Let’s review.
Key Concepts
Why smarter search
- Grid search tries every combination, so its cost is the product of the per-parameter counts: it explodes combinatorially as you add hyperparameters
- Random search is cheaper but has no memory, so it keeps re-testing regions that already scored badly
- Bayesian optimization / TPE (Optuna’s default) models how the objective depends on the hyperparameters and proposes the next trial where improvement looks likely, reaching good settings in far fewer trials
The Optuna workflow
- Four building blocks: a
study, anobjective(trial)that returns the value to minimize,trial.suggest_*calls that define the search space, andstudy.optimizeto run the loop - Use
log=Truefor parameters that span orders of magnitude (learning_rate,reg_lambda); seed the sampler (TPESampler(seed=42)) for reproducibility - Pruning is an optional second lever that stops clearly hopeless trials early to stretch your compute budget
Tuning honestly
- Use a three-way split: tune on validation, and report the winner once on a held-out test set
- Here, 30 trials found best validation RMSE 0.4535 (
max_depth9,learning_rate0.0352,subsample0.9271,colsample_bytree0.6613,reg_lambda0.0878,min_child_weight4) - Refit on train+validation, the tuned model scored test RMSE 0.4420 versus the default baseline’s 0.4584, a real but modest gain on this clean dataset
Why This Matters
Hyperparameter tuning is where a lot of real modeling effort goes, and doing it well is as much about discipline as about the search algorithm. Optuna gives you an efficient search, but the habits around it are what make the result trustworthy: a sensible search space, log scales where they belong, a big enough trial budget, a seeded sampler you can reproduce, and above all a held-out test set you look at only once. Get those right and you can quote a tuned model’s performance to a stakeholder and defend the number.
Just as important, you now know why the smart search is smart. TPE is not magic; it is a sampler that keeps score and spends its trials where the history says improvement lives. That framing lets you reach for it deliberately, expand the search space when you have compute to spare, add pruning when trials get expensive, and always close the loop on unseen data.
Next Steps
You have a tuned model and an honest test-set number to stand behind. Next you will learn to save that model so you never have to retrain it, serve its predictions, and compare XGBoost against the other major gradient-boosting libraries so you can pick the right tool for the job.
Lesson 4: Saving, Serving, and Comparing Libraries
Persist your tuned XGBoost model, serve its predictions, and see how XGBoost stacks up against the other gradient-boosting libraries.
Back to Module Overview
Return to the Interpretation, Tuning & Deployment module overview
Continue Building Your Skills
Before moving on, make the search your own. Rerun the study with a wider or narrower range on one hyperparameter and watch how best_params shifts; add a seventh parameter with a new trial.suggest_* line and see whether it helps; or wire up a MedianPruner and time how much faster the study finishes. Tuning is a skill you build by feel, and the more studies you run on data whose numbers you trust, the sharper your instinct for sensible ranges and adequate budgets will get, exactly the instinct that makes the deployment work in Lesson 4 land on a model you are confident in.