Lesson 2 - Model Selection

Welcome to Model Selection

In the last lesson you engineered features to give your model better raw material to learn from. But feature engineering answers only half the question. Once your data is in good shape, you still have to decide which algorithm to train on it. That decision is called model selection, and it is the focus of this lesson.

You will put five very different algorithms on the same data, evaluate them with the same metric, and read the results like a practitioner. Just as importantly, you will learn that picking a model is never only about the highest score: accuracy, interpretability, and speed all pull in different directions.

By the end of this lesson, you will be able to:

  • Explain what model selection is and why comparing model families matters
  • Train linear regression, k-nearest neighbors, a decision tree, a random forest, and gradient boosting using one shared scikit-learn interface
  • Evaluate regression models with the R2 R^2 score on a held-out test set
  • Compare five model families fairly on the same data
  • Reason about the trade-offs between accuracy, interpretability, and speed when choosing a model

You should be comfortable with the machine learning workflow, the train/test split, and basic scikit-learn usage from earlier lessons. Let’s begin.


What Model Selection Really Means

When people first hear “model selection,” they often picture tuning a single algorithm’s settings. That is part of the story, but it is the next lesson’s story. Here, model selection means something broader and more fundamental: choosing which family of algorithms to use in the first place.

A linear regression and a random forest are not two versions of the same idea. They make completely different assumptions about how the world works. Linear regression assumes the target is a straight-line combination of your features. A decision tree assumes the target can be carved up into rectangular regions. K-nearest neighbors assumes that similar inputs produce similar outputs. None of these assumptions is universally correct, and the only honest way to find out which one fits your data is to try several and compare.

This is one of the most reassuring facts in machine learning: you do not have to guess the best algorithm from first principles. You can let the data tell you.

Why Not Just Use the Fanciest Model?

It is tempting to reach for the most powerful algorithm you know and stop there. But “powerful” is not the same as “best for this problem.” A more flexible model can capture more complex patterns, but it can also chase noise, take longer to train, and become impossible to explain to a stakeholder. The right choice depends on your data, your constraints, and what you need the model for. By the end of this lesson you will see all three of those forces at work.

No free lunch

There is a famous result in machine learning called the No Free Lunch theorem. Informally, it says that no single algorithm is best across all possible problems. Averaged over every conceivable dataset, every algorithm performs the same. That is why we compare candidates on our data instead of trusting a universal ranking.


The Power of a Shared Interface

The reason model comparison is practical at all is a quiet piece of design genius in scikit-learn: every model speaks the same language. Whether you reach for a linear model or a gradient-boosted ensemble, the steps are identical.

model = SomeAlgorithm(...)   # 1. instantiate, optionally setting hyperparameters
model.fit(X_train, y_train)  # 2. train
model.score(X_test, y_test)  # 3. evaluate
model.predict(X_new)         # 4. predict

Because the interface never changes, swapping one algorithm for another is usually a one-line edit. That means you can write a single evaluation loop and run every candidate through it under identical conditions. Identical conditions are the whole point: if you scored one model on a different split or a different metric than another, the comparison would be meaningless.


The Data: California Housing

You will work with the real California Housing dataset, where each row describes a census block group in California and the target is the median house value in that block. This is a regression problem: the target is a continuous number, not a category, so the algorithms and metric below are all chosen with regression in mind.

import numpy as np
import pandas as pd

# download: https://datatweets.com/datasets/california_housing.csv
housing = pd.read_csv("california_housing.csv").dropna()

print("Shape:", housing.shape)
# Output: Shape: (20433, 10)

After dropping the small number of rows with missing values, you are left with 20,433 block groups and 10 columns. Nine of those columns are numeric features describing each block (median income, house age, average rooms, location, and so on), and one column is the target you want to predict.

print("Target summary (median_house_value):")
print("  mean:", 206864.0)
print("  min: ", 14999.0)
print("  max: ", 500001.0)
# Output:
# Target summary (median_house_value):
#   mean: 206864.0
#   min:  14999.0
#   max:  500001.0

Median house values range from about $15,000 up to a capped $500,001, with a mean near $207,000. Predicting that number well is the job, and the question of this lesson is: which algorithm does it best?

Setting Up the Comparison

To compare models fairly, you fix everything except the algorithm. You build the same feature matrix and target, make the same train/test split, and (because some algorithms care about feature scale and others do not) standardize the features. Scaling never hurts the scale-insensitive models, so applying it uniformly keeps the comparison clean.

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# Features: every numeric column except the target
X = housing.drop("median_house_value", axis=1)
y = housing["median_house_value"]

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.25, random_state=42
)

scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)   # fit on TRAIN only
X_test = scaler.transform(X_test)         # apply the same transform to TEST

print("Training rows:", X_train.shape[0])
print("Test rows:    ", X_test.shape[0])
# Output:
# Training rows: 15324
# Test rows:     5109

Notice the discipline carried over from earlier lessons: the scaler is fit on the training data only, then applied to both sets. Every model below sees exactly this split and this scaling, so any difference in score comes from the algorithm itself and nothing else.


Measuring Regression Performance: the R² Score

For classification you measured accuracy. Accuracy makes no sense for regression, because predicting $201,432 when the true value is $200,000 is not simply “right” or “wrong.” Instead, the standard score for regression is the coefficient of determination, written R2 R^2 .

R2 R^2 answers the question: how much of the variation in the target does the model explain? It is defined as:

R2=1i(yiy^i)2i(yiyˉ)2 R^2 = 1 - \frac{\sum_i (y_i - \hat{y}_i)^2}{\sum_i (y_i - \bar{y})^2}

The numerator is the squared error your model makes. The denominator is the squared error of a trivial baseline that always predicts the mean yˉ \bar{y} . So R2 R^2 compares your model against that naive baseline:

  • R2=1.0 R^2 = 1.0 means perfect predictions.
  • R2=0.0 R^2 = 0.0 means your model is no better than always guessing the average.
  • R2<0 R^2 < 0 means your model is worse than guessing the average.

A higher R2 R^2 is better, and in scikit-learn the .score() method returns it automatically for regression models. That gives you a single, consistent number to rank every candidate.

R² is not a percentage of accuracy

It is common to read R2=0.74 R^2 = 0.74 as “74 percent of the variance explained,” which is fine. Just resist the urge to call it “74 percent accurate.” Accuracy is a classification idea about right-versus-wrong labels; R2 R^2 is about how much of the spread in a continuous target your model accounts for.


Meet the Five Model Families

You will compare five algorithms that represent genuinely different ways of learning. You do not need to master the internals of each one here; later lessons go deeper. For now, focus on what assumption each one makes and how to train it with the shared interface.

1. Linear Regression

Linear regression draws the best straight-line relationship between the features and the target. It assumes that each feature contributes a fixed amount to the prediction, added together. It is fast, and its coefficients are easy to read, which makes it the most interpretable model in the lineup.

from sklearn.linear_model import LinearRegression

linear = LinearRegression()
linear.fit(X_train, y_train)

print(f"Linear  R2 = {linear.score(X_test, y_test):.3f}")
# Output: Linear  R2 = 0.647

2. K-Nearest Neighbors

KNN makes no assumption about the shape of the relationship at all. To predict a new block group, it finds the k k most similar block groups in the training data and averages their values. Here you consult the 10 nearest neighbors. Because “similar” is measured by distance, KNN is exactly the kind of model that depends on the feature scaling you applied earlier.

from sklearn.neighbors import KNeighborsRegressor

knn = KNeighborsRegressor(n_neighbors=10)
knn.fit(X_train, y_train)

print(f"KNN     R2 = {knn.score(X_test, y_test):.3f}")
# Output: KNN     R2 = 0.697

3. Decision Tree

A decision tree splits the data into smaller and smaller groups by asking yes/no questions about the features (“is median income above this threshold?”). Each split tries to make the groups more homogeneous in their target value. A tree can capture nonlinear patterns that linear regression cannot, and a shallow tree is wonderfully easy to explain. You cap the depth at 10 to keep it from growing arbitrarily complex.

from sklearn.tree import DecisionTreeRegressor

tree = DecisionTreeRegressor(max_depth=10, random_state=42)
tree.fit(X_train, y_train)

print(f"Tree    R2 = {tree.score(X_test, y_test):.3f}")
# Output: Tree    R2 = 0.654

4. Random Forest

A single tree is prone to overfitting. A random forest trains many trees, each on a random slice of the data and features, then averages their predictions. This averaging cancels out the quirks of individual trees and usually produces a much stronger model. Here you grow 200 trees.

from sklearn.ensemble import RandomForestRegressor

forest = RandomForestRegressor(n_estimators=200, random_state=42)
forest.fit(X_train, y_train)

print(f"Forest  R2 = {forest.score(X_test, y_test):.3f}")
# Output: Forest  R2 = 0.740

5. Gradient Boosting

Gradient boosting is another ensemble of trees, but it builds them sequentially instead of independently. Each new tree focuses on correcting the errors the previous trees made. Done carefully, this produces some of the strongest models available for tabular data.

from sklearn.ensemble import GradientBoostingRegressor

gbm = GradientBoostingRegressor(random_state=42)
gbm.fit(X_train, y_train)

print(f"GBM     R2 = {gbm.score(X_test, y_test):.3f}")
# Output: GBM     R2 = 0.729

Notice that the four lines that matter, instantiate / fit / score, are identical across all five algorithms. Only the class name changes. That is the shared interface earning its keep.


Comparing the Models Fairly

Running five separate code blocks is fine for learning, but in practice you would loop over the candidates so that every one is evaluated under exactly the same conditions. A dictionary of named models makes this clean.

from sklearn.linear_model import LinearRegression
from sklearn.neighbors import KNeighborsRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor

models = {
    "Linear":            LinearRegression(),
    "KNN (k=10)":        KNeighborsRegressor(n_neighbors=10),
    "Decision Tree":     DecisionTreeRegressor(max_depth=10, random_state=42),
    "Random Forest":     RandomForestRegressor(n_estimators=200, random_state=42),
    "Gradient Boosting": GradientBoostingRegressor(random_state=42),
}

results = {}
for name, model in models.items():
    model.fit(X_train, y_train)
    results[name] = model.score(X_test, y_test)
    print(f"{name:<18} R2 = {results[name]:.3f}")
# Output:
# Linear             R2 = 0.647
# KNN (k=10)         R2 = 0.697
# Decision Tree      R2 = 0.654
# Random Forest      R2 = 0.740
# Gradient Boosting  R2 = 0.729

That single loop is the heart of model selection. Every model received the same training data, the same test data, and was judged by the same metric. The numbers are now directly comparable.

A picture makes the ranking obvious at a glance.

Horizontal bar chart of test R-squared for five model families on the California Housing dataset
Test R² across five model families on the same California Housing split: the tree ensembles lead, linear regression trails.

Reading the Results

A few patterns jump out, and they are worth pausing on because they generalize far beyond this dataset.

  • The tree ensembles win. Random forest (0.740) and gradient boosting (0.729) clearly outperform everything else. Both combine many trees, and that combination captures the nonlinear, interacting structure in housing data that a single model misses.
  • A single tree is mediocre. The lone decision tree scores 0.654, barely above linear regression. On its own, one tree is a weak learner; the magic of forests and boosting is in combining many of them.
  • KNN is surprisingly competitive. At 0.697, plain nearest-neighbors beats both the single tree and the linear model. House values really do cluster geographically and by income, so “find similar blocks and average them” is a reasonable strategy here.
  • Linear regression is the floor, not a failure. Its 0.647 is the lowest score, but it still explains nearly two-thirds of the variation in house values. As a fast, transparent baseline, that is genuinely useful information.

One split is not the final word

These scores come from a single train/test split. Shuffle the data differently and the exact numbers will shift, and a close race could even change order. Before you commit to a model based on small differences, you should confirm the ranking holds across multiple splits. That is precisely what cross-validation does, and it is the subject of the next lesson.


Accuracy Is Not the Only Criterion

If model selection were only about the highest R2 R^2 , this lesson would end here: pick the random forest. But real projects almost never optimize a single number. Three forces are usually in tension, and the best model is the one that balances them for your situation.

Accuracy

The most obvious axis. Higher R2 R^2 (or whatever metric fits your problem) means better predictions. The ensembles win this round. If raw predictive performance is all that matters, they are the natural choice.

Interpretability

Can you explain why the model made a given prediction? Linear regression is the champion here: each coefficient tells you exactly how much a feature moves the prediction. A shallow decision tree is also readable, you can literally trace the path of yes/no questions. A random forest of 200 trees, by contrast, is a black box. In domains like lending, healthcare, or anything facing a regulator, an explainable model that scores slightly lower can be worth far more than an opaque one that scores higher.

Speed and Cost

How long does the model take to train, and how fast does it predict? Linear regression trains almost instantly and predicts in microseconds. KNN trains instantly but is slow to predict, because it must compare each new point against the entire training set. Random forests and gradient boosting are the most expensive to train and the heaviest to store. If you are retraining hourly on millions of rows, or running on a tiny device, that cost matters.

                  Accuracy   Interpretability   Speed
Linear              low           high           high
KNN                 medium        medium         slow predict
Decision Tree       medium        high           high
Random Forest       high          low            medium
Gradient Boosting   high          low            slow train

There is no row in that table that wins every column. That is the entire point. Model selection is the act of choosing where on these trade-offs your project should sit.

Start simple, then earn complexity

A practical habit: always train a simple model like linear regression first. It is fast, it gives you a baseline R2 R^2 , and it tells you whether a fancier model is even worth the trouble. If a 200-tree forest barely beats your one-line linear model, the simpler model may be the smarter choice. Add complexity only when it earns its keep.


Practice Exercises

Now it is your turn. Try these before checking the hints. Reuse the X_train, X_test, y_train, and y_test from the lesson.

Exercise 1: Tune the Number of Neighbors

The lesson used n_neighbors=10 for KNN. Train a KNeighborsRegressor for k in [3, 10, 25, 50] and print the test R2 R^2 for each. Which value of k gives the best score?

from sklearn.neighbors import KNeighborsRegressor

# Your code here

Hint

Loop over the list of k values. Inside the loop, create KNeighborsRegressor(n_neighbors=k), call .fit(X_train, y_train), then print .score(X_test, y_test). Very small k reacts to noise; very large k blurs over real structure, so expect the best score somewhere in the middle.

Exercise 2: Grow More Trees in the Forest

The random forest used 200 trees. Train two more forests with n_estimators=50 and n_estimators=400 (keep random_state=42) and compare their test R2 R^2 to the 200-tree result of 0.740. Does adding trees keep improving the score?

from sklearn.ensemble import RandomForestRegressor

# Your code here

Hint

Instantiate RandomForestRegressor(n_estimators=50, random_state=42) and again with 400, fitting and scoring each. You will notice the score climbs quickly at first and then flattens out: beyond a point, more trees mostly cost you time without adding accuracy.

Exercise 3: Rank the Models with a Loop

Build the same dictionary of five models from the lesson, run them all through one evaluation loop, and print them sorted from highest R2 R^2 to lowest. Confirm the order matches the chart.

# Build a dict of named models, fit each, score each, then sort.
# Your code here

Hint

Store results in a dictionary, then sort with sorted(results.items(), key=lambda kv: kv[1], reverse=True). The top two should be Random Forest (0.740) and Gradient Boosting (0.729), with Linear (0.647) at the bottom.


Summary

You compared five genuinely different algorithms on the same data and learned to read the results like a practitioner. Let’s review what you learned.

Key Concepts

What Model Selection Is

  • Model selection is choosing which family of algorithm to use, separate from tuning a single model’s settings
  • Different algorithms make different assumptions, so the only honest way to choose is to compare candidates on your own data
  • The No Free Lunch idea: no algorithm is best for every problem

The Shared Interface

  • Every scikit-learn model uses the same pattern: instantiate, .fit(), .score(), .predict()
  • Because the interface is identical, you can evaluate every candidate in one loop under identical conditions
  • A fair comparison means the same split, the same scaling, and the same metric for every model

Measuring Regression

  • Regression uses the R2 R^2 score, not accuracy
  • R2=1 R^2 = 1 is perfect, R2=0 R^2 = 0 is no better than predicting the mean, and R2<0 R^2 < 0 is worse than the mean
  • .score() returns R2 R^2 automatically for regression models

The Five Families

  • Linear regression (R² 0.647): fast, interpretable, assumes straight-line relationships
  • KNN (R² 0.697): averages similar points, depends on feature scaling
  • Decision tree (R² 0.654): splits data with yes/no questions, weak on its own
  • Random forest (R² 0.740): averages many independent trees, strongest here
  • Gradient boosting (R² 0.729): builds trees sequentially to fix prior errors

Choosing a Model

  • Accuracy, interpretability, and speed pull in different directions
  • The most accurate model is not always the right one; regulated or low-latency settings often favor simpler, explainable models
  • Start with a simple baseline and add complexity only when it earns its keep

Why This Matters

Model selection is where machine learning stops being a recipe and starts being judgment. Anyone can call .fit() on a random forest, but knowing whether a random forest is the right tool, and being able to defend that choice against an interpretable baseline, is what separates a practitioner from a button-pusher.

The single-split comparison you ran here is the honest first step, but it is only a first step. Those five scores came from one particular shuffle of the data, and small gaps between models could vanish or reverse under a different split. To trust a ranking, you need to see it hold up repeatedly. That is exactly the gap the next lesson closes.


Next Steps

You now know how to put model families on a level playing field and weigh accuracy against interpretability and speed. Next, you will make these comparisons far more reliable by evaluating each model across many splits instead of just one.

Continue to Lesson 3 - Cross-Validation

Move beyond a single train/test split and evaluate models reliably across many folds.

Back to Module Overview

Return to the Model Optimization module overview.


Keep Building Your Skills

You have learned to treat algorithm choice as an experiment rather than a guess: line the candidates up, evaluate them fairly, and weigh the trade-offs that matter for your project. That habit, comparing on equal terms and refusing to optimize a single number blindly, will serve you in every machine learning problem you ever tackle. In the next lesson you will give these comparisons the statistical backbone they deserve with cross-validation.