Lesson 3 - Cross-Validation

Welcome to Cross-Validation

In the last lesson you compared several models on a single test set and picked the best one. That works, but it hides a quiet problem: the number you trust to make that decision comes from just one random split of your data. In this lesson you will see why that single number is noisier than it looks, and how k-fold cross-validation gives you a more honest, lower-variance estimate of how a model really performs.

By the end of this lesson, you will be able to:

  • Explain why a single train/test split produces a noisy estimate of model performance
  • Describe how k-fold cross-validation splits data into folds and reuses each fold for both training and testing
  • Run cross-validation in scikit-learn with KFold and cross_val_score
  • Interpret the mean and standard deviation of fold scores as an estimate plus its uncertainty
  • Reason about the bias-variance trade-off of the estimate and choose a sensible number of folds

You should be comfortable with the train/test split, fitting a scikit-learn model, and the R2 R^2 metric from the previous lessons. Let’s begin.


The Problem With a Single Split

Every model you have trained so far ended the same way: split the data once, train on one part, score on the other, and report that score. It feels rigorous because the model never saw the test rows during training. But step back and ask a sharper question: how much should you trust that one number?

Recall what a train/test split actually does. It shuffles the rows and slices off a chunk, usually 20 or 25 percent, to hold out. The word that matters is shuffles. The split is random, which means a different random seed gives you a different test set, and a different test set gives you a different score.

That randomness can bite you in two ways.

First, the split can simply be lucky or unlucky. By chance, the test set might contain rows that happen to be easy for your model, inflating the score. Another draw might collect harder rows and deflate it. You only ever see one of those draws, so you have no way to tell whether you got a representative sample or an outlier.

Second, a few extreme observations can swing the result. Imagine a dataset with one unusual record, an outlier you cannot reasonably drop. In any single split that record lands either in the training set or in the test set. If it lands in training, it can distort what the model learns. If it lands in testing, the model produces one wildly wrong prediction for it, and that single error can drag the whole test metric down. Either way, one row tilts your conclusion.

The deeper issue is that you are judging the model on a metric that itself varies. You want to know how well the model generalizes, but the single split gives you a moving target dressed up as a fixed answer.

Seeing the Noise for Yourself

The cleanest way to feel this is to change only the random seed and watch the score move. You will use the real California Housing dataset, where each row describes a block of houses, and predict the median house value from a handful of numeric features.

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor

# download: https://datatweets.com/datasets/california_housing.csv
housing = pd.read_csv("california_housing.csv")

print("Shape:", housing.shape)
# Output: Shape: (20433, 10)

feature_cols = [
    "median_income", "housing_median_age", "total_rooms",
    "total_bedrooms", "population", "households",
    "latitude", "longitude",
]
X = housing[feature_cols]
y = housing["median_house_value"]

Now train the same model under five different splits, changing nothing but random_state, and collect the R2 R^2 on each test set.

scores = []
for seed in range(5):
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.2, random_state=seed
    )
    model = RandomForestRegressor(n_estimators=100, random_state=42)
    model.fit(X_train, y_train)
    scores.append(model.score(X_test, y_test))

for seed, s in enumerate(scores):
    print(f"seed={seed}  test R2={s:.3f}")
# Output:
# seed=0  test R2=0.730
# seed=1  test R2=0.699
# seed=2  test R2=0.752
# seed=3  test R2=0.758
# seed=4  test R2=0.733

Look at that spread. Nothing about the model, the features, or the data changed between runs. The only thing that moved was which rows happened to land in the test set, and yet the score swings from about 0.70 to about 0.76. If you had run the experiment once and reported a single number, you might have claimed 0.70 or 0.76 with equal confidence, and you would have been roughly six points apart depending purely on luck.

One score is a sample, not the truth

A single test score is one draw from a distribution of possible scores. Reporting it as the performance of your model quietly ignores how much it could have been. Before you commit to a model, you should know not just how well it scores, but how much that score wobbles.

So the goal becomes clear. Instead of one estimate from one split, you want several estimates from several splits, then summarize them. That is exactly what cross-validation does, in a systematic way.


The Idea of Folds

The loop above already hints at the fix: run multiple splits and average. But random seeds are wasteful and overlapping. Two random test sets can share many of the same rows, and some rows might never be tested at all. A more principled approach divides the data into clean, non-overlapping groups.

These groups are called folds. A fold is one equal, mutually exclusive slice of the dataset. If you cut the data into five folds, each fold holds one fifth of the rows, and no row appears in two folds.

The trick is what you do with them. You take turns: one fold is set aside as the test set, the other folds become the training set, and you score the model. Then you rotate. A different fold becomes the test set, the rest become training, and you score again. You keep rotating until every fold has served as the test set exactly once.

The diagram below shows this rotation for five folds. In each row, the shaded fold is the one being tested while the others train the model.

                 fold 1   fold 2   fold 3   fold 4   fold 5
  round 1   ->   [ TEST ] [train ] [train ] [train ] [train ]   -> score 1
  round 2   ->   [train ] [ TEST ] [train ] [train ] [train ]   -> score 2
  round 3   ->   [train ] [train ] [ TEST ] [train ] [train ]   -> score 3
  round 4   ->   [train ] [train ] [train ] [ TEST ] [train ]   -> score 4
  round 5   ->   [train ] [train ] [train ] [train ] [ TEST ]   -> score 5

Two things make this better than five random seeds. Every row is tested exactly once, so the whole dataset contributes to the evaluation without overlap. And because the test folds never overlap, the scores are cleaner samples of how the model behaves across different slices of data.

This procedure is k-fold cross-validation, where k k is the number of folds. Setting k=5 k = 5 gives you five scores; setting k=10 k = 10 gives you ten. The smallest meaningful case, k=2 k = 2 , splits the data in half, trains on each half, and tests on the other, giving you two scores to average.

Cross-validation is for estimating, not for the final model

Cross-validation trains a fresh model on each round and throws it away after scoring. Its job is to estimate how well a given configuration generalizes, not to produce the model you ship. Once you have used cross-validation to choose a model or its settings, you retrain that choice on all your data for deployment.


Cross-Validation in scikit-learn

You could implement the rotation by hand with a loop, but scikit-learn does it for you in two pieces: KFold, which defines how the folds are cut, and cross_val_score, which runs the whole train-and-score rotation and hands back the array of scores.

Start with KFold. It is a small object that knows how to split your data into k k folds.

from sklearn.model_selection import KFold

kf = KFold(n_splits=5, shuffle=True, random_state=42)

Three arguments matter here. n_splits is your k k , the number of folds. shuffle=True randomizes the row order before cutting the folds, which matters when the data has any ordering you do not want to leak into the folds. And random_state=42 fixes that shuffle so your folds are reproducible; any fixed number works.

Now hand the model, the data, and the KFold plan to cross_val_score. It performs the full rotation: fit on the training folds, score on the held-out fold, repeat, and return one score per fold.

from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestRegressor

model = RandomForestRegressor(n_estimators=100, random_state=42)

cv_scores = cross_val_score(model, X, y, cv=kf, scoring="r2")

print("Fold scores:", cv_scores.round(3))
# Output: Fold scores: [0.73  0.699 0.752 0.758 0.733]

That single call did everything: it cut the data into five folds, trained five separate random forests, and scored each one on the fold it had not seen. The result is an array of five R2 R^2 values, one per fold.

A few arguments are worth knowing. The first three, model, X, and y, are positional, so you pass them in order without names. The model does not need to be fitted beforehand; cross_val_score fits a fresh copy on each fold. The cv argument controls the splitting; you can pass a KFold object as above, or simply pass an integer like cv=5 to use the default folds. The scoring argument names the metric. Every scikit-learn model has a default score (R2 R^2 for regressors, accuracy for classifiers), but stating scoring="r2" explicitly makes your intent unambiguous.

Summarizing the Folds

Five numbers are more honest than one, but you still need to communicate a single headline result. You do that with two summaries: the mean tells you the typical performance, and the standard deviation tells you how much it varies from fold to fold.

import numpy as np

print(f"Mean R2: {cv_scores.mean():.3f}")
print(f"Std  R2: {cv_scores.std():.3f}")
# Output:
# Mean R2: 0.734
# Std  R2: 0.021

Read these together as an estimate with an uncertainty: the random forest scores about 0.734, give or take 0.021 across folds. The mean is your best single guess at how the model generalizes, and the standard deviation tells you how confident you can be in that guess. A small spread, as here, means the model is stable across different slices of the data. A large spread would be a warning that performance depends heavily on which rows it sees.

A picture makes the point immediately. Each bar below is one fold’s R2 R^2 , and the dashed line marks their mean.

Bar chart of R-squared for each of five cross-validation folds with a dashed line at the mean
Five-fold cross-validation gives one score per fold; the mean is a steadier estimate than any single split.

Notice how the fold scores echo the noisy single-split experiment from earlier, the same low of about 0.70 and high of about 0.76. The difference is that now you are looking at all of them at once and summarizing them deliberately, rather than seeing one of them by accident and mistaking it for the truth.

Why the Mean Is More Trustworthy

Why is averaging five scores better than reporting one? Because averaging cancels out the luck. When you take one split, any quirk of that particular test set, an unlucky cluster of hard rows, an outlier landing in the wrong place, lands entirely on your single score. When you average across folds, a fold that got unlucky is balanced by a fold that got lucky. The high-variance individual results smooth into a steadier central estimate.

This is the same reason a poll of many voters beats asking one person. Each fold is a small, imperfect measurement; their average is far more reliable than any one of them alone. That is the core value of cross-validation: it turns a noisy single number into an estimate you can actually stand behind, plus an honest measure of how much it might move.

Use cross-validation by default

Whenever you are comparing models or tuning settings, prefer cross-validation over a single split. It costs more compute, since you fit the model k k times instead of once, but the decisions you make from a cross-validated estimate are far less likely to be the result of a lucky or unlucky draw.


Choosing the Number of Folds

If five folds are good, are a hundred folds better? Not necessarily. Choosing k k is a genuine trade-off, and understanding it keeps you from picking a number blindly.

Push k k to its logical extreme and you reach leave-one-out cross-validation (LOOCV), where k=n k = n , the number of rows. Each observation becomes its own fold: you train on all rows but one, test on that single row, and repeat for every row. With 20,433 rows that means fitting the model 20,433 times, which is enormously expensive. Worse, scoring a model on a single observation produces a wildly unstable per-fold result, since one row is the noisiest possible test set.

At the other extreme, k=2 k = 2 trains on only half the data each time. That is cheap, but each model sees so little data that its estimate is systematically pessimistic; the model would do better if it could train on more. This is bias in the estimate: with few folds, your reported performance tends to understate what the model could achieve with all the data.

So the two extremes pull in opposite directions, and this is one face of the bias-variance trade-off applied to the estimate itself:

  • Few folds (small k k ): each model trains on less data, so the estimate is biased toward looking worse than reality, but the handful of scores are fairly stable.
  • Many folds (large k k ): each model trains on nearly all the data, so the estimate is less biased, but the per-fold scores swing more wildly because each test fold is tiny.

You want to balance the two. You want low bias, so the estimate reflects the model’s true ability, and you want manageable variance, so the estimate is stable and the computation is affordable.

In practice, simulations and long experience have converged on a clear recommendation: k=5 k = 5 or k=10 k = 10 strikes that balance for most problems. Five folds are cheaper; ten folds reduce bias a little more. Beyond that, the gains shrink while the compute cost climbs.

# A common, well-balanced default
cv_scores = cross_val_score(
    model, X, y, cv=KFold(n_splits=5, shuffle=True, random_state=42), scoring="r2"
)

Whichever you choose, the most important habit is to state your k k explicitly when you report results. A score of “0.734 mean R2 R^2 under 5-fold cross-validation” is reproducible; a bare “0.734” is not, because someone else cannot tell how you measured it.

The estimate’s variance, not the model’s

The bias-variance trade-off appears in several places in machine learning. The version here is specifically about the test-error estimate: how biased and how variable your cross-validated number is, depending on k k . It is a different question from whether the model itself underfits or overfits, which you will study with regularization in the next lesson.


Putting It All Together

Here is the complete cross-validation workflow in one runnable script, the pattern you will reuse whenever you need an honest read on a model.

import numpy as np
import pandas as pd
from sklearn.model_selection import KFold, cross_val_score
from sklearn.ensemble import RandomForestRegressor

# 1. Load the data
# download: https://datatweets.com/datasets/california_housing.csv
housing = pd.read_csv("california_housing.csv")

feature_cols = [
    "median_income", "housing_median_age", "total_rooms",
    "total_bedrooms", "population", "households",
    "latitude", "longitude",
]
X = housing[feature_cols]
y = housing["median_house_value"]

# 2. Define the model and the folding plan
model = RandomForestRegressor(n_estimators=100, random_state=42)
kf = KFold(n_splits=5, shuffle=True, random_state=42)

# 3. Run cross-validation
cv_scores = cross_val_score(model, X, y, cv=kf, scoring="r2")

# 4. Summarize: estimate plus uncertainty
print("Fold scores:", cv_scores.round(3))
print(f"Mean R2: {cv_scores.mean():.3f}")
print(f"Std  R2: {cv_scores.std():.3f}")
# Output:
# Fold scores: [0.73  0.699 0.752 0.758 0.733]
# Mean R2: 0.734
# Std  R2: 0.021

In a dozen lines you replaced a single, luck-dependent score with a stable estimate and a measure of its spread. That is the discipline that makes model comparisons and tuning decisions trustworthy.


Practice Exercises

Now it is your turn. Try these before checking the hints.

Exercise 1: Watch the Single-Split Noise

Train the same RandomForestRegressor under three different train/test splits using random_state values 10, 20, and 30, and print the test R2 R^2 for each. Confirm that the score moves even though nothing else changes.

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor

housing = pd.read_csv("california_housing.csv")  # download: https://datatweets.com/datasets/california_housing.csv
# Your code here

Hint

Loop over for seed in [10, 20, 30]:, call train_test_split(X, y, test_size=0.2, random_state=seed), fit a RandomForestRegressor(n_estimators=100, random_state=42), and print model.score(X_test, y_test). The point is to see the number wobble across seeds, which is exactly the noise cross-validation smooths out.

Exercise 2: Run 10-Fold Cross-Validation

Repeat the lesson’s cross-validation, but use 10 folds instead of 5. Print the mean and standard deviation of the fold scores, and compare them to the 5-fold result of mean 0.734 and std 0.021.

from sklearn.model_selection import KFold, cross_val_score
from sklearn.ensemble import RandomForestRegressor

# Your code here (reuse X and y from the lesson)

Hint

Build KFold(n_splits=10, shuffle=True, random_state=42), pass it as the cv argument to cross_val_score(model, X, y, cv=kf, scoring="r2"), then print cv_scores.mean() and cv_scores.std(). The mean should stay close to the 5-fold value; note whether the per-fold spread changes, since each fold now tests fewer rows.

Exercise 3: Report an Estimate With Its Uncertainty

Take the array of 5-fold scores and print a single sentence that reports the model’s performance as a mean plus or minus one standard deviation, rounded to three decimals, for example R2: 0.734 +/- 0.021.

import numpy as np
# Your code here (reuse cv_scores from the lesson)

Hint

Use an f-string: print(f"R2: {cv_scores.mean():.3f} +/- {cv_scores.std():.3f}"). Reporting both numbers together is the habit that separates an honest result from a single number that hides how much it could have varied.


Summary

Congratulations! You have learned how to turn a noisy single score into a stable, honest estimate of model performance. Let’s review what you covered.

Key Concepts

Why One Split Is Not Enough

  • A train/test split is random, so a different seed gives a different test set and a different score
  • A single score is one draw from a distribution; it can be inflated or deflated by luck or by a single outlier landing in the wrong set
  • Reporting one number hides how much that number could have varied

K-Fold Cross-Validation

  • Split the data into k k equal, non-overlapping folds
  • Rotate so each fold is the test set exactly once while the others train the model, giving k k scores
  • Every row is tested exactly once, and the test folds never overlap

Running It in scikit-learn

  • KFold(n_splits=k, shuffle=True, random_state=42) defines how folds are cut
  • cross_val_score(model, X, y, cv=kf, scoring="r2") runs the full rotation and returns one score per fold
  • The model is not fitted in advance; a fresh copy is trained on each fold

Summarizing the Result

  • The mean of the fold scores is your best estimate of generalization performance
  • The standard deviation measures how much that estimate varies, so report performance as a mean plus or minus a spread
  • Here the random forest scored a mean R2 R^2 of 0.734 with a standard deviation of 0.021 across five folds

Choosing k

  • Few folds train on less data and bias the estimate downward; many folds reduce that bias but make per-fold scores swing more
  • This is the bias-variance trade-off of the estimate itself
  • k=5 k = 5 or k=10 k = 10 balances the two for most problems; always state your chosen k k when reporting

Why This Matters

Every decision you make in machine learning, which model to use, which features to keep, which settings to tune, rests on comparing performance numbers. If those numbers come from a single split, you are comparing measurements that each carry several points of random noise, and you can easily pick the wrong option because it got a lucky draw. Cross-validation removes most of that noise and, just as importantly, tells you how much noise remains.

That second part is the quiet superpower. Knowing that a model scores “0.734 plus or minus 0.021” lets you judge whether a rival model that scores 0.745 is genuinely better or just within the noise. As you move into regularization and more complex models in the coming lessons, cross-validation is the measuring stick you will use to tell real improvements from random luck.


Next Steps

You can now estimate model performance honestly and quantify how much it varies. In the next lesson, you will use that measuring stick to fight overfitting directly, with regularization techniques that keep models from memorizing their training data.

Continue to Lesson 4 - Regularization

Learn how Ridge and Lasso penalize complexity to prevent overfitting, tuned with cross-validation.

Back to Module Overview

Return to the Model Optimization module overview.


Keep Building Your Skills

You have added one of the most important habits in applied machine learning to your toolkit: never trusting a single number when you can afford several. Cross-validation costs a little more compute, but it buys you confidence that your conclusions are about the model and not about a lucky split. Carry this discipline into every comparison and tuning experiment ahead, and your results will hold up when it matters most.