Lesson 3 - Evaluating and Optimizing Decision Trees

Welcome to Evaluating and Optimizing Decision Trees

In the last lesson you trained a DecisionTreeClassifier and watched it reach a solid accuracy on the Adult Income dataset. But there was a quiet danger hiding in that result. Left to its own devices, a decision tree will keep splitting until it has carved the training data into perfectly pure groups, memorizing every quirk and accident along the way. A tree like that looks brilliant on the data it was trained on and falls apart on data it has never seen. This lesson is about spotting that failure, measuring it, and fixing it.

By the end of this lesson, you will be able to:

  • Explain what overfitting is and why decision trees are especially prone to it
  • Measure the gap between training accuracy and test accuracy and read it as a warning sign
  • Sweep max_depth from shallow to deep and plot how train and test accuracy diverge
  • Choose a depth that generalizes well instead of one that just memorizes
  • Prune a tree with the three most important hyperparameters: max_depth, min_samples_leaf, and min_samples_split

You should be comfortable with the decision tree workflow from the previous two lessons, including train_test_split, fitting a classifier, and scoring it with .score(). Let’s begin.


A Tree That Memorizes

Recall how a decision tree grows. At each node it searches every feature for the split that best separates the classes, then repeats the process on each child node. If you never tell it to stop, it keeps going until every leaf is pure, meaning every observation that lands in that leaf shares the same label.

A fully grown tree can almost always achieve perfect, or near-perfect, accuracy on its training data. That should make you suspicious rather than happy. The tree has not necessarily learned the pattern that separates high earners from low earners; it may have simply built a private lookup table for the exact rows it was shown. The technical name for this is overfitting: the model adapts so tightly to the training data that it captures noise instead of signal, and its performance on new data suffers.

This is the central tension in machine learning. You do not actually care how well a model does on data you already have the answers for. You care how well it does on data you have not seen yet. A model that memorizes is like a student who memorizes the answer key instead of learning the subject: flawless on the practice exam, lost on the real one.

Why trees overfit so easily

Many algorithms have a fixed shape, but a decision tree’s complexity grows with the data. Given enough depth, it can fit any training set perfectly, including its random noise. That flexibility is exactly what makes trees powerful and exactly what makes them dangerous if you do not rein them in. Controlling that growth is the whole job of this lesson.

Setting Up the Data

You will work again with the real Adult Income dataset, where the goal is to predict whether a person earns more than 50,000 dollars per year from census features such as age, education, marital status, and weekly hours. Load it and prepare the features and target exactly as before.

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier

# download: https://datatweets.com/datasets/adult_income.csv
df = pd.read_csv("adult_income.csv")

print("Shape:", df.shape)
# Output: Shape: (30162, 12)

# Target: 1 if the person earns more than 50K, else 0
y = (df["income"] == ">50K").astype(int)

# Features: one-hot encode the categorical columns so the tree gets numbers
X = pd.get_dummies(df.drop(columns=["income"]))

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.25, random_state=42, stratify=y
)

print("Training rows:", X_train.shape[0])
print("Test rows:    ", X_test.shape[0])
# Output:
# Training rows: 22621
# Test rows:     7541

The dataset has 30,162 rows and 12 columns. After one-hot encoding the categorical features, you split off 25 percent of the rows as a test set the tree will never see during training. That held-out set is your only honest measure of how the model will behave in the real world.


Measuring the Gap

The clearest way to detect overfitting is to score the same model twice: once on the data it trained on, and once on the data it was held back from. If the two numbers are close, the model is generalizing. If training accuracy is far above test accuracy, the model is memorizing.

Let’s make the problem visible by training a tree with no depth limit at all, the default behavior, and comparing it to a deliberately shallow one.

# An unpruned tree: grows until every leaf is pure
deep_tree = DecisionTreeClassifier(random_state=42)
deep_tree.fit(X_train, y_train)

print("Unpruned tree")
print("  train accuracy:", round(deep_tree.score(X_train, y_train), 3))
print("  test accuracy: ", round(deep_tree.score(X_test, y_test), 3))
# Output:
# Unpruned tree
#   train accuracy: 0.999
#   test accuracy:  0.812

Look at that gap. The unpruned tree scores almost perfectly on the training data (0.999) but only 0.812 on the test data. Nearly nineteen points of accuracy evaporate the moment the tree faces unseen rows. That spread between train and test is the signature of overfitting, and it is the single most useful diagnostic you have.

Now compare a shallow tree that is forced to stop early.

# A pruned tree: capped at depth 5
shallow_tree = DecisionTreeClassifier(max_depth=5, random_state=42)
shallow_tree.fit(X_train, y_train)

print("Shallow tree (max_depth=5)")
print("  train accuracy:", round(shallow_tree.score(X_train, y_train), 3))
print("  test accuracy: ", round(shallow_tree.score(X_test, y_test), 3))
# Output:
# Shallow tree (max_depth=5)
#   train accuracy: 0.848
#   test accuracy:  0.846

The shallow tree scores lower on the training data, 0.848 instead of 0.999, but its test accuracy of 0.846 is much higher than the deep tree’s 0.812, and the train and test numbers now sit almost on top of each other. By giving up the ability to memorize, the tree gained the ability to generalize. This trade is the heart of model tuning.

Train accuracy is not the goal

It is tempting to celebrate when training accuracy climbs toward 100 percent. Resist it. A high training score with a much lower test score is not success; it is overfitting wearing a disguise. Always judge a model by its performance on data it did not train on, and watch the gap between the two.


The Bias-Variance Trade-off

Why does a shallow tree generalize better than a deep one? It comes down to a balance every model must strike.

A very shallow tree is too simple. With only a few splits it cannot capture the real structure in the data, so it makes systematic errors on both the training set and the test set. This is called underfitting, or high bias: the model is too rigid to learn the pattern.

A very deep tree is too complex. It captures the real pattern but also chases every random fluctuation in the training data, so it does brilliantly on training rows and poorly on new ones. This is overfitting, or high variance: the model is so flexible that it changes wildly depending on the exact rows it happened to see.

   high bias                  just right                 high variance
  (underfitting)                                          (overfitting)
  -------------              -------------               -------------
  tree too shallow           good depth                  tree too deep
  train acc: low             train acc: good             train acc: ~perfect
  test acc:  low             test acc:  high             test acc:  drops
  small gap                  small gap                   large gap

The sweet spot is in the middle: a tree complex enough to learn the signal but constrained enough to ignore the noise. Your job when tuning a tree is to find that middle point, and the most direct lever for doing so is the tree’s depth.


Finding the Best Depth

Rather than guessing, you can sweep max_depth across a range of values and watch what happens to both training and test accuracy. This single experiment makes the entire overfitting story concrete.

depths = range(1, 21)
train_scores = []
test_scores = []

for d in depths:
    tree = DecisionTreeClassifier(max_depth=d, random_state=42)
    tree.fit(X_train, y_train)
    train_scores.append(tree.score(X_train, y_train))
    test_scores.append(tree.score(X_test, y_test))

# Find the depth with the highest test accuracy
best_idx = test_scores.index(max(test_scores))
best_depth = list(depths)[best_idx]

print("Best depth:", best_depth)
print("  test accuracy: ", round(test_scores[best_idx], 3))
print("  train accuracy:", round(train_scores[best_idx], 3))
# Output:
# Best depth: 10
#   test accuracy:  0.853
#   train accuracy: 0.870

The best test accuracy, 0.853, arrives at max_depth=10. At that depth the training accuracy is 0.870, only slightly above the test score, which tells you the tree is learning the genuine pattern without memorizing. That small, healthy gap is what you are aiming for.

Now look at what happens if you let the tree keep growing.

print("At depth 20:")
print("  train accuracy:", round(train_scores[19], 3))
print("  test accuracy: ", round(test_scores[19], 3))
# Output:
# At depth 20:
#   train accuracy: 0.930
#   test accuracy:  0.830

By depth 20 the training accuracy has climbed to 0.930, but the test accuracy has fallen to 0.830, below the peak it reached at depth 10. The extra depth bought more memorization and less generalization. The widening distance between the two curves is overfitting happening in slow motion, and a picture makes it unmistakable.

Line chart of training and test accuracy plotted against max_depth from 1 to 20, with training accuracy rising steadily while test accuracy peaks near depth 10 and then declines
Training accuracy keeps climbing with depth while test accuracy peaks around depth 10 and then falls, opening the overfitting gap.

Read the chart from left to right. On the far left, both curves are low: the tree is too shallow to learn much, the classic underfitting corner. As depth increases, both curves rise together, the tree is learning real structure. Around depth 10 the test curve flattens and turns down, while the training curve keeps marching upward toward 1.0. Everything to the right of that turning point is the tree trading away generalization for memorization.

Do not tune endlessly against the test set

Sweeping max_depth and picking the value that maximizes test accuracy is a great way to learn, but be aware of a subtle trap: if you keep tweaking settings to chase the highest test score, you slowly start overfitting to the test set itself. In the next lesson you will meet cross-validation, a more disciplined way to choose hyperparameters that does not lean so heavily on a single test split.


Pruning Hyperparameters

Limiting depth is the most intuitive way to stop a tree from overgrowing, but it is a blunt instrument: it cuts every branch off at the same level, even branches that still had useful splits left. scikit-learn gives you finer tools. The three you will reach for most often are pre-pruning parameters, meaning they stop the tree from growing in the first place rather than trimming it afterward.

max_depth

You have already met this one. max_depth caps how many levels of questions the tree is allowed to ask. A small value forces simplicity and guards hard against overfitting; a large value (or None, the default) lets the tree grow freely. It is the first knob to reach for, and scikit-learn even suggests starting around 3 to 5 and adjusting from there.

tree = DecisionTreeClassifier(max_depth=10, random_state=42)
tree.fit(X_train, y_train)
print("max_depth=10 test accuracy:", round(tree.score(X_test, y_test), 3))
# Output: max_depth=10 test accuracy: 0.853

min_samples_leaf

min_samples_leaf sets the minimum number of training observations a leaf is allowed to contain. With the default of 1, the tree happily creates leaves describing a single person, which is almost always memorization rather than a real rule. Raising this floor forces every leaf to summarize a meaningful group, which smooths the model and curbs overfitting.

tree = DecisionTreeClassifier(min_samples_leaf=50, random_state=42)
tree.fit(X_train, y_train)
print("min_samples_leaf=50")
print("  train accuracy:", round(tree.score(X_train, y_train), 3))
print("  test accuracy: ", round(tree.score(X_test, y_test), 3))
# Output:
# min_samples_leaf=50
#   train accuracy: 0.864
#   test accuracy:  0.851

Notice how requiring at least 50 observations per leaf pulls the training accuracy down from the unpruned 0.999 to a sober 0.864, while the test accuracy climbs to 0.851 and the gap nearly closes. The tree can no longer build leaves around individual rows, so it is forced to find rules that apply to crowds.

min_samples_split

min_samples_split controls the other end of the same idea: it sets the minimum number of observations a node must hold before the tree is even allowed to split it. The default is 2, which means the tree will try to split almost anything. Raising it stops the tree from carving up tiny, already-homogeneous groups, which again reduces overfitting.

tree = DecisionTreeClassifier(min_samples_split=100, random_state=42)
tree.fit(X_train, y_train)
print("min_samples_split=100")
print("  train accuracy:", round(tree.score(X_train, y_train), 3))
print("  test accuracy: ", round(tree.score(X_test, y_test), 3))
# Output:
# min_samples_split=100
#   train accuracy: 0.872
#   test accuracy:  0.852

The effect is similar in spirit to min_samples_leaf: by refusing to split small nodes, the tree stays shallower and more general where the data is thin.

Combining the Knobs

These parameters are not mutually exclusive. In practice you set several at once, letting them reinforce each other to produce a compact, well-behaved tree.

tree = DecisionTreeClassifier(
    max_depth=10,
    min_samples_leaf=20,
    min_samples_split=50,
    random_state=42,
)
tree.fit(X_train, y_train)
print("Combined pruning")
print("  train accuracy:", round(tree.score(X_train, y_train), 3))
print("  test accuracy: ", round(tree.score(X_test, y_test), 3))
# Output:
# Combined pruning
#   train accuracy: 0.864
#   test accuracy:  0.853

The combined tree lands at a test accuracy of 0.853, matching the best depth-only result, but with a smaller train/test gap, which means it is a more trustworthy model. Each parameter is attacking overfitting from a slightly different angle: depth limits the overall height, while the sample-count rules prevent the tree from building flimsy branches in sparse corners of the data.

Pre-pruning versus post-pruning

The parameters in this lesson are all forms of pre-pruning: they stop the tree from growing too far in the first place. There is also post-pruning, where you grow a full tree and then trim back the branches that add little value. Both aim at the same goal, a simpler tree that generalizes, and pre-pruning is the faster and more common starting point.


Practice Exercises

Now it is your turn. Work through these before checking the hints. Assume X_train, X_test, y_train, and y_test are already prepared from the Adult Income dataset as shown earlier in the lesson.

Exercise 1: Measure the Overfitting Gap

Train an unpruned DecisionTreeClassifier (no depth limit, random_state=42) and print both its training and test accuracy. Then compute and print the difference between them. How large is the gap, and what does it tell you?

from sklearn.tree import DecisionTreeClassifier

# Your code here

Hint

Instantiate DecisionTreeClassifier(random_state=42), call .fit(X_train, y_train), then compute .score(X_train, y_train) and .score(X_test, y_test). Subtract the test score from the train score. You should see a training accuracy near 0.999 and a test accuracy near 0.812, a gap of almost 0.19, which is a textbook sign of overfitting.

Exercise 2: Sweep a Few Depths

Loop over the depths 3, 5, 10, 15, and 20. For each depth, train a tree and print the depth alongside its test accuracy. Which depth gives the highest test accuracy?

# Your code here

Hint

Inside a for d in [3, 5, 10, 15, 20]: loop, build DecisionTreeClassifier(max_depth=d, random_state=42), fit it, and print d with tree.score(X_test, y_test). The peak is at max_depth=10 with a test accuracy of about 0.853; depth 20 should be noticeably lower at about 0.830.

Exercise 3: Tune the Leaf Size

Compare two trees with no depth limit: one with the default min_samples_leaf=1 and one with min_samples_leaf=50. Print the train and test accuracy for each. How does raising the leaf size change the gap between training and test accuracy?

# Your code here

Hint

Train DecisionTreeClassifier(random_state=42) and DecisionTreeClassifier(min_samples_leaf=50, random_state=42), scoring each on both sets. The default tree overfits (train near 0.999, test near 0.812), while the min_samples_leaf=50 tree has a much smaller gap (train near 0.864, test near 0.851). Forcing larger leaves trades a little training accuracy for much better generalization.


Summary

Excellent work! You have learned to diagnose the most common failure of decision trees and to fix it with a handful of well-chosen settings. Let’s review what you covered.

Key Concepts

Overfitting

  • An unconstrained decision tree grows until every leaf is pure, memorizing the training data instead of learning the underlying pattern
  • Overfitting means the model fits the training data so closely that it performs poorly on unseen data
  • Underfitting is the opposite: a model too simple to capture the real pattern, scoring poorly everywhere

Diagnosing the Problem

  • Score the model on both the training set and the test set; a large gap between them signals overfitting
  • The unpruned tree on Adult Income scored 0.999 on train but only 0.812 on test, a clear overfitting gap
  • A shallow tree gave up training accuracy but generalized far better, with train and test scores nearly equal

The Bias-Variance Trade-off

  • Too simple means high bias and underfitting; too complex means high variance and overfitting
  • The goal is the middle ground: complex enough to learn the signal, simple enough to ignore the noise

Finding the Best Depth

  • Sweeping max_depth from 1 to 20 showed training accuracy rising steadily while test accuracy peaked and then fell
  • Test accuracy peaked at max_depth=10 (test 0.853, train 0.870); at max_depth=20 train rose to 0.930 but test dropped to 0.830

Pruning Hyperparameters

  • max_depth caps how many levels the tree can grow
  • min_samples_leaf sets the minimum observations in a leaf, preventing leaves built around single rows
  • min_samples_split sets the minimum observations a node needs before it can be split
  • These are all pre-pruning controls, and combining them produces a compact tree with a small train/test gap

Why This Matters

Overfitting is not a quirk of decision trees; it is the defining challenge of all of machine learning. Every model you ever build will face the same tension between fitting the data you have and generalizing to the data you do not. Decision trees just make the problem vivid, because you can literally watch the train and test curves pull apart as the tree deepens.

The habit you practiced here, always comparing training and test performance and treating a large gap as a red flag, is one of the most valuable instincts a practitioner can develop. It is also what motivates the next two big ideas in this module. Cross-validation gives you a more reliable way to choose hyperparameters than leaning on a single test split, and ensemble methods like random forests tame overfitting by combining many trees instead of trusting one. You are now ready for both.


Next Steps

You can now spot overfitting, measure it, and prune it away. The next lesson builds directly on this foundation, showing you how to evaluate models more reliably and how combining many trees produces a far stronger predictor than any single tree.

Continue to Lesson 4 - Cross-Validation and Ensemble Methods

Evaluate models reliably with cross-validation and boost accuracy by combining many trees into a random forest.

Back to Module Overview

Return to the Trees and Ensembles module overview.


Keep Building Your Skills

You have learned to look past a flattering training score and ask the only question that matters: how will this model do on data it has never seen? That single discipline, watching the gap between train and test and pruning until it closes, will serve you in every project you take on. Decision trees made the lesson concrete, but the instinct is universal. Carry it forward, and the more powerful models in the next lessons will feel like natural extensions of the same idea.