Lesson 1 - Foundations of Decision Trees

Welcome to Decision Trees

This lesson introduces the decision tree, one of the most intuitive and widely used models in machine learning. You will learn what a tree is made of, how it asks a sequence of yes/no questions to reach a prediction, and how it decides which question to ask first. Along the way you will meet two measures of “messiness” called Gini impurity and entropy, and you will read a real tree trained on the Adult Income dataset.

By the end of this lesson, you will be able to:

  • Describe the parts of a decision tree: nodes, splits, branches, and leaves
  • Explain how a tree classifies a new observation by following yes/no questions
  • Measure how mixed a group of labels is using Gini impurity and entropy
  • Explain how a tree chooses the best split by reducing impurity
  • Read a small depth-3 decision tree and trace a prediction through it

You should be comfortable with basic Python, pandas, and the idea of features and a target from the foundations module. No prior knowledge of trees is needed. Let’s begin.


What Is a Decision Tree?

Imagine you are trying to guess whether a person earns more than 50,000 dollars a year. You cannot see their bank statement, but you can ask questions. You might start with: “Are they married?” If yes, the odds of a high income go up. Then: “Did they finish more years of education?” If yes, the odds rise again. Each question narrows things down until you feel confident enough to make a guess.

That is exactly how a decision tree works. It is a model that learns a sequence of yes/no questions about the features, and follows them down to a final prediction. Because the questions are simple and the path is easy to trace, a trained tree is one of the few machine learning models you can read like a flowchart and explain to someone with no technical background.

Trees handle two kinds of problems. When the target is a category (does this person earn more than 50K, yes or no?), the tree is a classification tree. When the target is a number (what salary will this person earn?), it is a regression tree. This lesson focuses on classification, since the Adult Income target is a yes/no label.

Why trees are so popular

Most machine learning models are black boxes: they give an answer but cannot tell you why. A decision tree is different. Every prediction is the result of a readable chain of questions, so you can point at the exact rules that led to it. That transparency makes trees a favorite for explaining results to non-technical stakeholders, and it makes them the foundation for the powerful ensemble methods you will meet later in this module.

The Parts of a Tree

Decision trees are drawn upside down, with the starting point at the top and the answers at the bottom. The diagram has a small vocabulary that you will use constantly:

                 [ Root node ]            <- the first question
                 /          \
        True (left)        False (right)
            /                    \
    [ Internal node ]        [ Leaf ]      <- a final prediction
        /        \
   [ Leaf ]    [ Leaf ]
  • A node holds a question, also called a split or a threshold, such as education_num <= 12.5. It divides the data into two groups.
  • The root node is the single node at the very top, where every prediction begins.
  • An internal node is any node below the root that still asks a question and splits the data further.
  • A leaf (also called a terminal node) asks nothing. It is the end of a path and holds the final prediction.
  • A branch is an arrow connecting a node to one of its two children. By a near-universal convention, a True answer goes left and a False answer goes right.

There is one more pair of words worth knowing. A node that splits into two children is a parent, and the two nodes it produces are its children. The root is always a parent and never a child. A leaf is always a child and never a parent. Internal nodes are both.

How a Tree Makes a Prediction

To classify a new person, you start at the root and answer its question using that person’s features. If the answer is True you move down the left branch; if False you move right. You repeat this at each internal node until you land on a leaf, and the leaf gives you the prediction.

Two different people can both be predicted “high income” yet arrive at different leaves, because they took different paths and matched the high-income pattern for different reasons. This is part of what makes trees so readable: the path itself is an explanation.


The Goal: Pure Groups

If a tree is just a chain of questions, what makes one question better than another? The answer is the single idea that drives everything a tree does: a good split produces groups that are more homogeneous than the group it started with.

Homogeneous means the labels inside a group agree with each other. A group of 50 people who all earn more than 50K is perfectly homogeneous. A group that is half high-earners and half low-earners is as mixed as it gets. The tree’s whole job is to keep asking questions that sort a mixed group into purer and purer subgroups, until a group is pure enough to become a leaf and make a confident prediction.

A leaf containing only one class is called a pure leaf. Once a group is pure, there is nothing left to gain by splitting it further, so the tree stops there. To turn this intuition into something a computer can optimize, we need a number that measures how mixed a group is. That number is called impurity, and there are two common ways to compute it.


Measuring Impurity: Gini

The first measure is Gini impurity. It answers a clever question: if you reached into a group, pulled out one observation at random, and then guessed its label by drawing a second random label from the same group, how often would you be wrong? The more mixed the group, the more often you would misclassify, so a higher Gini means a messier group.

The formula sums over the k k classes in the group, where pi p_i is the fraction of observations belonging to class i i :

Gini=1i=1kpi2 \text{Gini} = 1 - \sum_{i=1}^{k} p_i^{\,2}

The best possible value is 0, which happens when a group contains only one class. The value rises as the group becomes more mixed. Here are three quick examples for a two-class problem (earns more than 50K vs. not):

  • All one class (10 high-earners, 0 low): 1(10/10)2=0 1 - (10/10)^2 = 0 . Perfectly pure.
  • Even split (5 high, 5 low): 1(5/10)2(5/10)2=0.5 1 - (5/10)^2 - (5/10)^2 = 0.5 . Maximum impurity for two classes.
  • Lopsided (9 high, 1 low): 1(9/10)2(1/10)2=0.18 1 - (9/10)^2 - (1/10)^2 = 0.18 . Mostly pure, low impurity.

You can compute Gini directly from the counts of each class.

def gini(pos, neg):
    total = pos + neg
    p_pos = pos / total
    p_neg = neg / total
    return 1 - p_pos**2 - p_neg**2

print(round(gini(10, 0), 3))   # pure group
print(round(gini(5, 5), 3))    # even split
print(round(gini(9, 1), 3))    # mostly one class
# Output:
# 0.0
# 0.5
# 0.18

Measuring Impurity: Entropy

The second measure comes from information theory and is called entropy. It captures the same idea, the amount of disorder in a group, but uses logarithms. Entropy is highest when the classes are evenly mixed (maximum uncertainty about which label you will draw) and zero when the group is pure.

The formula sums over the k k classes, again with pi p_i the fraction in class i i , using a logarithm of base 2:

Entropy=i=1kpilog2(pi) \text{Entropy} = - \sum_{i=1}^{k} p_i \log_{2}(p_i)

For a two-class problem, entropy ranges from 0 (pure) to 1 (a perfect 50/50 mix). By convention, a class with pi=0 p_i = 0 contributes 0 to the sum, since log2(0) \log_2(0) is undefined but the term vanishes in the limit.

from math import log2

def entropy(pos, neg):
    total = pos + neg
    out = 0
    for count in (pos, neg):
        if count > 0:               # skip the 0-probability term
            p = count / total
            out -= p * log2(p)
    return out

print(round(entropy(10, 0), 3))   # pure group
print(round(entropy(5, 5), 3))    # even split
print(round(entropy(9, 1), 3))    # mostly one class
# Output:
# 0.0
# 1.0
# 0.469

Notice the pattern: both measures are 0 for a pure group and peak for an even split. Gini peaks at 0.5 and entropy peaks at 1.0, but the shape of the two curves is almost identical, and they almost always agree on which split is best. The chart below plots both against the fraction of positive observations in a binary group.

Gini and entropy impurity curves plotted against class proportion, both peaking at an even 50/50 split and reaching zero for pure groups
Gini and entropy follow the same arch: zero for a pure group and maximum at an even split.

Gini or entropy: does it matter?

In practice the two criteria produce nearly identical trees, so the choice rarely changes your results. Gini is the default in scikit-learn because it avoids the logarithm and is slightly faster to compute. Reach for entropy only if you have a specific reason to; otherwise the default is a fine choice.


Choosing the Best Split

A single impurity number describes one group. But a split produces two groups, the True child and the False child, and they usually have different sizes. To score a split, the tree combines the two children’s impurities into a weighted impurity, giving each child a weight equal to its share of the observations:

Weighted Gini=ntruenparentGinitrue+nfalsenparentGinifalse \text{Weighted Gini} = \frac{n_{\text{true}}}{n_{\text{parent}}}\,\text{Gini}_{\text{true}} + \frac{n_{\text{false}}}{n_{\text{parent}}}\,\text{Gini}_{\text{false}}

The tree tries this for every candidate threshold on every feature and keeps the one with the lowest weighted impurity. That threshold becomes the node’s question. The same logic works for entropy, except trees usually phrase it as information gain, the drop in entropy from parent to children, and pick the split with the highest gain. Lowest impurity and highest information gain point at the same winning split.

Where Do the Candidate Thresholds Come From?

A tree does not try every conceivable number. For each numeric feature it sorts the unique values and considers the midpoint between each consecutive pair. If a feature has values [10, 12, 13, 16], the candidate thresholds are 11.0, 12.5, and 14.5. This keeps the search finite and fast while still covering every place the data could be cleanly divided.

Let’s make this concrete with a tiny example. Suppose a parent group has 6 high-earners and 4 low-earners, and a candidate split sends 5 high-earners with 1 low-earner to the left, and 1 high-earner with 3 low-earners to the right.

def gini(pos, neg):
    total = pos + neg
    return 1 - (pos / total)**2 - (neg / total)**2

parent_gini = gini(6, 4)                 # 10 observations total
left_gini = gini(5, 1)                   # 6 observations
right_gini = gini(1, 3)                  # 4 observations

weighted = (6 / 10) * left_gini + (4 / 10) * right_gini

print("parent  :", round(parent_gini, 3))
print("weighted:", round(weighted, 3))
# Output:
# parent  : 0.48
# weighted: 0.317

The split cut the impurity from 0.48 down to 0.317, so it made the groups meaningfully purer. If a competing threshold scored a lower weighted Gini, the tree would pick that one instead. Repeat this greedily, level after level, and you have grown a decision tree.

Trees are greedy

A tree chooses the best split at each node without looking ahead to how it affects later splits. This greedy strategy is fast and usually works well, but it does not guarantee the globally best tree. It also means trees can keep splitting until every leaf is pure, which leads to overfitting. Controlling how deep a tree grows is a central topic, and you will tackle it directly in a later lesson.


A Real Dataset: Adult Income

Toy examples are useful for the math, but a tree shows its value on real data. You will work with the Adult Income dataset, drawn from census records, where the task is to predict whether a person earns more than 50,000 dollars a year based on attributes like age, education, marital status, and hours worked per week.

import pandas as pd

# download: https://datatweets.com/datasets/adult_income.csv
df = pd.read_csv("adult_income.csv")

print("Shape:", df.shape)
# Output: Shape: (30162, 12)

The dataset has 30,162 rows and 12 columns. Each row is one person, and the income column is the target: the string ">50K" or "<=50K". Before training anything, look at how the target is distributed, because a lopsided target changes how you should read accuracy.

print(df["income"].value_counts())
# Output:
# income
# <=50K    22654
# >50K      7508
# Name: count, dtype: int64

rate = (df["income"] == ">50K").mean()
print("rate >50K:", round(rate, 3))
# Output: rate >50K: 0.249

Only about 24.9 percent of people earn more than 50K, so the dataset is imbalanced: the low-income class outnumbers the high-income class roughly three to one. A bar chart makes the gap clear.

Bar chart showing about 25 percent of people earn more than 50K and about 75 percent earn 50K or less
The Adult Income target is imbalanced: only about a quarter of people earn more than 50K.

Imbalance and the impurity of the whole dataset

Because the split is roughly 75/25, the Gini impurity of the full dataset is 10.75120.24920.374 1 - 0.751^2 - 0.249^2 \approx 0.374 , not the 0.5 you would see in a balanced set. A model that always guessed “earns 50K or less” would already be right about 75 percent of the time. That baseline is exactly why you must look at the target balance first: it tells you how much better than a trivial guess your tree actually needs to be.


Reading a Real Tree

To see the parts and the splits come together, here is a small tree trained on just four features, marital_status, education_num, capital_gain, and hours_per_week, and limited to a depth of 3 so it stays readable. Restricting the depth keeps the tree shallow on purpose: a full-depth tree on real data has hundreds of nodes and is impossible to read by eye.

A readable depth-3 decision tree predicting income, splitting first on marital status, then on education and capital gain, ending in leaves labeled high or low income
A depth-3 tree on four features: shallow enough to read, yet already capturing strong income patterns.

Trace a path through it the way the tree would. Start at the root, which asks about marital status: married people overwhelmingly fall on one side, unmarried people on the other, because marital status turns out to be the most informative single question in this data. From there, follow the education and capital-gain questions down to a leaf, where the majority class in that leaf becomes the prediction. Notice that some leaves are nearly pure (almost everyone high or almost everyone low) while others are still mixed, because a depth-3 tree simply runs out of questions before it can fully separate every group.

Even this tiny tree is surprisingly capable. Trained on the four features above and capped at depth 3, it correctly classifies about 80.3 percent of the training observations.

# Reproduced from the depth-3 tree shown above (4 features, max_depth=3)
train_accuracy = 0.803
print(f"Depth-3 training accuracy: {train_accuracy:.3f}")
# Output: Depth-3 training accuracy: 0.803

Hold this number up against the imbalance you found earlier. Always guessing “50K or less” would score about 75 percent, so the tree’s 80.3 percent is a real improvement built from just three layers of yes/no questions. You will see how to grow, visualize, and properly evaluate such trees with scikit-learn in the next lesson; for now, the goal is simply to read one and understand why each split is there.

Training accuracy is not the final word

The 80.3 percent here is measured on the same data the tree learned from, so it is an optimistic, in-sample number. The honest test of any model is how it performs on data it has never seen, which is why later lessons split the data into training and test sets before judging a tree. Treat this training accuracy as a sanity check that the splits are capturing something real, not as the model’s true score.


Practice Exercises

Try these before checking the hints. They reinforce the impurity math and the structure of a tree.

Exercise 1: Compute Gini for the Whole Dataset

Using the counts from the Adult Income target (22654 people earning 50K or less and 7508 earning more than 50K), write a function that returns the Gini impurity of the full dataset and print the result rounded to three decimals.

import pandas as pd
df = pd.read_csv("adult_income.csv")  # download: https://datatweets.com/datasets/adult_income.csv

# Your code here

Hint

Get the two probabilities with p_high = (df["income"] == ">50K").mean() and p_low = 1 - p_high. Then Gini is 1 - p_high**2 - p_low**2. You should get about 0.374, matching the imbalance callout above.

Exercise 2: Score a Candidate Split

A parent group has 8 high-earners and 2 low-earners. A candidate split sends 6 high-earners and 0 low-earners left, and 2 high-earners and 2 low-earners right. Compute the weighted Gini of this split and decide whether it improved on the parent’s impurity.

def gini(pos, neg):
    total = pos + neg
    return 1 - (pos / total)**2 - (neg / total)**2

# Your code here

Hint

The parent has 10 observations, so the left child weight is 6/10 and the right child weight is 4/10. Compute gini(6, 0) and gini(2, 2), multiply each by its weight, and add. The weighted Gini is 0.2, lower than the parent’s 0.32, so yes, the split helped.

Exercise 3: Compare Gini and Entropy

Write both a gini and an entropy function for a two-class group, then evaluate both on a 70/30 group (7 of one class, 3 of the other). Confirm that both report a moderate, non-zero impurity.

from math import log2

# Your code here

Hint

For entropy, loop over the two counts, skip any count of 0, and accumulate -p * log2(p). On a 70/30 group you should get Gini around 0.42 and entropy around 0.881. Both are well below their maximums (0.5 and 1.0), reflecting that the group leans toward one class.


Summary

You now understand what a decision tree is, what it is made of, and how it decides where to split. Let’s review the key ideas.

Key Concepts

Tree Structure

  • A tree is a sequence of yes/no questions running from a root node down to leaves
  • Nodes hold a split (a threshold); leaves hold a final prediction and ask nothing
  • Branches follow the convention True goes left, False goes right
  • A node that splits is a parent; the two groups it creates are its children

Making Predictions

  • To classify an observation, start at the root and follow the answers down to a leaf
  • The leaf’s majority class becomes the prediction
  • Two observations can reach the same prediction by different paths, and the path is itself an explanation

Impurity

  • A tree aims to produce homogeneous (pure) groups, where the labels agree
  • Gini impurity is the chance of misclassifying a randomly drawn label; it ranges from 0 (pure) to 0.5 (even split, two classes)
  • Entropy measures disorder; it ranges from 0 (pure) to 1 (even split, two classes)
  • Gini and entropy follow nearly the same curve and usually pick the same split

Choosing Splits

  • The tree scores a split by its weighted impurity, weighting each child by its share of observations
  • It tries every candidate threshold (midpoints between sorted feature values) and keeps the one with the lowest weighted impurity, equivalently the highest information gain
  • The process is greedy: best split now, repeated recursively, with no look-ahead

The Adult Income Data

  • 30,162 rows, target income of ">50K" or "<=50K"
  • Only about 24.9 percent earn more than 50K, so the data is imbalanced
  • A readable depth-3 tree on four features reaches about 80.3 percent training accuracy, beating the ~75 percent always-guess-low baseline

Why This Matters

Every advanced tree-based method you will learn, from a single tuned decision tree to a random forest of hundreds of them, is built on exactly the ideas in this lesson. A forest is just many trees, and each of those trees grows by the same impurity-reducing splits you traced here by hand. Understanding what a split is, why one threshold beats another, and how impurity drives the whole process means that when later lessons hand the work to scikit-learn, you will know precisely what the library is doing on your behalf, and why its choices make sense.

Just as importantly, you saw the warning signs early. A tree is greedy and will happily split until every leaf is pure, which is the road to overfitting, and training accuracy on imbalanced data can flatter a weak model. Keeping these in mind from the start will make you a far more careful modeler than someone who only knows how to call .fit().


Next Steps

You can now read a decision tree and explain how it learns. In the next lesson, you will stop computing splits by hand and let scikit-learn build, train, and visualize a full tree on the Adult Income data in just a few lines.

Continue to Lesson 2 - Building Decision Trees with Scikit-Learn

Train, visualize, and evaluate a real decision tree with the scikit-learn library.

Back to Module Overview

Return to the Trees and Ensembles module overview.


Keep Building Your Skills

You have laid the foundation for everything that follows in this module. The vocabulary of nodes, splits, and leaves, and the idea that good splits reduce impurity, will come back in every lesson, all the way through random forests. Take a moment to trace the depth-3 tree one more time and convince yourself you can predict where any person would land. Once that feels natural, you are ready to let the computer build trees for you, and to start asking the deeper questions of how big a tree should be and how well it truly generalizes.