Lesson 1 - Foundations of Decision Trees
Welcome to Decision Trees
This lesson introduces the decision tree, one of the most intuitive and widely used models in machine learning. You will learn what a tree is made of, how it asks a sequence of yes/no questions to reach a prediction, and how it decides which question to ask first. Along the way you will meet two measures of “messiness” called Gini impurity and entropy, and you will read a real tree trained on the Adult Income dataset.
By the end of this lesson, you will be able to:
- Describe the parts of a decision tree: nodes, splits, branches, and leaves
- Explain how a tree classifies a new observation by following yes/no questions
- Measure how mixed a group of labels is using Gini impurity and entropy
- Explain how a tree chooses the best split by reducing impurity
- Read a small depth-3 decision tree and trace a prediction through it
You should be comfortable with basic Python, pandas, and the idea of features and a target from the foundations module. No prior knowledge of trees is needed. Let’s begin.
What Is a Decision Tree?
Imagine you are trying to guess whether a person earns more than 50,000 dollars a year. You cannot see their bank statement, but you can ask questions. You might start with: “Are they married?” If yes, the odds of a high income go up. Then: “Did they finish more years of education?” If yes, the odds rise again. Each question narrows things down until you feel confident enough to make a guess.
That is exactly how a decision tree works. It is a model that learns a sequence of yes/no questions about the features, and follows them down to a final prediction. Because the questions are simple and the path is easy to trace, a trained tree is one of the few machine learning models you can read like a flowchart and explain to someone with no technical background.
Trees handle two kinds of problems. When the target is a category (does this person earn more than 50K, yes or no?), the tree is a classification tree. When the target is a number (what salary will this person earn?), it is a regression tree. This lesson focuses on classification, since the Adult Income target is a yes/no label.
Why trees are so popular
Most machine learning models are black boxes: they give an answer but cannot tell you why. A decision tree is different. Every prediction is the result of a readable chain of questions, so you can point at the exact rules that led to it. That transparency makes trees a favorite for explaining results to non-technical stakeholders, and it makes them the foundation for the powerful ensemble methods you will meet later in this module.
The Parts of a Tree
Decision trees are drawn upside down, with the starting point at the top and the answers at the bottom. The diagram has a small vocabulary that you will use constantly:
[ Root node ] <- the first question
/ \
True (left) False (right)
/ \
[ Internal node ] [ Leaf ] <- a final prediction
/ \
[ Leaf ] [ Leaf ]- A node holds a question, also called a split or a threshold, such as
education_num <= 12.5. It divides the data into two groups. - The root node is the single node at the very top, where every prediction begins.
- An internal node is any node below the root that still asks a question and splits the data further.
- A leaf (also called a terminal node) asks nothing. It is the end of a path and holds the final prediction.
- A branch is an arrow connecting a node to one of its two children. By a near-universal convention, a True answer goes left and a False answer goes right.
There is one more pair of words worth knowing. A node that splits into two children is a parent, and the two nodes it produces are its children. The root is always a parent and never a child. A leaf is always a child and never a parent. Internal nodes are both.
How a Tree Makes a Prediction
To classify a new person, you start at the root and answer its question using that person’s features. If the answer is True you move down the left branch; if False you move right. You repeat this at each internal node until you land on a leaf, and the leaf gives you the prediction.
Two different people can both be predicted “high income” yet arrive at different leaves, because they took different paths and matched the high-income pattern for different reasons. This is part of what makes trees so readable: the path itself is an explanation.
The Goal: Pure Groups
If a tree is just a chain of questions, what makes one question better than another? The answer is the single idea that drives everything a tree does: a good split produces groups that are more homogeneous than the group it started with.
Homogeneous means the labels inside a group agree with each other. A group of 50 people who all earn more than 50K is perfectly homogeneous. A group that is half high-earners and half low-earners is as mixed as it gets. The tree’s whole job is to keep asking questions that sort a mixed group into purer and purer subgroups, until a group is pure enough to become a leaf and make a confident prediction.
A leaf containing only one class is called a pure leaf. Once a group is pure, there is nothing left to gain by splitting it further, so the tree stops there. To turn this intuition into something a computer can optimize, we need a number that measures how mixed a group is. That number is called impurity, and there are two common ways to compute it.
Measuring Impurity: Gini
The first measure is Gini impurity. It answers a clever question: if you reached into a group, pulled out one observation at random, and then guessed its label by drawing a second random label from the same group, how often would you be wrong? The more mixed the group, the more often you would misclassify, so a higher Gini means a messier group.
The formula sums over the classes in the group, where is the fraction of observations belonging to class :
The best possible value is 0, which happens when a group contains only one class. The value rises as the group becomes more mixed. Here are three quick examples for a two-class problem (earns more than 50K vs. not):
- All one class (10 high-earners, 0 low): . Perfectly pure.
- Even split (5 high, 5 low): . Maximum impurity for two classes.
- Lopsided (9 high, 1 low): . Mostly pure, low impurity.
You can compute Gini directly from the counts of each class.
def gini(pos, neg):
total = pos + neg
p_pos = pos / total
p_neg = neg / total
return 1 - p_pos**2 - p_neg**2
print(round(gini(10, 0), 3)) # pure group
print(round(gini(5, 5), 3)) # even split
print(round(gini(9, 1), 3)) # mostly one class
# Output:
# 0.0
# 0.5
# 0.18Measuring Impurity: Entropy
The second measure comes from information theory and is called entropy. It captures the same idea, the amount of disorder in a group, but uses logarithms. Entropy is highest when the classes are evenly mixed (maximum uncertainty about which label you will draw) and zero when the group is pure.
The formula sums over the classes, again with the fraction in class , using a logarithm of base 2:
For a two-class problem, entropy ranges from 0 (pure) to 1 (a perfect 50/50 mix). By convention, a class with contributes 0 to the sum, since is undefined but the term vanishes in the limit.
from math import log2
def entropy(pos, neg):
total = pos + neg
out = 0
for count in (pos, neg):
if count > 0: # skip the 0-probability term
p = count / total
out -= p * log2(p)
return out
print(round(entropy(10, 0), 3)) # pure group
print(round(entropy(5, 5), 3)) # even split
print(round(entropy(9, 1), 3)) # mostly one class
# Output:
# 0.0
# 1.0
# 0.469Notice the pattern: both measures are 0 for a pure group and peak for an even split. Gini peaks at 0.5 and entropy peaks at 1.0, but the shape of the two curves is almost identical, and they almost always agree on which split is best. The chart below plots both against the fraction of positive observations in a binary group.
Gini or entropy: does it matter?
In practice the two criteria produce nearly identical trees, so the choice rarely changes your results. Gini is the default in scikit-learn because it avoids the logarithm and is slightly faster to compute. Reach for entropy only if you have a specific reason to; otherwise the default is a fine choice.
Choosing the Best Split
A single impurity number describes one group. But a split produces two groups, the True child and the False child, and they usually have different sizes. To score a split, the tree combines the two children’s impurities into a weighted impurity, giving each child a weight equal to its share of the observations:
The tree tries this for every candidate threshold on every feature and keeps the one with the lowest weighted impurity. That threshold becomes the node’s question. The same logic works for entropy, except trees usually phrase it as information gain, the drop in entropy from parent to children, and pick the split with the highest gain. Lowest impurity and highest information gain point at the same winning split.
Where Do the Candidate Thresholds Come From?
A tree does not try every conceivable number. For each numeric feature it sorts the unique values and considers the midpoint between each consecutive pair. If a feature has values [10, 12, 13, 16], the candidate thresholds are 11.0, 12.5, and 14.5. This keeps the search finite and fast while still covering every place the data could be cleanly divided.
Let’s make this concrete with a tiny example. Suppose a parent group has 6 high-earners and 4 low-earners, and a candidate split sends 5 high-earners with 1 low-earner to the left, and 1 high-earner with 3 low-earners to the right.
def gini(pos, neg):
total = pos + neg
return 1 - (pos / total)**2 - (neg / total)**2
parent_gini = gini(6, 4) # 10 observations total
left_gini = gini(5, 1) # 6 observations
right_gini = gini(1, 3) # 4 observations
weighted = (6 / 10) * left_gini + (4 / 10) * right_gini
print("parent :", round(parent_gini, 3))
print("weighted:", round(weighted, 3))
# Output:
# parent : 0.48
# weighted: 0.317The split cut the impurity from 0.48 down to 0.317, so it made the groups meaningfully purer. If a competing threshold scored a lower weighted Gini, the tree would pick that one instead. Repeat this greedily, level after level, and you have grown a decision tree.
Trees are greedy
A tree chooses the best split at each node without looking ahead to how it affects later splits. This greedy strategy is fast and usually works well, but it does not guarantee the globally best tree. It also means trees can keep splitting until every leaf is pure, which leads to overfitting. Controlling how deep a tree grows is a central topic, and you will tackle it directly in a later lesson.
A Real Dataset: Adult Income
Toy examples are useful for the math, but a tree shows its value on real data. You will work with the Adult Income dataset, drawn from census records, where the task is to predict whether a person earns more than 50,000 dollars a year based on attributes like age, education, marital status, and hours worked per week.
import pandas as pd
# download: https://datatweets.com/datasets/adult_income.csv
df = pd.read_csv("adult_income.csv")
print("Shape:", df.shape)
# Output: Shape: (30162, 12)The dataset has 30,162 rows and 12 columns. Each row is one person, and the income column is the target: the string ">50K" or "<=50K". Before training anything, look at how the target is distributed, because a lopsided target changes how you should read accuracy.
print(df["income"].value_counts())
# Output:
# income
# <=50K 22654
# >50K 7508
# Name: count, dtype: int64
rate = (df["income"] == ">50K").mean()
print("rate >50K:", round(rate, 3))
# Output: rate >50K: 0.249Only about 24.9 percent of people earn more than 50K, so the dataset is imbalanced: the low-income class outnumbers the high-income class roughly three to one. A bar chart makes the gap clear.
Imbalance and the impurity of the whole dataset
Because the split is roughly 75/25, the Gini impurity of the full dataset is , not the 0.5 you would see in a balanced set. A model that always guessed “earns 50K or less” would already be right about 75 percent of the time. That baseline is exactly why you must look at the target balance first: it tells you how much better than a trivial guess your tree actually needs to be.
Reading a Real Tree
To see the parts and the splits come together, here is a small tree trained on just four features, marital_status, education_num, capital_gain, and hours_per_week, and limited to a depth of 3 so it stays readable. Restricting the depth keeps the tree shallow on purpose: a full-depth tree on real data has hundreds of nodes and is impossible to read by eye.
Trace a path through it the way the tree would. Start at the root, which asks about marital status: married people overwhelmingly fall on one side, unmarried people on the other, because marital status turns out to be the most informative single question in this data. From there, follow the education and capital-gain questions down to a leaf, where the majority class in that leaf becomes the prediction. Notice that some leaves are nearly pure (almost everyone high or almost everyone low) while others are still mixed, because a depth-3 tree simply runs out of questions before it can fully separate every group.
Even this tiny tree is surprisingly capable. Trained on the four features above and capped at depth 3, it correctly classifies about 80.3 percent of the training observations.
# Reproduced from the depth-3 tree shown above (4 features, max_depth=3)
train_accuracy = 0.803
print(f"Depth-3 training accuracy: {train_accuracy:.3f}")
# Output: Depth-3 training accuracy: 0.803Hold this number up against the imbalance you found earlier. Always guessing “50K or less” would score about 75 percent, so the tree’s 80.3 percent is a real improvement built from just three layers of yes/no questions. You will see how to grow, visualize, and properly evaluate such trees with scikit-learn in the next lesson; for now, the goal is simply to read one and understand why each split is there.
Training accuracy is not the final word
The 80.3 percent here is measured on the same data the tree learned from, so it is an optimistic, in-sample number. The honest test of any model is how it performs on data it has never seen, which is why later lessons split the data into training and test sets before judging a tree. Treat this training accuracy as a sanity check that the splits are capturing something real, not as the model’s true score.
Practice Exercises
Try these before checking the hints. They reinforce the impurity math and the structure of a tree.
Exercise 1: Compute Gini for the Whole Dataset
Using the counts from the Adult Income target (22654 people earning 50K or less and 7508 earning more than 50K), write a function that returns the Gini impurity of the full dataset and print the result rounded to three decimals.
import pandas as pd
df = pd.read_csv("adult_income.csv") # download: https://datatweets.com/datasets/adult_income.csv
# Your code hereHint
Get the two probabilities with p_high = (df["income"] == ">50K").mean() and p_low = 1 - p_high. Then Gini is 1 - p_high**2 - p_low**2. You should get about 0.374, matching the imbalance callout above.
Exercise 2: Score a Candidate Split
A parent group has 8 high-earners and 2 low-earners. A candidate split sends 6 high-earners and 0 low-earners left, and 2 high-earners and 2 low-earners right. Compute the weighted Gini of this split and decide whether it improved on the parent’s impurity.
def gini(pos, neg):
total = pos + neg
return 1 - (pos / total)**2 - (neg / total)**2
# Your code hereHint
The parent has 10 observations, so the left child weight is 6/10 and the right child weight is 4/10. Compute gini(6, 0) and gini(2, 2), multiply each by its weight, and add. The weighted Gini is 0.2, lower than the parent’s 0.32, so yes, the split helped.
Exercise 3: Compare Gini and Entropy
Write both a gini and an entropy function for a two-class group, then evaluate both on a 70/30 group (7 of one class, 3 of the other). Confirm that both report a moderate, non-zero impurity.
from math import log2
# Your code hereHint
For entropy, loop over the two counts, skip any count of 0, and accumulate -p * log2(p). On a 70/30 group you should get Gini around 0.42 and entropy around 0.881. Both are well below their maximums (0.5 and 1.0), reflecting that the group leans toward one class.
Summary
You now understand what a decision tree is, what it is made of, and how it decides where to split. Let’s review the key ideas.
Key Concepts
Tree Structure
- A tree is a sequence of yes/no questions running from a root node down to leaves
- Nodes hold a split (a threshold); leaves hold a final prediction and ask nothing
- Branches follow the convention True goes left, False goes right
- A node that splits is a parent; the two groups it creates are its children
Making Predictions
- To classify an observation, start at the root and follow the answers down to a leaf
- The leaf’s majority class becomes the prediction
- Two observations can reach the same prediction by different paths, and the path is itself an explanation
Impurity
- A tree aims to produce homogeneous (pure) groups, where the labels agree
- Gini impurity is the chance of misclassifying a randomly drawn label; it ranges from 0 (pure) to 0.5 (even split, two classes)
- Entropy measures disorder; it ranges from 0 (pure) to 1 (even split, two classes)
- Gini and entropy follow nearly the same curve and usually pick the same split
Choosing Splits
- The tree scores a split by its weighted impurity, weighting each child by its share of observations
- It tries every candidate threshold (midpoints between sorted feature values) and keeps the one with the lowest weighted impurity, equivalently the highest information gain
- The process is greedy: best split now, repeated recursively, with no look-ahead
The Adult Income Data
- 30,162 rows, target
incomeof">50K"or"<=50K" - Only about 24.9 percent earn more than 50K, so the data is imbalanced
- A readable depth-3 tree on four features reaches about 80.3 percent training accuracy, beating the ~75 percent always-guess-low baseline
Why This Matters
Every advanced tree-based method you will learn, from a single tuned decision tree to a random forest of hundreds of them, is built on exactly the ideas in this lesson. A forest is just many trees, and each of those trees grows by the same impurity-reducing splits you traced here by hand. Understanding what a split is, why one threshold beats another, and how impurity drives the whole process means that when later lessons hand the work to scikit-learn, you will know precisely what the library is doing on your behalf, and why its choices make sense.
Just as importantly, you saw the warning signs early. A tree is greedy and will happily split until every leaf is pure, which is the road to overfitting, and training accuracy on imbalanced data can flatter a weak model. Keeping these in mind from the start will make you a far more careful modeler than someone who only knows how to call .fit().
Next Steps
You can now read a decision tree and explain how it learns. In the next lesson, you will stop computing splits by hand and let scikit-learn build, train, and visualize a full tree on the Adult Income data in just a few lines.
Continue to Lesson 2 - Building Decision Trees with Scikit-Learn
Train, visualize, and evaluate a real decision tree with the scikit-learn library.
Back to Module Overview
Return to the Trees and Ensembles module overview.
Keep Building Your Skills
You have laid the foundation for everything that follows in this module. The vocabulary of nodes, splits, and leaves, and the idea that good splits reduce impurity, will come back in every lesson, all the way through random forests. Take a moment to trace the depth-3 tree one more time and convince yourself you can predict where any person would land. Once that feels natural, you are ready to let the computer build trees for you, and to start asking the deeper questions of how big a tree should be and how well it truly generalizes.