Lesson 5 - Boolean Indexing and Data Filtering
Filtering Data Like a Pro
You can now create arrays, perform calculations, and select specific rows and columns. This lesson teaches you Boolean indexing—a powerful technique for filtering data based on conditions. This is how you find exactly the data you need.
By the end of this lesson, you will be able to:
- Create Boolean arrays using comparison operators
- Filter 1D arrays to extract values meeting specific criteria
- Filter 2D arrays based on column values
- Combine multiple conditions using logical operators (AND, OR, NOT)
- Count how many values meet certain criteria
- Apply these techniques to real data analysis problems
Boolean indexing is one of the most powerful features in NumPy. It transforms data analysis from tedious manual searching into elegant, expressive queries.
Creating Boolean Arrays
Comparison Operators
Comparison operators create Boolean arrays—arrays containing only True and False values. These arrays act as masks that tell NumPy which elements to include or exclude.
import numpy as np
ages = np.array([25, 30, 15, 45, 18, 62, 35, 12, 28])
print("Ages:", ages)Now apply comparison operators:
# Greater than or equal (>=)
adults = ages >= 18
print("Adults (>= 18):")
print(adults)
# Output: [ True True False True True True True False True]Visual representation:
ages: [25, 30, 15, 45, 18, 62, 35, 12, 28]
↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓
>=18? [T, T, F, T, T, T, T, F, T]More comparisons:
# Greater than (>)
seniors = ages > 60
print("\nSeniors (> 60):")
print(seniors)
# Output: [False False False False False True False False False]
# Equal to (==)
exactly_25 = ages == 25
print("\nExactly 25 years old:")
print(exactly_25)
# Output: [ True False False False False False False False False]
# Less than (<)
minors = ages < 18
print("\nMinors (< 18):")
print(minors)
# Output: [False False True False False False False True False]All Comparison Operators
Here are all six comparison operators:
scores = np.array([85, 92, 78, 88, 95, 72, 89, 91])
print("Scores:", scores)
print("\nA grades (>= 90):", scores >= 90)
# Output: [False True False False True False False True]
print("B grades (>= 80):", scores >= 80)
# Output: [ True True False True True False True True]
print("Failed (< 60):", scores < 60)
# Output: [False False False False False False False False]
print("Exactly 88:", scores == 88)
# Output: [False False False True False False False False]
print("Not 88:", scores != 88)
# Output: [ True True True False True True True True]The six operators are:
>greater than<less than>=greater than or equal<=less than or equal==equal to!=not equal to
Boolean Indexing for 1D Arrays
Filtering with Boolean Masks
Use a Boolean array as an index to extract only the values where the array is True:
ages = np.array([25, 30, 15, 45, 18, 62, 35, 12, 28])
# Create Boolean mask
adult_mask = ages >= 18
print("Boolean mask (>= 18):")
print(adult_mask)
# Output: [ True True False True True True True False True]
# Use mask to filter
adult_ages = ages[adult_mask]
print("\nAdult ages only:")
print(adult_ages)
# Output: [25 30 45 18 62 35 28]Visual representation:
ages: [25, 30, 15, 45, 18, 62, 35, 12, 28]
mask: [ T, T, F, T, T, T, T, F, T]
↓ ↓ ↓ ↓ ↓ ↓ ↓
result: [25, 30, 45, 18, 62, 35, 28]
Only positions where mask is True are includedShortcut: Combine in One Line
You typically combine the comparison and filtering in one line:
# Adult ages (one line)
adult_ages = ages[ages >= 18]
print("Adult ages (one line):")
print(adult_ages)
# Output: [25 30 45 18 62 35 28]
# Minor ages
minor_ages = ages[ages < 18]
print("\nMinor ages:")
print(minor_ages)
# Output: [15 12]This is more concise and easier to read.
Counting Matches
To count how many values meet a condition, use .sum() on the Boolean array. True counts as 1, False counts as 0:
ages = np.array([25, 30, 15, 45, 18, 62, 35, 12, 28])
# Count how many adults
num_adults = (ages >= 18).sum()
print(f"Number of adults: {num_adults}")
# Output: Number of adults: 7
# Alternative: count from filtered array
num_adults = ages[ages >= 18].shape[0]
print(f"Number of adults (alt): {num_adults}")
# Output: Number of adults (alt): 7
# Count seniors
num_seniors = (ages > 60).sum()
print(f"Number of seniors: {num_seniors}")
# Output: Number of seniors: 1Practical Example: Filter Sales Data
Let’s analyze daily sales:
# Daily sales for a month
daily_sales = np.array([120, 135, 98, 145, 110, 175, 190, 88, 165, 142,
130, 155, 95, 180, 125, 168, 152, 138, 115, 195,
185, 162, 105, 148, 172, 133, 158, 140, 112, 178])
print(f"Total days: {len(daily_sales)}")
print(f"Average daily sales: {daily_sales.mean():.1f}")Output:
Total days: 30
Average daily sales: 144.1Find high-performing days:
# Find high-performing days (> 150)
high_days = daily_sales[daily_sales > 150]
print(f"\nHigh-performing days (> 150): {len(high_days)} days")
print(f"Sales on those days: {high_days}")
print(f"Average on high days: {high_days.mean():.1f}")Output:
High-performing days (> 150): 12 days
Sales on those days: [175 190 165 155 180 168 152 195 185 162 172 158 178]
Average on high days: 172.1Find low-performing days:
# Find low-performing days (< 100)
low_days = daily_sales[daily_sales < 100]
print(f"\nLow-performing days (< 100): {len(low_days)} days")
print(f"Sales on those days: {low_days}")Output:
Low-performing days (< 100): 2 days
Sales on those days: [98 88 95]This analysis immediately identifies problem areas that need attention.
Boolean Indexing for 2D Arrays
Filter Rows Based on Column Value
With 2D arrays, you typically filter entire rows based on values in a specific column:
# Student data: ID, Math, Physics, Chemistry
students = np.array([
[101, 85, 90, 88],
[102, 92, 85, 91],
[103, 78, 82, 80],
[104, 88, 87, 92],
[105, 95, 89, 93],
[106, 72, 75, 70],
[107, 89, 91, 87]
])
print("Student Data:")
print(students)
print(f"Shape: {students.shape}")Find students with high math scores:
# Extract Math column
math_scores = students[:, 1]
print("Math scores:", math_scores)
# Output: [85 92 78 88 95 72 89]
# Create Boolean mask
high_math_mask = math_scores >= 90
print("High math mask (>= 90):", high_math_mask)
# Output: [False True False False True False False]
# Filter rows
high_math_students = students[high_math_mask]
print("\nStudents with Math >= 90:")
print(high_math_students)Output:
Students with Math >= 90:
[[102 92 85 91]
[105 95 89 93]]Shortcut: One-Line Filtering
Combine everything in one line:
# Students with Math >= 90 (one line)
high_math = students[students[:, 1] >= 90]
print("High Math students (one line):")
print(high_math)
print(f"\nNumber of students: {high_math.shape[0]}")Output:
High Math students (one line):
[[102 92 85 91]
[105 95 89 93]]
Number of students: 2Select Specific Columns from Filtered Rows
You can filter rows and then select specific columns:
# Students with Physics >= 85, show ID and Physics only
high_physics = students[students[:, 2] >= 85]
id_and_physics = high_physics[:, [0, 2]]
print("Students with Physics >= 85 (ID and Physics only):")
print(id_and_physics)Output:
Students with Physics >= 85 (ID and Physics only):
[[101 90]
[102 85]
[104 87]
[105 89]
[107 91]]Another example:
# Students with Chemistry < 75, show all their scores
low_chem = students[students[:, 3] < 75]
print("Students with Chemistry < 75:")
print(low_chem)Output:
Students with Chemistry < 75:
[[106 72 75 70]]Practical Example: Sales Analysis
# Sales data: Day, Product_A, Product_B, Product_C
sales = np.array([
[1, 120, 135, 98],
[2, 135, 142, 105],
[3, 150, 138, 112],
[4, 145, 155, 108],
[5, 160, 148, 118],
[6, 175, 162, 125],
[7, 190, 170, 132]
])
print("Weekly Sales:")
print(sales)Analyze product performance:
# Days when Product A sold > 150
strong_a_days = sales[sales[:, 1] > 150]
print("Days when Product A > 150:")
print(strong_a_days)
# Output:
# [[5 160 148 118]
# [6 175 162 125]
# [7 190 170 132]]
# Days when Product C sold < 110
weak_c_days = sales[sales[:, 3] < 110]
print("\nDays when Product C < 110:")
print(weak_c_days)
# Output:
# [[1 120 135 98]
# [2 135 142 105]
# [4 145 155 108]]Combining Multiple Conditions
Logical Operators
You can combine multiple conditions using logical operators:
- & (AND): Both conditions must be True
- | (OR): At least one condition must be True
- ~ (NOT): Reverses True/False
Important
Always use parentheses around each condition when combining them! This is required by NumPy’s syntax.
ages = np.array([25, 30, 15, 45, 18, 62, 35, 12, 28, 55])
print("Ages:", ages)AND Operator (&)
Both conditions must be True:
# Ages between 20 and 40 (inclusive)
middle_aged = ages[(ages >= 20) & (ages <= 40)]
print("Ages 20-40:", middle_aged)
# Output: [25 30 35 28]
# Count them
count = ((ages >= 20) & (ages <= 40)).sum()
print(f"Number of people aged 20-40: {count}")
# Output: 4Visual representation:
ages: [25, 30, 15, 45, 18, 62, 35, 12, 28, 55]
>=20: [ T, T, F, T, F, T, T, F, T, T]
<=40: [ T, T, T, F, T, F, T, T, T, F]
Both True: [ T, T, F, F, F, F, T, F, T, F]
↓ ↓ ↓ ↓
result: [25, 30, 35, 28]OR Operator (|)
At least one condition must be True:
# Young (< 20) OR senior (> 60)
young_or_senior = ages[(ages < 20) | (ages > 60)]
print("Young or Senior:", young_or_senior)
# Output: [15 62 12]
# Count them
count = ((ages < 20) | (ages > 60)).sum()
print(f"Number of young or seniors: {count}")
# Output: 3NOT Operator (~)
Reverses the condition:
# NOT adults (same as minors)
not_adults = ages[~(ages >= 18)]
print("Not adults (< 18):", not_adults)
# Output: [15 12]
# NOT in range 20-40
not_middle = ages[~((ages >= 20) & (ages <= 40))]
print("Not in 20-40 range:", not_middle)
# Output: [15 45 18 62 12 55]Complex Conditions
Combine multiple criteria:
scores = np.array([85, 92, 78, 88, 95, 72, 89, 91, 65, 82])
# B range: 80-89
b_grades = scores[(scores >= 80) & (scores < 90)]
print("B grades (80-89):", b_grades)
# Output: [85 88 89 82]
# A or F: >= 90 or < 60
extreme_grades = scores[(scores >= 90) | (scores < 60)]
print("Extreme grades (A or F):", extreme_grades)
# Output: [92 95 91]
# Not passing: < 60
failing = scores[scores < 60]
print("Failing grades:", failing)
# Output: []Combining Conditions in 2D Arrays
# Student data: ID, Math, Physics, Chemistry
students = np.array([
[101, 85, 90, 88],
[102, 92, 85, 91],
[103, 78, 82, 80],
[104, 88, 87, 92],
[105, 95, 89, 93],
[106, 72, 75, 70],
[107, 89, 91, 87]
])
print("Student Data:")
print(students)Find students strong in multiple subjects:
# Students with Math >= 85 AND Physics >= 85
strong_both = students[(students[:, 1] >= 85) & (students[:, 2] >= 85)]
print("Strong in both Math and Physics:")
print(strong_both)Output:
Strong in both Math and Physics:
[[101 85 90 88]
[102 92 85 91]
[104 88 87 92]
[105 95 89 93]
[107 89 91 87]]Find students who excel in at least one subject:
# Students with Math >= 90 OR Chemistry >= 90
excel_in_one = students[(students[:, 1] >= 90) | (students[:, 3] >= 90)]
print("Excel in Math OR Chemistry:")
print(excel_in_one)Output:
Excel in Math OR Chemistry:
[[102 92 85 91]
[104 88 87 92]
[105 95 89 93]]Find struggling students:
# Students with ANY score < 75
struggling = students[(students[:, 1] < 75) | (students[:, 2] < 75) | (students[:, 3] < 75)]
print("Students with any score < 75:")
print(struggling)Output:
Students with any score < 75:
[[106 72 75 70]]Advanced Real-World Example
# Sales data: Day, Revenue, Units_Sold, Customers
sales = np.array([
[1, 1200, 45, 28],
[2, 1350, 52, 32],
[3, 980, 38, 22],
[4, 1450, 55, 35],
[5, 1100, 42, 26],
[6, 1750, 65, 42],
[7, 1900, 70, 45]
])
print("Sales Data:")
print(sales)Identify high-performing days:
# High-performing: Revenue > 1400 AND Customers > 30
high_days = sales[(sales[:, 1] > 1400) & (sales[:, 3] > 30)]
print("High-performing days (Revenue > 1400 AND Customers > 30):")
print(high_days)
print(f"Number of high days: {high_days.shape[0]}")Output:
High-performing days (Revenue > 1400 AND Customers > 30):
[[4 1450 55 35]
[6 1750 65 42]
[7 1900 70 45]]
Number of high days: 3Identify alert days:
# Alert days: Low revenue (< 1000) OR few customers (< 25)
alert_days = sales[(sales[:, 1] < 1000) | (sales[:, 3] < 25)]
print("\nAlert days (Revenue < 1000 OR Customers < 25):")
print(alert_days)Output:
Alert days (Revenue < 1000 OR Customers < 25):
[[3 980 38 22]]Find medium-range days:
# Medium range: Units between 40 and 60
medium_days = sales[(sales[:, 2] >= 40) & (sales[:, 2] <= 60)]
print("\nMedium sales days (40-60 units):")
print(medium_days)Output:
Medium sales days (40-60 units):
[[1 1200 45 28]
[2 1350 52 32]
[4 1450 55 35]
[5 1100 42 26]]Practice Exercises
Apply Boolean indexing to solve these problems.
Exercise 1: Basic Filtering
Find all temperatures above 30°C:
temperatures = np.array([25, 28, 32, 30, 35, 27, 31, 29, 33, 26])
# Your code here:
# 1. Filter temperatures > 30
# 2. Count how manyExercise 2: 2D Filtering
Find products with price greater than 50:
# Product data: ID, Price, Stock
products = np.array([
[1, 45, 120],
[2, 65, 85],
[3, 30, 200],
[4, 75, 55],
[5, 40, 150]
])
# Your code here:
# Filter products with price (column 1) > 50Hint
Use products[products[:, 1] > 50] to filter based on column 1.
Exercise 3: Multiple Conditions
Find students with scores between 80 and 90 (inclusive):
exam_scores = np.array([75, 85, 92, 78, 88, 95, 82, 71, 89, 93])
# Your code here:
# 1. Filter scores in range 80-90
# 2. Count how manySummary
Boolean indexing is a powerful data filtering technique. Let’s review what you learned.
Key Concepts
Boolean Arrays
- Created with comparison operators:
>,<,>=,<=,==,!= - Contain True/False values
- Act as masks for filtering
Boolean Indexing (1D)
- Filter with condition:
array[array > 50] - Returns only values where condition is True
- Count matches:
(array > 50).sum()
Boolean Indexing (2D)
- Filter rows:
array[array[:, col] > value] - Returns entire rows where condition is True
- Can select specific columns from filtered rows
Combining Conditions
- AND:
(cond1) & (cond2)- both must be True - OR:
(cond1) | (cond2)- at least one must be True - NOT:
~(condition)- reverses True/False - Always use parentheses around each condition
Key Patterns Reference
# 1D filtering
array[array > 50] # Values > 50
(array > 50).sum() # Count > 50
array[array == value] # Exact matches
# 2D filtering
array[array[:, 1] > 50] # Rows where column 1 > 50
array[array[:, 0] == value] # Rows where column 0 equals value
# Combining conditions
array[(array > 20) & (array < 40)] # AND
array[(array < 10) | (array > 90)] # OR
array[~(array >= 50)] # NOT
# Count matches
((array > 20) & (array < 40)).sum()Important Reminders
- Always use parentheses around each condition when combining
&for AND,|for OR,~for NOT- Do not use Python’s
and,or,notwith arrays - Boolean indexing returns a copy, not a view
Why This Matters
Boolean indexing enables you to:
- Find data that meets specific criteria
- Identify outliers or anomalies
- Segment data into categories
- Filter datasets for focused analysis
- Answer complex questions about your data
This is how professional data analysts query datasets. Instead of manually searching through thousands of rows, you express your question as a Boolean condition and let NumPy do the work instantly.
Next Steps
You now have powerful filtering capabilities. In the next lesson, you will learn to modify data—changing values, cleaning datasets, and transforming arrays for analysis.
Continue to Lesson 6 - Modifying Data
Learn to update array values, clean data, and transform datasets
Back to Lesson 4 - Vector Operations
Review arithmetic operations and statistical methods
Master Data Filtering
Boolean indexing transforms how you work with data. Instead of writing complex loops to find matching values, you express your criteria as simple conditions. This technique is essential for data cleaning, analysis, and exploration.
Combined with the other NumPy skills you have learned, Boolean indexing makes you capable of handling real-world data analysis tasks efficiently and elegantly!