Lesson 5 - Boolean Indexing and Data Filtering

Filtering Data Like a Pro

You can now create arrays, perform calculations, and select specific rows and columns. This lesson teaches you Boolean indexing—a powerful technique for filtering data based on conditions. This is how you find exactly the data you need.

By the end of this lesson, you will be able to:

  • Create Boolean arrays using comparison operators
  • Filter 1D arrays to extract values meeting specific criteria
  • Filter 2D arrays based on column values
  • Combine multiple conditions using logical operators (AND, OR, NOT)
  • Count how many values meet certain criteria
  • Apply these techniques to real data analysis problems

Boolean indexing is one of the most powerful features in NumPy. It transforms data analysis from tedious manual searching into elegant, expressive queries.


Creating Boolean Arrays

Comparison Operators

Comparison operators create Boolean arrays—arrays containing only True and False values. These arrays act as masks that tell NumPy which elements to include or exclude.

import numpy as np

ages = np.array([25, 30, 15, 45, 18, 62, 35, 12, 28])

print("Ages:", ages)

Now apply comparison operators:

# Greater than or equal (>=)
adults = ages >= 18
print("Adults (>= 18):")
print(adults)
# Output: [ True  True False  True  True  True  True False  True]

Visual representation:

ages:   [25, 30, 15, 45, 18, 62, 35, 12, 28]
         ↓   ↓   ↓   ↓   ↓   ↓   ↓   ↓   ↓
>=18?   [T,  T,  F,  T,  T,  T,  T,  F,  T]

More comparisons:

# Greater than (>)
seniors = ages > 60
print("\nSeniors (> 60):")
print(seniors)
# Output: [False False False False False  True False False False]

# Equal to (==)
exactly_25 = ages == 25
print("\nExactly 25 years old:")
print(exactly_25)
# Output: [ True False False False False False False False False]

# Less than (<)
minors = ages < 18
print("\nMinors (< 18):")
print(minors)
# Output: [False False  True False False False False  True False]

All Comparison Operators

Here are all six comparison operators:

scores = np.array([85, 92, 78, 88, 95, 72, 89, 91])

print("Scores:", scores)
print("\nA grades (>= 90):", scores >= 90)
# Output: [False  True False False  True False False  True]

print("B grades (>= 80):", scores >= 80)
# Output: [ True  True False  True  True False  True  True]

print("Failed (< 60):", scores < 60)
# Output: [False False False False False False False False]

print("Exactly 88:", scores == 88)
# Output: [False False False  True False False False False]

print("Not 88:", scores != 88)
# Output: [ True  True  True False  True  True  True  True]

The six operators are:

  • > greater than
  • < less than
  • >= greater than or equal
  • <= less than or equal
  • == equal to
  • != not equal to

Boolean Indexing for 1D Arrays

Filtering with Boolean Masks

Use a Boolean array as an index to extract only the values where the array is True:

ages = np.array([25, 30, 15, 45, 18, 62, 35, 12, 28])

# Create Boolean mask
adult_mask = ages >= 18
print("Boolean mask (>= 18):")
print(adult_mask)
# Output: [ True  True False  True  True  True  True False  True]

# Use mask to filter
adult_ages = ages[adult_mask]
print("\nAdult ages only:")
print(adult_ages)
# Output: [25 30 45 18 62 35 28]

Visual representation:

ages:       [25, 30, 15, 45, 18, 62, 35, 12, 28]
mask:       [ T,  T,  F,  T,  T,  T,  T,  F,  T]
                ↓   ↓       ↓   ↓   ↓   ↓       ↓
result:     [25, 30,     45, 18, 62, 35,     28]

Only positions where mask is True are included

Shortcut: Combine in One Line

You typically combine the comparison and filtering in one line:

# Adult ages (one line)
adult_ages = ages[ages >= 18]
print("Adult ages (one line):")
print(adult_ages)
# Output: [25 30 45 18 62 35 28]

# Minor ages
minor_ages = ages[ages < 18]
print("\nMinor ages:")
print(minor_ages)
# Output: [15 12]

This is more concise and easier to read.

Counting Matches

To count how many values meet a condition, use .sum() on the Boolean array. True counts as 1, False counts as 0:

ages = np.array([25, 30, 15, 45, 18, 62, 35, 12, 28])

# Count how many adults
num_adults = (ages >= 18).sum()
print(f"Number of adults: {num_adults}")
# Output: Number of adults: 7

# Alternative: count from filtered array
num_adults = ages[ages >= 18].shape[0]
print(f"Number of adults (alt): {num_adults}")
# Output: Number of adults (alt): 7

# Count seniors
num_seniors = (ages > 60).sum()
print(f"Number of seniors: {num_seniors}")
# Output: Number of seniors: 1

Practical Example: Filter Sales Data

Let’s analyze daily sales:

# Daily sales for a month
daily_sales = np.array([120, 135, 98, 145, 110, 175, 190, 88, 165, 142,
                        130, 155, 95, 180, 125, 168, 152, 138, 115, 195,
                        185, 162, 105, 148, 172, 133, 158, 140, 112, 178])

print(f"Total days: {len(daily_sales)}")
print(f"Average daily sales: {daily_sales.mean():.1f}")

Output:

Total days: 30
Average daily sales: 144.1

Find high-performing days:

# Find high-performing days (> 150)
high_days = daily_sales[daily_sales > 150]
print(f"\nHigh-performing days (> 150): {len(high_days)} days")
print(f"Sales on those days: {high_days}")
print(f"Average on high days: {high_days.mean():.1f}")

Output:

High-performing days (> 150): 12 days
Sales on those days: [175 190 165 155 180 168 152 195 185 162 172 158 178]
Average on high days: 172.1

Find low-performing days:

# Find low-performing days (< 100)
low_days = daily_sales[daily_sales < 100]
print(f"\nLow-performing days (< 100): {len(low_days)} days")
print(f"Sales on those days: {low_days}")

Output:

Low-performing days (< 100): 2 days
Sales on those days: [98 88 95]

This analysis immediately identifies problem areas that need attention.


Boolean Indexing for 2D Arrays

Filter Rows Based on Column Value

With 2D arrays, you typically filter entire rows based on values in a specific column:

# Student data: ID, Math, Physics, Chemistry
students = np.array([
    [101, 85, 90, 88],
    [102, 92, 85, 91],
    [103, 78, 82, 80],
    [104, 88, 87, 92],
    [105, 95, 89, 93],
    [106, 72, 75, 70],
    [107, 89, 91, 87]
])

print("Student Data:")
print(students)
print(f"Shape: {students.shape}")

Find students with high math scores:

# Extract Math column
math_scores = students[:, 1]
print("Math scores:", math_scores)
# Output: [85 92 78 88 95 72 89]

# Create Boolean mask
high_math_mask = math_scores >= 90
print("High math mask (>= 90):", high_math_mask)
# Output: [False  True False False  True False False]

# Filter rows
high_math_students = students[high_math_mask]
print("\nStudents with Math >= 90:")
print(high_math_students)

Output:

Students with Math >= 90:
[[102  92  85  91]
 [105  95  89  93]]

Shortcut: One-Line Filtering

Combine everything in one line:

# Students with Math >= 90 (one line)
high_math = students[students[:, 1] >= 90]
print("High Math students (one line):")
print(high_math)

print(f"\nNumber of students: {high_math.shape[0]}")

Output:

High Math students (one line):
[[102  92  85  91]
 [105  95  89  93]]

Number of students: 2

Select Specific Columns from Filtered Rows

You can filter rows and then select specific columns:

# Students with Physics >= 85, show ID and Physics only
high_physics = students[students[:, 2] >= 85]
id_and_physics = high_physics[:, [0, 2]]

print("Students with Physics >= 85 (ID and Physics only):")
print(id_and_physics)

Output:

Students with Physics >= 85 (ID and Physics only):
[[101  90]
 [102  85]
 [104  87]
 [105  89]
 [107  91]]

Another example:

# Students with Chemistry < 75, show all their scores
low_chem = students[students[:, 3] < 75]
print("Students with Chemistry < 75:")
print(low_chem)

Output:

Students with Chemistry < 75:
[[106  72  75  70]]

Practical Example: Sales Analysis

# Sales data: Day, Product_A, Product_B, Product_C
sales = np.array([
    [1, 120, 135, 98],
    [2, 135, 142, 105],
    [3, 150, 138, 112],
    [4, 145, 155, 108],
    [5, 160, 148, 118],
    [6, 175, 162, 125],
    [7, 190, 170, 132]
])

print("Weekly Sales:")
print(sales)

Analyze product performance:

# Days when Product A sold > 150
strong_a_days = sales[sales[:, 1] > 150]
print("Days when Product A > 150:")
print(strong_a_days)

# Output:
# [[5 160 148 118]
#  [6 175 162 125]
#  [7 190 170 132]]

# Days when Product C sold < 110
weak_c_days = sales[sales[:, 3] < 110]
print("\nDays when Product C < 110:")
print(weak_c_days)

# Output:
# [[1 120 135  98]
#  [2 135 142 105]
#  [4 145 155 108]]

Combining Multiple Conditions

Logical Operators

You can combine multiple conditions using logical operators:

  • & (AND): Both conditions must be True
  • | (OR): At least one condition must be True
  • ~ (NOT): Reverses True/False

Important

Always use parentheses around each condition when combining them! This is required by NumPy’s syntax.

ages = np.array([25, 30, 15, 45, 18, 62, 35, 12, 28, 55])

print("Ages:", ages)

AND Operator (&)

Both conditions must be True:

# Ages between 20 and 40 (inclusive)
middle_aged = ages[(ages >= 20) & (ages <= 40)]
print("Ages 20-40:", middle_aged)
# Output: [25 30 35 28]

# Count them
count = ((ages >= 20) & (ages <= 40)).sum()
print(f"Number of people aged 20-40: {count}")
# Output: 4

Visual representation:

ages:        [25, 30, 15, 45, 18, 62, 35, 12, 28, 55]
>=20:        [ T,  T,  F,  T,  F,  T,  T,  F,  T,  T]
<=40:        [ T,  T,  T,  F,  T,  F,  T,  T,  T,  F]
Both True:   [ T,  T,  F,  F,  F,  F,  T,  F,  T,  F]
                ↓   ↓                   ↓       ↓
result:      [25, 30,                 35,     28]

OR Operator (|)

At least one condition must be True:

# Young (< 20) OR senior (> 60)
young_or_senior = ages[(ages < 20) | (ages > 60)]
print("Young or Senior:", young_or_senior)
# Output: [15 62 12]

# Count them
count = ((ages < 20) | (ages > 60)).sum()
print(f"Number of young or seniors: {count}")
# Output: 3

NOT Operator (~)

Reverses the condition:

# NOT adults (same as minors)
not_adults = ages[~(ages >= 18)]
print("Not adults (< 18):", not_adults)
# Output: [15 12]

# NOT in range 20-40
not_middle = ages[~((ages >= 20) & (ages <= 40))]
print("Not in 20-40 range:", not_middle)
# Output: [15 45 18 62 12 55]

Complex Conditions

Combine multiple criteria:

scores = np.array([85, 92, 78, 88, 95, 72, 89, 91, 65, 82])

# B range: 80-89
b_grades = scores[(scores >= 80) & (scores < 90)]
print("B grades (80-89):", b_grades)
# Output: [85 88 89 82]

# A or F: >= 90 or < 60
extreme_grades = scores[(scores >= 90) | (scores < 60)]
print("Extreme grades (A or F):", extreme_grades)
# Output: [92 95 91]

# Not passing: < 60
failing = scores[scores < 60]
print("Failing grades:", failing)
# Output: []

Combining Conditions in 2D Arrays

# Student data: ID, Math, Physics, Chemistry
students = np.array([
    [101, 85, 90, 88],
    [102, 92, 85, 91],
    [103, 78, 82, 80],
    [104, 88, 87, 92],
    [105, 95, 89, 93],
    [106, 72, 75, 70],
    [107, 89, 91, 87]
])

print("Student Data:")
print(students)

Find students strong in multiple subjects:

# Students with Math >= 85 AND Physics >= 85
strong_both = students[(students[:, 1] >= 85) & (students[:, 2] >= 85)]
print("Strong in both Math and Physics:")
print(strong_both)

Output:

Strong in both Math and Physics:
[[101  85  90  88]
 [102  92  85  91]
 [104  88  87  92]
 [105  95  89  93]
 [107  89  91  87]]

Find students who excel in at least one subject:

# Students with Math >= 90 OR Chemistry >= 90
excel_in_one = students[(students[:, 1] >= 90) | (students[:, 3] >= 90)]
print("Excel in Math OR Chemistry:")
print(excel_in_one)

Output:

Excel in Math OR Chemistry:
[[102  92  85  91]
 [104  88  87  92]
 [105  95  89  93]]

Find struggling students:

# Students with ANY score < 75
struggling = students[(students[:, 1] < 75) | (students[:, 2] < 75) | (students[:, 3] < 75)]
print("Students with any score < 75:")
print(struggling)

Output:

Students with any score < 75:
[[106  72  75  70]]

Advanced Real-World Example

# Sales data: Day, Revenue, Units_Sold, Customers
sales = np.array([
    [1, 1200, 45, 28],
    [2, 1350, 52, 32],
    [3, 980, 38, 22],
    [4, 1450, 55, 35],
    [5, 1100, 42, 26],
    [6, 1750, 65, 42],
    [7, 1900, 70, 45]
])

print("Sales Data:")
print(sales)

Identify high-performing days:

# High-performing: Revenue > 1400 AND Customers > 30
high_days = sales[(sales[:, 1] > 1400) & (sales[:, 3] > 30)]
print("High-performing days (Revenue > 1400 AND Customers > 30):")
print(high_days)
print(f"Number of high days: {high_days.shape[0]}")

Output:

High-performing days (Revenue > 1400 AND Customers > 30):
[[4 1450   55   35]
 [6 1750   65   42]
 [7 1900   70   45]]
Number of high days: 3

Identify alert days:

# Alert days: Low revenue (< 1000) OR few customers (< 25)
alert_days = sales[(sales[:, 1] < 1000) | (sales[:, 3] < 25)]
print("\nAlert days (Revenue < 1000 OR Customers < 25):")
print(alert_days)

Output:

Alert days (Revenue < 1000 OR Customers < 25):
[[3  980   38   22]]

Find medium-range days:

# Medium range: Units between 40 and 60
medium_days = sales[(sales[:, 2] >= 40) & (sales[:, 2] <= 60)]
print("\nMedium sales days (40-60 units):")
print(medium_days)

Output:

Medium sales days (40-60 units):
[[1 1200   45   28]
 [2 1350   52   32]
 [4 1450   55   35]
 [5 1100   42   26]]

Practice Exercises

Apply Boolean indexing to solve these problems.

Exercise 1: Basic Filtering

Find all temperatures above 30°C:

temperatures = np.array([25, 28, 32, 30, 35, 27, 31, 29, 33, 26])

# Your code here:
# 1. Filter temperatures > 30
# 2. Count how many

Exercise 2: 2D Filtering

Find products with price greater than 50:

# Product data: ID, Price, Stock
products = np.array([
    [1, 45, 120],
    [2, 65, 85],
    [3, 30, 200],
    [4, 75, 55],
    [5, 40, 150]
])

# Your code here:
# Filter products with price (column 1) > 50

Hint

Use products[products[:, 1] > 50] to filter based on column 1.

Exercise 3: Multiple Conditions

Find students with scores between 80 and 90 (inclusive):

exam_scores = np.array([75, 85, 92, 78, 88, 95, 82, 71, 89, 93])

# Your code here:
# 1. Filter scores in range 80-90
# 2. Count how many

Summary

Boolean indexing is a powerful data filtering technique. Let’s review what you learned.

Key Concepts

Boolean Arrays

  • Created with comparison operators: >, <, >=, <=, ==, !=
  • Contain True/False values
  • Act as masks for filtering

Boolean Indexing (1D)

  • Filter with condition: array[array > 50]
  • Returns only values where condition is True
  • Count matches: (array > 50).sum()

Boolean Indexing (2D)

  • Filter rows: array[array[:, col] > value]
  • Returns entire rows where condition is True
  • Can select specific columns from filtered rows

Combining Conditions

  • AND: (cond1) & (cond2) - both must be True
  • OR: (cond1) | (cond2) - at least one must be True
  • NOT: ~(condition) - reverses True/False
  • Always use parentheses around each condition

Key Patterns Reference

# 1D filtering
array[array > 50]              # Values > 50
(array > 50).sum()             # Count > 50
array[array == value]          # Exact matches

# 2D filtering
array[array[:, 1] > 50]        # Rows where column 1 > 50
array[array[:, 0] == value]    # Rows where column 0 equals value

# Combining conditions
array[(array > 20) & (array < 40)]     # AND
array[(array < 10) | (array > 90)]     # OR
array[~(array >= 50)]                  # NOT

# Count matches
((array > 20) & (array < 40)).sum()

Important Reminders

  • Always use parentheses around each condition when combining
  • & for AND, | for OR, ~ for NOT
  • Do not use Python’s and, or, not with arrays
  • Boolean indexing returns a copy, not a view

Why This Matters

Boolean indexing enables you to:

  • Find data that meets specific criteria
  • Identify outliers or anomalies
  • Segment data into categories
  • Filter datasets for focused analysis
  • Answer complex questions about your data

This is how professional data analysts query datasets. Instead of manually searching through thousands of rows, you express your question as a Boolean condition and let NumPy do the work instantly.


Next Steps

You now have powerful filtering capabilities. In the next lesson, you will learn to modify data—changing values, cleaning datasets, and transforming arrays for analysis.

Continue to Lesson 6 - Modifying Data

Learn to update array values, clean data, and transform datasets

Back to Lesson 4 - Vector Operations

Review arithmetic operations and statistical methods


Master Data Filtering

Boolean indexing transforms how you work with data. Instead of writing complex loops to find matching values, you express your criteria as simple conditions. This technique is essential for data cleaning, analysis, and exploration.

Combined with the other NumPy skills you have learned, Boolean indexing makes you capable of handling real-world data analysis tasks efficiently and elegantly!