Lesson 7 - Boolean Filtering in Pandas

Powerful Data Filtering

You learned Boolean indexing in NumPy. Now you will apply these same concepts to DataFrames—filtering rows based on conditions using labeled columns.

By the end of this lesson, you will be able to:

  • Create Boolean masks on DataFrame columns
  • Filter rows based on single conditions
  • Combine multiple conditions with & (AND), | (OR), ~ (NOT)
  • Use .isin() for membership testing
  • Filter and select specific columns simultaneously
  • Count rows that meet conditions

Boolean filtering is the foundation of data analysis—selecting exactly the rows you need based on your criteria. Let’s master it.


Understanding Boolean Filtering

Boolean filtering works exactly like NumPy arrays:

  1. Create a Boolean mask (True/False for each row)
  2. Use the mask to filter the DataFrame

The difference: Instead of numeric positions, you reference columns by name.

Let’s create sample data:

import pandas as pd
import numpy as np

# Create companies dataset
companies = pd.DataFrame({
    'company': ['TechCorp Global', 'FreshMart Inc', 'AutoDrive Motors', 'FirstBank Holdings', 'PowerGen Energy',
                'MediPharm Solutions', 'RetailHub Ltd', 'SkyWings Airlines', 'SteelCore Industries', 'NetLink Telecom',
                'LogiTrans Services', 'SolarNova Energy', 'SecureLife Insurance'],
    'sector': ['Technology', 'Food', 'Automotive', 'Financials', 'Energy',
               'Healthcare', 'Retail', 'Transportation', 'Materials', 'Technology',
               'Transportation', 'Energy', 'Financials'],
    'revenues': [125000, 89000, 156000, 234000, 178000,
                 98000, 112000, 187000, 145000, 165000,
                 92000, 156000, 201000],
    'profits': [12000, 8500, -3000, 45000, 23000,
                15000, 9800, 21000, 18000, 28000,
                -5000, 31000, 38000],
    'employees': [1200, 890, 2300, 5600, 3400,
                  2100, 4500, 8900, 3200, 6700,
                  1800, 980, 4200],
    'country': ['USA', 'USA', 'USA', 'UK', 'Germany',
                'USA', 'UK', 'Germany', 'USA', 'UK',
                'Germany', 'UK', 'USA']
})

print(f"Shape: {companies.shape}")
companies.head()

Single Condition Filtering

Filter DataFrame rows based on one condition.

Creating a Boolean Mask

# Create Boolean mask: Which companies are from USA?
is_iran = companies['country'] == 'USA'

print("Boolean mask:")
print(is_iran)
print(f"\nType: {type(is_iran)}")

Output:

Boolean mask:
0      True
1      True
2      True
3     False
4     False
...
Name: country, dtype: bool

Type: <class 'pandas.core.series.Series'>

A Series of True/False values, one for each row.

Applying the Mask

# Apply the mask to filter DataFrame
iran_companies = companies[is_iran]

print(f"Iranian companies: {iran_companies.shape[0]}")
iran_companies

Only rows where the mask is True are kept.

Shorthand Syntax

# More common: combine creation and application
iran_companies = companies[companies['country'] == 'USA']

print("Iranian companies (shorthand):")
print(iran_companies[['company', 'country', 'revenues']])

This is the typical way to filter—create and apply the mask in one step.

More Examples

# Companies with revenues > $150,000
high_revenue = companies[companies['revenues'] > 150000]

print(f"High revenue companies: {high_revenue.shape[0]}")
high_revenue[['company', 'revenues']]
# Unprofitable companies (negative profits)
unprofitable = companies[companies['profits'] < 0]

print("Unprofitable companies:")
unprofitable[['company', 'profits']]
# Technology sector companies
tech_companies = companies[companies['sector'] == 'Technology']

print(f"Technology companies: {tech_companies.shape[0]}")
tech_companies[['company', 'sector']]

Counting Matches

# Count how many meet the condition
num_iran = (companies['country'] == 'USA').sum()
num_profitable = (companies['profits'] > 0).sum()

print(f"Iranian companies: {num_iran}")
print(f"Profitable companies: {num_profitable}")

Output:

Iranian companies: 6
Profitable companies: 11

True counts as 1, False counts as 0, so .sum() gives the count.


Multiple Conditions

Combine conditions using logical operators:

  • & - AND (both conditions must be True)
  • | - OR (at least one condition must be True)
  • ~ - NOT (negate/reverse the condition)

Important Syntax Rule

Each condition must be in parentheses () when combining with &, |, or ~!

AND Operator (&)

# Iranian companies that are profitable
iran_profitable = companies[
    (companies['country'] == 'USA') &
    (companies['profits'] > 0)
]

print("Iranian AND profitable companies:")
iran_profitable[['company', 'country', 'profits']]

Both conditions must be True.

# Large profitable companies (revenues > $100k AND profits > $20k)
large_profitable = companies[
    (companies['revenues'] > 100000) &
    (companies['profits'] > 20000)
]

print("Large and highly profitable:")
large_profitable[['company', 'revenues', 'profits']]

OR Operator (|)

# Technology OR Energy companies
tech_or_energy = companies[
    (companies['sector'] == 'Technology') |
    (companies['sector'] == 'Energy')
]

print("Technology OR Energy companies:")
tech_or_energy[['company', 'sector']]

At least one condition must be True.

NOT Operator (~)

# Companies NOT from USA
not_iran = companies[~(companies['country'] == 'USA')]

print("Non-Iranian companies:")
not_iran[['company', 'country']]

The ~ reverses the Boolean mask.

Combining Multiple Operators

# Complex: UK companies in Technology or Energy with profits > $10k
uae_tech_energy_profitable = companies[
    (companies['country'] == 'UK') &
    ((companies['sector'] == 'Technology') | (companies['sector'] == 'Energy')) &
    (companies['profits'] > 10000)
]

print("UK tech/energy companies with profits > $10k:")
uae_tech_energy_profitable[['company', 'country', 'sector', 'profits']]

Notice the extra parentheses around the OR condition.

Common Mistake

# WRONG - Missing parentheses
# df[df['col1'] > 5 & df['col2'] < 10]  # ERROR!

# CORRECT - Each condition in parentheses
df[(df['col1'] > 5) & (df['col2'] < 10)]

Without parentheses, Python’s operator precedence causes errors.


Filtering with .loc[]

Combine row filtering with column selection using .loc[].

Filter Rows, Select Columns

# Iranian companies: show only company, revenues, and profits
iran_financials = companies.loc[
    companies['country'] == 'USA',
    ['company', 'revenues', 'profits']
]

print("Iranian companies financials:")
iran_financials

First argument: row filter (Boolean mask) Second argument: columns to select

# Technology companies: show sector, revenues, employees
tech_subset = companies.loc[
    companies['sector'] == 'Technology',
    ['sector', 'revenues', 'employees']
]

print("Technology companies subset:")
tech_subset

Multiple Conditions with Column Selection

# Profitable companies with > 2000 employees: show key metrics
large_profitable_subset = companies.loc[
    (companies['profits'] > 0) & (companies['employees'] > 2000),
    ['company', 'sector', 'profits', 'employees']
]

print("Large profitable companies - key metrics:")
large_profitable_subset

Helper Methods for Filtering

Pandas provides convenient methods for common filtering patterns.

.isin() for Multiple Values

# Companies from UK or Germany
gulf_countries = companies[companies['country'].isin(['UK', 'Germany'])]

print("Companies from UK or Germany:")
gulf_countries[['company', 'country']]

Much cleaner than:

# Equivalent but verbose
companies[(companies['country'] == 'UK') | (companies['country'] == 'Germany')]

NOT isin()

# Companies NOT in Technology or Retail
not_tech_retail = companies[~companies['sector'].isin(['Technology', 'Retail'])]

print("Companies not in Tech or Retail:")
not_tech_retail[['company', 'sector']]

.between() for Ranges

# Companies with revenues between $100k and $180k
mid_revenue = companies[companies['revenues'].between(100000, 180000)]

print("Mid-revenue companies ($100k-$180k):")
mid_revenue[['company', 'revenues']]

Cleaner than:

# Equivalent but more verbose
companies[(companies['revenues'] >= 100000) & (companies['revenues'] <= 180000)]

Useful Filtering Methods

MethodDescriptionExample
.isin([list])Value is in listdf[df['col'].isin(['A', 'B'])]
~.isin([list])Value NOT in listdf[~df['col'].isin(['A', 'B'])]
.between(a, b)Value in range [a, b]df[df['col'].between(10, 20)]
.str.contains()String contains patterndf[df['col'].str.contains('tech')]
.str.startswith()String starts withdf[df['col'].str.startswith('A')]

Practice Exercises

Apply Boolean filtering with these exercises.

Exercise 1: Simple Filtering

  1. Find all companies with more than 3000 employees
  2. Find all companies in the Financials sector
  3. Find all companies with revenues less than $100,000
# Your code here

Exercise 2: Multiple Conditions

  1. Find Iranian companies with profits > $10,000
  2. Find companies in Technology OR Healthcare sectors
  3. Find companies with revenues > $150k AND employees < 5000
# Your code here

Hint

Remember: Each condition needs parentheses when using & or |!

Exercise 3: Using .loc[] with Filtering

  1. For UK companies, show only company, sector, and profits
  2. For unprofitable companies, show company and profits columns
  3. For companies with > 4000 employees, show company, employees, and revenues
# Your code here

Exercise 4: Challenge

  1. Find companies from UK or Germany that are profitable
  2. Find companies NOT in USA with revenues between $90k and $200k
  3. Count how many companies have negative profits
# Your code here

Summary

You now master Boolean filtering in pandas. Let’s review the key concepts.

Key Concepts

Boolean Filtering Process

  • Create Boolean mask with comparison
  • Apply mask to DataFrame
  • Works exactly like NumPy arrays

Single Conditions

  • Use comparison operators: ==, !=, >, <, >=, <=
  • Returns True/False for each row
  • Filter: df[condition]

Multiple Conditions

  • & for AND (both True)
  • | for OR (at least one True)
  • ~ for NOT (reverse)
  • Each condition needs parentheses!

Helper Methods

  • .isin() for multiple values
  • .between() for ranges
  • Cleaner than complex conditions

Combine with Selection

  • Use .loc[condition, columns]
  • Filter rows and select columns together

Syntax Reference

# Simple filter
df[df['col'] > 100]

# AND condition
df[(df['col1'] > 100) & (df['col2'] == 'A')]

# OR condition
df[(df['col1'] > 100) | (df['col2'] > 200)]

# NOT condition
df[~(df['col'] == 'A')]

# Filter + select columns
df.loc[df['col'] > 100, ['col1', 'col2']]

# Multiple values
df[df['col'].isin(['A', 'B', 'C'])]

# Range
df[df['col'].between(10, 20)]

# Count matches
(df['col'] > 100).sum()

Comparison Operators

OperatorMeaningExample
==Equaldf['col'] == 5
!=Not equaldf['col'] != 5
>Greater thandf['col'] > 5
<Less thandf['col'] < 5
>=Greater or equaldf['col'] >= 5
<=Less or equaldf['col'] <= 5

Logical Operators

OperatorMeaningExample
&AND (both True)(cond1) & (cond2)
``OR (one True)
~NOT (reverse)~(condition)

Important Reminders

  • Parentheses required: Always wrap each condition when using &, |, ~
  • Column names: Reference columns by name, not position
  • Boolean Series: Result of comparison is a Series of True/False
  • Count with sum: True=1, False=0, so .sum() counts matches
  • Use .loc[]: Combine filtering with column selection
  • Helper methods: .isin() and .between() simplify common patterns

Next Steps

You can now filter DataFrames based on any conditions. In the next lesson, you will learn to sort data and find top/bottom values—essential for ranking and organizing results.

Continue to Lesson 8 - Sorting and Ranking

Learn to order data and identify top performers

Back to Lesson 6 - DateTime Fundamentals

Review working with dates and times


Master Data Selection

Boolean filtering is your most-used data selection technique. Combined with NumPy’s vectorization and pandas’ labeled columns, you can express complex filtering logic clearly and efficiently.

Use Boolean filtering to ask questions of your data and get precise answers!