Lesson 7 - Boolean Filtering in Pandas
Powerful Data Filtering
You learned Boolean indexing in NumPy. Now you will apply these same concepts to DataFrames—filtering rows based on conditions using labeled columns.
By the end of this lesson, you will be able to:
- Create Boolean masks on DataFrame columns
- Filter rows based on single conditions
- Combine multiple conditions with
&(AND),|(OR),~(NOT) - Use
.isin()for membership testing - Filter and select specific columns simultaneously
- Count rows that meet conditions
Boolean filtering is the foundation of data analysis—selecting exactly the rows you need based on your criteria. Let’s master it.
Understanding Boolean Filtering
Boolean filtering works exactly like NumPy arrays:
- Create a Boolean mask (True/False for each row)
- Use the mask to filter the DataFrame
The difference: Instead of numeric positions, you reference columns by name.
Let’s create sample data:
import pandas as pd
import numpy as np
# Create companies dataset
companies = pd.DataFrame({
'company': ['TechCorp Global', 'FreshMart Inc', 'AutoDrive Motors', 'FirstBank Holdings', 'PowerGen Energy',
'MediPharm Solutions', 'RetailHub Ltd', 'SkyWings Airlines', 'SteelCore Industries', 'NetLink Telecom',
'LogiTrans Services', 'SolarNova Energy', 'SecureLife Insurance'],
'sector': ['Technology', 'Food', 'Automotive', 'Financials', 'Energy',
'Healthcare', 'Retail', 'Transportation', 'Materials', 'Technology',
'Transportation', 'Energy', 'Financials'],
'revenues': [125000, 89000, 156000, 234000, 178000,
98000, 112000, 187000, 145000, 165000,
92000, 156000, 201000],
'profits': [12000, 8500, -3000, 45000, 23000,
15000, 9800, 21000, 18000, 28000,
-5000, 31000, 38000],
'employees': [1200, 890, 2300, 5600, 3400,
2100, 4500, 8900, 3200, 6700,
1800, 980, 4200],
'country': ['USA', 'USA', 'USA', 'UK', 'Germany',
'USA', 'UK', 'Germany', 'USA', 'UK',
'Germany', 'UK', 'USA']
})
print(f"Shape: {companies.shape}")
companies.head()Single Condition Filtering
Filter DataFrame rows based on one condition.
Creating a Boolean Mask
# Create Boolean mask: Which companies are from USA?
is_iran = companies['country'] == 'USA'
print("Boolean mask:")
print(is_iran)
print(f"\nType: {type(is_iran)}")Output:
Boolean mask:
0 True
1 True
2 True
3 False
4 False
...
Name: country, dtype: bool
Type: <class 'pandas.core.series.Series'>A Series of True/False values, one for each row.
Applying the Mask
# Apply the mask to filter DataFrame
iran_companies = companies[is_iran]
print(f"Iranian companies: {iran_companies.shape[0]}")
iran_companiesOnly rows where the mask is True are kept.
Shorthand Syntax
# More common: combine creation and application
iran_companies = companies[companies['country'] == 'USA']
print("Iranian companies (shorthand):")
print(iran_companies[['company', 'country', 'revenues']])This is the typical way to filter—create and apply the mask in one step.
More Examples
# Companies with revenues > $150,000
high_revenue = companies[companies['revenues'] > 150000]
print(f"High revenue companies: {high_revenue.shape[0]}")
high_revenue[['company', 'revenues']]# Unprofitable companies (negative profits)
unprofitable = companies[companies['profits'] < 0]
print("Unprofitable companies:")
unprofitable[['company', 'profits']]# Technology sector companies
tech_companies = companies[companies['sector'] == 'Technology']
print(f"Technology companies: {tech_companies.shape[0]}")
tech_companies[['company', 'sector']]Counting Matches
# Count how many meet the condition
num_iran = (companies['country'] == 'USA').sum()
num_profitable = (companies['profits'] > 0).sum()
print(f"Iranian companies: {num_iran}")
print(f"Profitable companies: {num_profitable}")Output:
Iranian companies: 6
Profitable companies: 11True counts as 1, False counts as 0, so .sum() gives the count.
Multiple Conditions
Combine conditions using logical operators:
&- AND (both conditions must be True)|- OR (at least one condition must be True)~- NOT (negate/reverse the condition)
Important Syntax Rule
Each condition must be in parentheses () when combining with &, |, or ~!
AND Operator (&)
# Iranian companies that are profitable
iran_profitable = companies[
(companies['country'] == 'USA') &
(companies['profits'] > 0)
]
print("Iranian AND profitable companies:")
iran_profitable[['company', 'country', 'profits']]Both conditions must be True.
# Large profitable companies (revenues > $100k AND profits > $20k)
large_profitable = companies[
(companies['revenues'] > 100000) &
(companies['profits'] > 20000)
]
print("Large and highly profitable:")
large_profitable[['company', 'revenues', 'profits']]OR Operator (|)
# Technology OR Energy companies
tech_or_energy = companies[
(companies['sector'] == 'Technology') |
(companies['sector'] == 'Energy')
]
print("Technology OR Energy companies:")
tech_or_energy[['company', 'sector']]At least one condition must be True.
NOT Operator (~)
# Companies NOT from USA
not_iran = companies[~(companies['country'] == 'USA')]
print("Non-Iranian companies:")
not_iran[['company', 'country']]The ~ reverses the Boolean mask.
Combining Multiple Operators
# Complex: UK companies in Technology or Energy with profits > $10k
uae_tech_energy_profitable = companies[
(companies['country'] == 'UK') &
((companies['sector'] == 'Technology') | (companies['sector'] == 'Energy')) &
(companies['profits'] > 10000)
]
print("UK tech/energy companies with profits > $10k:")
uae_tech_energy_profitable[['company', 'country', 'sector', 'profits']]Notice the extra parentheses around the OR condition.
Common Mistake
# WRONG - Missing parentheses
# df[df['col1'] > 5 & df['col2'] < 10] # ERROR!
# CORRECT - Each condition in parentheses
df[(df['col1'] > 5) & (df['col2'] < 10)]Without parentheses, Python’s operator precedence causes errors.
Filtering with .loc[]
Combine row filtering with column selection using .loc[].
Filter Rows, Select Columns
# Iranian companies: show only company, revenues, and profits
iran_financials = companies.loc[
companies['country'] == 'USA',
['company', 'revenues', 'profits']
]
print("Iranian companies financials:")
iran_financialsFirst argument: row filter (Boolean mask) Second argument: columns to select
# Technology companies: show sector, revenues, employees
tech_subset = companies.loc[
companies['sector'] == 'Technology',
['sector', 'revenues', 'employees']
]
print("Technology companies subset:")
tech_subsetMultiple Conditions with Column Selection
# Profitable companies with > 2000 employees: show key metrics
large_profitable_subset = companies.loc[
(companies['profits'] > 0) & (companies['employees'] > 2000),
['company', 'sector', 'profits', 'employees']
]
print("Large profitable companies - key metrics:")
large_profitable_subsetHelper Methods for Filtering
Pandas provides convenient methods for common filtering patterns.
.isin() for Multiple Values
# Companies from UK or Germany
gulf_countries = companies[companies['country'].isin(['UK', 'Germany'])]
print("Companies from UK or Germany:")
gulf_countries[['company', 'country']]Much cleaner than:
# Equivalent but verbose
companies[(companies['country'] == 'UK') | (companies['country'] == 'Germany')]NOT isin()
# Companies NOT in Technology or Retail
not_tech_retail = companies[~companies['sector'].isin(['Technology', 'Retail'])]
print("Companies not in Tech or Retail:")
not_tech_retail[['company', 'sector']].between() for Ranges
# Companies with revenues between $100k and $180k
mid_revenue = companies[companies['revenues'].between(100000, 180000)]
print("Mid-revenue companies ($100k-$180k):")
mid_revenue[['company', 'revenues']]Cleaner than:
# Equivalent but more verbose
companies[(companies['revenues'] >= 100000) & (companies['revenues'] <= 180000)]Useful Filtering Methods
| Method | Description | Example |
|---|---|---|
.isin([list]) | Value is in list | df[df['col'].isin(['A', 'B'])] |
~.isin([list]) | Value NOT in list | df[~df['col'].isin(['A', 'B'])] |
.between(a, b) | Value in range [a, b] | df[df['col'].between(10, 20)] |
.str.contains() | String contains pattern | df[df['col'].str.contains('tech')] |
.str.startswith() | String starts with | df[df['col'].str.startswith('A')] |
Practice Exercises
Apply Boolean filtering with these exercises.
Exercise 1: Simple Filtering
- Find all companies with more than 3000 employees
- Find all companies in the Financials sector
- Find all companies with revenues less than $100,000
# Your code hereExercise 2: Multiple Conditions
- Find Iranian companies with profits > $10,000
- Find companies in Technology OR Healthcare sectors
- Find companies with revenues > $150k AND employees < 5000
# Your code hereHint
Remember: Each condition needs parentheses when using & or |!
Exercise 3: Using .loc[] with Filtering
- For UK companies, show only company, sector, and profits
- For unprofitable companies, show company and profits columns
- For companies with > 4000 employees, show company, employees, and revenues
# Your code hereExercise 4: Challenge
- Find companies from UK or Germany that are profitable
- Find companies NOT in USA with revenues between $90k and $200k
- Count how many companies have negative profits
# Your code hereSummary
You now master Boolean filtering in pandas. Let’s review the key concepts.
Key Concepts
Boolean Filtering Process
- Create Boolean mask with comparison
- Apply mask to DataFrame
- Works exactly like NumPy arrays
Single Conditions
- Use comparison operators:
==,!=,>,<,>=,<= - Returns True/False for each row
- Filter:
df[condition]
Multiple Conditions
&for AND (both True)|for OR (at least one True)~for NOT (reverse)- Each condition needs parentheses!
Helper Methods
.isin()for multiple values.between()for ranges- Cleaner than complex conditions
Combine with Selection
- Use
.loc[condition, columns] - Filter rows and select columns together
Syntax Reference
# Simple filter
df[df['col'] > 100]
# AND condition
df[(df['col1'] > 100) & (df['col2'] == 'A')]
# OR condition
df[(df['col1'] > 100) | (df['col2'] > 200)]
# NOT condition
df[~(df['col'] == 'A')]
# Filter + select columns
df.loc[df['col'] > 100, ['col1', 'col2']]
# Multiple values
df[df['col'].isin(['A', 'B', 'C'])]
# Range
df[df['col'].between(10, 20)]
# Count matches
(df['col'] > 100).sum()Comparison Operators
| Operator | Meaning | Example |
|---|---|---|
== | Equal | df['col'] == 5 |
!= | Not equal | df['col'] != 5 |
> | Greater than | df['col'] > 5 |
< | Less than | df['col'] < 5 |
>= | Greater or equal | df['col'] >= 5 |
<= | Less or equal | df['col'] <= 5 |
Logical Operators
| Operator | Meaning | Example |
|---|---|---|
& | AND (both True) | (cond1) & (cond2) |
| ` | ` | OR (one True) |
~ | NOT (reverse) | ~(condition) |
Important Reminders
- Parentheses required: Always wrap each condition when using
&,|,~ - Column names: Reference columns by name, not position
- Boolean Series: Result of comparison is a Series of True/False
- Count with sum: True=1, False=0, so
.sum()counts matches - Use .loc[]: Combine filtering with column selection
- Helper methods:
.isin()and.between()simplify common patterns
Next Steps
You can now filter DataFrames based on any conditions. In the next lesson, you will learn to sort data and find top/bottom values—essential for ranking and organizing results.
Continue to Lesson 8 - Sorting and Ranking
Learn to order data and identify top performers
Back to Lesson 6 - DateTime Fundamentals
Review working with dates and times
Master Data Selection
Boolean filtering is your most-used data selection technique. Combined with NumPy’s vectorization and pandas’ labeled columns, you can express complex filtering logic clearly and efficiently.
Use Boolean filtering to ask questions of your data and get precise answers!