Lesson 11 - Handling Missing Data

Dealing with Missing Values

Real-world data is messy. Missing values appear in almost every dataset you will encounter. Now you will learn to detect, remove, and fill missing data—essential skills for data cleaning and preparation.

By the end of this lesson, you will be able to:

Detect missing values with .isnull() and .isna()
Count and calculate percentages of missing data
Remove missing data strategically with .dropna()
Fill missing values using various strategies with .fillna()
Choose appropriate methods for different data types
Understand when to drop vs. fill missing data

Missing data is a reality in all datasets. Surveys have unanswered questions, sensors fail, records are incomplete. Your job is to handle missing data appropriately without introducing bias or losing important information.

Understanding Missing Data

In pandas, missing values are represented as NaN (Not a Number) or None. Both indicate the absence of a value.

Creating Sample Data with Missing Values

import pandas as pd
import numpy as np

# Create dataset with missing values
companies = pd.DataFrame({
    'company': ['TechCorp Global', 'FreshMart Inc', 'AutoDrive Motors', 'FirstBank Holdings', 'PowerGen Energy',
                'MediPharm Solutions', 'RetailHub Ltd', 'SkyWings Airlines', 'SteelCore Industries', 'NetLink Telecom'],
    'sector': ['Technology', 'Food', 'Automotive', None, 'Energy',
               'Healthcare', 'Retail', 'Transportation', 'Materials', 'Technology'],
    'revenues': [125000, 89000, None, 234000, 178000,
                 98000, 112000, 187000, 145000, None],
    'profits': [12000, 8500, -3000, 45000, None,
                15000, 9800, 21000, None, 28000],
    'employees': [1200, 890, 2300, 5600, 3400,
                  None, 4500, 8900, 3200, 6700],
    'country': ['USA', 'USA', 'USA', 'UK', None,
                'USA', 'UK', 'Germany', 'USA', 'UK']
})

print("Dataset with missing values:")
companies

Notice the None and NaN values scattered throughout the dataset.

Detecting Missing Values

The first step in handling missing data is finding where it exists.

Creating Boolean Masks

# Check for missing values (returns True/False)
is_missing = companies.isnull()

print("Missing value mask (True = missing):")
print(is_missing)

Output shows True where values are missing, False where they exist.

# .isna() is an alias for .isnull() (same thing)
print("Using .isna():")
print(companies.isna().head())

Both methods work identically—use whichever you prefer.

Counting Missing Values

# Count missing values per column
missing_count = companies.isnull().sum()

print("Missing values per column:")
print(missing_count)

Output:

company        0
sector         1
revenues       2
profits        2
employees      1
country        1
dtype: int64

Percentage of Missing Data

# Calculate percentage missing
missing_pct = (companies.isnull().sum() / len(companies)) * 100

print("Missing values percentage:")
print(missing_pct.round(1))

Output:

company        0.0
sector        10.0
revenues      20.0
profits       20.0
employees     10.0
country       10.0
dtype: float64

20% of revenue and profit data is missing!

Missing Data Summary

# Create comprehensive summary
missing_summary = pd.DataFrame({
    'column': companies.columns,
    'missing_count': companies.isnull().sum().values,
    'missing_pct': missing_pct.values
})

print("Missing data summary:")
print(missing_summary)

This gives a complete overview of data quality issues.

Finding Rows with Missing Values

# Check which rows have ANY missing values
rows_with_missing = companies.isnull().any(axis=1)

print("Rows with at least one missing value:")
print(companies[rows_with_missing])

Shows all rows that contain at least one missing value.

Checking Specific Columns

# Check if specific column has missing values
has_missing_revenues = companies['revenues'].isnull().any()
has_missing_company = companies['company'].isnull().any()

print(f"Revenues column has missing: {has_missing_revenues}")
print(f"Company column has missing: {has_missing_company}")

Output:

Revenues column has missing: True
Company column has missing: False

Removing Missing Data

Use .dropna() to remove rows or columns with missing values.

Drop Rows with Any Missing Values

# Drop rows with ANY missing values
companies_clean = companies.dropna()

print(f"Original: {len(companies)} rows")
print(f"After dropna(): {len(companies_clean)} rows")
print("\nRemaining data:")
print(companies_clean)

Output:

Original: 10 rows
After dropna(): 6 rows

This removed 4 rows that had missing values—40% of the data!

Data Loss Warning

Dropping all rows with any missing values can remove significant portions of your dataset. Use this approach carefully!

Drop Rows with Specific Column Missing

# Drop rows where revenues is missing
companies_rev_clean = companies.dropna(subset=['revenues'])

print(f"Original: {len(companies)} rows")
print(f"After dropping rows with missing revenues: {len(companies_rev_clean)} rows")

Output:

Original: 10 rows
After dropping rows with missing revenues: 8 rows

Only removed rows missing the revenues column.

Drop Rows with Multiple Columns Missing

# Drop rows where revenues OR profits is missing
companies_multi_clean = companies.dropna(subset=['revenues', 'profits'])

print(f"Original: {len(companies)} rows")
print(f"After dropping rows missing revenues OR profits: {len(companies_multi_clean)} rows")

Drop Only Completely Empty Rows

# Drop rows only if ALL values are missing
companies_all_clean = companies.dropna(how='all')

print(f"After dropna(how='all'): {len(companies_all_clean)} rows")

The how='all' parameter only drops rows where every value is NaN.

Drop Columns with Missing Values

# Drop columns with any missing values
companies_no_missing_cols = companies.dropna(axis=1)

print("Columns without any missing values:")
print(companies_no_missing_cols.columns.tolist())

Output:

Columns without any missing values:
['company']

Only the ‘company’ column has no missing values!

Threshold for Dropping

# Keep rows with at least N non-null values
companies_thresh = companies.dropna(thresh=5)

print(f"Original: {len(companies)} rows")
print(f"After keeping rows with at least 5 non-null values: {len(companies_thresh)} rows")

The thresh parameter specifies minimum non-null values required.

Filling Missing Values

Use .fillna() to replace missing values with specific values or calculated statistics.

Fill with Specific Value

# Fill text columns with 'Unknown'
companies_filled = companies.copy()
companies_filled['sector'] = companies_filled['sector'].fillna('Unknown')
companies_filled['country'] = companies_filled['country'].fillna('Unknown')

print("After filling text columns with 'Unknown':")
print(companies_filled[['company', 'sector', 'country']])

Output shows ‘Unknown’ where values were missing.

Fill with Mean

# Fill numeric columns with mean
companies_mean = companies.copy()
mean_revenue = companies['revenues'].mean()
companies_mean['revenues'] = companies_mean['revenues'].fillna(mean_revenue)

print(f"Mean revenue: ${mean_revenue:,.0f}")
print("\nRevenues after filling with mean:")
print(companies_mean[['company', 'revenues']])

Useful for normally distributed data without outliers.

Fill with Median

# Fill with median (better for skewed data)
companies_median = companies.copy()
median_employees = companies['employees'].median()
companies_median['employees'] = companies_median['employees'].fillna(median_employees)

print(f"Median employees: {median_employees:,.0f}")
print("\nEmployees after filling with median:")
print(companies_median[['company', 'employees']])

Median is more robust to outliers than mean.

Fill with Zero

# Fill with 0 (common for counts or amounts)
companies_zero = companies.copy()
companies_zero['profits'] = companies_zero['profits'].fillna(0)

print("Profits after filling with 0:")
print(companies_zero[['company', 'profits']])

Appropriate when missing means “none” or “zero.”

Fill Different Columns with Different Values

# Fill multiple columns with different strategies
companies_multi_fill = companies.copy()

fill_values = {
    'sector': 'Unknown',
    'revenues': companies['revenues'].median(),
    'profits': 0,
    'employees': companies['employees'].mean(),
    'country': 'Unknown'
}

companies_multi_fill = companies_multi_fill.fillna(fill_values)

print("After filling with different strategies:")
print(companies_multi_fill)
print("\nMissing values remaining:")
print(companies_multi_fill.isnull().sum())

This applies appropriate filling method for each column type.

Forward Fill

# Forward fill - use previous value
test_data = pd.DataFrame({
    'value': [10, None, None, 20, None, 30]
})

print("Original:")
print(test_data)

print("\nAfter forward fill (ffill):")
print(test_data.fillna(method='ffill'))

Output:

   value
0   10.0
1   10.0  ← Filled with previous value
2   10.0  ← Filled with previous value
3   20.0
4   20.0  ← Filled with previous value
5   30.0

Backward Fill

# Backward fill - use next value
print("After backward fill (bfill):")
print(test_data.fillna(method='bfill'))

Output:

   value
0   10.0
1   20.0  ← Filled with next value
2   20.0  ← Filled with next value
3   20.0
4   30.0  ← Filled with next value
5   30.0

Fill Based on Groups

# Fill missing profits with sector average
companies_group = companies.copy()

# Calculate sector average profits
sector_avg_profit = companies.groupby('sector')['profits'].transform('mean')

# Fill missing with sector average
companies_group['profits'] = companies_group['profits'].fillna(sector_avg_profit)

print("Profits filled with sector average:")
print(companies_group[['company', 'sector', 'profits']])

This uses domain knowledge—companies in the same sector likely have similar profit patterns.

Practice Exercises

Apply missing data techniques with these exercises.

Exercise 1: Detecting Missing Data

Find how many total missing values exist in the entire DataFrame
Find which column has the highest percentage of missing values
Count how many rows have missing values in both ‘revenues’ AND ‘profits’

# Your code here

Exercise 2: Removing Missing Data

Create a version that drops rows missing ‘revenues’
Create a version that keeps only rows with ALL values present
Create a version that keeps rows with at least 4 non-null values

# Your code here

Exercise 3: Filling Missing Data

Create a cleaned version with this strategy:

Fill missing ‘sector’ with ‘Unspecified’
Fill missing ‘revenues’ with median revenue
Fill missing ‘profits’ with mean profit
Fill missing ’employees’ with 0
Fill missing ‘country’ with ‘International’

Verify no missing values remain.

# Your code here

Hint

Use a dictionary with .fillna() to fill multiple columns at once with different values!

Exercise 4: Challenge - Smart Filling Strategy

Create a sophisticated cleaning approach:

For companies in the same sector, fill missing ‘revenues’ with the sector’s median revenue
If sector is also missing, use overall median
Compare the difference between using sector-based filling vs. overall median

# Your code here

Summary

You now master handling missing data. Let’s review the key concepts.

Key Concepts

Detecting Missing Data

Use .isnull() or .isna() to find missing values
Count missing with .sum()
Calculate percentages to understand extent

Removing Missing Data

.dropna() removes rows/columns with missing values
Use subset to check specific columns only
Use thresh to require minimum non-null values
Use how='all' to drop only completely empty rows

Filling Missing Data

.fillna() replaces missing values
Fill with constants, statistics, or group-based values
Forward fill and backward fill for time series
Different strategies for different column types

Syntax Reference

# Detection
df.isnull()                    # Returns True/False mask
df.isnull().sum()              # Count per column
df.isnull().any()              # Check if any missing
df.isnull().sum().sum()        # Total missing in DataFrame

# Removal
df.dropna()                    # Drop rows with any missing
df.dropna(subset=['col'])      # Drop if specific column missing
df.dropna(how='all')           # Drop only if all values missing
df.dropna(thresh=n)            # Keep rows with at least n non-null
df.dropna(axis=1)              # Drop columns with missing

# Filling
df.fillna(value)               # Fill with specific value
df.fillna({'col1': val1, 'col2': val2})  # Different values per column
df['col'].fillna(df['col'].mean())       # Fill with mean
df['col'].fillna(df['col'].median())     # Fill with median
df.fillna(method='ffill')      # Forward fill
df.fillna(method='bfill')      # Backward fill

Decision Guide

Should you DROP or FILL?

Drop when:

Very few rows affected (< 5%)
Missing data is random
You have plenty of data
Missing critical columns

Fill when:

Many rows affected
Missing follows a pattern
Limited data available
Missing has meaning (e.g., 0)

Filling Strategies by Data Type:

Data Type	Strategy	Method
Categorical	‘Unknown’ or most common	`fillna('Unknown')` or mode
Numeric (normal)	Mean	`fillna(mean())`
Numeric (skewed)	Median	`fillna(median())`
Counts/Amounts	Zero	`fillna(0)`
Time series	Forward/backward fill	`fillna(method='ffill')`
Group-based	Group statistic	`groupby().transform()`

Important Reminders

Always investigate WHY data is missing - Could indicate data quality issues
Document decisions - Record what you filled and why
Create flags - Add boolean column to track which were filled
Compare before/after - Always check statistics after cleaning
Keep original - Maintain backup of original data
Context matters - Use domain knowledge to choose strategies
Group-based filling - Use sector/category averages when appropriate

Next Steps

You can now detect, remove, and fill missing data appropriately. In the next lesson, you will learn data type conversion—ensuring columns have the correct types for analysis.

Continue to Lesson 12 - Data Type Conversion

Learn to convert between data types and handle type errors

Back to Lesson 10 - Apply, Map, and Transform

Review advanced transformation techniques

Master Data Quality

Missing data is inevitable in real-world datasets. Your ability to handle it appropriately—detecting, understanding, and addressing it—is fundamental to data analysis.

Use these techniques to prepare clean, analysis-ready datasets!

Lesson 10 - Apply, Map, and Transform Functions

Lesson 12 - Data Type Conversion and Cleaning

Lesson 11 - Handling Missing Data

Dealing with Missing Values#

Understanding Missing Data#

Creating Sample Data with Missing Values#

Detecting Missing Values#

Creating Boolean Masks#

Counting Missing Values#

Percentage of Missing Data#

Missing Data Summary#

Finding Rows with Missing Values#

Checking Specific Columns#

Removing Missing Data#

Drop Rows with Any Missing Values#

Drop Rows with Specific Column Missing#

Drop Rows with Multiple Columns Missing#

Drop Only Completely Empty Rows#

Drop Columns with Missing Values#

Threshold for Dropping#

Filling Missing Values#

Fill with Specific Value#

Fill with Mean#

Fill with Median#

Fill with Zero#

Fill Different Columns with Different Values#

Forward Fill#

Backward Fill#

Fill Based on Groups#

Practice Exercises#

Exercise 1: Detecting Missing Data#

Exercise 2: Removing Missing Data#

Exercise 3: Filling Missing Data#

Exercise 4: Challenge - Smart Filling Strategy#

Summary#

Key Concepts#

Syntax Reference#

Decision Guide#

Important Reminders#

Next Steps#

Continue to Lesson 12 - Data Type Conversion

Back to Lesson 10 - Apply, Map, and Transform

Master Data Quality#

Dealing with Missing Values

Understanding Missing Data

Creating Sample Data with Missing Values

Detecting Missing Values

Creating Boolean Masks

Counting Missing Values

Percentage of Missing Data

Missing Data Summary

Finding Rows with Missing Values

Checking Specific Columns

Removing Missing Data

Drop Rows with Any Missing Values

Drop Rows with Specific Column Missing

Drop Rows with Multiple Columns Missing

Drop Only Completely Empty Rows

Drop Columns with Missing Values

Threshold for Dropping

Filling Missing Values

Fill with Specific Value

Fill with Mean

Fill with Median

Fill with Zero

Fill Different Columns with Different Values

Forward Fill

Backward Fill

Fill Based on Groups

Practice Exercises

Exercise 1: Detecting Missing Data

Exercise 2: Removing Missing Data

Exercise 3: Filling Missing Data

Exercise 4: Challenge - Smart Filling Strategy

Summary

Key Concepts

Syntax Reference

Decision Guide

Important Reminders

Next Steps

Master Data Quality