Lesson 11 - Handling Missing Data

Dealing with Missing Values

Real-world data is messy. Missing values appear in almost every dataset you will encounter. Now you will learn to detect, remove, and fill missing data—essential skills for data cleaning and preparation.

By the end of this lesson, you will be able to:

  • Detect missing values with .isnull() and .isna()
  • Count and calculate percentages of missing data
  • Remove missing data strategically with .dropna()
  • Fill missing values using various strategies with .fillna()
  • Choose appropriate methods for different data types
  • Understand when to drop vs. fill missing data

Missing data is a reality in all datasets. Surveys have unanswered questions, sensors fail, records are incomplete. Your job is to handle missing data appropriately without introducing bias or losing important information.


Understanding Missing Data

In pandas, missing values are represented as NaN (Not a Number) or None. Both indicate the absence of a value.

Creating Sample Data with Missing Values

import pandas as pd
import numpy as np

# Create dataset with missing values
companies = pd.DataFrame({
    'company': ['TechCorp Global', 'FreshMart Inc', 'AutoDrive Motors', 'FirstBank Holdings', 'PowerGen Energy',
                'MediPharm Solutions', 'RetailHub Ltd', 'SkyWings Airlines', 'SteelCore Industries', 'NetLink Telecom'],
    'sector': ['Technology', 'Food', 'Automotive', None, 'Energy',
               'Healthcare', 'Retail', 'Transportation', 'Materials', 'Technology'],
    'revenues': [125000, 89000, None, 234000, 178000,
                 98000, 112000, 187000, 145000, None],
    'profits': [12000, 8500, -3000, 45000, None,
                15000, 9800, 21000, None, 28000],
    'employees': [1200, 890, 2300, 5600, 3400,
                  None, 4500, 8900, 3200, 6700],
    'country': ['USA', 'USA', 'USA', 'UK', None,
                'USA', 'UK', 'Germany', 'USA', 'UK']
})

print("Dataset with missing values:")
companies

Notice the None and NaN values scattered throughout the dataset.


Detecting Missing Values

The first step in handling missing data is finding where it exists.

Creating Boolean Masks

# Check for missing values (returns True/False)
is_missing = companies.isnull()

print("Missing value mask (True = missing):")
print(is_missing)

Output shows True where values are missing, False where they exist.

# .isna() is an alias for .isnull() (same thing)
print("Using .isna():")
print(companies.isna().head())

Both methods work identically—use whichever you prefer.

Counting Missing Values

# Count missing values per column
missing_count = companies.isnull().sum()

print("Missing values per column:")
print(missing_count)

Output:

company        0
sector         1
revenues       2
profits        2
employees      1
country        1
dtype: int64

Percentage of Missing Data

# Calculate percentage missing
missing_pct = (companies.isnull().sum() / len(companies)) * 100

print("Missing values percentage:")
print(missing_pct.round(1))

Output:

company        0.0
sector        10.0
revenues      20.0
profits       20.0
employees     10.0
country       10.0
dtype: float64

20% of revenue and profit data is missing!

Missing Data Summary

# Create comprehensive summary
missing_summary = pd.DataFrame({
    'column': companies.columns,
    'missing_count': companies.isnull().sum().values,
    'missing_pct': missing_pct.values
})

print("Missing data summary:")
print(missing_summary)

This gives a complete overview of data quality issues.

Finding Rows with Missing Values

# Check which rows have ANY missing values
rows_with_missing = companies.isnull().any(axis=1)

print("Rows with at least one missing value:")
print(companies[rows_with_missing])

Shows all rows that contain at least one missing value.

Checking Specific Columns

# Check if specific column has missing values
has_missing_revenues = companies['revenues'].isnull().any()
has_missing_company = companies['company'].isnull().any()

print(f"Revenues column has missing: {has_missing_revenues}")
print(f"Company column has missing: {has_missing_company}")

Output:

Revenues column has missing: True
Company column has missing: False

Removing Missing Data

Use .dropna() to remove rows or columns with missing values.

Drop Rows with Any Missing Values

# Drop rows with ANY missing values
companies_clean = companies.dropna()

print(f"Original: {len(companies)} rows")
print(f"After dropna(): {len(companies_clean)} rows")
print("\nRemaining data:")
print(companies_clean)

Output:

Original: 10 rows
After dropna(): 6 rows

This removed 4 rows that had missing values—40% of the data!

Data Loss Warning

Dropping all rows with any missing values can remove significant portions of your dataset. Use this approach carefully!

Drop Rows with Specific Column Missing

# Drop rows where revenues is missing
companies_rev_clean = companies.dropna(subset=['revenues'])

print(f"Original: {len(companies)} rows")
print(f"After dropping rows with missing revenues: {len(companies_rev_clean)} rows")

Output:

Original: 10 rows
After dropping rows with missing revenues: 8 rows

Only removed rows missing the revenues column.

Drop Rows with Multiple Columns Missing

# Drop rows where revenues OR profits is missing
companies_multi_clean = companies.dropna(subset=['revenues', 'profits'])

print(f"Original: {len(companies)} rows")
print(f"After dropping rows missing revenues OR profits: {len(companies_multi_clean)} rows")

Drop Only Completely Empty Rows

# Drop rows only if ALL values are missing
companies_all_clean = companies.dropna(how='all')

print(f"After dropna(how='all'): {len(companies_all_clean)} rows")

The how='all' parameter only drops rows where every value is NaN.

Drop Columns with Missing Values

# Drop columns with any missing values
companies_no_missing_cols = companies.dropna(axis=1)

print("Columns without any missing values:")
print(companies_no_missing_cols.columns.tolist())

Output:

Columns without any missing values:
['company']

Only the ‘company’ column has no missing values!

Threshold for Dropping

# Keep rows with at least N non-null values
companies_thresh = companies.dropna(thresh=5)

print(f"Original: {len(companies)} rows")
print(f"After keeping rows with at least 5 non-null values: {len(companies_thresh)} rows")

The thresh parameter specifies minimum non-null values required.


Filling Missing Values

Use .fillna() to replace missing values with specific values or calculated statistics.

Fill with Specific Value

# Fill text columns with 'Unknown'
companies_filled = companies.copy()
companies_filled['sector'] = companies_filled['sector'].fillna('Unknown')
companies_filled['country'] = companies_filled['country'].fillna('Unknown')

print("After filling text columns with 'Unknown':")
print(companies_filled[['company', 'sector', 'country']])

Output shows ‘Unknown’ where values were missing.

Fill with Mean

# Fill numeric columns with mean
companies_mean = companies.copy()
mean_revenue = companies['revenues'].mean()
companies_mean['revenues'] = companies_mean['revenues'].fillna(mean_revenue)

print(f"Mean revenue: ${mean_revenue:,.0f}")
print("\nRevenues after filling with mean:")
print(companies_mean[['company', 'revenues']])

Useful for normally distributed data without outliers.

Fill with Median

# Fill with median (better for skewed data)
companies_median = companies.copy()
median_employees = companies['employees'].median()
companies_median['employees'] = companies_median['employees'].fillna(median_employees)

print(f"Median employees: {median_employees:,.0f}")
print("\nEmployees after filling with median:")
print(companies_median[['company', 'employees']])

Median is more robust to outliers than mean.

Fill with Zero

# Fill with 0 (common for counts or amounts)
companies_zero = companies.copy()
companies_zero['profits'] = companies_zero['profits'].fillna(0)

print("Profits after filling with 0:")
print(companies_zero[['company', 'profits']])

Appropriate when missing means “none” or “zero.”

Fill Different Columns with Different Values

# Fill multiple columns with different strategies
companies_multi_fill = companies.copy()

fill_values = {
    'sector': 'Unknown',
    'revenues': companies['revenues'].median(),
    'profits': 0,
    'employees': companies['employees'].mean(),
    'country': 'Unknown'
}

companies_multi_fill = companies_multi_fill.fillna(fill_values)

print("After filling with different strategies:")
print(companies_multi_fill)
print("\nMissing values remaining:")
print(companies_multi_fill.isnull().sum())

This applies appropriate filling method for each column type.

Forward Fill

# Forward fill - use previous value
test_data = pd.DataFrame({
    'value': [10, None, None, 20, None, 30]
})

print("Original:")
print(test_data)

print("\nAfter forward fill (ffill):")
print(test_data.fillna(method='ffill'))

Output:

   value
0   10.0
1   10.0  ← Filled with previous value
2   10.0  ← Filled with previous value
3   20.0
4   20.0  ← Filled with previous value
5   30.0

Backward Fill

# Backward fill - use next value
print("After backward fill (bfill):")
print(test_data.fillna(method='bfill'))

Output:

   value
0   10.0
1   20.0  ← Filled with next value
2   20.0  ← Filled with next value
3   20.0
4   30.0  ← Filled with next value
5   30.0

Fill Based on Groups

# Fill missing profits with sector average
companies_group = companies.copy()

# Calculate sector average profits
sector_avg_profit = companies.groupby('sector')['profits'].transform('mean')

# Fill missing with sector average
companies_group['profits'] = companies_group['profits'].fillna(sector_avg_profit)

print("Profits filled with sector average:")
print(companies_group[['company', 'sector', 'profits']])

This uses domain knowledge—companies in the same sector likely have similar profit patterns.


Practice Exercises

Apply missing data techniques with these exercises.

Exercise 1: Detecting Missing Data

  1. Find how many total missing values exist in the entire DataFrame
  2. Find which column has the highest percentage of missing values
  3. Count how many rows have missing values in both ‘revenues’ AND ‘profits’
# Your code here

Exercise 2: Removing Missing Data

  1. Create a version that drops rows missing ‘revenues’
  2. Create a version that keeps only rows with ALL values present
  3. Create a version that keeps rows with at least 4 non-null values
# Your code here

Exercise 3: Filling Missing Data

Create a cleaned version with this strategy:

  1. Fill missing ‘sector’ with ‘Unspecified’
  2. Fill missing ‘revenues’ with median revenue
  3. Fill missing ‘profits’ with mean profit
  4. Fill missing ’employees’ with 0
  5. Fill missing ‘country’ with ‘International’

Verify no missing values remain.

# Your code here

Hint

Use a dictionary with .fillna() to fill multiple columns at once with different values!

Exercise 4: Challenge - Smart Filling Strategy

Create a sophisticated cleaning approach:

  1. For companies in the same sector, fill missing ‘revenues’ with the sector’s median revenue
  2. If sector is also missing, use overall median
  3. Compare the difference between using sector-based filling vs. overall median
# Your code here

Summary

You now master handling missing data. Let’s review the key concepts.

Key Concepts

Detecting Missing Data

  • Use .isnull() or .isna() to find missing values
  • Count missing with .sum()
  • Calculate percentages to understand extent

Removing Missing Data

  • .dropna() removes rows/columns with missing values
  • Use subset to check specific columns only
  • Use thresh to require minimum non-null values
  • Use how='all' to drop only completely empty rows

Filling Missing Data

  • .fillna() replaces missing values
  • Fill with constants, statistics, or group-based values
  • Forward fill and backward fill for time series
  • Different strategies for different column types

Syntax Reference

# Detection
df.isnull()                    # Returns True/False mask
df.isnull().sum()              # Count per column
df.isnull().any()              # Check if any missing
df.isnull().sum().sum()        # Total missing in DataFrame

# Removal
df.dropna()                    # Drop rows with any missing
df.dropna(subset=['col'])      # Drop if specific column missing
df.dropna(how='all')           # Drop only if all values missing
df.dropna(thresh=n)            # Keep rows with at least n non-null
df.dropna(axis=1)              # Drop columns with missing

# Filling
df.fillna(value)               # Fill with specific value
df.fillna({'col1': val1, 'col2': val2})  # Different values per column
df['col'].fillna(df['col'].mean())       # Fill with mean
df['col'].fillna(df['col'].median())     # Fill with median
df.fillna(method='ffill')      # Forward fill
df.fillna(method='bfill')      # Backward fill

Decision Guide

Should you DROP or FILL?

Drop when:

  • Very few rows affected (< 5%)
  • Missing data is random
  • You have plenty of data
  • Missing critical columns

Fill when:

  • Many rows affected
  • Missing follows a pattern
  • Limited data available
  • Missing has meaning (e.g., 0)

Filling Strategies by Data Type:

Data TypeStrategyMethod
Categorical‘Unknown’ or most commonfillna('Unknown') or mode
Numeric (normal)Meanfillna(mean())
Numeric (skewed)Medianfillna(median())
Counts/AmountsZerofillna(0)
Time seriesForward/backward fillfillna(method='ffill')
Group-basedGroup statisticgroupby().transform()

Important Reminders

  • Always investigate WHY data is missing - Could indicate data quality issues
  • Document decisions - Record what you filled and why
  • Create flags - Add boolean column to track which were filled
  • Compare before/after - Always check statistics after cleaning
  • Keep original - Maintain backup of original data
  • Context matters - Use domain knowledge to choose strategies
  • Group-based filling - Use sector/category averages when appropriate

Next Steps

You can now detect, remove, and fill missing data appropriately. In the next lesson, you will learn data type conversion—ensuring columns have the correct types for analysis.

Continue to Lesson 12 - Data Type Conversion

Learn to convert between data types and handle type errors

Back to Lesson 10 - Apply, Map, and Transform

Review advanced transformation techniques


Master Data Quality

Missing data is inevitable in real-world datasets. Your ability to handle it appropriately—detecting, understanding, and addressing it—is fundamental to data analysis.

Use these techniques to prepare clean, analysis-ready datasets!