Lesson 13 - Removing Duplicates and Handling Outliers

Ensuring Data Quality

Duplicate records and outliers compromise data quality. Now you will learn to find duplicates, remove them appropriately, detect outliers using statistical methods, and decide how to handle extreme values.

By the end of this lesson, you will be able to:

Find duplicate rows with .duplicated()
Remove duplicates with .drop_duplicates()
Handle partial duplicates based on specific columns
Detect outliers using IQR method
Detect outliers using z-score method
Decide when to remove, cap, or keep outliers

Duplicates and outliers distort analysis—duplicate records inflate counts, extreme values skew averages. Handling them correctly is essential for reliable insights.

Understanding Duplicates and Outliers

Duplicates occur when the same data appears multiple times—data entry errors, system bugs, or legitimate repeated transactions.

Outliers are extreme values that differ significantly from other observations—errors, rare events, or genuinely unusual cases.

Sample Data

import pandas as pd
import numpy as np

# Create dataset with duplicates and outliers
companies = pd.DataFrame({
    'company': ['TechCorp Global', 'FreshMart Inc', 'TechCorp Global', 'AutoDrive Motors', 'FreshMart Inc',
                'PowerGen Energy', 'MediPharm Solutions', 'AutoDrive Motors', 'TechCorp Global', 'NetLink Telecom'],
    'sector': ['Technology', 'Food', 'Technology', 'Automotive', 'Food',
               'Energy', 'Healthcare', 'Automotive', 'Technology', 'Technology'],
    'revenues': [125000, 89000, 125000, 156000, 89000,
                 178000, 98000, 156000, 125000, 5000000],  # Last one is outlier!
    'profits': [12000, 8500, 12000, -3000, 8500,
                23000, 15000, -3000, 12000, 450000],  # Outlier here too
    'employees': [1200, 890, 1200, 2300, 890,
                  3400, 2100, 2300, 1200, 6700]
})

print("Dataset with duplicates and outliers:")
companies

Notice ‘TechCorp Global’ and ‘FreshMart Inc’ appear multiple times, and ‘NetLink Telecom’ has suspiciously high revenues.

Finding Duplicates

Use .duplicated() to identify duplicate rows.

Check for Duplicates

# Check for duplicate rows (all columns must match)
is_duplicate = companies.duplicated()

print("Duplicate mask (True = duplicate):")
print(is_duplicate)
print(f"\nNumber of duplicates: {is_duplicate.sum()}")

Output:

Duplicate mask:
0    False  ← TechCorp Global (first occurrence)
1    False  ← FreshMart Inc (first occurrence)
2     True  ← TechCorp Global (duplicate!)
3    False  ← AutoDrive Motors (first occurrence)
4     True  ← FreshMart Inc (duplicate!)
5    False
6    False
7     True  ← AutoDrive Motors (duplicate!)
8     True  ← TechCorp Global (duplicate!)
9    False
dtype: bool

Number of duplicates: 4

By default, .duplicated() marks only the 2nd+ occurrences as True.

Show Duplicate Rows

# Show only duplicate rows
duplicates = companies[companies.duplicated()]

print("Duplicate rows:")
print(duplicates)

Show All Rows That Have Duplicates

# Show ALL rows involved in duplication (including first)
all_duplicates = companies[companies.duplicated(keep=False)]

print("All duplicate rows (including first occurrence):")
print(all_duplicates.sort_values('company'))

Output shows all ‘TechCorp Global’, ‘FreshMart Inc’, and ‘AutoDrive Motors’ rows—both originals and duplicates.

Check Duplicates on Specific Columns

# Check duplicates based on company name only
company_name_duplicates = companies.duplicated(subset=['company'])

print("Duplicate company names:")
print(companies[company_name_duplicates][['company', 'sector', 'revenues']])

This finds rows with duplicate company names, even if other columns differ.

Removing Duplicates

Use .drop_duplicates() to remove duplicate rows.

Remove All Duplicates

# Remove duplicate rows (keeps first occurrence)
companies_unique = companies.drop_duplicates()

print(f"Original: {len(companies)} rows")
print(f"After removing duplicates: {len(companies_unique)} rows")
print("\nUnique data:")
print(companies_unique)

Output:

Original: 10 rows
After removing duplicates: 6 rows

Removed 4 duplicate rows!

Keep Last Occurrence Instead

# Keep last occurrence instead of first
companies_keep_last = companies.drop_duplicates(keep='last')

print("Keeping last occurrence:")
print(companies_keep_last)

The keep parameter controls which occurrence to keep:

'first' (default): Keep first occurrence
'last': Keep last occurrence
False: Remove all duplicates (including first)

Remove Based on Specific Columns

# Remove duplicates based on company name and sector only
companies_unique_names = companies.drop_duplicates(subset=['company', 'sector'])

print(f"Original: {len(companies)} rows")
print(f"After removing duplicate company-sector combinations: {len(companies_unique_names)} rows")

This is useful when you care about specific column combinations, not all columns.

Detecting Outliers

Outliers are extreme values that differ significantly from the rest.

Visual Inspection with Statistics

# First, work with unique data
companies_clean = companies.drop_duplicates()

# Look at statistics
print("Revenue statistics:")
print(companies_clean['revenues'].describe())

Output:

count         6.000000
mean     960166.666667  ← Mean is very high!
std     1957392.234876  ← Huge standard deviation
min       89000.000000
25%      111500.000000
50%      140000.000000
75%      167000.000000
max     5000000.000000  ← WAY higher than others!

The max is dramatically different—likely an outlier.

Sort to Find Extremes

# Sort to visualize outliers
print("Companies sorted by revenue:")
print(companies_clean[['company', 'revenues']].sort_values('revenues', ascending=False))

Output:

           company  revenues
9    NetLink Telecom   5000000  ← Outlier!
5    PowerGen Energy    178000
3      AutoDrive Motors    156000
0         TechCorp Global    125000
6     MediPharm Solutions     98000
1       FreshMart Inc     89000

NetLink Telecom has 5 million while others range 89k-178k!

IQR Method for Outliers

The Interquartile Range (IQR) method is a standard statistical approach.

Understanding IQR

IQR = Q3 - Q1 (75th percentile - 25th percentile)

Outlier boundaries:

Lower bound = Q1 - 1.5 × IQR
Upper bound = Q3 + 1.5 × IQR

Values outside these bounds are outliers.

Calculate IQR Boundaries

# Calculate IQR
Q1 = companies_clean['revenues'].quantile(0.25)
Q3 = companies_clean['revenues'].quantile(0.75)
IQR = Q3 - Q1

# Define outlier boundaries
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

print(f"Q1 (25th percentile): ${Q1:,.0f}")
print(f"Q3 (75th percentile): ${Q3:,.0f}")
print(f"IQR: ${IQR:,.0f}")
print(f"\nOutlier boundaries:")
print(f"Lower: ${lower_bound:,.0f}")
print(f"Upper: ${upper_bound:,.0f}")

Output:

Q1 (25th percentile): $111,500
Q3 (75th percentile): $167,000
IQR: $55,500

Outlier boundaries:
Lower: $28,250
Upper: $250,250

Any revenue below $28,250 or above $250,250 is an outlier.

Find Outliers

# Identify outliers
outliers = companies_clean[(companies_clean['revenues'] < lower_bound) |
                           (companies_clean['revenues'] > upper_bound)]

print("Revenue outliers:")
print(outliers[['company', 'revenues', 'profits']])

Output:

          company  revenues  profits
9   NetLink Telecom   5000000   450000

NetLink Telecom is indeed an outlier!

Z-Score Method for Outliers

The z-score measures how many standard deviations a value is from the mean.

Rule of thumb: |z-score| > 3 indicates an outlier (captures 99.7% of normal data).

Calculate Z-Scores

# Calculate z-scores
mean_revenue = companies_clean['revenues'].mean()
std_revenue = companies_clean['revenues'].std()

companies_clean_copy = companies_clean.copy()
companies_clean_copy['revenue_zscore'] = (companies_clean_copy['revenues'] - mean_revenue) / std_revenue

print("Revenue z-scores:")
print(companies_clean_copy[['company', 'revenues', 'revenue_zscore']].sort_values('revenue_zscore', ascending=False))

Output:

           company  revenues  revenue_zscore
9    NetLink Telecom   5000000        2.063918  ← High z-score
5    PowerGen Energy    178000       -0.399439
3      AutoDrive Motors    156000       -0.410689
0         TechCorp Global    125000       -0.426807
6     MediPharm Solutions     98000       -0.440368
1       FreshMart Inc     89000       -0.445022

Find Extreme Z-Scores

# Outliers typically have |z-score| > 3
outliers_zscore = companies_clean_copy[abs(companies_clean_copy['revenue_zscore']) > 3]

print("Outliers (|z-score| > 3):")
print(outliers_zscore[['company', 'revenues', 'revenue_zscore']])

With threshold of 3, no outliers found (z-score is 2.06). You could lower threshold to 2 to catch it.

Handling Outliers

Once identified, you have three main options for handling outliers.

Option 1: Remove Outliers

Appropriate when outliers are likely errors.

# Remove outliers
companies_no_outliers = companies_clean[(companies_clean['revenues'] >= lower_bound) &
                                         (companies_clean['revenues'] <= upper_bound)]

print(f"Original: {len(companies_clean)} rows")
print(f"After removing outliers: {len(companies_no_outliers)} rows")

print("\nRevenue statistics without outliers:")
print(companies_no_outliers['revenues'].describe())

Output:

Original: 6 rows
After removing outliers: 5 rows

Revenue statistics without outliers:
count         5.000000
mean     129200.000000  ← Much more reasonable!
std       32926.829268
min       89000.000000
max      178000.000000

Mean changed from $960k to $129k—huge impact!

Option 2: Cap Outliers

Use .clip() to limit extreme values to boundaries.

# Cap outliers to boundaries
companies_capped = companies_clean.copy()
companies_capped['revenues'] = companies_capped['revenues'].clip(lower=lower_bound, upper=upper_bound)

print("Before capping:")
print(companies_clean[['company', 'revenues']].sort_values('revenues', ascending=False).head())

print("\nAfter capping:")
print(companies_capped[['company', 'revenues']].sort_values('revenues', ascending=False).head())

Output:

Before capping:
           company  revenues
9    NetLink Telecom   5000000  ← Extreme outlier
5    PowerGen Energy    178000

After capping:
           company  revenues
9    NetLink Telecom    250250  ← Capped to upper bound
5    PowerGen Energy    178000

This preserves the row while reducing outlier impact.

Option 3: Flag Outliers

Create a boolean column to mark outliers for investigation.

# Flag outliers
companies_flagged = companies_clean.copy()
companies_flagged['is_outlier'] = ((companies_flagged['revenues'] < lower_bound) |
                                    (companies_flagged['revenues'] > upper_bound))

print("Flagged data:")
print(companies_flagged[['company', 'revenues', 'is_outlier']])
print(f"\nNumber of outliers: {companies_flagged['is_outlier'].sum()}")

This keeps all data while highlighting unusual cases for further review.

Practice Exercises

Apply duplicate and outlier techniques with these exercises.

Exercise 1: Finding Duplicates

Given:

sales = pd.DataFrame({
    'product': ['Laptop', 'Mouse', 'Laptop', 'Keyboard', 'Mouse'],
    'region': ['North', 'South', 'North', 'East', 'South'],
    'sales': [1200, 300, 1200, 450, 300]
})

Find how many duplicate rows exist
Show all duplicate rows (including first occurrence)
Remove duplicates keeping the first occurrence

# Your code here

Exercise 2: Detecting Outliers

Given:

scores = pd.DataFrame({
    'student': ['Ali', 'Sara', 'Reza', 'Maryam', 'Hassan', 'Zahra', 'Fatima'],
    'score': [85, 92, 88, 15, 90, 87, 91]  # 15 is an outlier!
})

Use the IQR method to find outliers
Identify which students have outlier scores
Calculate what the mean score would be without outliers

# Your code here

Hint

Calculate Q1, Q3, and IQR, then check which values fall outside Q1 - 1.5×IQR or Q3 + 1.5×IQR!

Exercise 3: Complete Cleaning

Given employee data with duplicates and outliers:

employees = pd.DataFrame({
    'name': ['Ali', 'Sara', 'Ali', 'Reza', 'Maryam', 'Hassan'],
    'department': ['IT', 'HR', 'IT', 'Sales', 'IT', 'Sales'],
    'salary': [50000, 45000, 50000, 48000, 52000, 250000]  # Hassan's salary is outlier
})

Tasks:

Remove duplicate rows
Detect salary outliers using IQR method
Compare mean salary with vs without outliers
Create final clean dataset (no duplicates, outliers capped)

# Your code here

Summary

You now master finding duplicates and detecting outliers. Let’s review the key concepts.

Key Concepts

Duplicates

Use .duplicated() to find duplicate rows
Use .drop_duplicates() to remove them
Check specific columns with subset parameter
Control which occurrence to keep

Outliers

Extreme values that differ significantly
IQR method: Standard statistical approach
Z-score method: Measures standard deviations from mean
Three options: remove, cap, or flag

Decision Making

Investigate before removing
Consider domain knowledge
Document decisions
Compare statistics before/after

Syntax Reference

# Duplicates
df.duplicated()                          # Check for duplicates
df.duplicated(subset=['col'])            # Check specific columns
df.duplicated(keep=False)                # Include first occurrence
df.drop_duplicates()                     # Remove duplicates
df.drop_duplicates(subset=['col'])       # Based on specific columns
df.drop_duplicates(keep='last')          # Keep last occurrence

# IQR Method
Q1 = df['col'].quantile(0.25)
Q3 = df['col'].quantile(0.75)
IQR = Q3 - Q1
lower = Q1 - 1.5 * IQR
upper = Q3 + 1.5 * IQR
outliers = df[(df['col'] < lower) | (df['col'] > upper)]

# Z-Score Method
mean = df['col'].mean()
std = df['col'].std()
z_scores = (df['col'] - mean) / std
outliers = df[abs(z_scores) > 3]

# Handling Outliers
df_clean = df[(df['col'] >= lower) & (df['col'] <= upper)]  # Remove
df['col'] = df['col'].clip(lower=lower, upper=upper)        # Cap
df['is_outlier'] = (df['col'] < lower) | (df['col'] > upper)  # Flag

When to Remove vs Keep

Remove duplicates when:

Exact duplicates with no meaning
Data entry errors
System-generated duplicates

Keep duplicates when:

Legitimate repeated transactions
Same person, different events
Time-based duplicates are meaningful

Remove outliers when:

Clearly errors or typos
Data entry mistakes
Impossible values

Keep outliers when:

Representing real phenomena
Valid extreme cases
Rare but legitimate events

Best Practices

Always investigate before removing data
Document decisions - note what you removed and why
Compare statistics before and after cleaning
Keep original data - maintain backup
Use domain knowledge - understand what’s reasonable
Flag instead of remove when uncertain

Next Steps

You can now identify and handle duplicates and outliers. In the next lesson, you will learn string operations—cleaning text data, extracting information, and searching patterns.

Continue to Lesson 14 - String Operations

Learn to clean and manipulate text data efficiently

Back to Lesson 12 - Data Type Conversion

Review converting between data types

Master Data Quality

Duplicates and outliers compromise analysis quality. Your ability to detect and handle them appropriately—removing errors while preserving legitimate data—is essential for reliable insights.

Use these techniques to ensure your data is clean and trustworthy!

Lesson 12 - Data Type Conversion and Cleaning

Lesson 14 - Working with String Data

Lesson 13 - Removing Duplicates and Handling Outliers

Ensuring Data Quality#

Understanding Duplicates and Outliers#

Sample Data#

Finding Duplicates#

Check for Duplicates#

Show Duplicate Rows#

Show All Rows That Have Duplicates#

Check Duplicates on Specific Columns#

Removing Duplicates#

Remove All Duplicates#

Keep Last Occurrence Instead#

Remove Based on Specific Columns#

Detecting Outliers#

Visual Inspection with Statistics#

Sort to Find Extremes#

IQR Method for Outliers#

Understanding IQR#

Calculate IQR Boundaries#

Find Outliers#

Z-Score Method for Outliers#

Calculate Z-Scores#

Find Extreme Z-Scores#

Handling Outliers#

Option 1: Remove Outliers#

Option 2: Cap Outliers#

Option 3: Flag Outliers#

Practice Exercises#

Exercise 1: Finding Duplicates#

Exercise 2: Detecting Outliers#

Exercise 3: Complete Cleaning#

Summary#

Key Concepts#

Syntax Reference#

When to Remove vs Keep#

Best Practices#

Next Steps#

Continue to Lesson 14 - String Operations

Back to Lesson 12 - Data Type Conversion

Master Data Quality#

Ensuring Data Quality

Understanding Duplicates and Outliers

Sample Data

Finding Duplicates

Check for Duplicates

Show Duplicate Rows

Show All Rows That Have Duplicates

Check Duplicates on Specific Columns

Removing Duplicates

Remove All Duplicates

Keep Last Occurrence Instead

Remove Based on Specific Columns

Detecting Outliers

Visual Inspection with Statistics

Sort to Find Extremes

IQR Method for Outliers

Understanding IQR

Calculate IQR Boundaries

Find Outliers

Z-Score Method for Outliers

Calculate Z-Scores

Find Extreme Z-Scores

Handling Outliers

Option 1: Remove Outliers

Option 2: Cap Outliers

Option 3: Flag Outliers

Practice Exercises

Exercise 1: Finding Duplicates

Exercise 2: Detecting Outliers

Exercise 3: Complete Cleaning

Summary

Key Concepts

Syntax Reference

When to Remove vs Keep

Best Practices

Next Steps

Master Data Quality