Lesson 13 - Removing Duplicates and Handling Outliers
Ensuring Data Quality
Duplicate records and outliers compromise data quality. Now you will learn to find duplicates, remove them appropriately, detect outliers using statistical methods, and decide how to handle extreme values.
By the end of this lesson, you will be able to:
- Find duplicate rows with
.duplicated() - Remove duplicates with
.drop_duplicates() - Handle partial duplicates based on specific columns
- Detect outliers using IQR method
- Detect outliers using z-score method
- Decide when to remove, cap, or keep outliers
Duplicates and outliers distort analysis—duplicate records inflate counts, extreme values skew averages. Handling them correctly is essential for reliable insights.
Understanding Duplicates and Outliers
Duplicates occur when the same data appears multiple times—data entry errors, system bugs, or legitimate repeated transactions.
Outliers are extreme values that differ significantly from other observations—errors, rare events, or genuinely unusual cases.
Sample Data
import pandas as pd
import numpy as np
# Create dataset with duplicates and outliers
companies = pd.DataFrame({
'company': ['TechCorp Global', 'FreshMart Inc', 'TechCorp Global', 'AutoDrive Motors', 'FreshMart Inc',
'PowerGen Energy', 'MediPharm Solutions', 'AutoDrive Motors', 'TechCorp Global', 'NetLink Telecom'],
'sector': ['Technology', 'Food', 'Technology', 'Automotive', 'Food',
'Energy', 'Healthcare', 'Automotive', 'Technology', 'Technology'],
'revenues': [125000, 89000, 125000, 156000, 89000,
178000, 98000, 156000, 125000, 5000000], # Last one is outlier!
'profits': [12000, 8500, 12000, -3000, 8500,
23000, 15000, -3000, 12000, 450000], # Outlier here too
'employees': [1200, 890, 1200, 2300, 890,
3400, 2100, 2300, 1200, 6700]
})
print("Dataset with duplicates and outliers:")
companiesNotice ‘TechCorp Global’ and ‘FreshMart Inc’ appear multiple times, and ‘NetLink Telecom’ has suspiciously high revenues.
Finding Duplicates
Use .duplicated() to identify duplicate rows.
Check for Duplicates
# Check for duplicate rows (all columns must match)
is_duplicate = companies.duplicated()
print("Duplicate mask (True = duplicate):")
print(is_duplicate)
print(f"\nNumber of duplicates: {is_duplicate.sum()}")Output:
Duplicate mask:
0 False ← TechCorp Global (first occurrence)
1 False ← FreshMart Inc (first occurrence)
2 True ← TechCorp Global (duplicate!)
3 False ← AutoDrive Motors (first occurrence)
4 True ← FreshMart Inc (duplicate!)
5 False
6 False
7 True ← AutoDrive Motors (duplicate!)
8 True ← TechCorp Global (duplicate!)
9 False
dtype: bool
Number of duplicates: 4By default, .duplicated() marks only the 2nd+ occurrences as True.
Show Duplicate Rows
# Show only duplicate rows
duplicates = companies[companies.duplicated()]
print("Duplicate rows:")
print(duplicates)Show All Rows That Have Duplicates
# Show ALL rows involved in duplication (including first)
all_duplicates = companies[companies.duplicated(keep=False)]
print("All duplicate rows (including first occurrence):")
print(all_duplicates.sort_values('company'))Output shows all ‘TechCorp Global’, ‘FreshMart Inc’, and ‘AutoDrive Motors’ rows—both originals and duplicates.
Check Duplicates on Specific Columns
# Check duplicates based on company name only
company_name_duplicates = companies.duplicated(subset=['company'])
print("Duplicate company names:")
print(companies[company_name_duplicates][['company', 'sector', 'revenues']])This finds rows with duplicate company names, even if other columns differ.
Removing Duplicates
Use .drop_duplicates() to remove duplicate rows.
Remove All Duplicates
# Remove duplicate rows (keeps first occurrence)
companies_unique = companies.drop_duplicates()
print(f"Original: {len(companies)} rows")
print(f"After removing duplicates: {len(companies_unique)} rows")
print("\nUnique data:")
print(companies_unique)Output:
Original: 10 rows
After removing duplicates: 6 rowsRemoved 4 duplicate rows!
Keep Last Occurrence Instead
# Keep last occurrence instead of first
companies_keep_last = companies.drop_duplicates(keep='last')
print("Keeping last occurrence:")
print(companies_keep_last)The keep parameter controls which occurrence to keep:
'first'(default): Keep first occurrence'last': Keep last occurrenceFalse: Remove all duplicates (including first)
Remove Based on Specific Columns
# Remove duplicates based on company name and sector only
companies_unique_names = companies.drop_duplicates(subset=['company', 'sector'])
print(f"Original: {len(companies)} rows")
print(f"After removing duplicate company-sector combinations: {len(companies_unique_names)} rows")This is useful when you care about specific column combinations, not all columns.
Detecting Outliers
Outliers are extreme values that differ significantly from the rest.
Visual Inspection with Statistics
# First, work with unique data
companies_clean = companies.drop_duplicates()
# Look at statistics
print("Revenue statistics:")
print(companies_clean['revenues'].describe())Output:
count 6.000000
mean 960166.666667 ← Mean is very high!
std 1957392.234876 ← Huge standard deviation
min 89000.000000
25% 111500.000000
50% 140000.000000
75% 167000.000000
max 5000000.000000 ← WAY higher than others!The max is dramatically different—likely an outlier.
Sort to Find Extremes
# Sort to visualize outliers
print("Companies sorted by revenue:")
print(companies_clean[['company', 'revenues']].sort_values('revenues', ascending=False))Output:
company revenues
9 NetLink Telecom 5000000 ← Outlier!
5 PowerGen Energy 178000
3 AutoDrive Motors 156000
0 TechCorp Global 125000
6 MediPharm Solutions 98000
1 FreshMart Inc 89000NetLink Telecom has 5 million while others range 89k-178k!
IQR Method for Outliers
The Interquartile Range (IQR) method is a standard statistical approach.
Understanding IQR
IQR = Q3 - Q1 (75th percentile - 25th percentile)
Outlier boundaries:
- Lower bound = Q1 - 1.5 × IQR
- Upper bound = Q3 + 1.5 × IQR
Values outside these bounds are outliers.
Calculate IQR Boundaries
# Calculate IQR
Q1 = companies_clean['revenues'].quantile(0.25)
Q3 = companies_clean['revenues'].quantile(0.75)
IQR = Q3 - Q1
# Define outlier boundaries
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
print(f"Q1 (25th percentile): ${Q1:,.0f}")
print(f"Q3 (75th percentile): ${Q3:,.0f}")
print(f"IQR: ${IQR:,.0f}")
print(f"\nOutlier boundaries:")
print(f"Lower: ${lower_bound:,.0f}")
print(f"Upper: ${upper_bound:,.0f}")Output:
Q1 (25th percentile): $111,500
Q3 (75th percentile): $167,000
IQR: $55,500
Outlier boundaries:
Lower: $28,250
Upper: $250,250Any revenue below $28,250 or above $250,250 is an outlier.
Find Outliers
# Identify outliers
outliers = companies_clean[(companies_clean['revenues'] < lower_bound) |
(companies_clean['revenues'] > upper_bound)]
print("Revenue outliers:")
print(outliers[['company', 'revenues', 'profits']])Output:
company revenues profits
9 NetLink Telecom 5000000 450000NetLink Telecom is indeed an outlier!
Z-Score Method for Outliers
The z-score measures how many standard deviations a value is from the mean.
Rule of thumb: |z-score| > 3 indicates an outlier (captures 99.7% of normal data).
Calculate Z-Scores
# Calculate z-scores
mean_revenue = companies_clean['revenues'].mean()
std_revenue = companies_clean['revenues'].std()
companies_clean_copy = companies_clean.copy()
companies_clean_copy['revenue_zscore'] = (companies_clean_copy['revenues'] - mean_revenue) / std_revenue
print("Revenue z-scores:")
print(companies_clean_copy[['company', 'revenues', 'revenue_zscore']].sort_values('revenue_zscore', ascending=False))Output:
company revenues revenue_zscore
9 NetLink Telecom 5000000 2.063918 ← High z-score
5 PowerGen Energy 178000 -0.399439
3 AutoDrive Motors 156000 -0.410689
0 TechCorp Global 125000 -0.426807
6 MediPharm Solutions 98000 -0.440368
1 FreshMart Inc 89000 -0.445022Find Extreme Z-Scores
# Outliers typically have |z-score| > 3
outliers_zscore = companies_clean_copy[abs(companies_clean_copy['revenue_zscore']) > 3]
print("Outliers (|z-score| > 3):")
print(outliers_zscore[['company', 'revenues', 'revenue_zscore']])With threshold of 3, no outliers found (z-score is 2.06). You could lower threshold to 2 to catch it.
Handling Outliers
Once identified, you have three main options for handling outliers.
Option 1: Remove Outliers
Appropriate when outliers are likely errors.
# Remove outliers
companies_no_outliers = companies_clean[(companies_clean['revenues'] >= lower_bound) &
(companies_clean['revenues'] <= upper_bound)]
print(f"Original: {len(companies_clean)} rows")
print(f"After removing outliers: {len(companies_no_outliers)} rows")
print("\nRevenue statistics without outliers:")
print(companies_no_outliers['revenues'].describe())Output:
Original: 6 rows
After removing outliers: 5 rows
Revenue statistics without outliers:
count 5.000000
mean 129200.000000 ← Much more reasonable!
std 32926.829268
min 89000.000000
max 178000.000000Mean changed from $960k to $129k—huge impact!
Option 2: Cap Outliers
Use .clip() to limit extreme values to boundaries.
# Cap outliers to boundaries
companies_capped = companies_clean.copy()
companies_capped['revenues'] = companies_capped['revenues'].clip(lower=lower_bound, upper=upper_bound)
print("Before capping:")
print(companies_clean[['company', 'revenues']].sort_values('revenues', ascending=False).head())
print("\nAfter capping:")
print(companies_capped[['company', 'revenues']].sort_values('revenues', ascending=False).head())Output:
Before capping:
company revenues
9 NetLink Telecom 5000000 ← Extreme outlier
5 PowerGen Energy 178000
After capping:
company revenues
9 NetLink Telecom 250250 ← Capped to upper bound
5 PowerGen Energy 178000This preserves the row while reducing outlier impact.
Option 3: Flag Outliers
Create a boolean column to mark outliers for investigation.
# Flag outliers
companies_flagged = companies_clean.copy()
companies_flagged['is_outlier'] = ((companies_flagged['revenues'] < lower_bound) |
(companies_flagged['revenues'] > upper_bound))
print("Flagged data:")
print(companies_flagged[['company', 'revenues', 'is_outlier']])
print(f"\nNumber of outliers: {companies_flagged['is_outlier'].sum()}")This keeps all data while highlighting unusual cases for further review.
Practice Exercises
Apply duplicate and outlier techniques with these exercises.
Exercise 1: Finding Duplicates
Given:
sales = pd.DataFrame({
'product': ['Laptop', 'Mouse', 'Laptop', 'Keyboard', 'Mouse'],
'region': ['North', 'South', 'North', 'East', 'South'],
'sales': [1200, 300, 1200, 450, 300]
})- Find how many duplicate rows exist
- Show all duplicate rows (including first occurrence)
- Remove duplicates keeping the first occurrence
# Your code hereExercise 2: Detecting Outliers
Given:
scores = pd.DataFrame({
'student': ['Ali', 'Sara', 'Reza', 'Maryam', 'Hassan', 'Zahra', 'Fatima'],
'score': [85, 92, 88, 15, 90, 87, 91] # 15 is an outlier!
})- Use the IQR method to find outliers
- Identify which students have outlier scores
- Calculate what the mean score would be without outliers
# Your code hereHint
Calculate Q1, Q3, and IQR, then check which values fall outside Q1 - 1.5×IQR or Q3 + 1.5×IQR!
Exercise 3: Complete Cleaning
Given employee data with duplicates and outliers:
employees = pd.DataFrame({
'name': ['Ali', 'Sara', 'Ali', 'Reza', 'Maryam', 'Hassan'],
'department': ['IT', 'HR', 'IT', 'Sales', 'IT', 'Sales'],
'salary': [50000, 45000, 50000, 48000, 52000, 250000] # Hassan's salary is outlier
})Tasks:
- Remove duplicate rows
- Detect salary outliers using IQR method
- Compare mean salary with vs without outliers
- Create final clean dataset (no duplicates, outliers capped)
# Your code hereSummary
You now master finding duplicates and detecting outliers. Let’s review the key concepts.
Key Concepts
Duplicates
- Use
.duplicated()to find duplicate rows - Use
.drop_duplicates()to remove them - Check specific columns with
subsetparameter - Control which occurrence to keep
Outliers
- Extreme values that differ significantly
- IQR method: Standard statistical approach
- Z-score method: Measures standard deviations from mean
- Three options: remove, cap, or flag
Decision Making
- Investigate before removing
- Consider domain knowledge
- Document decisions
- Compare statistics before/after
Syntax Reference
# Duplicates
df.duplicated() # Check for duplicates
df.duplicated(subset=['col']) # Check specific columns
df.duplicated(keep=False) # Include first occurrence
df.drop_duplicates() # Remove duplicates
df.drop_duplicates(subset=['col']) # Based on specific columns
df.drop_duplicates(keep='last') # Keep last occurrence
# IQR Method
Q1 = df['col'].quantile(0.25)
Q3 = df['col'].quantile(0.75)
IQR = Q3 - Q1
lower = Q1 - 1.5 * IQR
upper = Q3 + 1.5 * IQR
outliers = df[(df['col'] < lower) | (df['col'] > upper)]
# Z-Score Method
mean = df['col'].mean()
std = df['col'].std()
z_scores = (df['col'] - mean) / std
outliers = df[abs(z_scores) > 3]
# Handling Outliers
df_clean = df[(df['col'] >= lower) & (df['col'] <= upper)] # Remove
df['col'] = df['col'].clip(lower=lower, upper=upper) # Cap
df['is_outlier'] = (df['col'] < lower) | (df['col'] > upper) # FlagWhen to Remove vs Keep
Remove duplicates when:
- Exact duplicates with no meaning
- Data entry errors
- System-generated duplicates
Keep duplicates when:
- Legitimate repeated transactions
- Same person, different events
- Time-based duplicates are meaningful
Remove outliers when:
- Clearly errors or typos
- Data entry mistakes
- Impossible values
Keep outliers when:
- Representing real phenomena
- Valid extreme cases
- Rare but legitimate events
Best Practices
- Always investigate before removing data
- Document decisions - note what you removed and why
- Compare statistics before and after cleaning
- Keep original data - maintain backup
- Use domain knowledge - understand what’s reasonable
- Flag instead of remove when uncertain
Next Steps
You can now identify and handle duplicates and outliers. In the next lesson, you will learn string operations—cleaning text data, extracting information, and searching patterns.
Continue to Lesson 14 - String Operations
Learn to clean and manipulate text data efficiently
Back to Lesson 12 - Data Type Conversion
Review converting between data types
Master Data Quality
Duplicates and outliers compromise analysis quality. Your ability to detect and handle them appropriately—removing errors while preserving legitimate data—is essential for reliable insights.
Use these techniques to ensure your data is clean and trustworthy!