Lesson 12 - Comparing Distributions
Beyond Single Distributions
You can create histograms to visualize one distribution. Now you will learn to compare multiple distributions to identify differences between groups, conditions, or time periods.
By the end of this lesson, you will be able to:
- Overlay multiple histograms for comparison
- Create side-by-side distribution plots
- Use transparency effectively
- Calculate and compare distribution statistics
- Identify significant differences between groups
- Choose appropriate comparison techniques
Comparing distributions answers: “How do these groups differ in their data patterns?”
Why Compare Distributions?
Real-World Questions
Distribution comparisons help answer:
- Do bike rentals differ between weekdays and weekends?
- How does temperature distribution vary across seasons?
- Are casual and registered users different in their usage patterns?
- Has the rental distribution changed from 2011 to 2012?
Key insight: Comparing distributions reveals differences that summary statistics (mean, median) alone might miss.
Overlapping Histograms
Basic Overlay
import pandas as pd
import matplotlib.pyplot as plt
bikes_hour = pd.read_csv('hour.csv')
# Separate data by working day status
working_rentals = bikes_hour[bikes_hour['workingday'] == 1]['cnt']
weekend_rentals = bikes_hour[bikes_hour['workingday'] == 0]['cnt']
# Create overlapping histograms
plt.figure(figsize=(10, 6))
plt.hist(working_rentals, bins=40, alpha=0.6, label='Working Days',
color='blue', edgecolor='darkblue')
plt.hist(weekend_rentals, bins=40, alpha=0.6, label='Weekends/Holidays',
color='red', edgecolor='darkred')
plt.xlabel('Hourly Bike Rentals')
plt.ylabel('Frequency')
plt.title('Rental Distribution: Working Days vs Weekends')
plt.legend()
plt.grid(True, alpha=0.3, axis='y')
plt.show()Key technique: Use alpha=0.6 (transparency) so overlapping areas are visible.
What it shows:
- Both distributions are right-skewed
- Working days have more hours with moderate rentals (100-250)
- Weekends show flatter distribution
Statistical Comparison
Calculate Summary Statistics
import pandas as pd
import numpy as np
bikes_hour = pd.read_csv('hour.csv')
working_rentals = bikes_hour[bikes_hour['workingday'] == 1]['cnt']
weekend_rentals = bikes_hour[bikes_hour['workingday'] == 0]['cnt']
print("DISTRIBUTION COMPARISON: Working Days vs Weekends")
print("=" * 60)
print(f"{'Metric':<20s} {'Working Days':>15s} {'Weekends':>15s}")
print("-" * 60)
print(f"{'Count':<20s} {len(working_rentals):>15,} {len(weekend_rentals):>15,}")
print(f"{'Mean':<20s} {working_rentals.mean():>15.1f} {weekend_rentals.mean():>15.1f}")
print(f"{'Median':<20s} {working_rentals.median():>15.1f} {weekend_rentals.median():>15.1f}")
print(f"{'Std Deviation':<20s} {working_rentals.std():>15.1f} {weekend_rentals.std():>15.1f}")
print(f"{'Min':<20s} {working_rentals.min():>15,} {weekend_rentals.min():>15,}")
print(f"{'Max':<20s} {working_rentals.max():>15,} {weekend_rentals.max():>15,}")
print(f"{'25th Percentile':<20s} {working_rentals.quantile(0.25):>15.1f} {weekend_rentals.quantile(0.25):>15.1f}")
print(f"{'75th Percentile':<20s} {working_rentals.quantile(0.75):>15.1f} {weekend_rentals.quantile(0.75):>15.1f}")
# Calculate difference
diff_mean = working_rentals.mean() - weekend_rentals.mean()
diff_pct = (diff_mean / weekend_rentals.mean()) * 100
print("\nDifference:")
print(f" Working days average {diff_pct:+.1f}% more rentals than weekends")DISTRIBUTION COMPARISON: Working Days vs Weekends
============================================================
Metric Working Days Weekends
------------------------------------------------------------
Count 12,044 5,335
Mean 195.2 173.3
Median 157.0 118.0
Std Deviation 186.7 167.0
Min 1 1
Max 977 948
25th Percentile 47.0 32.0
75th Percentile 291.0 261.0
Difference:
Working days average +12.7% more rentals than weekendsNormalized Density Comparison
Using Density for Fair Comparison
When groups have different sample sizes, use density histograms:
import pandas as pd
import matplotlib.pyplot as plt
bikes_hour = pd.read_csv('hour.csv')
working_rentals = bikes_hour[bikes_hour['workingday'] == 1]['cnt']
weekend_rentals = bikes_hour[bikes_hour['workingday'] == 0]['cnt']
plt.figure(figsize=(10, 6))
plt.hist(working_rentals, bins=40, density=True, alpha=0.6,
label=f'Working Days (n={len(working_rentals):,})',
color='blue', edgecolor='darkblue')
plt.hist(weekend_rentals, bins=40, density=True, alpha=0.6,
label=f'Weekends (n={len(weekend_rentals):,})',
color='red', edgecolor='darkred')
plt.xlabel('Hourly Bike Rentals')
plt.ylabel('Density')
plt.title('Normalized Rental Distribution Comparison')
plt.legend()
plt.grid(True, alpha=0.3, axis='y')
plt.show()Why density?
- Working days: 12,044 hours
- Weekends: 5,335 hours
- Different counts make raw frequencies hard to compare
- Density normalizes to make shapes comparable
Seasonal Comparisons
Four-Way Distribution Comparison
import pandas as pd
import matplotlib.pyplot as plt
bikes = pd.read_csv('day.csv')
# Separate by season
seasons = {
1: ('Spring', 'lightgreen'),
2: ('Summer', 'gold'),
3: ('Fall', 'orange'),
4: ('Winter', 'lightblue')
}
plt.figure(figsize=(12, 7))
for season_num, (season_name, color) in seasons.items():
season_data = bikes[bikes['season'] == season_num]['cnt']
plt.hist(season_data, bins=25, alpha=0.5, label=season_name,
color=color, edgecolor='black')
plt.xlabel('Daily Bike Rentals')
plt.ylabel('Frequency')
plt.title('Bike Rental Distribution by Season')
plt.legend()
plt.grid(True, alpha=0.3, axis='y')
plt.show()
# Statistics by season
print("\nSEASONAL DISTRIBUTION STATISTICS")
print("=" * 70)
print(f"{'Season':<12s} {'Days':>6s} {'Mean':>8s} {'Median':>8s} {'Std':>8s} {'Range'}")
print("-" * 70)
for season_num, (season_name, _) in seasons.items():
data = bikes[bikes['season'] == season_num]['cnt']
print(f"{season_name:<12s} {len(data):>6,} {data.mean():>8.0f} {data.median():>8.0f} "
f"{data.std():>8.0f} {data.min():>5.0f}-{data.max():<5.0f}")SEASONAL DISTRIBUTION STATISTICS
======================================================================
Season Days Mean Median Std Range
----------------------------------------------------------------------
Spring 181 2604 2470 1132 431-6946
Summer 184 4992 5200 1417 1476-8227
Fall 188 5644 5722 1494 1061-8714
Winter 178 4728 4850 1593 605-8555Insights:
- Fall has highest average and peak rentals
- Spring has lowest rentals and most variability
- Summer and Winter are similar despite temperature differences
- All seasons show right-skewed distributions
Side-by-Side Subplots
Clear Visual Separation
import pandas as pd
import matplotlib.pyplot as plt
bikes_hour = pd.read_csv('hour.csv')
working_rentals = bikes_hour[bikes_hour['workingday'] == 1]['cnt']
weekend_rentals = bikes_hour[bikes_hour['workingday'] == 0]['cnt']
# Create side-by-side subplots
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
# Working days
axes[0].hist(working_rentals, bins=40, color='blue', alpha=0.7, edgecolor='black')
axes[0].axvline(working_rentals.mean(), color='red', linestyle='--',
linewidth=2, label=f'Mean={working_rentals.mean():.0f}')
axes[0].set_xlabel('Hourly Rentals')
axes[0].set_ylabel('Frequency')
axes[0].set_title(f'Working Days (n={len(working_rentals):,})')
axes[0].legend()
axes[0].grid(True, alpha=0.3, axis='y')
# Weekends
axes[1].hist(weekend_rentals, bins=40, color='red', alpha=0.7, edgecolor='black')
axes[1].axvline(weekend_rentals.mean(), color='blue', linestyle='--',
linewidth=2, label=f'Mean={weekend_rentals.mean():.0f}')
axes[1].set_xlabel('Hourly Rentals')
axes[1].set_ylabel('Frequency')
axes[1].set_title(f'Weekends/Holidays (n={len(weekend_rentals):,})')
axes[1].legend()
axes[1].grid(True, alpha=0.3, axis='y')
plt.tight_layout()
plt.show()Advantages:
- No overlapping—each distribution clearly visible
- Easier to compare shapes
- Can use same colors without confusion
Comparing User Types
Casual vs Registered Distribution
import pandas as pd
import matplotlib.pyplot as plt
bikes_hour = pd.read_csv('hour.csv')
casual = bikes_hour['casual']
registered = bikes_hour['registered']
fig, axes = plt.subplots(2, 1, figsize=(12, 10))
# Casual users
axes[0].hist(casual, bins=50, color='coral', alpha=0.7, edgecolor='black')
axes[0].axvline(casual.mean(), color='darkred', linestyle='--',
linewidth=2, label=f'Mean={casual.mean():.0f}')
axes[0].axvline(casual.median(), color='darkgreen', linestyle='--',
linewidth=2, label=f'Median={casual.median():.0f}')
axes[0].set_xlabel('Hourly Rentals')
axes[0].set_ylabel('Frequency')
axes[0].set_title('Casual Users Distribution')
axes[0].legend()
axes[0].grid(True, alpha=0.3, axis='y')
# Registered users
axes[1].hist(registered, bins=50, color='steelblue', alpha=0.7, edgecolor='black')
axes[1].axvline(registered.mean(), color='darkred', linestyle='--',
linewidth=2, label=f'Mean={registered.mean():.0f}')
axes[1].axvline(registered.median(), color='darkgreen', linestyle='--',
linewidth=2, label=f'Median={registered.median():.0f}')
axes[1].set_xlabel('Hourly Rentals')
axes[1].set_ylabel('Frequency')
axes[1].set_title('Registered Users Distribution')
axes[1].legend()
axes[1].grid(True, alpha=0.3, axis='y')
plt.tight_layout()
plt.show()
# Statistics
print("\nUSER TYPE COMPARISON")
print("=" * 60)
print(f"{'Metric':<20s} {'Casual':>15s} {'Registered':>15s}")
print("-" * 60)
print(f"{'Mean':<20s} {casual.mean():>15.1f} {registered.mean():>15.1f}")
print(f"{'Median':<20s} {casual.median():>15.1f} {registered.median():>15.1f}")
print(f"{'Std Dev':<20s} {casual.std():>15.1f} {registered.std():>15.1f}")
print(f"{'Max':<20s} {casual.max():>15,} {registered.max():>15,}")
ratio = registered.mean() / casual.mean()
print(f"\nRegistered users average {ratio:.1f}x more rentals than casual users")USER TYPE COMPARISON
============================================================
Metric Casual Registered
------------------------------------------------------------
Mean 36.0 153.8
Median 17.0 118.0
Std Dev 50.0 152.0
Max 367 886
Registered users average 4.3x more rentals than casual usersKey findings:
- Registered users dominate the system (4.3x higher average)
- Casual user distribution more right-skewed (lower median)
- Registered users show more consistent usage patterns
Year-over-Year Comparison
Growth Analysis
import pandas as pd
import matplotlib.pyplot as plt
bikes = pd.read_csv('day.csv')
year_2011 = bikes[bikes['yr'] == 0]['cnt']
year_2012 = bikes[bikes['yr'] == 1]['cnt']
plt.figure(figsize=(12, 6))
plt.hist(year_2011, bins=30, alpha=0.6, label='2011',
color='lightcoral', edgecolor='darkred')
plt.hist(year_2012, bins=30, alpha=0.6, label='2012',
color='lightblue', edgecolor='darkblue')
plt.xlabel('Daily Bike Rentals')
plt.ylabel('Frequency')
plt.title('Bike Sharing Growth: 2011 vs 2012')
plt.legend(fontsize=12)
plt.grid(True, alpha=0.3, axis='y')
# Add statistics text box
stats_text = f"""2011: Mean = {year_2011.mean():.0f}
2012: Mean = {year_2012.mean():.0f}
Growth: +{((year_2012.mean() - year_2011.mean())/year_2011.mean()*100):.1f}%"""
plt.text(0.98, 0.97, stats_text, transform=plt.gca().transAxes,
fontsize=11, verticalalignment='top', horizontalalignment='right',
bbox=dict(boxstyle='round', facecolor='wheat', alpha=0.8))
plt.show()Analysis:
- 2012 distribution shifted right (higher rentals overall)
- 2012 peak is higher (more consistent high-rental days)
- Growth indicates system maturation and adoption
Summary
You learned to compare multiple distributions effectively:
- Overlay histograms with transparency for direct comparison
- Density histograms normalize different sample sizes
- Side-by-side subplots avoid overlap confusion
- Statistical comparison quantifies differences (mean, median, std)
- Multiple groups can be compared (seasons, user types, years)
- Visual cues (colors, labels, lines) enhance clarity
- Choose technique based on number of groups and complexity
Next Steps: In the next lesson, you will learn pandas plotting methods for quick visualizations directly from DataFrames.
Practice: Compare temperature distributions across the four seasons. Which season has the most consistent temperatures (lowest variability)?