Lesson 12 - Comparing Distributions

Beyond Single Distributions

You can create histograms to visualize one distribution. Now you will learn to compare multiple distributions to identify differences between groups, conditions, or time periods.

By the end of this lesson, you will be able to:

  • Overlay multiple histograms for comparison
  • Create side-by-side distribution plots
  • Use transparency effectively
  • Calculate and compare distribution statistics
  • Identify significant differences between groups
  • Choose appropriate comparison techniques

Comparing distributions answers: “How do these groups differ in their data patterns?”


Why Compare Distributions?

Real-World Questions

Distribution comparisons help answer:

  • Do bike rentals differ between weekdays and weekends?
  • How does temperature distribution vary across seasons?
  • Are casual and registered users different in their usage patterns?
  • Has the rental distribution changed from 2011 to 2012?

Key insight: Comparing distributions reveals differences that summary statistics (mean, median) alone might miss.


Overlapping Histograms

Basic Overlay

import pandas as pd
import matplotlib.pyplot as plt

bikes_hour = pd.read_csv('hour.csv')

# Separate data by working day status
working_rentals = bikes_hour[bikes_hour['workingday'] == 1]['cnt']
weekend_rentals = bikes_hour[bikes_hour['workingday'] == 0]['cnt']

# Create overlapping histograms
plt.figure(figsize=(10, 6))
plt.hist(working_rentals, bins=40, alpha=0.6, label='Working Days',
         color='blue', edgecolor='darkblue')
plt.hist(weekend_rentals, bins=40, alpha=0.6, label='Weekends/Holidays',
         color='red', edgecolor='darkred')

plt.xlabel('Hourly Bike Rentals')
plt.ylabel('Frequency')
plt.title('Rental Distribution: Working Days vs Weekends')
plt.legend()
plt.grid(True, alpha=0.3, axis='y')
plt.show()

Key technique: Use alpha=0.6 (transparency) so overlapping areas are visible.

What it shows:

  • Both distributions are right-skewed
  • Working days have more hours with moderate rentals (100-250)
  • Weekends show flatter distribution

Statistical Comparison

Calculate Summary Statistics

import pandas as pd
import numpy as np

bikes_hour = pd.read_csv('hour.csv')

working_rentals = bikes_hour[bikes_hour['workingday'] == 1]['cnt']
weekend_rentals = bikes_hour[bikes_hour['workingday'] == 0]['cnt']

print("DISTRIBUTION COMPARISON: Working Days vs Weekends")
print("=" * 60)
print(f"{'Metric':<20s} {'Working Days':>15s} {'Weekends':>15s}")
print("-" * 60)
print(f"{'Count':<20s} {len(working_rentals):>15,} {len(weekend_rentals):>15,}")
print(f"{'Mean':<20s} {working_rentals.mean():>15.1f} {weekend_rentals.mean():>15.1f}")
print(f"{'Median':<20s} {working_rentals.median():>15.1f} {weekend_rentals.median():>15.1f}")
print(f"{'Std Deviation':<20s} {working_rentals.std():>15.1f} {weekend_rentals.std():>15.1f}")
print(f"{'Min':<20s} {working_rentals.min():>15,} {weekend_rentals.min():>15,}")
print(f"{'Max':<20s} {working_rentals.max():>15,} {weekend_rentals.max():>15,}")
print(f"{'25th Percentile':<20s} {working_rentals.quantile(0.25):>15.1f} {weekend_rentals.quantile(0.25):>15.1f}")
print(f"{'75th Percentile':<20s} {working_rentals.quantile(0.75):>15.1f} {weekend_rentals.quantile(0.75):>15.1f}")

# Calculate difference
diff_mean = working_rentals.mean() - weekend_rentals.mean()
diff_pct = (diff_mean / weekend_rentals.mean()) * 100

print("\nDifference:")
print(f"  Working days average {diff_pct:+.1f}% more rentals than weekends")
DISTRIBUTION COMPARISON: Working Days vs Weekends
============================================================
Metric               Working Days        Weekends
------------------------------------------------------------
Count                      12,044           5,335
Mean                        195.2           173.3
Median                      157.0           118.0
Std Deviation               186.7           167.0
Min                             1               1
Max                           977             948
25th Percentile              47.0            32.0
75th Percentile             291.0           261.0

Difference:
  Working days average +12.7% more rentals than weekends

Normalized Density Comparison

Using Density for Fair Comparison

When groups have different sample sizes, use density histograms:

import pandas as pd
import matplotlib.pyplot as plt

bikes_hour = pd.read_csv('hour.csv')

working_rentals = bikes_hour[bikes_hour['workingday'] == 1]['cnt']
weekend_rentals = bikes_hour[bikes_hour['workingday'] == 0]['cnt']

plt.figure(figsize=(10, 6))
plt.hist(working_rentals, bins=40, density=True, alpha=0.6,
         label=f'Working Days (n={len(working_rentals):,})',
         color='blue', edgecolor='darkblue')
plt.hist(weekend_rentals, bins=40, density=True, alpha=0.6,
         label=f'Weekends (n={len(weekend_rentals):,})',
         color='red', edgecolor='darkred')

plt.xlabel('Hourly Bike Rentals')
plt.ylabel('Density')
plt.title('Normalized Rental Distribution Comparison')
plt.legend()
plt.grid(True, alpha=0.3, axis='y')
plt.show()

Why density?

  • Working days: 12,044 hours
  • Weekends: 5,335 hours
  • Different counts make raw frequencies hard to compare
  • Density normalizes to make shapes comparable

Seasonal Comparisons

Four-Way Distribution Comparison

import pandas as pd
import matplotlib.pyplot as plt

bikes = pd.read_csv('day.csv')

# Separate by season
seasons = {
    1: ('Spring', 'lightgreen'),
    2: ('Summer', 'gold'),
    3: ('Fall', 'orange'),
    4: ('Winter', 'lightblue')
}

plt.figure(figsize=(12, 7))

for season_num, (season_name, color) in seasons.items():
    season_data = bikes[bikes['season'] == season_num]['cnt']
    plt.hist(season_data, bins=25, alpha=0.5, label=season_name,
             color=color, edgecolor='black')

plt.xlabel('Daily Bike Rentals')
plt.ylabel('Frequency')
plt.title('Bike Rental Distribution by Season')
plt.legend()
plt.grid(True, alpha=0.3, axis='y')
plt.show()

# Statistics by season
print("\nSEASONAL DISTRIBUTION STATISTICS")
print("=" * 70)
print(f"{'Season':<12s} {'Days':>6s} {'Mean':>8s} {'Median':>8s} {'Std':>8s} {'Range'}")
print("-" * 70)

for season_num, (season_name, _) in seasons.items():
    data = bikes[bikes['season'] == season_num]['cnt']
    print(f"{season_name:<12s} {len(data):>6,} {data.mean():>8.0f} {data.median():>8.0f} "
          f"{data.std():>8.0f} {data.min():>5.0f}-{data.max():<5.0f}")
SEASONAL DISTRIBUTION STATISTICS
======================================================================
Season       Days     Mean   Median      Std Range
----------------------------------------------------------------------
Spring        181     2604     2470     1132   431-6946
Summer        184     4992     5200     1417  1476-8227
Fall          188     5644     5722     1494  1061-8714
Winter        178     4728     4850     1593   605-8555

Insights:

  • Fall has highest average and peak rentals
  • Spring has lowest rentals and most variability
  • Summer and Winter are similar despite temperature differences
  • All seasons show right-skewed distributions

Side-by-Side Subplots

Clear Visual Separation

import pandas as pd
import matplotlib.pyplot as plt

bikes_hour = pd.read_csv('hour.csv')

working_rentals = bikes_hour[bikes_hour['workingday'] == 1]['cnt']
weekend_rentals = bikes_hour[bikes_hour['workingday'] == 0]['cnt']

# Create side-by-side subplots
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Working days
axes[0].hist(working_rentals, bins=40, color='blue', alpha=0.7, edgecolor='black')
axes[0].axvline(working_rentals.mean(), color='red', linestyle='--',
                linewidth=2, label=f'Mean={working_rentals.mean():.0f}')
axes[0].set_xlabel('Hourly Rentals')
axes[0].set_ylabel('Frequency')
axes[0].set_title(f'Working Days (n={len(working_rentals):,})')
axes[0].legend()
axes[0].grid(True, alpha=0.3, axis='y')

# Weekends
axes[1].hist(weekend_rentals, bins=40, color='red', alpha=0.7, edgecolor='black')
axes[1].axvline(weekend_rentals.mean(), color='blue', linestyle='--',
                linewidth=2, label=f'Mean={weekend_rentals.mean():.0f}')
axes[1].set_xlabel('Hourly Rentals')
axes[1].set_ylabel('Frequency')
axes[1].set_title(f'Weekends/Holidays (n={len(weekend_rentals):,})')
axes[1].legend()
axes[1].grid(True, alpha=0.3, axis='y')

plt.tight_layout()
plt.show()

Advantages:

  • No overlapping—each distribution clearly visible
  • Easier to compare shapes
  • Can use same colors without confusion

Comparing User Types

Casual vs Registered Distribution

import pandas as pd
import matplotlib.pyplot as plt

bikes_hour = pd.read_csv('hour.csv')

casual = bikes_hour['casual']
registered = bikes_hour['registered']

fig, axes = plt.subplots(2, 1, figsize=(12, 10))

# Casual users
axes[0].hist(casual, bins=50, color='coral', alpha=0.7, edgecolor='black')
axes[0].axvline(casual.mean(), color='darkred', linestyle='--',
                linewidth=2, label=f'Mean={casual.mean():.0f}')
axes[0].axvline(casual.median(), color='darkgreen', linestyle='--',
                linewidth=2, label=f'Median={casual.median():.0f}')
axes[0].set_xlabel('Hourly Rentals')
axes[0].set_ylabel('Frequency')
axes[0].set_title('Casual Users Distribution')
axes[0].legend()
axes[0].grid(True, alpha=0.3, axis='y')

# Registered users
axes[1].hist(registered, bins=50, color='steelblue', alpha=0.7, edgecolor='black')
axes[1].axvline(registered.mean(), color='darkred', linestyle='--',
                linewidth=2, label=f'Mean={registered.mean():.0f}')
axes[1].axvline(registered.median(), color='darkgreen', linestyle='--',
                linewidth=2, label=f'Median={registered.median():.0f}')
axes[1].set_xlabel('Hourly Rentals')
axes[1].set_ylabel('Frequency')
axes[1].set_title('Registered Users Distribution')
axes[1].legend()
axes[1].grid(True, alpha=0.3, axis='y')

plt.tight_layout()
plt.show()

# Statistics
print("\nUSER TYPE COMPARISON")
print("=" * 60)
print(f"{'Metric':<20s} {'Casual':>15s} {'Registered':>15s}")
print("-" * 60)
print(f"{'Mean':<20s} {casual.mean():>15.1f} {registered.mean():>15.1f}")
print(f"{'Median':<20s} {casual.median():>15.1f} {registered.median():>15.1f}")
print(f"{'Std Dev':<20s} {casual.std():>15.1f} {registered.std():>15.1f}")
print(f"{'Max':<20s} {casual.max():>15,} {registered.max():>15,}")

ratio = registered.mean() / casual.mean()
print(f"\nRegistered users average {ratio:.1f}x more rentals than casual users")
USER TYPE COMPARISON
============================================================
Metric                     Casual      Registered
------------------------------------------------------------
Mean                         36.0           153.8
Median                       17.0           118.0
Std Dev                      50.0           152.0
Max                         367             886

Registered users average 4.3x more rentals than casual users

Key findings:

  • Registered users dominate the system (4.3x higher average)
  • Casual user distribution more right-skewed (lower median)
  • Registered users show more consistent usage patterns

Year-over-Year Comparison

Growth Analysis

import pandas as pd
import matplotlib.pyplot as plt

bikes = pd.read_csv('day.csv')

year_2011 = bikes[bikes['yr'] == 0]['cnt']
year_2012 = bikes[bikes['yr'] == 1]['cnt']

plt.figure(figsize=(12, 6))
plt.hist(year_2011, bins=30, alpha=0.6, label='2011',
         color='lightcoral', edgecolor='darkred')
plt.hist(year_2012, bins=30, alpha=0.6, label='2012',
         color='lightblue', edgecolor='darkblue')

plt.xlabel('Daily Bike Rentals')
plt.ylabel('Frequency')
plt.title('Bike Sharing Growth: 2011 vs 2012')
plt.legend(fontsize=12)
plt.grid(True, alpha=0.3, axis='y')

# Add statistics text box
stats_text = f"""2011: Mean = {year_2011.mean():.0f}
2012: Mean = {year_2012.mean():.0f}
Growth: +{((year_2012.mean() - year_2011.mean())/year_2011.mean()*100):.1f}%"""

plt.text(0.98, 0.97, stats_text, transform=plt.gca().transAxes,
         fontsize=11, verticalalignment='top', horizontalalignment='right',
         bbox=dict(boxstyle='round', facecolor='wheat', alpha=0.8))

plt.show()

Analysis:

  • 2012 distribution shifted right (higher rentals overall)
  • 2012 peak is higher (more consistent high-rental days)
  • Growth indicates system maturation and adoption

Summary

You learned to compare multiple distributions effectively:

  • Overlay histograms with transparency for direct comparison
  • Density histograms normalize different sample sizes
  • Side-by-side subplots avoid overlap confusion
  • Statistical comparison quantifies differences (mean, median, std)
  • Multiple groups can be compared (seasons, user types, years)
  • Visual cues (colors, labels, lines) enhance clarity
  • Choose technique based on number of groups and complexity

Next Steps: In the next lesson, you will learn pandas plotting methods for quick visualizations directly from DataFrames.

Practice: Compare temperature distributions across the four seasons. Which season has the most consistent temperatures (lowest variability)?