Lesson 9 - Understanding Distributions
From Individual Points to Overall Patterns
You have learned to plot relationships between variables. Now you will learn about distributions—how individual data values spread across a range. Understanding distributions is essential for data analysis.
By the end of this lesson, you will be able to:
- Understand what a distribution is
- Identify distribution shapes (normal, skewed, uniform)
- Recognize measures of central tendency (mean, median)
- Understand spread (range, variance)
- Interpret distribution patterns in real data
Distributions answer the question: “How are my data values spread out?”
What is a Distribution?
The Concept
A distribution shows how frequently different values appear in your dataset.
Example: Hourly bike rentals
Instead of asking “How many rentals at 3pm on Tuesday?”, distributions ask:
- What is the most common rental count?
- What is the range from lowest to highest?
- Are most values clustered together or spread out?
- Is the pattern symmetric or lopsided?
Distribution vs Individual Values
Individual values:
Hour 0: 47 rentals
Hour 1: 33 rentals
Hour 2: 13 rentals
...
Hour 17: 461 rentals
Hour 18: 525 rentalsDistribution view:
Rentals 0-100: ████████ (8 hours)
Rentals 100-200: ████ (4 hours)
Rentals 200-300: ██████ (6 hours)
Rentals 300-400: ███ (3 hours)
Rentals 400-500: ██ (2 hours)
Rentals 500-600: █ (1 hour)The distribution shows the overall pattern—most hours have moderate rentals (100-300), with fewer very low or very high hours.
Common Distribution Shapes
Normal (Bell Curve) Distribution
Shape: Symmetric bell curve, most values near center
Frequency
│ ****
│ ********
│ ************
│ ****************
│********************
└────────────────────→ Values
Low Mid HighExamples: Heights, test scores, measurement errors
Characteristics:
- Mean = Median (center)
- Symmetric around center
- Rare extreme values on both sides
Right-Skewed Distribution
Shape: Long tail on the right side
Frequency
│****
│*******
│**********
│***********
│*************----
└────────────────────→ Values
Many low Few highExamples: Income, house prices, website visits
Characteristics:
- Mean > Median (pulled right by extreme values)
- Most values on the left (low end)
- Few very high values stretch the tail right
Left-Skewed Distribution
Shape: Long tail on the left side
Frequency
│ ****
│ *********
│ ************
│ ****************
│----*****************
└────────────────────→ Values
Few low Many highExamples: Age at retirement, test scores on easy exams
Characteristics:
- Mean < Median (pulled left)
- Most values on the right (high end)
- Few very low values stretch tail left
Uniform Distribution
Shape: All values equally likely
Frequency
│********************
│********************
│********************
│********************
│********************
└────────────────────→ ValuesExamples: Random number generation, fair dice rolls
Characteristics:
- No clear peak
- Equal frequency across range
Analyzing Bike Rental Distribution
We will use the hourly bike rental data (hour.csv) to examine distribution patterns.
Load Hourly Data
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
# Load hourly bike data
bikes_hour = pd.read_csv('hour.csv')
# Look at rental counts
print("Bike Rental Statistics:")
print(bikes_hour['cnt'].describe())Bike Rental Statistics:
count 17379.000000
mean 189.463088
std 181.387599
min 1.000000
25% 42.000000
50% 145.000000
75% 284.000000
max 977.000000
Name: cnt, dtype: float64Key insights:
- Mean = 189: Average hourly rentals
- Median = 145: Middle value (50th percentile)
- Mean > Median: Suggests right skew (some very high rental hours)
- Range: 1 to 977 rentals
- Spread: Standard deviation of 181 shows high variability
Visualizing with Histograms
What is a Histogram?
A histogram divides data into bins and counts how many values fall in each bin.
import pandas as pd
import matplotlib.pyplot as plt
bikes_hour = pd.read_csv('hour.csv')
# Create histogram
plt.figure(figsize=(10, 6))
plt.hist(bikes_hour['cnt'], bins=30, edgecolor='black', alpha=0.7)
plt.xlabel('Hourly Bike Rentals')
plt.ylabel('Frequency (Number of Hours)')
plt.title('Distribution of Hourly Bike Rentals')
plt.grid(True, alpha=0.3)
plt.show()What the histogram shows:
- Peak around 50-150: Most hours have moderate rentals
- Right skew: Long tail extending toward high values
- Few extreme values: Very few hours with 700+ rentals
- Shape: Not normal—asymmetric with right tail
Interpreting Distribution Features
Central Tendency
Where is the “center” of the data?
import pandas as pd
import numpy as np
bikes_hour = pd.read_csv('hour.csv')
mean_rentals = bikes_hour['cnt'].mean()
median_rentals = bikes_hour['cnt'].median()
mode_rentals = bikes_hour['cnt'].mode()[0]
print(f"Mean: {mean_rentals:.1f} rentals")
print(f"Median: {median_rentals:.1f} rentals")
print(f"Mode: {mode_rentals} rentals (most frequent)")Mean: 189.5 rentals
Median: 145.0 rentals
Mode: 1 rentals (most frequent)Interpretation:
- Mean (189) > Median (145): Confirms right skew
- High-rental hours pull the mean up
- Median better represents “typical” hour
- Mode (1) represents overnight hours with minimal activity
Spread (Variability)
How spread out are the values?
import pandas as pd
import numpy as np
bikes_hour = pd.read_csv('hour.csv')
min_val = bikes_hour['cnt'].min()
max_val = bikes_hour['cnt'].max()
range_val = max_val - min_val
std_val = bikes_hour['cnt'].std()
print(f"Minimum: {min_val}")
print(f"Maximum: {max_val}")
print(f"Range: {range_val}")
print(f"Std Deviation: {std_val:.1f}")
print(f"Coefficient of Variation: {(std_val/bikes_hour['cnt'].mean())*100:.1f}%")Minimum: 1
Maximum: 977
Range: 976
Std Deviation: 181.4
Coefficient of Variation: 95.8%Interpretation:
- Large range (976): Huge difference between quietest and busiest hours
- High std dev (181): Values vary widely from mean
- CV = 96%: Very high variability relative to mean
Patterns by Hour of Day
Distribution Changes Throughout Day
import pandas as pd
import matplotlib.pyplot as plt
bikes_hour = pd.read_csv('hour.csv')
# Calculate mean rentals by hour
hourly_avg = bikes_hour.groupby('hr')['cnt'].mean()
# Plot
plt.figure(figsize=(12, 6))
plt.bar(hourly_avg.index, hourly_avg.values, edgecolor='black', alpha=0.7)
plt.xlabel('Hour of Day')
plt.ylabel('Average Bike Rentals')
plt.title('Average Bike Rentals by Hour')
plt.grid(True, alpha=0.3, axis='y')
plt.xticks(range(0, 24))
plt.show()Pattern observed:
- Overnight (0-5am): Very low rentals (< 50)
- Morning peak (7-8am): Commute to work (~350)
- Midday (10am-3pm): Moderate steady usage (~200)
- Evening peak (5-6pm): Highest usage (~450)
- Late evening (8pm-midnight): Declining activity
This is bimodal distribution by hour—two peaks (morning and evening commutes).
Comparing Distributions
Weekday vs Weekend
import pandas as pd
import matplotlib.pyplot as plt
bikes_hour = pd.read_csv('hour.csv')
# Separate weekday and weekend data
weekday_data = bikes_hour[bikes_hour['workingday'] == 1]['cnt']
weekend_data = bikes_hour[bikes_hour['workingday'] == 0]['cnt']
# Print statistics
print("Weekday Rentals:")
print(f" Mean: {weekday_data.mean():.1f}")
print(f" Median: {weekday_data.median():.1f}")
print(f" Std: {weekday_data.std():.1f}")
print("\nWeekend Rentals:")
print(f" Mean: {weekend_data.mean():.1f}")
print(f" Median: {weekend_data.median():.1f}")
print(f" Std: {weekend_data.std():.1f}")Weekday Rentals:
Mean: 195.2
Median: 157.0
Std: 186.7
Weekend Rentals:
Mean: 173.3
Median: 118.0
Std: 167.0Key differences:
- Weekdays have higher average: Commuters drive up usage
- Weekdays more variable: Commute peaks create wider spread
- Both right-skewed: Mean > Median in both cases
Practical Analysis
Complete Distribution Report
import pandas as pd
import numpy as np
bikes_hour = pd.read_csv('hour.csv')
print("HOURLY BIKE RENTAL DISTRIBUTION ANALYSIS")
print("=" * 70)
# Overall statistics
print("\n1. Overall Distribution:")
print("-" * 70)
print(f" Total hours observed: {len(bikes_hour):,}")
print(f" Mean rentals: {bikes_hour['cnt'].mean():.1f}")
print(f" Median rentals: {bikes_hour['cnt'].median():.1f}")
print(f" Std deviation: {bikes_hour['cnt'].std():.1f}")
print(f" Range: {bikes_hour['cnt'].min()} to {bikes_hour['cnt'].max()}")
# Shape analysis
mean_val = bikes_hour['cnt'].mean()
median_val = bikes_hour['cnt'].median()
skew_indicator = "Right-skewed" if mean_val > median_val else "Left-skewed" if mean_val < median_val else "Symmetric"
print(f" Distribution shape: {skew_indicator}")
# Quartiles
q25 = bikes_hour['cnt'].quantile(0.25)
q75 = bikes_hour['cnt'].quantile(0.75)
iqr = q75 - q25
print(f"\n2. Quartile Analysis:")
print("-" * 70)
print(f" 25th percentile (Q1): {q25:.1f}")
print(f" 50th percentile (Median): {median_val:.1f}")
print(f" 75th percentile (Q3): {q75:.1f}")
print(f" Interquartile range: {iqr:.1f}")
print(f" 50% of hours have: {q25:.0f} to {q75:.0f} rentals")
# Peak hours
print(f"\n3. Distribution Peaks:")
print("-" * 70)
peak_hour = bikes_hour.groupby('hr')['cnt'].mean().idxmax()
peak_avg = bikes_hour.groupby('hr')['cnt'].mean().max()
low_hour = bikes_hour.groupby('hr')['cnt'].mean().idxmin()
low_avg = bikes_hour.groupby('hr')['cnt'].mean().min()
print(f" Peak hour: {peak_hour}:00 (avg {peak_avg:.0f} rentals)")
print(f" Lowest hour: {low_hour}:00 (avg {low_avg:.0f} rentals)")
print(f" Peak to low ratio: {peak_avg/low_avg:.1f}x difference")HOURLY BIKE RENTAL DISTRIBUTION ANALYSIS
======================================================================
1. Overall Distribution:
----------------------------------------------------------------------
Total hours observed: 17,379
Mean rentals: 189.5
Median rentals: 145.0
Std deviation: 181.4
Range: 1 to 977
Distribution shape: Right-skewed
2. Quartile Analysis:
----------------------------------------------------------------------
25th percentile (Q1): 42.0
50th percentile (Median): 145.0
75th percentile (Q3): 284.0
Interquartile range: 242.0
50% of hours have: 42 to 284 rentals
3. Distribution Peaks:
----------------------------------------------------------------------
Peak hour: 17:00 (avg 461 rentals)
Lowest hour: 4:00 (avg 8 rentals)
Peak to low ratio: 58.2x differenceSummary
You learned the fundamentals of distributions:
- Distribution shows how data values spread across a range
- Common shapes: Normal (symmetric), right-skewed, left-skewed, uniform
- Central tendency: Mean, median, mode locate the “center”
- Spread: Range, standard deviation, IQR measure variability
- Skewness: Mean > Median indicates right skew
- Histograms visualize distributions with binned frequencies
- Context matters: Distributions can vary by time, conditions, groups
Next Steps: In the next lesson, you will learn to create bar plots for comparing categories and groups.
Practice: Analyze the distribution of temperature (temp) or humidity (hum) in the bike dataset. What shape do these distributions have?