Lesson 9 - Understanding Distributions

From Individual Points to Overall Patterns

You have learned to plot relationships between variables. Now you will learn about distributions—how individual data values spread across a range. Understanding distributions is essential for data analysis.

By the end of this lesson, you will be able to:

  • Understand what a distribution is
  • Identify distribution shapes (normal, skewed, uniform)
  • Recognize measures of central tendency (mean, median)
  • Understand spread (range, variance)
  • Interpret distribution patterns in real data

Distributions answer the question: “How are my data values spread out?”


What is a Distribution?

The Concept

A distribution shows how frequently different values appear in your dataset.

Example: Hourly bike rentals

Instead of asking “How many rentals at 3pm on Tuesday?”, distributions ask:

  • What is the most common rental count?
  • What is the range from lowest to highest?
  • Are most values clustered together or spread out?
  • Is the pattern symmetric or lopsided?

Distribution vs Individual Values

Individual values:

Hour 0:  47 rentals
Hour 1:  33 rentals
Hour 2:  13 rentals
...
Hour 17: 461 rentals
Hour 18: 525 rentals

Distribution view:

Rentals 0-100:    ████████ (8 hours)
Rentals 100-200:  ████ (4 hours)
Rentals 200-300:  ██████ (6 hours)
Rentals 300-400:  ███ (3 hours)
Rentals 400-500:  ██ (2 hours)
Rentals 500-600:  █ (1 hour)

The distribution shows the overall pattern—most hours have moderate rentals (100-300), with fewer very low or very high hours.


Common Distribution Shapes

Normal (Bell Curve) Distribution

Shape: Symmetric bell curve, most values near center

       Frequency
          │        ****
          │      ********
          │    ************
          │  ****************
          │********************
          └────────────────────→ Values
              Low  Mid  High

Examples: Heights, test scores, measurement errors

Characteristics:

  • Mean = Median (center)
  • Symmetric around center
  • Rare extreme values on both sides

Right-Skewed Distribution

Shape: Long tail on the right side

       Frequency
          │****
          │*******
          │**********
          │***********
          │*************----
          └────────────────────→ Values
          Many low     Few high

Examples: Income, house prices, website visits

Characteristics:

  • Mean > Median (pulled right by extreme values)
  • Most values on the left (low end)
  • Few very high values stretch the tail right

Left-Skewed Distribution

Shape: Long tail on the left side

       Frequency
          │              ****
          │         *********
          │      ************
          │  ****************
          │----*****************
          └────────────────────→ Values
          Few low    Many high

Examples: Age at retirement, test scores on easy exams

Characteristics:

  • Mean < Median (pulled left)
  • Most values on the right (high end)
  • Few very low values stretch tail left

Uniform Distribution

Shape: All values equally likely

       Frequency
          │********************
          │********************
          │********************
          │********************
          │********************
          └────────────────────→ Values

Examples: Random number generation, fair dice rolls

Characteristics:

  • No clear peak
  • Equal frequency across range

Analyzing Bike Rental Distribution

We will use the hourly bike rental data (hour.csv) to examine distribution patterns.

Load Hourly Data

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

# Load hourly bike data
bikes_hour = pd.read_csv('hour.csv')

# Look at rental counts
print("Bike Rental Statistics:")
print(bikes_hour['cnt'].describe())
Bike Rental Statistics:
count    17379.000000
mean       189.463088
std        181.387599
min          1.000000
25%         42.000000
50%        145.000000
75%        284.000000
max        977.000000
Name: cnt, dtype: float64

Key insights:

  • Mean = 189: Average hourly rentals
  • Median = 145: Middle value (50th percentile)
  • Mean > Median: Suggests right skew (some very high rental hours)
  • Range: 1 to 977 rentals
  • Spread: Standard deviation of 181 shows high variability

Visualizing with Histograms

What is a Histogram?

A histogram divides data into bins and counts how many values fall in each bin.

import pandas as pd
import matplotlib.pyplot as plt

bikes_hour = pd.read_csv('hour.csv')

# Create histogram
plt.figure(figsize=(10, 6))
plt.hist(bikes_hour['cnt'], bins=30, edgecolor='black', alpha=0.7)
plt.xlabel('Hourly Bike Rentals')
plt.ylabel('Frequency (Number of Hours)')
plt.title('Distribution of Hourly Bike Rentals')
plt.grid(True, alpha=0.3)
plt.show()

What the histogram shows:

  • Peak around 50-150: Most hours have moderate rentals
  • Right skew: Long tail extending toward high values
  • Few extreme values: Very few hours with 700+ rentals
  • Shape: Not normal—asymmetric with right tail

Interpreting Distribution Features

Central Tendency

Where is the “center” of the data?

import pandas as pd
import numpy as np

bikes_hour = pd.read_csv('hour.csv')

mean_rentals = bikes_hour['cnt'].mean()
median_rentals = bikes_hour['cnt'].median()
mode_rentals = bikes_hour['cnt'].mode()[0]

print(f"Mean:   {mean_rentals:.1f} rentals")
print(f"Median: {median_rentals:.1f} rentals")
print(f"Mode:   {mode_rentals} rentals (most frequent)")
Mean:   189.5 rentals
Median: 145.0 rentals
Mode:   1 rentals (most frequent)

Interpretation:

  • Mean (189) > Median (145): Confirms right skew
  • High-rental hours pull the mean up
  • Median better represents “typical” hour
  • Mode (1) represents overnight hours with minimal activity

Spread (Variability)

How spread out are the values?

import pandas as pd
import numpy as np

bikes_hour = pd.read_csv('hour.csv')

min_val = bikes_hour['cnt'].min()
max_val = bikes_hour['cnt'].max()
range_val = max_val - min_val
std_val = bikes_hour['cnt'].std()

print(f"Minimum:         {min_val}")
print(f"Maximum:         {max_val}")
print(f"Range:           {range_val}")
print(f"Std Deviation:   {std_val:.1f}")
print(f"Coefficient of Variation: {(std_val/bikes_hour['cnt'].mean())*100:.1f}%")
Minimum:         1
Maximum:         977
Range:           976
Std Deviation:   181.4
Coefficient of Variation: 95.8%

Interpretation:

  • Large range (976): Huge difference between quietest and busiest hours
  • High std dev (181): Values vary widely from mean
  • CV = 96%: Very high variability relative to mean

Patterns by Hour of Day

Distribution Changes Throughout Day

import pandas as pd
import matplotlib.pyplot as plt

bikes_hour = pd.read_csv('hour.csv')

# Calculate mean rentals by hour
hourly_avg = bikes_hour.groupby('hr')['cnt'].mean()

# Plot
plt.figure(figsize=(12, 6))
plt.bar(hourly_avg.index, hourly_avg.values, edgecolor='black', alpha=0.7)
plt.xlabel('Hour of Day')
plt.ylabel('Average Bike Rentals')
plt.title('Average Bike Rentals by Hour')
plt.grid(True, alpha=0.3, axis='y')
plt.xticks(range(0, 24))
plt.show()

Pattern observed:

  • Overnight (0-5am): Very low rentals (< 50)
  • Morning peak (7-8am): Commute to work (~350)
  • Midday (10am-3pm): Moderate steady usage (~200)
  • Evening peak (5-6pm): Highest usage (~450)
  • Late evening (8pm-midnight): Declining activity

This is bimodal distribution by hour—two peaks (morning and evening commutes).


Comparing Distributions

Weekday vs Weekend

import pandas as pd
import matplotlib.pyplot as plt

bikes_hour = pd.read_csv('hour.csv')

# Separate weekday and weekend data
weekday_data = bikes_hour[bikes_hour['workingday'] == 1]['cnt']
weekend_data = bikes_hour[bikes_hour['workingday'] == 0]['cnt']

# Print statistics
print("Weekday Rentals:")
print(f"  Mean:   {weekday_data.mean():.1f}")
print(f"  Median: {weekday_data.median():.1f}")
print(f"  Std:    {weekday_data.std():.1f}")

print("\nWeekend Rentals:")
print(f"  Mean:   {weekend_data.mean():.1f}")
print(f"  Median: {weekend_data.median():.1f}")
print(f"  Std:    {weekend_data.std():.1f}")
Weekday Rentals:
  Mean:   195.2
  Median: 157.0
  Std:    186.7

Weekend Rentals:
  Mean:   173.3
  Median: 118.0
  Std:    167.0

Key differences:

  • Weekdays have higher average: Commuters drive up usage
  • Weekdays more variable: Commute peaks create wider spread
  • Both right-skewed: Mean > Median in both cases

Practical Analysis

Complete Distribution Report

import pandas as pd
import numpy as np

bikes_hour = pd.read_csv('hour.csv')

print("HOURLY BIKE RENTAL DISTRIBUTION ANALYSIS")
print("=" * 70)

# Overall statistics
print("\n1. Overall Distribution:")
print("-" * 70)
print(f"  Total hours observed:     {len(bikes_hour):,}")
print(f"  Mean rentals:             {bikes_hour['cnt'].mean():.1f}")
print(f"  Median rentals:           {bikes_hour['cnt'].median():.1f}")
print(f"  Std deviation:            {bikes_hour['cnt'].std():.1f}")
print(f"  Range:                    {bikes_hour['cnt'].min()} to {bikes_hour['cnt'].max()}")

# Shape analysis
mean_val = bikes_hour['cnt'].mean()
median_val = bikes_hour['cnt'].median()
skew_indicator = "Right-skewed" if mean_val > median_val else "Left-skewed" if mean_val < median_val else "Symmetric"
print(f"  Distribution shape:       {skew_indicator}")

# Quartiles
q25 = bikes_hour['cnt'].quantile(0.25)
q75 = bikes_hour['cnt'].quantile(0.75)
iqr = q75 - q25

print(f"\n2. Quartile Analysis:")
print("-" * 70)
print(f"  25th percentile (Q1):     {q25:.1f}")
print(f"  50th percentile (Median): {median_val:.1f}")
print(f"  75th percentile (Q3):     {q75:.1f}")
print(f"  Interquartile range:      {iqr:.1f}")
print(f"  50% of hours have:        {q25:.0f} to {q75:.0f} rentals")

# Peak hours
print(f"\n3. Distribution Peaks:")
print("-" * 70)
peak_hour = bikes_hour.groupby('hr')['cnt'].mean().idxmax()
peak_avg = bikes_hour.groupby('hr')['cnt'].mean().max()
low_hour = bikes_hour.groupby('hr')['cnt'].mean().idxmin()
low_avg = bikes_hour.groupby('hr')['cnt'].mean().min()

print(f"  Peak hour:                {peak_hour}:00 (avg {peak_avg:.0f} rentals)")
print(f"  Lowest hour:              {low_hour}:00 (avg {low_avg:.0f} rentals)")
print(f"  Peak to low ratio:        {peak_avg/low_avg:.1f}x difference")
HOURLY BIKE RENTAL DISTRIBUTION ANALYSIS
======================================================================

1. Overall Distribution:
----------------------------------------------------------------------
  Total hours observed:     17,379
  Mean rentals:             189.5
  Median rentals:           145.0
  Std deviation:            181.4
  Range:                    1 to 977
  Distribution shape:       Right-skewed

2. Quartile Analysis:
----------------------------------------------------------------------
  25th percentile (Q1):     42.0
  50th percentile (Median): 145.0
  75th percentile (Q3):     284.0
  Interquartile range:      242.0
  50% of hours have:        42 to 284 rentals

3. Distribution Peaks:
----------------------------------------------------------------------
  Peak hour:                17:00 (avg 461 rentals)
  Lowest hour:              4:00 (avg 8 rentals)
  Peak to low ratio:        58.2x difference

Summary

You learned the fundamentals of distributions:

  • Distribution shows how data values spread across a range
  • Common shapes: Normal (symmetric), right-skewed, left-skewed, uniform
  • Central tendency: Mean, median, mode locate the “center”
  • Spread: Range, standard deviation, IQR measure variability
  • Skewness: Mean > Median indicates right skew
  • Histograms visualize distributions with binned frequencies
  • Context matters: Distributions can vary by time, conditions, groups

Next Steps: In the next lesson, you will learn to create bar plots for comparing categories and groups.

Practice: Analyze the distribution of temperature (temp) or humidity (hum) in the bike dataset. What shape do these distributions have?