Lesson 11 - Creating Histograms

From Categories to Continuous Distributions

You learned bar plots for comparing categories. Now you will learn histograms—specialized charts that show how continuous data is distributed across a range of values.

By the end of this lesson, you will be able to:

  • Create histograms with plt.hist()
  • Understand and control bins
  • Choose appropriate bin counts
  • Interpret histogram shapes
  • Customize histogram appearance
  • Analyze frequency distributions

Histograms answer: “How many data points fall within each range of values?”


Histogram vs Bar Plot

Key Differences

Bar Plot:

  • Compares distinct categories (days, products, countries)
  • Each bar is a separate category
  • Gaps between bars show categories are discrete
  • Order usually doesn’t matter (can be alphabetical, by value, etc.)

Histogram:

  • Shows distribution of continuous data (temperature, price, age)
  • Each bar represents a range (bin)
  • No gaps—data is continuous
  • Order matters (low to high values)

Visual Comparison

Bar Plot (Categories)          Histogram (Continuous)
      ┌───┐                         ┌───┬───┬───┐
      │   │     ┌───┐               │   │   │   │
  ┌───┤   ├───┬─┤   │           ┌───┤   │   │   ├───┐
  │ A │ B │ C │D│ E │           │ 0 │10 │20 │30 │40 │
  └───┴───┴───┴─┴───┘           └───┴───┴───┴───┴───┘
   Discrete categories           Continuous ranges

Creating Basic Histograms

Bike Rental Distribution

import pandas as pd
import matplotlib.pyplot as plt

# Load hourly data
bikes_hour = pd.read_csv('hour.csv')

# Create histogram
plt.figure(figsize=(10, 6))
plt.hist(bikes_hour['cnt'], bins=30, edgecolor='black', alpha=0.7)
plt.xlabel('Hourly Bike Rentals')
plt.ylabel('Frequency (Number of Hours)')
plt.title('Distribution of Hourly Bike Rentals')
plt.grid(True, alpha=0.3, axis='y')
plt.show()

What the histogram shows:

  • X-axis: Rental counts (the data values)
  • Y-axis: Frequency (how many hours had that count)
  • Bars: Height indicates how common each range is
  • Shape: Right-skewed (most hours have low-moderate rentals)

Understanding Bins

What are Bins?

Bins divide the data range into intervals. Each bar represents one bin.

Example: For data from 0 to 100 with 10 bins:

Bin 1:  0-10    (13 values) ███
Bin 2: 10-20    (25 values) ██████
Bin 3: 20-30    (42 values) ██████████
Bin 4: 30-40    (38 values) █████████
Bin 5: 40-50    (19 values) ████
...

Each bin counts how many values fall in that range.

Controlling Bin Count

import pandas as pd
import matplotlib.pyplot as plt

bikes_hour = pd.read_csv('hour.csv')

# Create figure with different bin counts
fig, axes = plt.subplots(2, 2, figsize=(14, 10))
bin_counts = [10, 20, 50, 100]

for ax, bins in zip(axes.flat, bin_counts):
    ax.hist(bikes_hour['cnt'], bins=bins, edgecolor='black', alpha=0.7)
    ax.set_xlabel('Hourly Bike Rentals')
    ax.set_ylabel('Frequency')
    ax.set_title(f'Histogram with {bins} Bins')
    ax.grid(True, alpha=0.3, axis='y')

plt.tight_layout()
plt.show()

Observations:

  • Too few bins (10): Loses detail, overly smooth
  • Just right (20-30): Shows clear distribution pattern
  • Too many bins (100): Noisy, hard to see overall shape

Choosing the Right Bin Count

Guidelines

Sturges’ Rule (automatic):

import numpy as np

n = len(bikes_hour['cnt'])
bins = int(np.ceil(np.log2(n)) + 1)
print(f"Sturges' rule suggests {bins} bins for {n} data points")
Sturges' rule suggests 15 bins for 17379 data points

Square Root Rule:

bins = int(np.ceil(np.sqrt(n)))
print(f"Square root rule suggests {bins} bins")
Square root rule suggests 132 bins

Practical approach:

  • Start with default (auto-calculated)
  • Experiment with 20-50 bins
  • Choose what shows pattern clearest
  • Avoid extremes (too few or too many)

Using Auto Bins

import pandas as pd
import matplotlib.pyplot as plt

bikes_hour = pd.read_csv('hour.csv')

# Let matplotlib choose bins automatically
plt.figure(figsize=(10, 6))
plt.hist(bikes_hour['cnt'], bins='auto', edgecolor='black', alpha=0.7, color='steelblue')
plt.xlabel('Hourly Bike Rentals')
plt.ylabel('Frequency')
plt.title('Histogram with Automatic Binning')
plt.grid(True, alpha=0.3, axis='y')
plt.show()

bins='auto' lets matplotlib choose a reasonable bin count based on the data.


Customizing Histogram Appearance

Colors and Transparency

import pandas as pd
import matplotlib.pyplot as plt

bikes_hour = pd.read_csv('hour.csv')

plt.figure(figsize=(10, 6))
plt.hist(bikes_hour['cnt'], bins=40, color='coral', alpha=0.6,
         edgecolor='darkred', linewidth=1.2)
plt.xlabel('Hourly Bike Rentals')
plt.ylabel('Frequency')
plt.title('Customized Histogram')
plt.grid(True, alpha=0.3, axis='y')
plt.show()

Parameters:

  • color='coral': Fill color
  • alpha=0.6: Transparency
  • edgecolor='darkred': Border color
  • linewidth=1.2: Border thickness

Adding Statistical Lines

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

bikes_hour = pd.read_csv('hour.csv')

# Calculate statistics
mean_val = bikes_hour['cnt'].mean()
median_val = bikes_hour['cnt'].median()

# Create histogram
plt.figure(figsize=(10, 6))
plt.hist(bikes_hour['cnt'], bins=40, color='skyblue', alpha=0.7, edgecolor='black')

# Add mean and median lines
plt.axvline(mean_val, color='red', linestyle='--', linewidth=2, label=f'Mean = {mean_val:.0f}')
plt.axvline(median_val, color='green', linestyle='--', linewidth=2, label=f'Median = {median_val:.0f}')

plt.xlabel('Hourly Bike Rentals')
plt.ylabel('Frequency')
plt.title('Bike Rental Distribution with Central Tendency')
plt.legend()
plt.grid(True, alpha=0.3, axis='y')
plt.show()

Insight: Mean (189) > Median (145) confirms right-skewed distribution.


Analyzing Distribution Shapes

Identifying Patterns

import pandas as pd
import matplotlib.pyplot as plt

bikes_hour = pd.read_csv('hour.csv')

# Analyze temperature distribution
plt.figure(figsize=(10, 6))
plt.hist(bikes_hour['temp'], bins=30, color='orange', alpha=0.7, edgecolor='black')
plt.xlabel('Normalized Temperature')
plt.ylabel('Frequency')
plt.title('Temperature Distribution')
plt.grid(True, alpha=0.3, axis='y')
plt.show()

# Statistics
print("Temperature Distribution:")
print(f"  Mean:   {bikes_hour['temp'].mean():.3f}")
print(f"  Median: {bikes_hour['temp'].median():.3f}")
print(f"  Skew:   {'Right' if bikes_hour['temp'].mean() > bikes_hour['temp'].median() else 'Left' if bikes_hour['temp'].mean() < bikes_hour['temp'].median() else 'Symmetric'}")
Temperature Distribution:
  Mean:   0.497
  Median: 0.498
  Skew:   Symmetric

Interpretation: Temperature is nearly symmetric (normal distribution), unlike the right-skewed rental counts.


Comparing Distributions

Overlaying Histograms

import pandas as pd
import matplotlib.pyplot as plt

bikes_hour = pd.read_csv('hour.csv')

# Separate working days and weekends
working_day = bikes_hour[bikes_hour['workingday'] == 1]['cnt']
weekend = bikes_hour[bikes_hour['workingday'] == 0]['cnt']

plt.figure(figsize=(10, 6))
plt.hist(working_day, bins=30, alpha=0.6, label='Working Day', color='blue', edgecolor='black')
plt.hist(weekend, bins=30, alpha=0.6, label='Weekend/Holiday', color='red', edgecolor='black')
plt.xlabel('Hourly Bike Rentals')
plt.ylabel('Frequency')
plt.title('Rental Distribution: Working Days vs Weekends')
plt.legend()
plt.grid(True, alpha=0.3, axis='y')
plt.show()

Observations:

  • Both distributions are right-skewed
  • Working days have higher peak around 150-200
  • Weekend distribution is slightly flatter (more spread)

Frequency vs Density

Understanding Density Histograms

Frequency histogram: Y-axis shows count (number of observations)

Density histogram: Y-axis shows proportion (area under curve = 1.0)

import pandas as pd
import matplotlib.pyplot as plt

bikes_hour = pd.read_csv('hour.csv')

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Frequency histogram
axes[0].hist(bikes_hour['cnt'], bins=30, color='steelblue', alpha=0.7, edgecolor='black')
axes[0].set_xlabel('Hourly Bike Rentals')
axes[0].set_ylabel('Frequency (Count)')
axes[0].set_title('Frequency Histogram')
axes[0].grid(True, alpha=0.3, axis='y')

# Density histogram
axes[1].hist(bikes_hour['cnt'], bins=30, density=True, color='coral', alpha=0.7, edgecolor='black')
axes[1].set_xlabel('Hourly Bike Rentals')
axes[1].set_ylabel('Density (Proportion)')
axes[1].set_title('Density Histogram')
axes[1].grid(True, alpha=0.3, axis='y')

plt.tight_layout()
plt.show()

When to use density:

  • Comparing distributions with different sample sizes
  • Overlaying probability distributions
  • Statistical analysis requiring probabilities

Practical Analysis

Hourly Rental Pattern Analysis

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

bikes_hour = pd.read_csv('hour.csv')

# Analyze rentals during different hour ranges
hour_ranges = {
    'Night (0-6)': bikes_hour[bikes_hour['hr'] < 6]['cnt'],
    'Morning (6-12)': bikes_hour[(bikes_hour['hr'] >= 6) & (bikes_hour['hr'] < 12)]['cnt'],
    'Afternoon (12-18)': bikes_hour[(bikes_hour['hr'] >= 12) & (bikes_hour['hr'] < 18)]['cnt'],
    'Evening (18-24)': bikes_hour[bikes_hour['hr'] >= 18]['cnt']
}

fig, axes = plt.subplots(2, 2, figsize=(14, 10))
colors = ['darkblue', 'gold', 'orange', 'purple']

for ax, (period, data), color in zip(axes.flat, hour_ranges.items(), colors):
    ax.hist(data, bins=30, color=color, alpha=0.7, edgecolor='black')
    ax.set_xlabel('Bike Rentals')
    ax.set_ylabel('Frequency')
    ax.set_title(f'{period}\nMean: {data.mean():.0f}, Median: {data.median():.0f}')
    ax.grid(True, alpha=0.3, axis='y')
    ax.axvline(data.mean(), color='red', linestyle='--', linewidth=2, alpha=0.7)

plt.tight_layout()
plt.show()

# Summary statistics
print("\nRENTAL DISTRIBUTION BY TIME OF DAY")
print("=" * 60)
for period, data in hour_ranges.items():
    print(f"\n{period}:")
    print(f"  Mean:   {data.mean():6.1f}")
    print(f"  Median: {data.median():6.1f}")
    print(f"  Std:    {data.std():6.1f}")
    print(f"  Range:  {data.min():.0f} to {data.max():.0f}")
RENTAL DISTRIBUTION BY TIME OF DAY
============================================================

Night (0-6):
  Mean:     18.9
  Median:   15.0
  Std:      22.4
  Range:  1 to 200

Morning (6-12):
  Mean:    277.5
  Median:  231.0
  Std:     180.9
  Range:  2 to 977

Afternoon (12-18):
  Mean:    304.6
  Median:  287.0
  Std:     167.1
  Range:  3 to 945

Evening (18-24):
  Mean:    215.4
  Median:  192.0
  Std:     157.1
  Range:  2 to 757

Summary

You learned to create and interpret histograms:

  • plt.hist() creates histograms for continuous data
  • Bins divide data into ranges; choose 20-50 for most cases
  • bins='auto' lets matplotlib choose appropriate binning
  • Histogram shape reveals distribution pattern (normal, skewed, etc.)
  • Customization: color, alpha, edgecolor, linewidth
  • Statistical lines: add mean, median with plt.axvline()
  • Density histograms show proportions instead of counts
  • Histograms ≠ bar plots: continuous vs categorical data

Next Steps: In the next lesson, you will learn to compare multiple distributions side-by-side and analyze differences between groups.

Practice: Create histograms for different weather variables (humidity, windspeed). What distribution shapes do you observe?