Lesson 11 - Creating Histograms

From Categories to Continuous Distributions

You learned bar plots for comparing categories. Now you will learn histograms—specialized charts that show how continuous data is distributed across a range of values.

By the end of this lesson, you will be able to:

Create histograms with plt.hist()
Understand and control bins
Choose appropriate bin counts
Interpret histogram shapes
Customize histogram appearance
Analyze frequency distributions

Histograms answer: “How many data points fall within each range of values?”

Histogram vs Bar Plot

Key Differences

Bar Plot:

Compares distinct categories (days, products, countries)
Each bar is a separate category
Gaps between bars show categories are discrete
Order usually doesn’t matter (can be alphabetical, by value, etc.)

Histogram:

Shows distribution of continuous data (temperature, price, age)
Each bar represents a range (bin)
No gaps—data is continuous
Order matters (low to high values)

Visual Comparison

Bar Plot (Categories)          Histogram (Continuous)
      ┌───┐                         ┌───┬───┬───┐
      │   │     ┌───┐               │   │   │   │
  ┌───┤   ├───┬─┤   │           ┌───┤   │   │   ├───┐
  │ A │ B │ C │D│ E │           │ 0 │10 │20 │30 │40 │
  └───┴───┴───┴─┴───┘           └───┴───┴───┴───┴───┘
   Discrete categories           Continuous ranges

Creating Basic Histograms

Bike Rental Distribution

import pandas as pd
import matplotlib.pyplot as plt

# Load hourly data
bikes_hour = pd.read_csv('hour.csv')

# Create histogram
plt.figure(figsize=(10, 6))
plt.hist(bikes_hour['cnt'], bins=30, edgecolor='black', alpha=0.7)
plt.xlabel('Hourly Bike Rentals')
plt.ylabel('Frequency (Number of Hours)')
plt.title('Distribution of Hourly Bike Rentals')
plt.grid(True, alpha=0.3, axis='y')
plt.show()

What the histogram shows:

X-axis: Rental counts (the data values)
Y-axis: Frequency (how many hours had that count)
Bars: Height indicates how common each range is
Shape: Right-skewed (most hours have low-moderate rentals)

Understanding Bins

What are Bins?

Bins divide the data range into intervals. Each bar represents one bin.

Example: For data from 0 to 100 with 10 bins:

Bin 1:  0-10    (13 values) ███
Bin 2: 10-20    (25 values) ██████
Bin 3: 20-30    (42 values) ██████████
Bin 4: 30-40    (38 values) █████████
Bin 5: 40-50    (19 values) ████
...

Each bin counts how many values fall in that range.

Controlling Bin Count

import pandas as pd
import matplotlib.pyplot as plt

bikes_hour = pd.read_csv('hour.csv')

# Create figure with different bin counts
fig, axes = plt.subplots(2, 2, figsize=(14, 10))
bin_counts = [10, 20, 50, 100]

for ax, bins in zip(axes.flat, bin_counts):
    ax.hist(bikes_hour['cnt'], bins=bins, edgecolor='black', alpha=0.7)
    ax.set_xlabel('Hourly Bike Rentals')
    ax.set_ylabel('Frequency')
    ax.set_title(f'Histogram with {bins} Bins')
    ax.grid(True, alpha=0.3, axis='y')

plt.tight_layout()
plt.show()

Observations:

Too few bins (10): Loses detail, overly smooth
Just right (20-30): Shows clear distribution pattern
Too many bins (100): Noisy, hard to see overall shape

Choosing the Right Bin Count

Guidelines

Sturges’ Rule (automatic):

import numpy as np

n = len(bikes_hour['cnt'])
bins = int(np.ceil(np.log2(n)) + 1)
print(f"Sturges' rule suggests {bins} bins for {n} data points")

Sturges' rule suggests 15 bins for 17379 data points

Square Root Rule:

bins = int(np.ceil(np.sqrt(n)))
print(f"Square root rule suggests {bins} bins")

Square root rule suggests 132 bins

Practical approach:

Start with default (auto-calculated)
Experiment with 20-50 bins
Choose what shows pattern clearest
Avoid extremes (too few or too many)

Using Auto Bins

import pandas as pd
import matplotlib.pyplot as plt

bikes_hour = pd.read_csv('hour.csv')

# Let matplotlib choose bins automatically
plt.figure(figsize=(10, 6))
plt.hist(bikes_hour['cnt'], bins='auto', edgecolor='black', alpha=0.7, color='steelblue')
plt.xlabel('Hourly Bike Rentals')
plt.ylabel('Frequency')
plt.title('Histogram with Automatic Binning')
plt.grid(True, alpha=0.3, axis='y')
plt.show()

bins='auto' lets matplotlib choose a reasonable bin count based on the data.

Customizing Histogram Appearance

Colors and Transparency

import pandas as pd
import matplotlib.pyplot as plt

bikes_hour = pd.read_csv('hour.csv')

plt.figure(figsize=(10, 6))
plt.hist(bikes_hour['cnt'], bins=40, color='coral', alpha=0.6,
         edgecolor='darkred', linewidth=1.2)
plt.xlabel('Hourly Bike Rentals')
plt.ylabel('Frequency')
plt.title('Customized Histogram')
plt.grid(True, alpha=0.3, axis='y')
plt.show()

Parameters:

color='coral': Fill color
alpha=0.6: Transparency
edgecolor='darkred': Border color
linewidth=1.2: Border thickness

Adding Statistical Lines

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

bikes_hour = pd.read_csv('hour.csv')

# Calculate statistics
mean_val = bikes_hour['cnt'].mean()
median_val = bikes_hour['cnt'].median()

# Create histogram
plt.figure(figsize=(10, 6))
plt.hist(bikes_hour['cnt'], bins=40, color='skyblue', alpha=0.7, edgecolor='black')

# Add mean and median lines
plt.axvline(mean_val, color='red', linestyle='--', linewidth=2, label=f'Mean = {mean_val:.0f}')
plt.axvline(median_val, color='green', linestyle='--', linewidth=2, label=f'Median = {median_val:.0f}')

plt.xlabel('Hourly Bike Rentals')
plt.ylabel('Frequency')
plt.title('Bike Rental Distribution with Central Tendency')
plt.legend()
plt.grid(True, alpha=0.3, axis='y')
plt.show()

Insight: Mean (189) > Median (145) confirms right-skewed distribution.

Analyzing Distribution Shapes

Identifying Patterns

import pandas as pd
import matplotlib.pyplot as plt

bikes_hour = pd.read_csv('hour.csv')

# Analyze temperature distribution
plt.figure(figsize=(10, 6))
plt.hist(bikes_hour['temp'], bins=30, color='orange', alpha=0.7, edgecolor='black')
plt.xlabel('Normalized Temperature')
plt.ylabel('Frequency')
plt.title('Temperature Distribution')
plt.grid(True, alpha=0.3, axis='y')
plt.show()

# Statistics
print("Temperature Distribution:")
print(f"  Mean:   {bikes_hour['temp'].mean():.3f}")
print(f"  Median: {bikes_hour['temp'].median():.3f}")
print(f"  Skew:   {'Right' if bikes_hour['temp'].mean() > bikes_hour['temp'].median() else 'Left' if bikes_hour['temp'].mean() < bikes_hour['temp'].median() else 'Symmetric'}")

Temperature Distribution:
  Mean:   0.497
  Median: 0.498
  Skew:   Symmetric

Interpretation: Temperature is nearly symmetric (normal distribution), unlike the right-skewed rental counts.

Comparing Distributions

Overlaying Histograms

import pandas as pd
import matplotlib.pyplot as plt

bikes_hour = pd.read_csv('hour.csv')

# Separate working days and weekends
working_day = bikes_hour[bikes_hour['workingday'] == 1]['cnt']
weekend = bikes_hour[bikes_hour['workingday'] == 0]['cnt']

plt.figure(figsize=(10, 6))
plt.hist(working_day, bins=30, alpha=0.6, label='Working Day', color='blue', edgecolor='black')
plt.hist(weekend, bins=30, alpha=0.6, label='Weekend/Holiday', color='red', edgecolor='black')
plt.xlabel('Hourly Bike Rentals')
plt.ylabel('Frequency')
plt.title('Rental Distribution: Working Days vs Weekends')
plt.legend()
plt.grid(True, alpha=0.3, axis='y')
plt.show()

Observations:

Both distributions are right-skewed
Working days have higher peak around 150-200
Weekend distribution is slightly flatter (more spread)

Frequency vs Density

Understanding Density Histograms

Frequency histogram: Y-axis shows count (number of observations)

Density histogram: Y-axis shows proportion (area under curve = 1.0)

import pandas as pd
import matplotlib.pyplot as plt

bikes_hour = pd.read_csv('hour.csv')

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Frequency histogram
axes[0].hist(bikes_hour['cnt'], bins=30, color='steelblue', alpha=0.7, edgecolor='black')
axes[0].set_xlabel('Hourly Bike Rentals')
axes[0].set_ylabel('Frequency (Count)')
axes[0].set_title('Frequency Histogram')
axes[0].grid(True, alpha=0.3, axis='y')

# Density histogram
axes[1].hist(bikes_hour['cnt'], bins=30, density=True, color='coral', alpha=0.7, edgecolor='black')
axes[1].set_xlabel('Hourly Bike Rentals')
axes[1].set_ylabel('Density (Proportion)')
axes[1].set_title('Density Histogram')
axes[1].grid(True, alpha=0.3, axis='y')

plt.tight_layout()
plt.show()

When to use density:

Comparing distributions with different sample sizes
Overlaying probability distributions
Statistical analysis requiring probabilities

Practical Analysis

Hourly Rental Pattern Analysis

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

bikes_hour = pd.read_csv('hour.csv')

# Analyze rentals during different hour ranges
hour_ranges = {
    'Night (0-6)': bikes_hour[bikes_hour['hr'] < 6]['cnt'],
    'Morning (6-12)': bikes_hour[(bikes_hour['hr'] >= 6) & (bikes_hour['hr'] < 12)]['cnt'],
    'Afternoon (12-18)': bikes_hour[(bikes_hour['hr'] >= 12) & (bikes_hour['hr'] < 18)]['cnt'],
    'Evening (18-24)': bikes_hour[bikes_hour['hr'] >= 18]['cnt']
}

fig, axes = plt.subplots(2, 2, figsize=(14, 10))
colors = ['darkblue', 'gold', 'orange', 'purple']

for ax, (period, data), color in zip(axes.flat, hour_ranges.items(), colors):
    ax.hist(data, bins=30, color=color, alpha=0.7, edgecolor='black')
    ax.set_xlabel('Bike Rentals')
    ax.set_ylabel('Frequency')
    ax.set_title(f'{period}\nMean: {data.mean():.0f}, Median: {data.median():.0f}')
    ax.grid(True, alpha=0.3, axis='y')
    ax.axvline(data.mean(), color='red', linestyle='--', linewidth=2, alpha=0.7)

plt.tight_layout()
plt.show()

# Summary statistics
print("\nRENTAL DISTRIBUTION BY TIME OF DAY")
print("=" * 60)
for period, data in hour_ranges.items():
    print(f"\n{period}:")
    print(f"  Mean:   {data.mean():6.1f}")
    print(f"  Median: {data.median():6.1f}")
    print(f"  Std:    {data.std():6.1f}")
    print(f"  Range:  {data.min():.0f} to {data.max():.0f}")

RENTAL DISTRIBUTION BY TIME OF DAY
============================================================

Night (0-6):
  Mean:     18.9
  Median:   15.0
  Std:      22.4
  Range:  1 to 200

Morning (6-12):
  Mean:    277.5
  Median:  231.0
  Std:     180.9
  Range:  2 to 977

Afternoon (12-18):
  Mean:    304.6
  Median:  287.0
  Std:     167.1
  Range:  3 to 945

Evening (18-24):
  Mean:    215.4
  Median:  192.0
  Std:     157.1
  Range:  2 to 757

Summary

You learned to create and interpret histograms:

plt.hist() creates histograms for continuous data
Bins divide data into ranges; choose 20-50 for most cases
bins='auto' lets matplotlib choose appropriate binning
Histogram shape reveals distribution pattern (normal, skewed, etc.)
Customization: color, alpha, edgecolor, linewidth
Statistical lines: add mean, median with plt.axvline()
Density histograms show proportions instead of counts
Histograms ≠ bar plots: continuous vs categorical data

Next Steps: In the next lesson, you will learn to compare multiple distributions side-by-side and analyze differences between groups.

Practice: Create histograms for different weather variables (humidity, windspeed). What distribution shapes do you observe?

Resources

Lesson 12 - Comparing Distributions

Courses

DATATWEETS

Title here

Lesson 11 - Creating Histograms

From Categories to Continuous Distributions

Histogram vs Bar Plot

Key Differences

Visual Comparison

Creating Basic Histograms

Bike Rental Distribution

Understanding Bins

What are Bins?

Controlling Bin Count

Choosing the Right Bin Count

Guidelines

Using Auto Bins

Customizing Histogram Appearance

Colors and Transparency

Adding Statistical Lines

Analyzing Distribution Shapes

Identifying Patterns

Comparing Distributions

Overlaying Histograms

Frequency vs Density

Understanding Density Histograms

Practical Analysis

Hourly Rental Pattern Analysis

Summary

Lesson 11 - Creating Histograms

From Categories to Continuous Distributions#

Histogram vs Bar Plot#

Key Differences#

Visual Comparison#

Creating Basic Histograms#

Bike Rental Distribution#

Understanding Bins#

What are Bins?#

Controlling Bin Count#

Choosing the Right Bin Count#

Guidelines#

Using Auto Bins#

Customizing Histogram Appearance#

Colors and Transparency#

Adding Statistical Lines#

Analyzing Distribution Shapes#

Identifying Patterns#

Comparing Distributions#

Overlaying Histograms#

Frequency vs Density#

Understanding Density Histograms#

Practical Analysis#

Hourly Rental Pattern Analysis#

Summary#

From Categories to Continuous Distributions

Histogram vs Bar Plot

Key Differences

Visual Comparison

Creating Basic Histograms

Bike Rental Distribution

Understanding Bins

What are Bins?

Controlling Bin Count

Choosing the Right Bin Count

Guidelines

Using Auto Bins

Customizing Histogram Appearance

Colors and Transparency

Adding Statistical Lines

Analyzing Distribution Shapes

Identifying Patterns

Comparing Distributions

Overlaying Histograms

Frequency vs Density

Understanding Density Histograms

Practical Analysis

Hourly Rental Pattern Analysis

Summary