Lesson 11 - Creating Histograms
From Categories to Continuous Distributions
You learned bar plots for comparing categories. Now you will learn histograms—specialized charts that show how continuous data is distributed across a range of values.
By the end of this lesson, you will be able to:
- Create histograms with
plt.hist() - Understand and control bins
- Choose appropriate bin counts
- Interpret histogram shapes
- Customize histogram appearance
- Analyze frequency distributions
Histograms answer: “How many data points fall within each range of values?”
Histogram vs Bar Plot
Key Differences
Bar Plot:
- Compares distinct categories (days, products, countries)
- Each bar is a separate category
- Gaps between bars show categories are discrete
- Order usually doesn’t matter (can be alphabetical, by value, etc.)
Histogram:
- Shows distribution of continuous data (temperature, price, age)
- Each bar represents a range (bin)
- No gaps—data is continuous
- Order matters (low to high values)
Visual Comparison
Bar Plot (Categories) Histogram (Continuous)
┌───┐ ┌───┬───┬───┐
│ │ ┌───┐ │ │ │ │
┌───┤ ├───┬─┤ │ ┌───┤ │ │ ├───┐
│ A │ B │ C │D│ E │ │ 0 │10 │20 │30 │40 │
└───┴───┴───┴─┴───┘ └───┴───┴───┴───┴───┘
Discrete categories Continuous rangesCreating Basic Histograms
Bike Rental Distribution
import pandas as pd
import matplotlib.pyplot as plt
# Load hourly data
bikes_hour = pd.read_csv('hour.csv')
# Create histogram
plt.figure(figsize=(10, 6))
plt.hist(bikes_hour['cnt'], bins=30, edgecolor='black', alpha=0.7)
plt.xlabel('Hourly Bike Rentals')
plt.ylabel('Frequency (Number of Hours)')
plt.title('Distribution of Hourly Bike Rentals')
plt.grid(True, alpha=0.3, axis='y')
plt.show()What the histogram shows:
- X-axis: Rental counts (the data values)
- Y-axis: Frequency (how many hours had that count)
- Bars: Height indicates how common each range is
- Shape: Right-skewed (most hours have low-moderate rentals)
Understanding Bins
What are Bins?
Bins divide the data range into intervals. Each bar represents one bin.
Example: For data from 0 to 100 with 10 bins:
Bin 1: 0-10 (13 values) ███
Bin 2: 10-20 (25 values) ██████
Bin 3: 20-30 (42 values) ██████████
Bin 4: 30-40 (38 values) █████████
Bin 5: 40-50 (19 values) ████
...Each bin counts how many values fall in that range.
Controlling Bin Count
import pandas as pd
import matplotlib.pyplot as plt
bikes_hour = pd.read_csv('hour.csv')
# Create figure with different bin counts
fig, axes = plt.subplots(2, 2, figsize=(14, 10))
bin_counts = [10, 20, 50, 100]
for ax, bins in zip(axes.flat, bin_counts):
ax.hist(bikes_hour['cnt'], bins=bins, edgecolor='black', alpha=0.7)
ax.set_xlabel('Hourly Bike Rentals')
ax.set_ylabel('Frequency')
ax.set_title(f'Histogram with {bins} Bins')
ax.grid(True, alpha=0.3, axis='y')
plt.tight_layout()
plt.show()Observations:
- Too few bins (10): Loses detail, overly smooth
- Just right (20-30): Shows clear distribution pattern
- Too many bins (100): Noisy, hard to see overall shape
Choosing the Right Bin Count
Guidelines
Sturges’ Rule (automatic):
import numpy as np
n = len(bikes_hour['cnt'])
bins = int(np.ceil(np.log2(n)) + 1)
print(f"Sturges' rule suggests {bins} bins for {n} data points")Sturges' rule suggests 15 bins for 17379 data pointsSquare Root Rule:
bins = int(np.ceil(np.sqrt(n)))
print(f"Square root rule suggests {bins} bins")Square root rule suggests 132 binsPractical approach:
- Start with default (auto-calculated)
- Experiment with 20-50 bins
- Choose what shows pattern clearest
- Avoid extremes (too few or too many)
Using Auto Bins
import pandas as pd
import matplotlib.pyplot as plt
bikes_hour = pd.read_csv('hour.csv')
# Let matplotlib choose bins automatically
plt.figure(figsize=(10, 6))
plt.hist(bikes_hour['cnt'], bins='auto', edgecolor='black', alpha=0.7, color='steelblue')
plt.xlabel('Hourly Bike Rentals')
plt.ylabel('Frequency')
plt.title('Histogram with Automatic Binning')
plt.grid(True, alpha=0.3, axis='y')
plt.show()bins='auto' lets matplotlib choose a reasonable bin count based on the data.
Customizing Histogram Appearance
Colors and Transparency
import pandas as pd
import matplotlib.pyplot as plt
bikes_hour = pd.read_csv('hour.csv')
plt.figure(figsize=(10, 6))
plt.hist(bikes_hour['cnt'], bins=40, color='coral', alpha=0.6,
edgecolor='darkred', linewidth=1.2)
plt.xlabel('Hourly Bike Rentals')
plt.ylabel('Frequency')
plt.title('Customized Histogram')
plt.grid(True, alpha=0.3, axis='y')
plt.show()Parameters:
color='coral': Fill coloralpha=0.6: Transparencyedgecolor='darkred': Border colorlinewidth=1.2: Border thickness
Adding Statistical Lines
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
bikes_hour = pd.read_csv('hour.csv')
# Calculate statistics
mean_val = bikes_hour['cnt'].mean()
median_val = bikes_hour['cnt'].median()
# Create histogram
plt.figure(figsize=(10, 6))
plt.hist(bikes_hour['cnt'], bins=40, color='skyblue', alpha=0.7, edgecolor='black')
# Add mean and median lines
plt.axvline(mean_val, color='red', linestyle='--', linewidth=2, label=f'Mean = {mean_val:.0f}')
plt.axvline(median_val, color='green', linestyle='--', linewidth=2, label=f'Median = {median_val:.0f}')
plt.xlabel('Hourly Bike Rentals')
plt.ylabel('Frequency')
plt.title('Bike Rental Distribution with Central Tendency')
plt.legend()
plt.grid(True, alpha=0.3, axis='y')
plt.show()Insight: Mean (189) > Median (145) confirms right-skewed distribution.
Analyzing Distribution Shapes
Identifying Patterns
import pandas as pd
import matplotlib.pyplot as plt
bikes_hour = pd.read_csv('hour.csv')
# Analyze temperature distribution
plt.figure(figsize=(10, 6))
plt.hist(bikes_hour['temp'], bins=30, color='orange', alpha=0.7, edgecolor='black')
plt.xlabel('Normalized Temperature')
plt.ylabel('Frequency')
plt.title('Temperature Distribution')
plt.grid(True, alpha=0.3, axis='y')
plt.show()
# Statistics
print("Temperature Distribution:")
print(f" Mean: {bikes_hour['temp'].mean():.3f}")
print(f" Median: {bikes_hour['temp'].median():.3f}")
print(f" Skew: {'Right' if bikes_hour['temp'].mean() > bikes_hour['temp'].median() else 'Left' if bikes_hour['temp'].mean() < bikes_hour['temp'].median() else 'Symmetric'}")Temperature Distribution:
Mean: 0.497
Median: 0.498
Skew: SymmetricInterpretation: Temperature is nearly symmetric (normal distribution), unlike the right-skewed rental counts.
Comparing Distributions
Overlaying Histograms
import pandas as pd
import matplotlib.pyplot as plt
bikes_hour = pd.read_csv('hour.csv')
# Separate working days and weekends
working_day = bikes_hour[bikes_hour['workingday'] == 1]['cnt']
weekend = bikes_hour[bikes_hour['workingday'] == 0]['cnt']
plt.figure(figsize=(10, 6))
plt.hist(working_day, bins=30, alpha=0.6, label='Working Day', color='blue', edgecolor='black')
plt.hist(weekend, bins=30, alpha=0.6, label='Weekend/Holiday', color='red', edgecolor='black')
plt.xlabel('Hourly Bike Rentals')
plt.ylabel('Frequency')
plt.title('Rental Distribution: Working Days vs Weekends')
plt.legend()
plt.grid(True, alpha=0.3, axis='y')
plt.show()Observations:
- Both distributions are right-skewed
- Working days have higher peak around 150-200
- Weekend distribution is slightly flatter (more spread)
Frequency vs Density
Understanding Density Histograms
Frequency histogram: Y-axis shows count (number of observations)
Density histogram: Y-axis shows proportion (area under curve = 1.0)
import pandas as pd
import matplotlib.pyplot as plt
bikes_hour = pd.read_csv('hour.csv')
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
# Frequency histogram
axes[0].hist(bikes_hour['cnt'], bins=30, color='steelblue', alpha=0.7, edgecolor='black')
axes[0].set_xlabel('Hourly Bike Rentals')
axes[0].set_ylabel('Frequency (Count)')
axes[0].set_title('Frequency Histogram')
axes[0].grid(True, alpha=0.3, axis='y')
# Density histogram
axes[1].hist(bikes_hour['cnt'], bins=30, density=True, color='coral', alpha=0.7, edgecolor='black')
axes[1].set_xlabel('Hourly Bike Rentals')
axes[1].set_ylabel('Density (Proportion)')
axes[1].set_title('Density Histogram')
axes[1].grid(True, alpha=0.3, axis='y')
plt.tight_layout()
plt.show()When to use density:
- Comparing distributions with different sample sizes
- Overlaying probability distributions
- Statistical analysis requiring probabilities
Practical Analysis
Hourly Rental Pattern Analysis
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
bikes_hour = pd.read_csv('hour.csv')
# Analyze rentals during different hour ranges
hour_ranges = {
'Night (0-6)': bikes_hour[bikes_hour['hr'] < 6]['cnt'],
'Morning (6-12)': bikes_hour[(bikes_hour['hr'] >= 6) & (bikes_hour['hr'] < 12)]['cnt'],
'Afternoon (12-18)': bikes_hour[(bikes_hour['hr'] >= 12) & (bikes_hour['hr'] < 18)]['cnt'],
'Evening (18-24)': bikes_hour[bikes_hour['hr'] >= 18]['cnt']
}
fig, axes = plt.subplots(2, 2, figsize=(14, 10))
colors = ['darkblue', 'gold', 'orange', 'purple']
for ax, (period, data), color in zip(axes.flat, hour_ranges.items(), colors):
ax.hist(data, bins=30, color=color, alpha=0.7, edgecolor='black')
ax.set_xlabel('Bike Rentals')
ax.set_ylabel('Frequency')
ax.set_title(f'{period}\nMean: {data.mean():.0f}, Median: {data.median():.0f}')
ax.grid(True, alpha=0.3, axis='y')
ax.axvline(data.mean(), color='red', linestyle='--', linewidth=2, alpha=0.7)
plt.tight_layout()
plt.show()
# Summary statistics
print("\nRENTAL DISTRIBUTION BY TIME OF DAY")
print("=" * 60)
for period, data in hour_ranges.items():
print(f"\n{period}:")
print(f" Mean: {data.mean():6.1f}")
print(f" Median: {data.median():6.1f}")
print(f" Std: {data.std():6.1f}")
print(f" Range: {data.min():.0f} to {data.max():.0f}")RENTAL DISTRIBUTION BY TIME OF DAY
============================================================
Night (0-6):
Mean: 18.9
Median: 15.0
Std: 22.4
Range: 1 to 200
Morning (6-12):
Mean: 277.5
Median: 231.0
Std: 180.9
Range: 2 to 977
Afternoon (12-18):
Mean: 304.6
Median: 287.0
Std: 167.1
Range: 3 to 945
Evening (18-24):
Mean: 215.4
Median: 192.0
Std: 157.1
Range: 2 to 757Summary
You learned to create and interpret histograms:
plt.hist()creates histograms for continuous data- Bins divide data into ranges; choose 20-50 for most cases
bins='auto'lets matplotlib choose appropriate binning- Histogram shape reveals distribution pattern (normal, skewed, etc.)
- Customization: color, alpha, edgecolor, linewidth
- Statistical lines: add mean, median with
plt.axvline() - Density histograms show proportions instead of counts
- Histograms ≠ bar plots: continuous vs categorical data
Next Steps: In the next lesson, you will learn to compare multiple distributions side-by-side and analyze differences between groups.
Practice: Create histograms for different weather variables (humidity, windspeed). What distribution shapes do you observe?