Lesson 6 - Creating Scatter Plots
What is a Scatter Plot?
You have used line plots to show how one variable changes over time. Now you will learn scatter plots—a tool for exploring the relationship between two variables.
By the end of this lesson, you will be able to:
- Create scatter plots using
plt.scatter() - Understand when to use scatter plots vs line plots
- Read and interpret scatter plot patterns
- Identify positive, negative, and no relationship patterns
- Customize scatter plot appearance
- Use scatter plots to test hypotheses about data relationships
Scatter plots help answer questions like: Does temperature affect bike rentals? Does price affect demand?
Line Plots vs Scatter Plots
Line Plots
Purpose: Show how one variable changes over time
Time → ValueStructure:
- X-axis: Time (dates, months, hours)
- Y-axis: Some measurement (rentals, sales, temperature)
- Points connected by lines
- Shows trends over time
Example questions:
- How do bike rentals change throughout the year?
- What is the hourly pattern of website traffic?
- How has temperature varied over the past decade?
Scatter Plots
Purpose: Show the relationship between two variables
Variable 1 ↔ Variable 2Structure:
- X-axis: First variable (temperature, price, age)
- Y-axis: Second variable (rentals, sales, income)
- Points not connected by lines
- Each point = one observation
- Shows relationships between variables
Example questions:
- Does temperature affect bike rentals?
- Does advertising spending increase sales?
- Does study time improve test scores?
Creating Your First Scatter Plot
Basic Scatter Plot
import pandas as pd
import matplotlib.pyplot as plt
bikes = pd.read_csv('day.csv')
# Create scatter plot: temperature vs rentals
plt.scatter(bikes['temp'], bikes['cnt'])
plt.xlabel('Normalized Temperature')
plt.ylabel('Number of Bike Rentals')
plt.title('Temperature vs Bike Rentals')
plt.show()Output: Each point represents one day. X-position shows temperature, Y-position shows rentals.
Syntax:
plt.scatter(x_values, y_values)x_values: First variable (independent variable)y_values: Second variable (dependent variable)- Each matching pair creates one point
Understanding Normalized Temperature
In the bike dataset, temp is normalized to 0-1 scale:
- 0.0 = Coldest temperature
- 0.5 = Medium temperature
- 1.0 = Hottest temperature
You don’t need exact degrees. The pattern is what matters!
Reading Scatter Plot Patterns
When you look at a scatter plot, you look for patterns in how points are arranged.
Pattern 1: Positive Relationship
Appearance: Points slope upward from left to right ↗
Meaning: As X increases, Y increases
Visual:
High Y │ ●
│ ● ●
│ ● ●
│ ● ●
│● ●
Low Y └──────────────
Low X High XReal examples:
- Temperature ↑ → Bike rentals ↑
- Advertising ↑ → Sales ↑
- Study time ↑ → Test scores ↑
Pattern 2: Negative Relationship
Appearance: Points slope downward from left to right ↘
Meaning: As X increases, Y decreases
Visual:
High Y │●
│ ● ●
│ ● ●
│ ● ●
│ ●
Low Y └──────────────
Low X High XReal examples:
- Price ↑ → Demand ↓
- Distance ↑ → Signal strength ↓
- Temperature ↑ → Heating costs ↓
Pattern 3: No Relationship
Appearance: Points randomly scattered (no clear slope)
Meaning: X and Y are unrelated
Visual:
High Y │ ● ●
│ ● ●
│● ●
│ ● ●
│ ●
Low Y └──────────────
Low X High XReal examples:
- Shoe size vs IQ (no connection)
- Day of week vs temperature
- Random unrelated variables
Testing Hypotheses with Scatter Plots
Hypothesis 1: Temperature and Rentals
Question: Does temperature affect bike rentals?
import pandas as pd
import matplotlib.pyplot as plt
bikes = pd.read_csv('day.csv')
plt.figure(figsize=(10, 6))
plt.scatter(bikes['temp'], bikes['cnt'], alpha=0.6, s=40)
plt.xlabel('Normalized Temperature', fontsize=12)
plt.ylabel('Daily Bike Rentals', fontsize=12)
plt.title('Temperature vs Bike Rentals', fontsize=14, fontweight='bold')
plt.grid(True, alpha=0.3)
plt.show()Answer: Clear upward pattern → Positive relationship → Warmer days have more rentals! ✅
Hypothesis 2: Humidity and Rentals
Question: Does humidity affect bike rentals?
import pandas as pd
import matplotlib.pyplot as plt
bikes = pd.read_csv('day.csv')
plt.figure(figsize=(10, 6))
plt.scatter(bikes['hum'], bikes['cnt'], alpha=0.6, s=40, color='blue')
plt.xlabel('Normalized Humidity', fontsize=12)
plt.ylabel('Daily Bike Rentals', fontsize=12)
plt.title('Humidity vs Bike Rentals', fontsize=14, fontweight='bold')
plt.grid(True, alpha=0.3)
plt.show()Answer: Slight downward pattern → Weak negative relationship → More humidity, slightly fewer rentals.
Hypothesis 3: Wind Speed and Rentals
Question: Does wind speed affect bike rentals?
import pandas as pd
import matplotlib.pyplot as plt
bikes = pd.read_csv('day.csv')
plt.figure(figsize=(10, 6))
plt.scatter(bikes['windspeed'], bikes['cnt'], alpha=0.6, s=40, color='green')
plt.xlabel('Normalized Wind Speed', fontsize=12)
plt.ylabel('Daily Bike Rentals', fontsize=12)
plt.title('Wind Speed vs Bike Rentals', fontsize=14, fontweight='bold')
plt.grid(True, alpha=0.3)
plt.show()Answer: Weak downward pattern → Weak negative relationship → Windier days discourage biking slightly.
Customizing Scatter Plots
Point Size
Control point size with s parameter:
import pandas as pd
import matplotlib.pyplot as plt
bikes = pd.read_csv('day.csv')
plt.figure(figsize=(10, 6))
# Small points
plt.scatter(bikes['temp'], bikes['cnt'], s=10, alpha=0.6)
plt.xlabel('Temperature')
plt.ylabel('Bike Rentals')
plt.title('Small Points (s=10)')
plt.show()Common sizes:
s=10→ Small points (good for many data points)s=50→ Medium points (default-ish)s=100→ Large points (good for small datasets)
Point Color
import pandas as pd
import matplotlib.pyplot as plt
bikes = pd.read_csv('day.csv')
plt.figure(figsize=(10, 6))
plt.scatter(bikes['temp'], bikes['cnt'], s=40, color='red', alpha=0.6)
plt.xlabel('Temperature')
plt.ylabel('Bike Rentals')
plt.title('Red Points')
plt.show()Common colors: ‘blue’, ‘red’, ‘green’, ‘orange’, ‘purple’, ‘black’, ‘gray’
Transparency (Alpha)
Control transparency with alpha (0 to 1):
import pandas as pd
import matplotlib.pyplot as plt
bikes = pd.read_csv('day.csv')
fig, axes = plt.subplots(1, 3, figsize=(15, 5))
# Low transparency (opaque)
axes[0].scatter(bikes['temp'], bikes['cnt'], alpha=0.2, s=50)
axes[0].set_title('alpha=0.2 (very transparent)')
# Medium transparency
axes[1].scatter(bikes['temp'], bikes['cnt'], alpha=0.6, s=50)
axes[1].set_title('alpha=0.6 (medium)')
# High transparency (opaque)
axes[2].scatter(bikes['temp'], bikes['cnt'], alpha=1.0, s=50)
axes[2].set_title('alpha=1.0 (opaque)')
for ax in axes:
ax.set_xlabel('Temperature')
ax.set_ylabel('Bike Rentals')
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()Recommendation: Use alpha=0.5 to alpha=0.7 for overlapping points.
Edge Color
Add borders to points:
import pandas as pd
import matplotlib.pyplot as plt
bikes = pd.read_csv('day.csv')
plt.figure(figsize=(10, 6))
plt.scatter(bikes['temp'], bikes['cnt'],
s=60,
color='skyblue',
edgecolor='black',
linewidth=0.5,
alpha=0.7)
plt.xlabel('Temperature')
plt.ylabel('Bike Rentals')
plt.title('Points with Black Edges')
plt.grid(True, alpha=0.3)
plt.show()Multiple Scatter Plots
Comparing Different Variables
import pandas as pd
import matplotlib.pyplot as plt
bikes = pd.read_csv('day.csv')
fig, axes = plt.subplots(2, 2, figsize=(14, 10))
# Temperature
axes[0, 0].scatter(bikes['temp'], bikes['cnt'], alpha=0.5, s=30, color='red')
axes[0, 0].set_xlabel('Temperature')
axes[0, 0].set_ylabel('Bike Rentals')
axes[0, 0].set_title('Temperature vs Rentals')
axes[0, 0].grid(True, alpha=0.3)
# Humidity
axes[0, 1].scatter(bikes['hum'], bikes['cnt'], alpha=0.5, s=30, color='blue')
axes[0, 1].set_xlabel('Humidity')
axes[0, 1].set_ylabel('Bike Rentals')
axes[0, 1].set_title('Humidity vs Rentals')
axes[0, 1].grid(True, alpha=0.3)
# Wind Speed
axes[1, 0].scatter(bikes['windspeed'], bikes['cnt'], alpha=0.5, s=30, color='green')
axes[1, 0].set_xlabel('Wind Speed')
axes[1, 0].set_ylabel('Bike Rentals')
axes[1, 0].set_title('Wind Speed vs Rentals')
axes[1, 0].grid(True, alpha=0.3)
# Feels-Like Temperature
axes[1, 1].scatter(bikes['atemp'], bikes['cnt'], alpha=0.5, s=30, color='orange')
axes[1, 1].set_xlabel('Feels-Like Temperature')
axes[1, 1].set_ylabel('Bike Rentals')
axes[1, 1].set_title('Feels-Like Temp vs Rentals')
axes[1, 1].grid(True, alpha=0.3)
plt.tight_layout()
plt.show()Comparison: Temperature has strongest positive relationship. Wind and humidity show weak negative relationships.
Practical Interpretation
Strong vs Weak Relationships
Strong Relationship:
- Points form a tight pattern (close to an imaginary line)
- Easy to see the trend
- X is a good predictor of Y
High Y │ ●●
│ ●●●
│ ●●●
│●●●
Low Y └──────────Weak Relationship:
- Points are widely scattered
- Trend exists but with lots of variation
- X is a poor predictor of Y
High Y │ ● ●
│ ● ● ●
│● ● ●
│ ● ●
Low Y └──────────Example Analysis
import pandas as pd
import matplotlib.pyplot as plt
bikes = pd.read_csv('day.csv')
plt.figure(figsize=(10, 6))
plt.scatter(bikes['temp'], bikes['cnt'], alpha=0.5, s=40, color='steelblue')
plt.xlabel('Normalized Temperature', fontsize=12)
plt.ylabel('Daily Bike Rentals', fontsize=12)
plt.title('Temperature vs Bike Rentals Analysis', fontsize=14, fontweight='bold')
plt.grid(True, alpha=0.3)
# Add text annotation
plt.text(0.05, 8000,
'Strong Positive Relationship:\n- Upward slope\n- Points fairly tight\n- Temperature is good predictor',
fontsize=11,
bbox=dict(boxstyle='round', facecolor='wheat', alpha=0.8))
plt.show()Common Mistakes to Avoid
Mistake 1: Connecting Points with Lines
Wrong:
# DON'T do this for scatter plots
plt.plot(bikes['temp'], bikes['cnt']) # This connects points!Right:
# Use scatter for relationships
plt.scatter(bikes['temp'], bikes['cnt'])Why: Lines imply order/sequence. Scatter plots show relationships, not sequences.
Mistake 2: Wrong Variable on X-axis
Convention:
- X-axis: Independent variable (the “cause”)
- Y-axis: Dependent variable (the “effect”)
Examples:
- Temperature (X) affects Rentals (Y)
- Price (X) affects Demand (Y)
- Study time (X) affects Test score (Y)
Mistake 3: Ignoring Outliers
import pandas as pd
import matplotlib.pyplot as plt
bikes = pd.read_csv('day.csv')
plt.figure(figsize=(10, 6))
plt.scatter(bikes['temp'], bikes['cnt'], alpha=0.6, s=40)
plt.xlabel('Temperature')
plt.ylabel('Bike Rentals')
plt.title('Look for Outliers!')
plt.grid(True, alpha=0.3)
# Highlight potential outlier
outlier = bikes[bikes['cnt'] < 500]
plt.scatter(outlier['temp'], outlier['cnt'], color='red', s=100, edgecolor='black', linewidth=2)
plt.show()Outliers are points far from the pattern. They might be:
- Data errors
- Special events (holidays, extreme weather)
- Important insights
Always investigate outliers!
Summary
You learned to create and interpret scatter plots:
- Scatter plots show relationships between two variables
- Syntax:
plt.scatter(x, y) - Patterns:
- Upward slope = Positive relationship (X↑ Y↑)
- Downward slope = Negative relationship (X↑ Y↓)
- Random scatter = No relationship
- Customization: Control size (
s), color, transparency (alpha), edges - Use cases: Testing hypotheses, exploring data relationships, comparing variables
- Interpretation: Look for pattern strength (tight vs scattered)
Next Steps: In the next lesson, you will learn to measure relationships mathematically using correlation coefficients.
Practice: Create a scatter plot showing humidity vs bike rentals. What pattern do you see? Is it positive, negative, or no relationship?