Lesson 6 - Creating Scatter Plots

What is a Scatter Plot?

You have used line plots to show how one variable changes over time. Now you will learn scatter plots—a tool for exploring the relationship between two variables.

By the end of this lesson, you will be able to:

  • Create scatter plots using plt.scatter()
  • Understand when to use scatter plots vs line plots
  • Read and interpret scatter plot patterns
  • Identify positive, negative, and no relationship patterns
  • Customize scatter plot appearance
  • Use scatter plots to test hypotheses about data relationships

Scatter plots help answer questions like: Does temperature affect bike rentals? Does price affect demand?


Line Plots vs Scatter Plots

Line Plots

Purpose: Show how one variable changes over time

Time → Value

Structure:

  • X-axis: Time (dates, months, hours)
  • Y-axis: Some measurement (rentals, sales, temperature)
  • Points connected by lines
  • Shows trends over time

Example questions:

  • How do bike rentals change throughout the year?
  • What is the hourly pattern of website traffic?
  • How has temperature varied over the past decade?

Scatter Plots

Purpose: Show the relationship between two variables

Variable 1 ↔ Variable 2

Structure:

  • X-axis: First variable (temperature, price, age)
  • Y-axis: Second variable (rentals, sales, income)
  • Points not connected by lines
  • Each point = one observation
  • Shows relationships between variables

Example questions:

  • Does temperature affect bike rentals?
  • Does advertising spending increase sales?
  • Does study time improve test scores?

Creating Your First Scatter Plot

Basic Scatter Plot

import pandas as pd
import matplotlib.pyplot as plt

bikes = pd.read_csv('day.csv')

# Create scatter plot: temperature vs rentals
plt.scatter(bikes['temp'], bikes['cnt'])
plt.xlabel('Normalized Temperature')
plt.ylabel('Number of Bike Rentals')
plt.title('Temperature vs Bike Rentals')
plt.show()

Output: Each point represents one day. X-position shows temperature, Y-position shows rentals.

Syntax:

plt.scatter(x_values, y_values)
  • x_values: First variable (independent variable)
  • y_values: Second variable (dependent variable)
  • Each matching pair creates one point

Understanding Normalized Temperature

In the bike dataset, temp is normalized to 0-1 scale:

  • 0.0 = Coldest temperature
  • 0.5 = Medium temperature
  • 1.0 = Hottest temperature

You don’t need exact degrees. The pattern is what matters!


Reading Scatter Plot Patterns

When you look at a scatter plot, you look for patterns in how points are arranged.

Pattern 1: Positive Relationship

Appearance: Points slope upward from left to right ↗

Meaning: As X increases, Y increases

Visual:

High Y │        ●
       │      ●   ●
       │    ●   ●
       │  ●   ●
       │●   ●
Low Y  └──────────────
      Low X   High X

Real examples:

  • Temperature ↑ → Bike rentals ↑
  • Advertising ↑ → Sales ↑
  • Study time ↑ → Test scores ↑

Pattern 2: Negative Relationship

Appearance: Points slope downward from left to right ↘

Meaning: As X increases, Y decreases

Visual:

High Y │●
       │  ●   ●
       │    ●   ●
       │      ●   ●
       │        ●
Low Y  └──────────────
      Low X   High X

Real examples:

  • Price ↑ → Demand ↓
  • Distance ↑ → Signal strength ↓
  • Temperature ↑ → Heating costs ↓

Pattern 3: No Relationship

Appearance: Points randomly scattered (no clear slope)

Meaning: X and Y are unrelated

Visual:

High Y │  ●     ●
       │    ●     ●
       │●     ●
       │     ●  ●
       │  ●
Low Y  └──────────────
      Low X   High X

Real examples:

  • Shoe size vs IQ (no connection)
  • Day of week vs temperature
  • Random unrelated variables

Testing Hypotheses with Scatter Plots

Hypothesis 1: Temperature and Rentals

Question: Does temperature affect bike rentals?

import pandas as pd
import matplotlib.pyplot as plt

bikes = pd.read_csv('day.csv')

plt.figure(figsize=(10, 6))
plt.scatter(bikes['temp'], bikes['cnt'], alpha=0.6, s=40)
plt.xlabel('Normalized Temperature', fontsize=12)
plt.ylabel('Daily Bike Rentals', fontsize=12)
plt.title('Temperature vs Bike Rentals', fontsize=14, fontweight='bold')
plt.grid(True, alpha=0.3)
plt.show()

Answer: Clear upward pattern → Positive relationship → Warmer days have more rentals! ✅

Hypothesis 2: Humidity and Rentals

Question: Does humidity affect bike rentals?

import pandas as pd
import matplotlib.pyplot as plt

bikes = pd.read_csv('day.csv')

plt.figure(figsize=(10, 6))
plt.scatter(bikes['hum'], bikes['cnt'], alpha=0.6, s=40, color='blue')
plt.xlabel('Normalized Humidity', fontsize=12)
plt.ylabel('Daily Bike Rentals', fontsize=12)
plt.title('Humidity vs Bike Rentals', fontsize=14, fontweight='bold')
plt.grid(True, alpha=0.3)
plt.show()

Answer: Slight downward pattern → Weak negative relationship → More humidity, slightly fewer rentals.

Hypothesis 3: Wind Speed and Rentals

Question: Does wind speed affect bike rentals?

import pandas as pd
import matplotlib.pyplot as plt

bikes = pd.read_csv('day.csv')

plt.figure(figsize=(10, 6))
plt.scatter(bikes['windspeed'], bikes['cnt'], alpha=0.6, s=40, color='green')
plt.xlabel('Normalized Wind Speed', fontsize=12)
plt.ylabel('Daily Bike Rentals', fontsize=12)
plt.title('Wind Speed vs Bike Rentals', fontsize=14, fontweight='bold')
plt.grid(True, alpha=0.3)
plt.show()

Answer: Weak downward pattern → Weak negative relationship → Windier days discourage biking slightly.


Customizing Scatter Plots

Point Size

Control point size with s parameter:

import pandas as pd
import matplotlib.pyplot as plt

bikes = pd.read_csv('day.csv')

plt.figure(figsize=(10, 6))

# Small points
plt.scatter(bikes['temp'], bikes['cnt'], s=10, alpha=0.6)

plt.xlabel('Temperature')
plt.ylabel('Bike Rentals')
plt.title('Small Points (s=10)')
plt.show()

Common sizes:

  • s=10 → Small points (good for many data points)
  • s=50 → Medium points (default-ish)
  • s=100 → Large points (good for small datasets)

Point Color

import pandas as pd
import matplotlib.pyplot as plt

bikes = pd.read_csv('day.csv')

plt.figure(figsize=(10, 6))
plt.scatter(bikes['temp'], bikes['cnt'], s=40, color='red', alpha=0.6)
plt.xlabel('Temperature')
plt.ylabel('Bike Rentals')
plt.title('Red Points')
plt.show()

Common colors: ‘blue’, ‘red’, ‘green’, ‘orange’, ‘purple’, ‘black’, ‘gray’

Transparency (Alpha)

Control transparency with alpha (0 to 1):

import pandas as pd
import matplotlib.pyplot as plt

bikes = pd.read_csv('day.csv')

fig, axes = plt.subplots(1, 3, figsize=(15, 5))

# Low transparency (opaque)
axes[0].scatter(bikes['temp'], bikes['cnt'], alpha=0.2, s=50)
axes[0].set_title('alpha=0.2 (very transparent)')

# Medium transparency
axes[1].scatter(bikes['temp'], bikes['cnt'], alpha=0.6, s=50)
axes[1].set_title('alpha=0.6 (medium)')

# High transparency (opaque)
axes[2].scatter(bikes['temp'], bikes['cnt'], alpha=1.0, s=50)
axes[2].set_title('alpha=1.0 (opaque)')

for ax in axes:
    ax.set_xlabel('Temperature')
    ax.set_ylabel('Bike Rentals')
    ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

Recommendation: Use alpha=0.5 to alpha=0.7 for overlapping points.

Edge Color

Add borders to points:

import pandas as pd
import matplotlib.pyplot as plt

bikes = pd.read_csv('day.csv')

plt.figure(figsize=(10, 6))
plt.scatter(bikes['temp'], bikes['cnt'],
            s=60,
            color='skyblue',
            edgecolor='black',
            linewidth=0.5,
            alpha=0.7)
plt.xlabel('Temperature')
plt.ylabel('Bike Rentals')
plt.title('Points with Black Edges')
plt.grid(True, alpha=0.3)
plt.show()

Multiple Scatter Plots

Comparing Different Variables

import pandas as pd
import matplotlib.pyplot as plt

bikes = pd.read_csv('day.csv')

fig, axes = plt.subplots(2, 2, figsize=(14, 10))

# Temperature
axes[0, 0].scatter(bikes['temp'], bikes['cnt'], alpha=0.5, s=30, color='red')
axes[0, 0].set_xlabel('Temperature')
axes[0, 0].set_ylabel('Bike Rentals')
axes[0, 0].set_title('Temperature vs Rentals')
axes[0, 0].grid(True, alpha=0.3)

# Humidity
axes[0, 1].scatter(bikes['hum'], bikes['cnt'], alpha=0.5, s=30, color='blue')
axes[0, 1].set_xlabel('Humidity')
axes[0, 1].set_ylabel('Bike Rentals')
axes[0, 1].set_title('Humidity vs Rentals')
axes[0, 1].grid(True, alpha=0.3)

# Wind Speed
axes[1, 0].scatter(bikes['windspeed'], bikes['cnt'], alpha=0.5, s=30, color='green')
axes[1, 0].set_xlabel('Wind Speed')
axes[1, 0].set_ylabel('Bike Rentals')
axes[1, 0].set_title('Wind Speed vs Rentals')
axes[1, 0].grid(True, alpha=0.3)

# Feels-Like Temperature
axes[1, 1].scatter(bikes['atemp'], bikes['cnt'], alpha=0.5, s=30, color='orange')
axes[1, 1].set_xlabel('Feels-Like Temperature')
axes[1, 1].set_ylabel('Bike Rentals')
axes[1, 1].set_title('Feels-Like Temp vs Rentals')
axes[1, 1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

Comparison: Temperature has strongest positive relationship. Wind and humidity show weak negative relationships.


Practical Interpretation

Strong vs Weak Relationships

Strong Relationship:

  • Points form a tight pattern (close to an imaginary line)
  • Easy to see the trend
  • X is a good predictor of Y
High Y │      ●●
       │    ●●●
       │  ●●●
       │●●●
Low Y  └──────────

Weak Relationship:

  • Points are widely scattered
  • Trend exists but with lots of variation
  • X is a poor predictor of Y
High Y │    ●   ●
       │  ●   ●   ●
       │●   ●   ●
       │  ●   ●
Low Y  └──────────

Example Analysis

import pandas as pd
import matplotlib.pyplot as plt

bikes = pd.read_csv('day.csv')

plt.figure(figsize=(10, 6))
plt.scatter(bikes['temp'], bikes['cnt'], alpha=0.5, s=40, color='steelblue')
plt.xlabel('Normalized Temperature', fontsize=12)
plt.ylabel('Daily Bike Rentals', fontsize=12)
plt.title('Temperature vs Bike Rentals Analysis', fontsize=14, fontweight='bold')
plt.grid(True, alpha=0.3)

# Add text annotation
plt.text(0.05, 8000,
         'Strong Positive Relationship:\n- Upward slope\n- Points fairly tight\n- Temperature is good predictor',
         fontsize=11,
         bbox=dict(boxstyle='round', facecolor='wheat', alpha=0.8))

plt.show()

Common Mistakes to Avoid

Mistake 1: Connecting Points with Lines

Wrong:

# DON'T do this for scatter plots
plt.plot(bikes['temp'], bikes['cnt'])  # This connects points!

Right:

# Use scatter for relationships
plt.scatter(bikes['temp'], bikes['cnt'])

Why: Lines imply order/sequence. Scatter plots show relationships, not sequences.

Mistake 2: Wrong Variable on X-axis

Convention:

  • X-axis: Independent variable (the “cause”)
  • Y-axis: Dependent variable (the “effect”)

Examples:

  • Temperature (X) affects Rentals (Y)
  • Price (X) affects Demand (Y)
  • Study time (X) affects Test score (Y)

Mistake 3: Ignoring Outliers

import pandas as pd
import matplotlib.pyplot as plt

bikes = pd.read_csv('day.csv')

plt.figure(figsize=(10, 6))
plt.scatter(bikes['temp'], bikes['cnt'], alpha=0.6, s=40)
plt.xlabel('Temperature')
plt.ylabel('Bike Rentals')
plt.title('Look for Outliers!')
plt.grid(True, alpha=0.3)

# Highlight potential outlier
outlier = bikes[bikes['cnt'] < 500]
plt.scatter(outlier['temp'], outlier['cnt'], color='red', s=100, edgecolor='black', linewidth=2)

plt.show()

Outliers are points far from the pattern. They might be:

  • Data errors
  • Special events (holidays, extreme weather)
  • Important insights

Always investigate outliers!


Summary

You learned to create and interpret scatter plots:

  • Scatter plots show relationships between two variables
  • Syntax: plt.scatter(x, y)
  • Patterns:
    • Upward slope = Positive relationship (X↑ Y↑)
    • Downward slope = Negative relationship (X↑ Y↓)
    • Random scatter = No relationship
  • Customization: Control size (s), color, transparency (alpha), edges
  • Use cases: Testing hypotheses, exploring data relationships, comparing variables
  • Interpretation: Look for pattern strength (tight vs scattered)

Next Steps: In the next lesson, you will learn to measure relationships mathematically using correlation coefficients.

Practice: Create a scatter plot showing humidity vs bike rentals. What pattern do you see? Is it positive, negative, or no relationship?