Lesson 5 - Scatter Plots Basics
Exploring Relationships Between Variables
You can visualize trends over time with line plots. Now you will learn scatter plots—the essential tool for exploring relationships between two variables.
By the end of this lesson, you will be able to:
- Create scatter plots using
plt.scatter() - Understand when to use scatter plots vs line plots
- Read and interpret scatter plot patterns
- Identify positive, negative, and no relationship patterns
- Use scatter plots to test hypotheses about data relationships
- Customize scatter plot appearance
Scatter plots reveal whether variables are related, helping you test hypotheses and discover insights.
What is a Scatter Plot?
Line Plots vs Scatter Plots
In previous lessons, we used line plots to show how ONE variable changes over time:
- X-axis: Time (dates, months, years)
- Y-axis: Some value (rentals, sales, temperature)
- Shows trends and patterns over time
But what if we want to explore the relationship between TWO variables?
Examples:
- Does temperature affect bike rentals?
- Does advertising spending increase sales?
- Does study time improve test scores?
- Does price affect demand?
For this, we use scatter plots.
Scatter Plot Definition
A graph where each point represents one observation with:
- X-coordinate: Value of first variable
- Y-coordinate: Value of second variable
Unlike line plots that connect points in order, scatter plots show all points simultaneously to reveal patterns in their relationship.
Creating Your First Scatter Plot
Let’s test a hypothesis: Does temperature affect bike rentals?
We will use plt.scatter() instead of plt.plot():
import pandas as pd
import matplotlib.pyplot as plt
# Load bike sharing data
bikes = pd.read_csv('day.csv')
# Create scatter plot: temperature vs rentals
plt.scatter(bikes['temp'], bikes['cnt'])
plt.title('Temperature vs Bike Rentals')
plt.xlabel('Normalized Temperature')
plt.ylabel('Number of Rentals')
plt.show()Understanding the Syntax
plt.scatter(x_values, y_values)- x_values: First variable (temperature)
- y_values: Second variable (rentals)
- Each point represents one day in the dataset
Note about temperature: The temp column is “normalized” (0 to 1 scale). The exact values do not matter—what matters is the pattern.
What Do You See?
Looking at the scatter plot:
- Points slope upward from left to right
- Higher temperature corresponds to more rentals
- Clear positive relationship
- Hypothesis confirmed: Temperature affects bike rentals
Reading Scatter Plot Patterns
When you look at a scatter plot, you are looking for patterns in how the points are arranged.
Pattern 1: Positive Relationship
Points slope upward from left to right
- As X increases, Y increases
- Variables move together in the same direction
- Example: Temperature ↑ → Bike rentals ↑
Y
| *
| * *
| * *
|*
+--------> XPattern 2: Negative Relationship
Points slope downward from left to right
- As X increases, Y decreases
- Variables move in opposite directions
- Example: Price ↑ → Demand ↓
Y
|*
| * *
| * *
| *
+--------> XPattern 3: No Relationship
Points are randomly scattered
- No clear upward or downward pattern
- X and Y are unrelated
- Example: Shoe size vs IQ (no connection)
Y
| * *
| * *
| * *
| * *
+--------> XStrength of Relationship
Beyond direction, consider how tightly clustered the points are:
- Strong relationship: Points form a tight pattern
- Weak relationship: Points are more scattered
- No relationship: Points are completely random
Testing Another Hypothesis
Let’s test: Does humidity affect bike rentals?
The hum column contains normalized humidity (0 = very dry, 1 = very humid).
import pandas as pd
import matplotlib.pyplot as plt
bikes = pd.read_csv('day.csv')
# Humidity vs rentals
plt.scatter(bikes['hum'], bikes['cnt'], alpha=0.5) # alpha makes points transparent
plt.title('Humidity vs Bike Rentals')
plt.xlabel('Normalized Humidity')
plt.ylabel('Number of Rentals')
plt.show()What Pattern Do You See?
- Slight downward slope = Weak negative relationship
- High humidity (muggy, uncomfortable) → Slightly fewer rentals
- But the relationship is much weaker than temperature
- Points are more scattered
New Parameter: alpha
alpha=0.5makes points semi-transparent (50% opacity)- Helps see overlapping points
- Values from 0 (invisible) to 1 (solid)
- Essential when you have many data points
Customizing Scatter Plots
Like line plots, you can customize scatter plots with colors, sizes, and more:
import pandas as pd
import matplotlib.pyplot as plt
bikes = pd.read_csv('day.csv')
# Customized scatter plot
plt.scatter(bikes['temp'], bikes['cnt'],
color='coral', # Point color
alpha=0.6, # Transparency
s=30, # Size (default is around 20)
edgecolors='black', # Border color
linewidths=0.5) # Border width
plt.title('Temperature vs Bike Rentals (Customized)', fontsize=14)
plt.xlabel('Normalized Temperature', fontsize=12)
plt.ylabel('Number of Rentals', fontsize=12)
plt.grid(True, alpha=0.3) # Add subtle grid
plt.show()Customization Options
| Parameter | Purpose | Example Values |
|---|---|---|
color= | Point color | 'red', 'blue', '#FF5733' |
alpha= | Transparency | 0.0 to 1.0 |
s= | Size (points squared) | 20 (small), 100 (large) |
edgecolors= | Border color | 'black', 'white' |
linewidths= | Border thickness | 0.5, 1.0, 2.0 |
marker= | Shape | 'o' (circle), 's' (square), '^' (triangle) |
When to Customize
- Too many overlapping points → use
alpha - Hard to see points → increase
s - Want to emphasize → add borders with
edgecolors - Large dataset → reduce size and increase transparency
Practice Exercises
Apply scatter plot techniques to explore data relationships.
Exercise 1: Wind Speed vs Rentals
Does wind speed affect bike rentals?
The windspeed column contains normalized wind speed (higher = windier).
Task:
- Create a scatter plot with windspeed on X-axis, rentals on Y-axis
- Add appropriate title and labels
- Use transparency (
alpha=0.5) - What pattern do you see?
- Does wind affect rentals? If yes, how?
# Your code hereSolution
import pandas as pd
import matplotlib.pyplot as plt
bikes = pd.read_csv('day.csv')
plt.scatter(bikes['windspeed'], bikes['cnt'], alpha=0.5, color='teal')
plt.title('Wind Speed vs Bike Rentals')
plt.xlabel('Normalized Wind Speed')
plt.ylabel('Number of Rentals')
plt.show()
print("\nPattern: Weak negative relationship")
print("Higher wind speed → Slightly fewer rentals")
print("But the effect is much weaker than temperature!")Exercise 2: Casual vs Registered Users
Are casual and registered users correlated?
Hypothesis: On days when casual users rent more, do registered users also rent more?
Task:
- Create scatter plot: casual (X) vs registered (Y)
- Use blue color with alpha=0.4
- Add title and labels
- What pattern appears?
- What does this tell you?
# Your code hereSolution
import pandas as pd
import matplotlib.pyplot as plt
bikes = pd.read_csv('day.csv')
plt.scatter(bikes['casual'], bikes['registered'],
alpha=0.4, color='blue', s=40)
plt.title('Casual Users vs Registered Users')
plt.xlabel('Casual Rentals')
plt.ylabel('Registered Rentals')
plt.show()
print("\nPattern: Positive relationship")
print("Days with high casual rentals also tend to have high registered rentals")
print("\nInsight:")
print("Both groups respond to the same factors (weather, season, etc.)")
print("Good weather increases BOTH casual and registered usage")Exercise 3: Finding No Relationship
Not all variables are related! Let’s explore.
The dataset has a holiday column (0 = regular day, 1 = holiday).
Task:
- Create scatter plot: holiday (X) vs rentals (Y)
- What pattern do you see?
- Why might this not show a clear relationship?
Hint: This is tricky because X only has two values (0 and 1)!
# Your code hereSolution
import pandas as pd
import matplotlib.pyplot as plt
bikes = pd.read_csv('day.csv')
plt.scatter(bikes['holiday'], bikes['cnt'],
alpha=0.5, color='purple', s=50)
plt.title('Holiday vs Bike Rentals')
plt.xlabel('Holiday (0=No, 1=Yes)')
plt.ylabel('Number of Rentals')
plt.xticks([0, 1], ['Regular Day', 'Holiday']) # Better labels
plt.show()
print("\nPattern: Two vertical columns")
print("This happens when X is categorical (only 2 values)")
print("\nInsight:")
print("Holidays have a wide range of rentals (some high, some low)")
print("Regular days also have a wide range")
print("Holiday status alone doesn't strongly predict rentals")
print("\n(A bar chart or box plot would work better here!)")Summary
You now create and interpret scatter plots. Let’s review the key concepts.
Key Concepts
Scatter Plots
- Show relationships between TWO variables
- X-axis: First variable
- Y-axis: Second variable
- Each point: One observation
Syntax
plt.scatter(x_data, y_data)- Simple and similar to
plt.plot()
Three Main Patterns
- Positive relationship: Both increase together (upward slope)
- Negative relationship: One increases, other decreases (downward slope)
- No relationship: Random scatter, no pattern
Relationship Strength
- Strong: Points tightly clustered around pattern
- Weak: Points loosely scattered
- None: Completely random
When to Use Scatter Plots
- Exploring relationships between variables
- Testing hypotheses about correlations
- Both variables are continuous (numbers on a scale)
- NOT for showing trends over time (use line plots)
- NOT for categorical data (use bar charts)
Customization
- Use
alphawhen points overlap - Adjust
s(size) for better visibility - Add
edgecolorsfor emphasis - Use
colorto make points stand out
Syntax Reference
# Basic scatter plot
plt.scatter(x, y)
# Customized scatter plot
plt.scatter(x, y,
color='red',
alpha=0.5,
s=50,
edgecolors='black',
linewidths=0.5,
marker='o')
plt.title('Title')
plt.xlabel('X Label')
plt.ylabel('Y Label')
plt.show()Patterns Equal Insights
- Strong upward pattern → Strong positive relationship
- Weak pattern → Weak or no relationship
- No pattern → Variables are unrelated
- Tight clustering → Strong relationship
- Wide scatter → Weak relationship
Next Steps
You can now see relationships visually using scatter plots. But can you measure how strong a relationship is?
Questions:
- Is temperature’s effect on rentals “strong” or “weak”?
- Which weather variable has the strongest effect?
- Can we put a number on relationship strength?
In the next lesson, you will learn to customize scatter plots further with marker sizes, colors, and colorbars to encode additional dimensions.
Continue to Lesson 6 - Customizing Scatter Plots
Control marker size, color, transparency, and add colorbars
Back to Lesson 4 - Multiple Lines and Series
Review plotting multiple datasets for comparison
Discover Relationships in Your Data
Scatter plots transform hypothesis testing from guesswork to visual analysis. You can now explore whether variables are related and understand the nature of their relationships.
Use scatter plots to test hypotheses and discover insights!