Lesson 7 - Understanding Correlation
Beyond Visual Patterns
You can create scatter plots to see relationships visually. Now you will learn to measure those relationships mathematically using correlation coefficients.
By the end of this lesson, you will be able to:
- Understand what correlation means
- Calculate Pearson correlation coefficients
- Interpret correlation values (-1 to +1 scale)
- Distinguish between positive, negative, and no correlation
- Use correlation to quantify relationship strength
- Combine scatter plots with correlation metrics
Correlation is one of the most important concepts in data analysis. It helps you move from “I see a pattern” to “I can measure that pattern.”
What is Correlation?
The Concept
Correlation measures how two variables move together:
- When one goes up, does the other go up? (positive correlation)
- When one goes up, does the other go down? (negative correlation)
- Do they move independently? (no correlation)
Real-World Example
Think about bike rentals and temperature:
- On hot days, more people rent bikes → positive correlation
- On rainy days (high humidity), fewer people rent bikes → negative correlation
- Day of the week and temperature? → probably no correlation
Visual Correlation
Scatter plots show correlation patterns:
Strong Positive Strong Negative No Correlation
• • •
• • •
• • •
• • •
• • •The tighter the points follow a line, the stronger the correlation.
The Correlation Coefficient
What is r?
The Pearson correlation coefficient (r) is a number between -1 and +1 that measures linear relationships.
Scale interpretation:
-1.0 ← -0.7 ← -0.3 ← 0.0 → +0.3 → +0.7 → +1.0
│ │ │ │ │ │ │
Perfect Strong Weak No Weak Strong Perfect
Negative Negative Negative Corr Positive Positive PositiveInterpretation Guidelines
Strength of correlation:
- 0.0 to ±0.3: Weak or negligible
- ±0.3 to ±0.7: Moderate
- ±0.7 to ±1.0: Strong
- ±1.0: Perfect (all points on a line)
Sign meaning:
- Positive (+): Variables move in the same direction
- Negative (-): Variables move in opposite directions
Calculating Correlation with NumPy
NumPy provides np.corrcoef() to calculate correlation coefficients.
Load the Bike Data
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
# Load bike sharing data
bikes = pd.read_csv('day.csv')
# Check first few rows
print(bikes[['temp', 'cnt']].head()) temp cnt
0 0.344167 985
1 0.363478 801
2 0.196364 1349
3 0.200000 1562
4 0.226957 1600Calculate Correlation
# Calculate correlation between temperature and bike rentals
correlation_matrix = np.corrcoef(bikes['temp'], bikes['cnt'])
print("Correlation Matrix:")
print(correlation_matrix)Correlation Matrix:
[[1. 0.62758859]
[0.62758859 1. ]]Understanding the Output
The correlation matrix is 2×2:
temp cnt
temp [ 1.00 0.63 ]
cnt [ 0.63 1.00 ]- Diagonal values = 1.00: Each variable correlates perfectly with itself
- Off-diagonal = 0.63: Correlation between temp and cnt
Extract just the correlation value:
# Get the correlation coefficient (position [0,1] or [1,0])
r = correlation_matrix[0, 1]
print(f"Correlation coefficient: {r:.4f}")Correlation coefficient: 0.6276Interpretation: r = 0.63 indicates a moderate positive correlation. As temperature increases, bike rentals tend to increase.
Visualizing Correlation
Combine scatter plots with correlation values for complete analysis.
Temperature vs Bike Rentals
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
bikes = pd.read_csv('day.csv')
# Calculate correlation
r = np.corrcoef(bikes['temp'], bikes['cnt'])[0, 1]
# Create scatter plot
plt.figure(figsize=(10, 6))
plt.scatter(bikes['temp'], bikes['cnt'], alpha=0.5)
plt.xlabel('Normalized Temperature')
plt.ylabel('Bike Rentals')
plt.title(f'Temperature vs Bike Rentals (r = {r:.3f})')
plt.grid(True, alpha=0.3)
plt.show()The scatter plot shows:
- Clear upward trend (positive relationship)
- Points are somewhat spread out (not perfect correlation)
- The r = 0.628 confirms moderate strength
Exploring Different Correlations
Let’s examine multiple relationships in the bike data.
Temperature vs Rentals (Positive)
import pandas as pd
import numpy as np
bikes = pd.read_csv('day.csv')
# Temperature vs rentals
r_temp = np.corrcoef(bikes['temp'], bikes['cnt'])[0, 1]
print(f"Temperature vs Rentals: r = {r_temp:.3f}")Temperature vs Rentals: r = 0.628Interpretation: Moderate positive correlation. Warmer weather encourages more bike usage.
Humidity vs Rentals (Negative)
# Humidity vs rentals
r_hum = np.corrcoef(bikes['hum'], bikes['cnt'])[0, 1]
print(f"Humidity vs Rentals: r = {r_hum:.3f}")Humidity vs Rentals: r = -0.100Interpretation: Weak negative correlation. Higher humidity slightly discourages rentals.
Windspeed vs Rentals
# Windspeed vs rentals
r_wind = np.corrcoef(bikes['windspeed'], bikes['cnt'])[0, 1]
print(f"Windspeed vs Rentals: r = {r_wind:.3f}")Windspeed vs Rentals: r = -0.235Interpretation: Weak negative correlation. Windier conditions somewhat reduce rentals.
Correlation vs Causation
Critical Understanding
Correlation does NOT prove causation.
Just because two variables correlate does not mean one causes the other.
Example
Ice cream sales and drowning deaths are positively correlated. Does eating ice cream cause drowning?
NO. Both are caused by a third factor: hot weather.
- Hot weather → more people swim → more drownings
- Hot weather → more people buy ice cream
In Our Data
Temperature and bike rentals correlate. Does temperature cause more rentals?
Likely yes, but other factors matter:
- Day of week (working day vs weekend)
- Season (summer habits vs winter)
- Weather (clear vs rainy)
- Holidays and events
Correlation identifies relationships. Determining causation requires more analysis and domain knowledge.
Practical Analysis
Complete Correlation Analysis
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
bikes = pd.read_csv('day.csv')
# Analyze multiple correlations
variables = ['temp', 'atemp', 'hum', 'windspeed']
correlations = []
for var in variables:
r = np.corrcoef(bikes[var], bikes['cnt'])[0, 1]
correlations.append((var, r))
# Sort by absolute correlation strength
correlations.sort(key=lambda x: abs(x[1]), reverse=True)
print("Correlations with Bike Rentals (sorted by strength):")
print("-" * 50)
for var, r in correlations:
strength = "Strong" if abs(r) > 0.7 else "Moderate" if abs(r) > 0.3 else "Weak"
direction = "positive" if r > 0 else "negative"
print(f"{var:12s}: r = {r:6.3f} ({strength} {direction})")Correlations with Bike Rentals (sorted by strength):
--------------------------------------------------
temp : r = 0.628 (Moderate positive)
atemp : r = 0.631 (Moderate positive)
windspeed : r = -0.235 (Weak negative)
hum : r = -0.100 (Weak negative)Key insights:
- Temperature (actual and feels-like) are the strongest predictors
- Weather conditions (humidity, wind) have weak negative effects
- None reach “strong” correlation (>0.7), suggesting bike rentals depend on multiple factors
Summary
You learned to quantify relationships using correlation:
- Correlation coefficient (r) measures linear relationships from -1 to +1
- Positive correlation: Variables move together
- Negative correlation: Variables move in opposite directions
- Strength: |r| > 0.7 strong, 0.3-0.7 moderate, < 0.3 weak
- Calculate with NumPy:
np.corrcoef(x, y)[0, 1] - Correlation ≠ Causation: Association does not prove cause-and-effect
Next Steps: In the next lesson, you will compare multiple correlations simultaneously and learn techniques to identify the strongest relationships in your data.
Practice: Calculate correlations between different weather variables in the bike dataset. Which pairs have the strongest relationships?