Lesson 8 - Comparing Correlations
Finding the Strongest Relationships
You can calculate correlation between two variables. Now you will learn to compare multiple correlations simultaneously to identify which relationships matter most in your dataset.
By the end of this lesson, you will be able to:
- Calculate correlation matrices for multiple variables
- Compare correlation strengths across variables
- Identify the strongest predictors
- Create visual correlation comparisons
- Understand correlation matrix structure
When analyzing real data, you often have many potential relationships. Comparing correlations helps you focus on what matters most.
The Correlation Matrix
What is a Correlation Matrix?
A correlation matrix shows correlations between all pairs of variables in a dataset.
For 4 variables (A, B, C, D), the matrix looks like:
A B C D
A [ 1.00 0.45 0.82 -0.23 ]
B [ 0.45 1.00 0.61 0.15 ]
C [ 0.82 0.61 1.00 -0.18 ]
D [-0.23 0.15 -0.18 1.00 ]Key properties:
- Diagonal = 1.00: Each variable correlates perfectly with itself
- Symmetric: Value at [A,B] equals [B,A]
- Range: All values between -1 and +1
Creating Correlation Matrices
Using Pandas corr()
The easiest way to create correlation matrices is with pandas .corr() method.
import pandas as pd
import numpy as np
# Load bike data
bikes = pd.read_csv('day.csv')
# Select numeric columns of interest
columns = ['temp', 'atemp', 'hum', 'windspeed', 'casual', 'registered', 'cnt']
bike_subset = bikes[columns]
# Calculate correlation matrix
correlation_matrix = bike_subset.corr()
print(correlation_matrix) temp atemp hum windspeed casual registered cnt
temp 1.000000 0.991466 -0.064685 -0.017842 0.467097 0.631548 0.627494
atemp 0.991466 1.000000 -0.043536 -0.057473 0.462225 0.631214 0.631213
hum -0.064685 -0.043536 1.000000 -0.318607 -0.084524 -0.106732 -0.100659
windspeed -0.017842 -0.057473 -0.318607 1.000000 -0.147121 -0.251104 -0.234545
casual 0.467097 0.462225 -0.084524 -0.147121 1.000000 0.497250 0.690414
registered 0.631548 0.631214 -0.106732 -0.251104 0.497250 1.000000 0.950665
cnt 0.627494 0.631213 -0.100659 -0.234545 0.690414 0.950665 1.000000Each cell shows the correlation between row and column variables.
Interpreting the Matrix
Reading Correlation Values
Let’s extract specific insights from the matrix:
import pandas as pd
bikes = pd.read_csv('day.csv')
columns = ['temp', 'atemp', 'hum', 'windspeed', 'cnt']
corr_matrix = bikes[columns].corr()
# What correlates most with total rentals (cnt)?
print("Correlations with Total Rentals (cnt):")
print(corr_matrix['cnt'].sort_values(ascending=False))Correlations with Total Rentals (cnt):
cnt 1.000000
atemp 0.631213
temp 0.627494
windspeed -0.234545
hum -0.100659
Name: cnt, dtype: float64Key insights:
- atemp (feels-like temperature): r = 0.631 - strongest predictor
- temp (actual temperature): r = 0.627 - nearly as strong
- windspeed: r = -0.235 - weak negative effect
- hum (humidity): r = -0.101 - very weak negative effect
Temperature Variables are Highly Correlated
# Check correlation between temp and atemp
temp_atemp_corr = corr_matrix.loc['temp', 'atemp']
print(f"Correlation between temp and atemp: {temp_atemp_corr:.4f}")Correlation between temp and atemp: 0.9915Interpretation: Actual temperature and feels-like temperature are nearly perfectly correlated (r = 0.99). This makes sense—they measure essentially the same thing.
Identifying Strong Relationships
Filter for Strong Correlations
import pandas as pd
import numpy as np
bikes = pd.read_csv('day.csv')
columns = ['temp', 'atemp', 'hum', 'windspeed', 'casual', 'registered', 'cnt']
corr_matrix = bikes[columns].corr()
# Find all correlations stronger than 0.5
print("Strong Correlations (|r| > 0.5):")
print("-" * 60)
for i in range(len(corr_matrix.columns)):
for j in range(i+1, len(corr_matrix.columns)): # Upper triangle only
var1 = corr_matrix.columns[i]
var2 = corr_matrix.columns[j]
r = corr_matrix.iloc[i, j]
if abs(r) > 0.5:
print(f"{var1:12s} <-> {var2:12s}: r = {r:6.3f}")Strong Correlations (|r| > 0.5):
------------------------------------------------------------
temp <-> atemp : r = 0.991
temp <-> registered : r = 0.632
temp <-> cnt : r = 0.627
atemp <-> registered : r = 0.631
atemp <-> cnt : r = 0.631
casual <-> cnt : r = 0.690
registered <-> cnt : r = 0.951Key findings:
- temp ↔ atemp (r = 0.99): Redundant—they measure the same thing
- registered ↔ cnt (r = 0.95): Very strong—registered users dominate total count
- casual ↔ cnt (r = 0.69): Moderate—casual users contribute but less than registered
- Temperature ↔ rentals (r ≈ 0.63): Moderate—temperature is important but not the only factor
Visualizing Correlations
Bar Chart Comparison
Compare how different variables correlate with rentals:
import pandas as pd
import matplotlib.pyplot as plt
bikes = pd.read_csv('day.csv')
columns = ['temp', 'atemp', 'hum', 'windspeed']
corr_matrix = bikes[columns + ['cnt']].corr()
# Get correlations with cnt, exclude cnt itself
cnt_correlations = corr_matrix['cnt'][:-1].sort_values()
# Create bar chart
plt.figure(figsize=(10, 6))
colors = ['red' if x < 0 else 'green' for x in cnt_correlations]
plt.barh(cnt_correlations.index, cnt_correlations.values, color=colors, alpha=0.7)
plt.xlabel('Correlation with Bike Rentals')
plt.title('Weather Variables vs Bike Rentals')
plt.axvline(x=0, color='black', linestyle='-', linewidth=0.5)
plt.grid(True, alpha=0.3, axis='x')
plt.show()The bar chart clearly shows:
- Temperature variables (positive green bars) strongly predict rentals
- Weather conditions (negative red bars) weakly discourage rentals
Comparing User Types
Casual vs Registered Users
Do weather factors affect casual and registered users differently?
import pandas as pd
bikes = pd.read_csv('day.csv')
# Calculate correlations for each user type
weather_vars = ['temp', 'hum', 'windspeed']
print("Weather Impact on Different User Types:")
print("-" * 60)
print(f"{'Variable':<12s} {'Casual':>10s} {'Registered':>12s} {'Difference':>12s}")
print("-" * 60)
for var in weather_vars:
r_casual = bikes[var].corr(bikes['casual'])
r_registered = bikes[var].corr(bikes['registered'])
diff = r_casual - r_registered
print(f"{var:<12s} {r_casual:>10.3f} {r_registered:>12.3f} {diff:>12.3f}")Weather Impact on Different User Types:
------------------------------------------------------------
Variable Casual Registered Difference
------------------------------------------------------------
temp 0.467 0.632 -0.165
hum -0.085 -0.107 0.022
windspeed -0.147 -0.251 0.104Insights:
- Temperature: Registered users (r = 0.632) are MORE affected than casual users (r = 0.467)
- Humidity: Similar weak negative effect for both groups
- Windspeed: Stronger negative effect on registered users
Why? Registered users likely commute daily regardless of weather, while casual users ride recreationally when conditions are pleasant.
Seasonal Correlation Changes
Do Correlations Vary by Season?
import pandas as pd
bikes = pd.read_csv('day.csv')
# Calculate correlation for each season
seasons = {1: 'Spring', 2: 'Summer', 3: 'Fall', 4: 'Winter'}
print("Temperature-Rental Correlation by Season:")
print("-" * 50)
for season_num, season_name in seasons.items():
season_data = bikes[bikes['season'] == season_num]
r = season_data['temp'].corr(season_data['cnt'])
print(f"{season_name:10s}: r = {r:.3f}")Temperature-Rental Correlation by Season:
--------------------------------------------------
Spring : r = 0.436
Summer : r = 0.239
Fall : r = 0.407
Winter : r = 0.467Interpretation:
- Winter: Strongest correlation (r = 0.467) - temperature matters most in cold months
- Summer: Weakest correlation (r = 0.239) - already warm, temperature less critical
- Spring/Fall: Moderate correlation - transitional seasons
This shows correlation context matters. The same relationship can vary by conditions.
Practical Analysis
Complete Correlation Report
import pandas as pd
import numpy as np
bikes = pd.read_csv('day.csv')
# Select variables for analysis
weather_vars = ['temp', 'atemp', 'hum', 'windspeed']
outcome_vars = ['casual', 'registered', 'cnt']
print("CORRELATION ANALYSIS REPORT")
print("=" * 70)
print("\n1. Strongest Weather Predictors:")
print("-" * 70)
# Find strongest predictor for total rentals
correlations = []
for var in weather_vars:
r = bikes[var].corr(bikes['cnt'])
correlations.append((var, r))
correlations.sort(key=lambda x: abs(x[1]), reverse=True)
for var, r in correlations:
strength = "Strong" if abs(r) > 0.6 else "Moderate" if abs(r) > 0.3 else "Weak"
print(f" {var:12s}: r = {r:7.3f} ({strength})")
print("\n2. User Type Differences:")
print("-" * 70)
best_var = correlations[0][0] # Variable with strongest overall correlation
r_casual = bikes[best_var].corr(bikes['casual'])
r_registered = bikes[best_var].corr(bikes['registered'])
print(f" {best_var} effect on casual users: r = {r_casual:.3f}")
print(f" {best_var} effect on registered users: r = {r_registered:.3f}")
print(f" Registered users are more affected by {(r_registered - r_casual):.3f}")
print("\n3. Redundant Variables:")
print("-" * 70)
r_temp_atemp = bikes['temp'].corr(bikes['atemp'])
print(f" temp <-> atemp correlation: r = {r_temp_atemp:.3f}")
print(f" These variables are {r_temp_atemp*100:.1f}% correlated (nearly identical)")
print(f" Recommendation: Use only one in predictive models")CORRELATION ANALYSIS REPORT
======================================================================
1. Strongest Weather Predictors:
----------------------------------------------------------------------
atemp : r = 0.631 (Moderate)
temp : r = 0.627 (Moderate)
windspeed : r = -0.235 (Weak)
hum : r = -0.101 (Weak)
2. User Type Differences:
----------------------------------------------------------------------
atemp effect on casual users: r = 0.462
atemp effect on registered users: r = 0.631
Registered users are more affected by 0.169
3. Redundant Variables:
----------------------------------------------------------------------
temp <-> atemp correlation: r = 0.991
These variables are 99.1% correlated (nearly identical)
Recommendation: Use only one in predictive modelsSummary
You learned to compare multiple correlations simultaneously:
- Correlation matrix shows all pairwise correlations in one table
- pandas
.corr()creates correlation matrices easily - Comparison reveals which variables are strongest predictors
- Context matters: Correlations can vary by season, user type, etc.
- Redundancy: Highly correlated predictors (r > 0.9) measure the same thing
- Visual comparison: Bar charts help identify strongest relationships
Next Steps: In the next lesson, you will learn about distributions—how data spreads across different values—using histograms and distribution plots.
Practice: Calculate correlations between all weather variables (temp, hum, windspeed, weathersit) and analyze which factors work together or independently.