Lesson 8 - Comparing Correlations

Finding the Strongest Relationships

You can calculate correlation between two variables. Now you will learn to compare multiple correlations simultaneously to identify which relationships matter most in your dataset.

By the end of this lesson, you will be able to:

  • Calculate correlation matrices for multiple variables
  • Compare correlation strengths across variables
  • Identify the strongest predictors
  • Create visual correlation comparisons
  • Understand correlation matrix structure

When analyzing real data, you often have many potential relationships. Comparing correlations helps you focus on what matters most.


The Correlation Matrix

What is a Correlation Matrix?

A correlation matrix shows correlations between all pairs of variables in a dataset.

For 4 variables (A, B, C, D), the matrix looks like:

         A      B      C      D
    A [ 1.00   0.45   0.82  -0.23 ]
    B [ 0.45   1.00   0.61   0.15 ]
    C [ 0.82   0.61   1.00  -0.18 ]
    D [-0.23   0.15  -0.18   1.00 ]

Key properties:

  • Diagonal = 1.00: Each variable correlates perfectly with itself
  • Symmetric: Value at [A,B] equals [B,A]
  • Range: All values between -1 and +1

Creating Correlation Matrices

Using Pandas corr()

The easiest way to create correlation matrices is with pandas .corr() method.

import pandas as pd
import numpy as np

# Load bike data
bikes = pd.read_csv('day.csv')

# Select numeric columns of interest
columns = ['temp', 'atemp', 'hum', 'windspeed', 'casual', 'registered', 'cnt']
bike_subset = bikes[columns]

# Calculate correlation matrix
correlation_matrix = bike_subset.corr()

print(correlation_matrix)
              temp    atemp      hum  windspeed   casual  registered      cnt
temp        1.000000  0.991466 -0.064685  -0.017842  0.467097    0.631548  0.627494
atemp       0.991466  1.000000 -0.043536  -0.057473  0.462225    0.631214  0.631213
hum        -0.064685 -0.043536  1.000000  -0.318607 -0.084524   -0.106732 -0.100659
windspeed  -0.017842 -0.057473 -0.318607   1.000000 -0.147121   -0.251104 -0.234545
casual      0.467097  0.462225 -0.084524  -0.147121  1.000000    0.497250  0.690414
registered  0.631548  0.631214 -0.106732  -0.251104  0.497250    1.000000  0.950665
cnt         0.627494  0.631213 -0.100659  -0.234545  0.690414    0.950665  1.000000

Each cell shows the correlation between row and column variables.


Interpreting the Matrix

Reading Correlation Values

Let’s extract specific insights from the matrix:

import pandas as pd

bikes = pd.read_csv('day.csv')
columns = ['temp', 'atemp', 'hum', 'windspeed', 'cnt']
corr_matrix = bikes[columns].corr()

# What correlates most with total rentals (cnt)?
print("Correlations with Total Rentals (cnt):")
print(corr_matrix['cnt'].sort_values(ascending=False))
Correlations with Total Rentals (cnt):
cnt          1.000000
atemp        0.631213
temp         0.627494
windspeed   -0.234545
hum         -0.100659
Name: cnt, dtype: float64

Key insights:

  • atemp (feels-like temperature): r = 0.631 - strongest predictor
  • temp (actual temperature): r = 0.627 - nearly as strong
  • windspeed: r = -0.235 - weak negative effect
  • hum (humidity): r = -0.101 - very weak negative effect

Temperature Variables are Highly Correlated

# Check correlation between temp and atemp
temp_atemp_corr = corr_matrix.loc['temp', 'atemp']
print(f"Correlation between temp and atemp: {temp_atemp_corr:.4f}")
Correlation between temp and atemp: 0.9915

Interpretation: Actual temperature and feels-like temperature are nearly perfectly correlated (r = 0.99). This makes sense—they measure essentially the same thing.


Identifying Strong Relationships

Filter for Strong Correlations

import pandas as pd
import numpy as np

bikes = pd.read_csv('day.csv')
columns = ['temp', 'atemp', 'hum', 'windspeed', 'casual', 'registered', 'cnt']
corr_matrix = bikes[columns].corr()

# Find all correlations stronger than 0.5
print("Strong Correlations (|r| > 0.5):")
print("-" * 60)

for i in range(len(corr_matrix.columns)):
    for j in range(i+1, len(corr_matrix.columns)):  # Upper triangle only
        var1 = corr_matrix.columns[i]
        var2 = corr_matrix.columns[j]
        r = corr_matrix.iloc[i, j]

        if abs(r) > 0.5:
            print(f"{var1:12s} <-> {var2:12s}: r = {r:6.3f}")
Strong Correlations (|r| > 0.5):
------------------------------------------------------------
temp         <-> atemp       : r =  0.991
temp         <-> registered  : r =  0.632
temp         <-> cnt         : r =  0.627
atemp        <-> registered  : r =  0.631
atemp        <-> cnt         : r =  0.631
casual       <-> cnt         : r =  0.690
registered   <-> cnt         : r =  0.951

Key findings:

  1. temp ↔ atemp (r = 0.99): Redundant—they measure the same thing
  2. registered ↔ cnt (r = 0.95): Very strong—registered users dominate total count
  3. casual ↔ cnt (r = 0.69): Moderate—casual users contribute but less than registered
  4. Temperature ↔ rentals (r ≈ 0.63): Moderate—temperature is important but not the only factor

Visualizing Correlations

Bar Chart Comparison

Compare how different variables correlate with rentals:

import pandas as pd
import matplotlib.pyplot as plt

bikes = pd.read_csv('day.csv')
columns = ['temp', 'atemp', 'hum', 'windspeed']
corr_matrix = bikes[columns + ['cnt']].corr()

# Get correlations with cnt, exclude cnt itself
cnt_correlations = corr_matrix['cnt'][:-1].sort_values()

# Create bar chart
plt.figure(figsize=(10, 6))
colors = ['red' if x < 0 else 'green' for x in cnt_correlations]
plt.barh(cnt_correlations.index, cnt_correlations.values, color=colors, alpha=0.7)
plt.xlabel('Correlation with Bike Rentals')
plt.title('Weather Variables vs Bike Rentals')
plt.axvline(x=0, color='black', linestyle='-', linewidth=0.5)
plt.grid(True, alpha=0.3, axis='x')
plt.show()

The bar chart clearly shows:

  • Temperature variables (positive green bars) strongly predict rentals
  • Weather conditions (negative red bars) weakly discourage rentals

Comparing User Types

Casual vs Registered Users

Do weather factors affect casual and registered users differently?

import pandas as pd

bikes = pd.read_csv('day.csv')

# Calculate correlations for each user type
weather_vars = ['temp', 'hum', 'windspeed']

print("Weather Impact on Different User Types:")
print("-" * 60)
print(f"{'Variable':<12s} {'Casual':>10s} {'Registered':>12s} {'Difference':>12s}")
print("-" * 60)

for var in weather_vars:
    r_casual = bikes[var].corr(bikes['casual'])
    r_registered = bikes[var].corr(bikes['registered'])
    diff = r_casual - r_registered

    print(f"{var:<12s} {r_casual:>10.3f} {r_registered:>12.3f} {diff:>12.3f}")
Weather Impact on Different User Types:
------------------------------------------------------------
Variable        Casual   Registered   Difference
------------------------------------------------------------
temp             0.467        0.632       -0.165
hum             -0.085       -0.107        0.022
windspeed       -0.147       -0.251        0.104

Insights:

  • Temperature: Registered users (r = 0.632) are MORE affected than casual users (r = 0.467)
  • Humidity: Similar weak negative effect for both groups
  • Windspeed: Stronger negative effect on registered users

Why? Registered users likely commute daily regardless of weather, while casual users ride recreationally when conditions are pleasant.


Seasonal Correlation Changes

Do Correlations Vary by Season?

import pandas as pd

bikes = pd.read_csv('day.csv')

# Calculate correlation for each season
seasons = {1: 'Spring', 2: 'Summer', 3: 'Fall', 4: 'Winter'}

print("Temperature-Rental Correlation by Season:")
print("-" * 50)

for season_num, season_name in seasons.items():
    season_data = bikes[bikes['season'] == season_num]
    r = season_data['temp'].corr(season_data['cnt'])
    print(f"{season_name:10s}: r = {r:.3f}")
Temperature-Rental Correlation by Season:
--------------------------------------------------
Spring    : r = 0.436
Summer    : r = 0.239
Fall      : r = 0.407
Winter    : r = 0.467

Interpretation:

  • Winter: Strongest correlation (r = 0.467) - temperature matters most in cold months
  • Summer: Weakest correlation (r = 0.239) - already warm, temperature less critical
  • Spring/Fall: Moderate correlation - transitional seasons

This shows correlation context matters. The same relationship can vary by conditions.


Practical Analysis

Complete Correlation Report

import pandas as pd
import numpy as np

bikes = pd.read_csv('day.csv')

# Select variables for analysis
weather_vars = ['temp', 'atemp', 'hum', 'windspeed']
outcome_vars = ['casual', 'registered', 'cnt']

print("CORRELATION ANALYSIS REPORT")
print("=" * 70)
print("\n1. Strongest Weather Predictors:")
print("-" * 70)

# Find strongest predictor for total rentals
correlations = []
for var in weather_vars:
    r = bikes[var].corr(bikes['cnt'])
    correlations.append((var, r))

correlations.sort(key=lambda x: abs(x[1]), reverse=True)

for var, r in correlations:
    strength = "Strong" if abs(r) > 0.6 else "Moderate" if abs(r) > 0.3 else "Weak"
    print(f"  {var:12s}: r = {r:7.3f}  ({strength})")

print("\n2. User Type Differences:")
print("-" * 70)

best_var = correlations[0][0]  # Variable with strongest overall correlation
r_casual = bikes[best_var].corr(bikes['casual'])
r_registered = bikes[best_var].corr(bikes['registered'])

print(f"  {best_var} effect on casual users:     r = {r_casual:.3f}")
print(f"  {best_var} effect on registered users: r = {r_registered:.3f}")
print(f"  Registered users are more affected by {(r_registered - r_casual):.3f}")

print("\n3. Redundant Variables:")
print("-" * 70)

r_temp_atemp = bikes['temp'].corr(bikes['atemp'])
print(f"  temp <-> atemp correlation: r = {r_temp_atemp:.3f}")
print(f"  These variables are {r_temp_atemp*100:.1f}% correlated (nearly identical)")
print(f"  Recommendation: Use only one in predictive models")
CORRELATION ANALYSIS REPORT
======================================================================

1. Strongest Weather Predictors:
----------------------------------------------------------------------
  atemp       : r =   0.631  (Moderate)
  temp        : r =   0.627  (Moderate)
  windspeed   : r =  -0.235  (Weak)
  hum         : r =  -0.101  (Weak)

2. User Type Differences:
----------------------------------------------------------------------
  atemp effect on casual users:     r = 0.462
  atemp effect on registered users: r = 0.631
  Registered users are more affected by 0.169

3. Redundant Variables:
----------------------------------------------------------------------
  temp <-> atemp correlation: r = 0.991
  These variables are 99.1% correlated (nearly identical)
  Recommendation: Use only one in predictive models

Summary

You learned to compare multiple correlations simultaneously:

  • Correlation matrix shows all pairwise correlations in one table
  • pandas .corr() creates correlation matrices easily
  • Comparison reveals which variables are strongest predictors
  • Context matters: Correlations can vary by season, user type, etc.
  • Redundancy: Highly correlated predictors (r > 0.9) measure the same thing
  • Visual comparison: Bar charts help identify strongest relationships

Next Steps: In the next lesson, you will learn about distributions—how data spreads across different values—using histograms and distribution plots.

Practice: Calculate correlations between all weather variables (temp, hum, windspeed, weathersit) and analyze which factors work together or independently.