Lesson 7 - Understanding Correlation

Beyond Visual Patterns

You can create scatter plots to see relationships visually. Now you will learn to measure those relationships mathematically using correlation coefficients.

By the end of this lesson, you will be able to:

  • Understand what correlation means
  • Calculate Pearson correlation coefficients
  • Interpret correlation values (-1 to +1 scale)
  • Distinguish between positive, negative, and no correlation
  • Use correlation to quantify relationship strength
  • Combine scatter plots with correlation metrics

Correlation is one of the most important concepts in data analysis. It helps you move from “I see a pattern” to “I can measure that pattern.”


What is Correlation?

The Concept

Correlation measures how two variables move together:

  • When one goes up, does the other go up? (positive correlation)
  • When one goes up, does the other go down? (negative correlation)
  • Do they move independently? (no correlation)

Real-World Example

Think about bike rentals and temperature:

  • On hot days, more people rent bikes → positive correlation
  • On rainy days (high humidity), fewer people rent bikes → negative correlation
  • Day of the week and temperature? → probably no correlation

Visual Correlation

Scatter plots show correlation patterns:

Strong Positive          Strong Negative         No Correlation
    •                        •                       •
      •                    •                           •
        •                •                         •
          •            •                       •
            •        •                           •

The tighter the points follow a line, the stronger the correlation.


The Correlation Coefficient

What is r?

The Pearson correlation coefficient (r) is a number between -1 and +1 that measures linear relationships.

Scale interpretation:

-1.0  ←  -0.7  ←  -0.3  ←  0.0  →  +0.3  →  +0.7  →  +1.0
 │         │         │       │       │        │         │
Perfect  Strong    Weak    No     Weak    Strong   Perfect
Negative Negative  Negative Corr  Positive Positive Positive

Interpretation Guidelines

Strength of correlation:

  • 0.0 to ±0.3: Weak or negligible
  • ±0.3 to ±0.7: Moderate
  • ±0.7 to ±1.0: Strong
  • ±1.0: Perfect (all points on a line)

Sign meaning:

  • Positive (+): Variables move in the same direction
  • Negative (-): Variables move in opposite directions

Calculating Correlation with NumPy

NumPy provides np.corrcoef() to calculate correlation coefficients.

Load the Bike Data

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

# Load bike sharing data
bikes = pd.read_csv('day.csv')

# Check first few rows
print(bikes[['temp', 'cnt']].head())
       temp   cnt
0  0.344167   985
1  0.363478   801
2  0.196364  1349
3  0.200000  1562
4  0.226957  1600

Calculate Correlation

# Calculate correlation between temperature and bike rentals
correlation_matrix = np.corrcoef(bikes['temp'], bikes['cnt'])

print("Correlation Matrix:")
print(correlation_matrix)
Correlation Matrix:
[[1.         0.62758859]
 [0.62758859 1.        ]]

Understanding the Output

The correlation matrix is 2×2:

              temp      cnt
temp      [  1.00    0.63  ]
cnt       [  0.63    1.00  ]
  • Diagonal values = 1.00: Each variable correlates perfectly with itself
  • Off-diagonal = 0.63: Correlation between temp and cnt

Extract just the correlation value:

# Get the correlation coefficient (position [0,1] or [1,0])
r = correlation_matrix[0, 1]
print(f"Correlation coefficient: {r:.4f}")
Correlation coefficient: 0.6276

Interpretation: r = 0.63 indicates a moderate positive correlation. As temperature increases, bike rentals tend to increase.


Visualizing Correlation

Combine scatter plots with correlation values for complete analysis.

Temperature vs Bike Rentals

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

bikes = pd.read_csv('day.csv')

# Calculate correlation
r = np.corrcoef(bikes['temp'], bikes['cnt'])[0, 1]

# Create scatter plot
plt.figure(figsize=(10, 6))
plt.scatter(bikes['temp'], bikes['cnt'], alpha=0.5)
plt.xlabel('Normalized Temperature')
plt.ylabel('Bike Rentals')
plt.title(f'Temperature vs Bike Rentals (r = {r:.3f})')
plt.grid(True, alpha=0.3)
plt.show()

The scatter plot shows:

  • Clear upward trend (positive relationship)
  • Points are somewhat spread out (not perfect correlation)
  • The r = 0.628 confirms moderate strength

Exploring Different Correlations

Let’s examine multiple relationships in the bike data.

Temperature vs Rentals (Positive)

import pandas as pd
import numpy as np

bikes = pd.read_csv('day.csv')

# Temperature vs rentals
r_temp = np.corrcoef(bikes['temp'], bikes['cnt'])[0, 1]
print(f"Temperature vs Rentals: r = {r_temp:.3f}")
Temperature vs Rentals: r = 0.628

Interpretation: Moderate positive correlation. Warmer weather encourages more bike usage.

Humidity vs Rentals (Negative)

# Humidity vs rentals
r_hum = np.corrcoef(bikes['hum'], bikes['cnt'])[0, 1]
print(f"Humidity vs Rentals: r = {r_hum:.3f}")
Humidity vs Rentals: r = -0.100

Interpretation: Weak negative correlation. Higher humidity slightly discourages rentals.

Windspeed vs Rentals

# Windspeed vs rentals
r_wind = np.corrcoef(bikes['windspeed'], bikes['cnt'])[0, 1]
print(f"Windspeed vs Rentals: r = {r_wind:.3f}")
Windspeed vs Rentals: r = -0.235

Interpretation: Weak negative correlation. Windier conditions somewhat reduce rentals.


Correlation vs Causation

Critical Understanding

Correlation does NOT prove causation.

Just because two variables correlate does not mean one causes the other.

Example

Ice cream sales and drowning deaths are positively correlated. Does eating ice cream cause drowning?

NO. Both are caused by a third factor: hot weather.

  • Hot weather → more people swim → more drownings
  • Hot weather → more people buy ice cream

In Our Data

Temperature and bike rentals correlate. Does temperature cause more rentals?

Likely yes, but other factors matter:

  • Day of week (working day vs weekend)
  • Season (summer habits vs winter)
  • Weather (clear vs rainy)
  • Holidays and events

Correlation identifies relationships. Determining causation requires more analysis and domain knowledge.


Practical Analysis

Complete Correlation Analysis

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

bikes = pd.read_csv('day.csv')

# Analyze multiple correlations
variables = ['temp', 'atemp', 'hum', 'windspeed']
correlations = []

for var in variables:
    r = np.corrcoef(bikes[var], bikes['cnt'])[0, 1]
    correlations.append((var, r))

# Sort by absolute correlation strength
correlations.sort(key=lambda x: abs(x[1]), reverse=True)

print("Correlations with Bike Rentals (sorted by strength):")
print("-" * 50)
for var, r in correlations:
    strength = "Strong" if abs(r) > 0.7 else "Moderate" if abs(r) > 0.3 else "Weak"
    direction = "positive" if r > 0 else "negative"
    print(f"{var:12s}: r = {r:6.3f}  ({strength} {direction})")
Correlations with Bike Rentals (sorted by strength):
--------------------------------------------------
temp        : r =  0.628  (Moderate positive)
atemp       : r =  0.631  (Moderate positive)
windspeed   : r = -0.235  (Weak negative)
hum         : r = -0.100  (Weak negative)

Key insights:

  • Temperature (actual and feels-like) are the strongest predictors
  • Weather conditions (humidity, wind) have weak negative effects
  • None reach “strong” correlation (>0.7), suggesting bike rentals depend on multiple factors

Summary

You learned to quantify relationships using correlation:

  • Correlation coefficient (r) measures linear relationships from -1 to +1
  • Positive correlation: Variables move together
  • Negative correlation: Variables move in opposite directions
  • Strength: |r| > 0.7 strong, 0.3-0.7 moderate, < 0.3 weak
  • Calculate with NumPy: np.corrcoef(x, y)[0, 1]
  • Correlation ≠ Causation: Association does not prove cause-and-effect

Next Steps: In the next lesson, you will compare multiple correlations simultaneously and learn techniques to identify the strongest relationships in your data.

Practice: Calculate correlations between different weather variables in the bike dataset. Which pairs have the strongest relationships?