Lesson 18 - Merging and Joining DataFrames

SQL-Style Joins in Pandas

You can concatenate DataFrames by stacking. Now you will learn Merging and Joining—combining DataFrames based on matching key values, similar to SQL database joins.

By the end of this lesson, you will be able to:

  • Understand different types of joins (inner, left, right, outer)
  • Merge DataFrames using .merge()
  • Join DataFrames on single and multiple columns
  • Use .join() for index-based joining
  • Add indicator columns to track merge sources
  • Handle duplicate column names

Merging is essential for combining related datasets, enriching data with additional information, and linking tables together. Let’s master this powerful technique.


Why Merge DataFrames?

Merging (or joining) combines DataFrames based on common column values, similar to SQL joins.

Common use cases:

  • Combine customer data from different systems
  • Add product details to sales transactions
  • Enrich datasets with additional information
  • Link related tables (orders + customers, employees + departments)

Merge vs Concatenate:

  • Concatenate: Stack DataFrames (rows/columns)
  • Merge: Join based on matching key values

Inner Join

Keep only rows where the key exists in both DataFrames.

Creating Sample DataFrames

import pandas as pd
import numpy as np

# Employee data
employees = pd.DataFrame({
    'employee_id': [101, 102, 103, 104, 105],
    'name': ['James', 'Maria', 'David', 'Anna', 'Michael'],
    'dept_id': [1, 2, 1, 3, 2],
    'salary': [50000, 60000, 55000, 65000, 58000]
})

# Department data
departments = pd.DataFrame({
    'dept_id': [1, 2, 3, 4],
    'dept_name': ['Engineering', 'Sales', 'Marketing', 'HR'],
    'location': ['New York', 'London', 'Paris', 'Berlin']
})

print("Employees:")
print(employees)
print("\nDepartments:")
print(departments)

Inner Join - Keep Only Matching Rows

# Inner join on dept_id
inner_result = pd.merge(employees, departments, on='dept_id', how='inner')

print("Inner Join Result:")
print(inner_result)
print(f"\nShape: {inner_result.shape}")

Output:

   employee_id    name  dept_id  salary    dept_name  location
0          101   James        1   50000  Engineering  New York
1          103   David        1   55000  Engineering  New York
2          102   Maria        2   60000        Sales    London
3          105 Michael        2   58000        Sales    London
4          104    Anna        3   65000    Marketing     Paris

Notice:

  • All 5 employees matched (dept_id 1, 2, 3 exist in both)
  • HR department (dept_id 4) not included (no employees)
  • Result has columns from both DataFrames

Left, Right, and Outer Joins

Different join types keep different combinations of rows.

Left Join - Keep All Left Rows

Left join keeps all rows from left DataFrame, fills NaN for missing right values.

# Add employee without department
employees_extended = pd.DataFrame({
    'employee_id': [101, 102, 103, 104, 105, 106],
    'name': ['James', 'Maria', 'David', 'Anna', 'Michael', 'Sophie'],
    'dept_id': [1, 2, 1, 3, 2, 99],  # 99 doesn't exist in departments
    'salary': [50000, 60000, 55000, 65000, 58000, 52000]
})

# Left join
left_result = pd.merge(employees_extended, departments, on='dept_id', how='left')

print("Left Join Result:")
print(left_result)

Output:

   employee_id    name  dept_id  salary    dept_name  location
0          101   James        1   50000  Engineering  New York
1          102   Maria        2   60000        Sales    London
2          103   David        1   55000  Engineering  New York
3          104    Anna        3   65000    Marketing     Paris
4          105 Michael        2   58000        Sales    London
5          106  Sophie       99   52000          NaN       NaN  ← Missing dept

Key point: Sophie (dept_id=99) is included, but dept_name and location are NaN.

Right Join - Keep All Right Rows

Right join keeps all rows from right DataFrame.

# Right join (keep all departments)
right_result = pd.merge(employees, departments, on='dept_id', how='right')

print("Right Join Result:")
print(right_result)

Output:

   employee_id    name  dept_id   salary    dept_name  location
0        101.0   James        1  50000.0  Engineering  New York
1        103.0   David        1  55000.0  Engineering  New York
2        102.0   Maria        2  60000.0        Sales    London
3        105.0 Michael        2  58000.0        Sales    London
4        104.0    Anna        3  65000.0    Marketing     Paris
5          NaN     NaN        4      NaN           HR    Berlin  ← No employees

Key point: HR department (dept_id=4) is included even though no employees work there (NaN values).

Outer Join - Keep All Rows from Both

Outer join (full join) keeps all rows from both DataFrames, fills NaN where no match.

# Outer join
outer_result = pd.merge(employees_extended, departments, on='dept_id', how='outer')

print("Outer Join Result:")
print(outer_result)
print(f"\nShape: {outer_result.shape}")

Output:

   employee_id    name  dept_id   salary    dept_name  location
0        101.0   James        1  50000.0  Engineering  New York
1        103.0   David        1  55000.0  Engineering  New York
2        102.0   Maria        2  60000.0        Sales    London
3        105.0 Michael        2  58000.0        Sales    London
4        104.0    Anna        3  65000.0    Marketing     Paris
5        106.0  Sophie       99  52000.0          NaN       NaN  ← No dept
6          NaN     NaN        4      NaN           HR    Berlin  ← No emp

Result includes:

  • Sophie (dept_id=99, no matching department)
  • HR department (dept_id=4, no employees)
  • All matching rows

Understanding Join Types

Visual comparison of join types:

Join TypeSQL EquivalentDescriptionUse Case
InnerINNER JOINOnly matching rowsCommon analysis
LeftLEFT JOINAll left + matching rightKeep all customers, add order details
RightRIGHT JOINAll right + matching leftKeep all products, add sales data
OuterFULL OUTER JOINAll rows from bothComplete data audit

Visual Example:

Left DF:        Right DF:
A, B, C         B, C, D

Inner:  B, C           (intersection)
Left:   A, B, C        (all left)
Right:  B, C, D        (all right)
Outer:  A, B, C, D     (union)

Merging on Multiple Columns

Sometimes you need to match on more than one column.

Join on Multiple Keys

# Sales data with region and product
sales = pd.DataFrame({
    'region': ['North', 'North', 'South', 'South'],
    'product': ['Laptop', 'Mouse', 'Laptop', 'Mouse'],
    'quantity': [15, 120, 20, 150]
})

# Price data with region and product
prices = pd.DataFrame({
    'region': ['North', 'North', 'South', 'South', 'East'],
    'product': ['Laptop', 'Mouse', 'Laptop', 'Mouse', 'Laptop'],
    'price': [1200, 30, 1200, 30, 1200]
})

# Merge on both region AND product
sales_with_prices = pd.merge(sales, prices, on=['region', 'product'], how='left')

# Calculate revenue
sales_with_prices['revenue'] = sales_with_prices['quantity'] * sales_with_prices['price']

print("Sales with Prices and Revenue:")
print(sales_with_prices)

Key point: Both region AND product must match for the join to work.

Different Column Names

When key columns have different names, use left_on and right_on.

# Orders with customer_id
orders = pd.DataFrame({
    'order_id': [1001, 1002, 1003],
    'customer_id': [501, 502, 503],
    'amount': [1500, 2000, 1200]
})

# Customers with id (different column name)
customers = pd.DataFrame({
    'id': [501, 502, 503, 504],
    'name': ['James', 'Maria', 'David', 'Anna'],
    'city': ['New York', 'London', 'Paris', 'Berlin']
})

# Merge with different column names
orders_with_customers = pd.merge(orders, customers, left_on='customer_id', right_on='id', how='inner')

print("Orders with Customer Details:")
print(orders_with_customers)

# Drop duplicate id column
orders_with_customers = orders_with_customers.drop('id', axis=1)
print("\nCleaned (removed duplicate 'id' column):")
print(orders_with_customers)

Index-Based Joining

Use .join() for efficient joining on indexes.

Join on Index with .join()

# Create DataFrames with matching indexes
employee_info = pd.DataFrame({
    'name': ['James', 'Maria', 'David'],
    'age': [30, 28, 35]
}, index=[101, 102, 103])

employee_salary = pd.DataFrame({
    'salary': [50000, 60000, 55000],
    'bonus': [5000, 6000, 5500]
}, index=[101, 102, 103])

# Join on index
complete_employee = employee_info.join(employee_salary)

print("Employee Info:")
print(employee_info)
print("\nEmployee Salary:")
print(employee_salary)
print("\nJoined (on index):")
print(complete_employee)

Merge on Index

You can also use .merge() with index parameters.

# Merge on left index and right column
result = pd.merge(employee_info,
                  pd.DataFrame({'emp_id': [101, 102, 103], 'rating': [4.5, 4.8, 4.2]}),
                  left_index=True,
                  right_on='emp_id')

print("Merged on Index:")
print(result)

Indicator and Suffixes

Track merge sources and handle duplicate column names.

Indicator Column

Use indicator=True to see which DataFrame each row came from.

# Create data with some non-matching rows
df1 = pd.DataFrame({'key': ['A', 'B', 'C'], 'value1': [1, 2, 3]})
df2 = pd.DataFrame({'key': ['B', 'C', 'D'], 'value2': [4, 5, 6]})

# Outer merge with indicator
merged = pd.merge(df1, df2, on='key', how='outer', indicator=True)

print("Merge with Indicator:")
print(merged)
print("\nIndicator counts:")
print(merged['_merge'].value_counts())

Output:

  key  value1  value2      _merge
0   A     1.0     NaN   left_only
1   B     2.0     4.0        both
2   C     3.0     5.0        both
3   D     NaN     6.0  right_only

Indicator values:

  • left_only: Row only in left DataFrame
  • right_only: Row only in right DataFrame
  • both: Row in both DataFrames

Handling Duplicate Column Names

When both DataFrames have columns with same names, use suffixes.

# Both DataFrames have 'score' column
midterm = pd.DataFrame({
    'student': ['James', 'Maria', 'David'],
    'score': [85, 90, 78]
})

final = pd.DataFrame({
    'student': ['James', 'Maria', 'David'],
    'score': [88, 92, 82]
})

# Merge with custom suffixes
all_scores = pd.merge(midterm, final, on='student', suffixes=('_midterm', '_final'))

# Calculate average
all_scores['average'] = (all_scores['score_midterm'] + all_scores['score_final']) / 2

print("All Scores:")
print(all_scores)

Output:

  student  score_midterm  score_final  average
0   James             85           88     86.5
1   Maria             90           92     91.0
2   David             78           82     80.0

Practice Exercises

Apply merging operations with these exercises.

Exercise 1: Customer Orders Analysis

  1. Create customers DataFrame (id, name, city)
  2. Create orders DataFrame (order_id, customer_id, amount)
  3. Perform left join to keep all customers
  4. Find customers who haven’t placed orders
# Your code here

Hint

Customers without orders will have NaN in the order_id column after a left join!

Exercise 2: Product Inventory

  1. Create products DataFrame (product_id, name, category)
  2. Create inventory DataFrame (product_id, warehouse, quantity)
  3. Inner join to see products in stock
  4. Calculate total quantity per product
# Your code here

Exercise 3: Student Grades from Multiple Exams

  1. Create exam1 DataFrame (student, math_score)
  2. Create exam2 DataFrame (student, science_score)
  3. Outer join to include all students
  4. Calculate average score (handling NaN)
# Your code here

Summary

You now master merging and joining DataFrames. Let’s review the key concepts.

Key Concepts

Basic Merge Syntax

  • Use pd.merge() to join DataFrames
  • Specify join key with on parameter
  • Choose join type with how parameter
  • Default is inner join

Join Types

  • Inner: Only matching rows (intersection)
  • Left: All left rows, matching right
  • Right: All right rows, matching left
  • Outer: All rows from both (union)

Multiple Keys

  • Pass list to on for multiple columns
  • Use left_on/right_on for different names
  • All specified columns must match

Index Joining

  • Use .join() for index-based joins
  • More efficient than merge for index operations
  • Can specify join type with how parameter

Syntax Reference

# Basic merge (inner join by default)
pd.merge(df1, df2, on='key')

# Specify join type
pd.merge(df1, df2, on='key', how='inner')  # Only matches
pd.merge(df1, df2, on='key', how='left')   # All left
pd.merge(df1, df2, on='key', how='right')  # All right
pd.merge(df1, df2, on='key', how='outer')  # All rows

# Multiple columns
pd.merge(df1, df2, on=['key1', 'key2'])

# Different column names
pd.merge(df1, df2, left_on='id', right_on='customer_id')

# Index-based joining
df1.join(df2)
pd.merge(df1, df2, left_index=True, right_index=True)

# With indicator
pd.merge(df1, df2, on='key', how='outer', indicator=True)

# Custom suffixes
pd.merge(df1, df2, on='key', suffixes=('_left', '_right'))

Join Types Summary

TypeKeepsUse When
innerOnly matchesDefault, most common
leftAll left rowsKeep all from first DF
rightAll right rowsKeep all from second DF
outerAll rowsComplete data audit

Common Patterns

Pattern 1: Add Details to Transactions

# Add product info to sales
sales_detailed = pd.merge(sales, products, on='product_id', how='left')

Pattern 2: Find Missing Relationships

# Find customers without orders
result = pd.merge(customers, orders, left_on='id', right_on='customer_id', how='left')
no_orders = result[result['order_id'].isna()]

Pattern 3: Combine Multiple Sources

# Merge three DataFrames
step1 = pd.merge(df1, df2, on='key')
final = pd.merge(step1, df3, on='key')

Merge vs Join vs Concat

MethodPurposeKey
merge()Join on columnsColumn values
join()Join on indexIndex values
concat()Stack DataFramesNo matching

Important Reminders

  • Check shapes: Always verify row counts before and after merge
  • Use indicator: indicator=True helps verify merge results
  • Handle NaN: Outer/left/right joins create missing values
  • Drop duplicates: Use drop() to remove duplicate columns from left_on/right_on
  • Data types: Ensure key columns have matching data types
  • Performance: Index-based joins are faster than column-based

Best Practices

  1. Always check DataFrame shapes before and after merge
  2. Use indicator=True to verify merge results
  3. Handle NaN values appropriately after outer/left/right joins
  4. Drop duplicate columns when using left_on/right_on
  5. Validate key columns have matching data types
  6. Consider using index-based joins for better performance

Next Steps

You can now merge and join DataFrames using SQL-style operations. In the next lesson, you will learn Multi-Index and Hierarchical Data—working with DataFrames that have multiple index levels.

Continue to Lesson 19 - MultiIndex and Hierarchical Data

Learn to work with multi-level indexes for complex data structures

Back to Lesson 17 - Concatenating DataFrames

Review stacking DataFrames vertically and horizontally


Master Data Relationships

Merging is the database analyst’s essential skill—now you can link related datasets just like SQL joins.

Use merging to enrich your data and uncover relationships between tables!