Lesson 18 - Merging and Joining DataFrames

SQL-Style Joins in Pandas

You can concatenate DataFrames by stacking. Now you will learn Merging and Joining—combining DataFrames based on matching key values, similar to SQL database joins.

By the end of this lesson, you will be able to:

Understand different types of joins (inner, left, right, outer)
Merge DataFrames using .merge()
Join DataFrames on single and multiple columns
Use .join() for index-based joining
Add indicator columns to track merge sources
Handle duplicate column names

Merging is essential for combining related datasets, enriching data with additional information, and linking tables together. Let’s master this powerful technique.

Why Merge DataFrames?

Merging (or joining) combines DataFrames based on common column values, similar to SQL joins.

Common use cases:

Combine customer data from different systems
Add product details to sales transactions
Enrich datasets with additional information
Link related tables (orders + customers, employees + departments)

Merge vs Concatenate:

Concatenate: Stack DataFrames (rows/columns)
Merge: Join based on matching key values

Inner Join

Keep only rows where the key exists in both DataFrames.

Creating Sample DataFrames

import pandas as pd
import numpy as np

# Employee data
employees = pd.DataFrame({
    'employee_id': [101, 102, 103, 104, 105],
    'name': ['James', 'Maria', 'David', 'Anna', 'Michael'],
    'dept_id': [1, 2, 1, 3, 2],
    'salary': [50000, 60000, 55000, 65000, 58000]
})

# Department data
departments = pd.DataFrame({
    'dept_id': [1, 2, 3, 4],
    'dept_name': ['Engineering', 'Sales', 'Marketing', 'HR'],
    'location': ['New York', 'London', 'Paris', 'Berlin']
})

print("Employees:")
print(employees)
print("\nDepartments:")
print(departments)

Inner Join - Keep Only Matching Rows

# Inner join on dept_id
inner_result = pd.merge(employees, departments, on='dept_id', how='inner')

print("Inner Join Result:")
print(inner_result)
print(f"\nShape: {inner_result.shape}")

Output:

   employee_id    name  dept_id  salary    dept_name  location
0          101   James        1   50000  Engineering  New York
1          103   David        1   55000  Engineering  New York
2          102   Maria        2   60000        Sales    London
3          105 Michael        2   58000        Sales    London
4          104    Anna        3   65000    Marketing     Paris

Notice:

All 5 employees matched (dept_id 1, 2, 3 exist in both)
HR department (dept_id 4) not included (no employees)
Result has columns from both DataFrames

Left, Right, and Outer Joins

Different join types keep different combinations of rows.

Left Join - Keep All Left Rows

Left join keeps all rows from left DataFrame, fills NaN for missing right values.

# Add employee without department
employees_extended = pd.DataFrame({
    'employee_id': [101, 102, 103, 104, 105, 106],
    'name': ['James', 'Maria', 'David', 'Anna', 'Michael', 'Sophie'],
    'dept_id': [1, 2, 1, 3, 2, 99],  # 99 doesn't exist in departments
    'salary': [50000, 60000, 55000, 65000, 58000, 52000]
})

# Left join
left_result = pd.merge(employees_extended, departments, on='dept_id', how='left')

print("Left Join Result:")
print(left_result)

Output:

   employee_id    name  dept_id  salary    dept_name  location
0          101   James        1   50000  Engineering  New York
1          102   Maria        2   60000        Sales    London
2          103   David        1   55000  Engineering  New York
3          104    Anna        3   65000    Marketing     Paris
4          105 Michael        2   58000        Sales    London
5          106  Sophie       99   52000          NaN       NaN  ← Missing dept

Key point: Sophie (dept_id=99) is included, but dept_name and location are NaN.

Right Join - Keep All Right Rows

Right join keeps all rows from right DataFrame.

# Right join (keep all departments)
right_result = pd.merge(employees, departments, on='dept_id', how='right')

print("Right Join Result:")
print(right_result)

Output:

   employee_id    name  dept_id   salary    dept_name  location
0        101.0   James        1  50000.0  Engineering  New York
1        103.0   David        1  55000.0  Engineering  New York
2        102.0   Maria        2  60000.0        Sales    London
3        105.0 Michael        2  58000.0        Sales    London
4        104.0    Anna        3  65000.0    Marketing     Paris
5          NaN     NaN        4      NaN           HR    Berlin  ← No employees

Key point: HR department (dept_id=4) is included even though no employees work there (NaN values).

Outer Join - Keep All Rows from Both

Outer join (full join) keeps all rows from both DataFrames, fills NaN where no match.

# Outer join
outer_result = pd.merge(employees_extended, departments, on='dept_id', how='outer')

print("Outer Join Result:")
print(outer_result)
print(f"\nShape: {outer_result.shape}")

Output:

   employee_id    name  dept_id   salary    dept_name  location
0        101.0   James        1  50000.0  Engineering  New York
1        103.0   David        1  55000.0  Engineering  New York
2        102.0   Maria        2  60000.0        Sales    London
3        105.0 Michael        2  58000.0        Sales    London
4        104.0    Anna        3  65000.0    Marketing     Paris
5        106.0  Sophie       99  52000.0          NaN       NaN  ← No dept
6          NaN     NaN        4      NaN           HR    Berlin  ← No emp

Result includes:

Sophie (dept_id=99, no matching department)
HR department (dept_id=4, no employees)
All matching rows

Understanding Join Types

Visual comparison of join types:

Join Type	SQL Equivalent	Description	Use Case
Inner	`INNER JOIN`	Only matching rows	Common analysis
Left	`LEFT JOIN`	All left + matching right	Keep all customers, add order details
Right	`RIGHT JOIN`	All right + matching left	Keep all products, add sales data
Outer	`FULL OUTER JOIN`	All rows from both	Complete data audit

Visual Example:

Left DF:        Right DF:
A, B, C         B, C, D

Inner:  B, C           (intersection)
Left:   A, B, C        (all left)
Right:  B, C, D        (all right)
Outer:  A, B, C, D     (union)

Merging on Multiple Columns

Sometimes you need to match on more than one column.

Join on Multiple Keys

# Sales data with region and product
sales = pd.DataFrame({
    'region': ['North', 'North', 'South', 'South'],
    'product': ['Laptop', 'Mouse', 'Laptop', 'Mouse'],
    'quantity': [15, 120, 20, 150]
})

# Price data with region and product
prices = pd.DataFrame({
    'region': ['North', 'North', 'South', 'South', 'East'],
    'product': ['Laptop', 'Mouse', 'Laptop', 'Mouse', 'Laptop'],
    'price': [1200, 30, 1200, 30, 1200]
})

# Merge on both region AND product
sales_with_prices = pd.merge(sales, prices, on=['region', 'product'], how='left')

# Calculate revenue
sales_with_prices['revenue'] = sales_with_prices['quantity'] * sales_with_prices['price']

print("Sales with Prices and Revenue:")
print(sales_with_prices)

Key point: Both region AND product must match for the join to work.

Different Column Names

When key columns have different names, use left_on and right_on.

# Orders with customer_id
orders = pd.DataFrame({
    'order_id': [1001, 1002, 1003],
    'customer_id': [501, 502, 503],
    'amount': [1500, 2000, 1200]
})

# Customers with id (different column name)
customers = pd.DataFrame({
    'id': [501, 502, 503, 504],
    'name': ['James', 'Maria', 'David', 'Anna'],
    'city': ['New York', 'London', 'Paris', 'Berlin']
})

# Merge with different column names
orders_with_customers = pd.merge(orders, customers, left_on='customer_id', right_on='id', how='inner')

print("Orders with Customer Details:")
print(orders_with_customers)

# Drop duplicate id column
orders_with_customers = orders_with_customers.drop('id', axis=1)
print("\nCleaned (removed duplicate 'id' column):")
print(orders_with_customers)

Index-Based Joining

Use .join() for efficient joining on indexes.

Join on Index with .join()

# Create DataFrames with matching indexes
employee_info = pd.DataFrame({
    'name': ['James', 'Maria', 'David'],
    'age': [30, 28, 35]
}, index=[101, 102, 103])

employee_salary = pd.DataFrame({
    'salary': [50000, 60000, 55000],
    'bonus': [5000, 6000, 5500]
}, index=[101, 102, 103])

# Join on index
complete_employee = employee_info.join(employee_salary)

print("Employee Info:")
print(employee_info)
print("\nEmployee Salary:")
print(employee_salary)
print("\nJoined (on index):")
print(complete_employee)

Merge on Index

You can also use .merge() with index parameters.

# Merge on left index and right column
result = pd.merge(employee_info,
                  pd.DataFrame({'emp_id': [101, 102, 103], 'rating': [4.5, 4.8, 4.2]}),
                  left_index=True,
                  right_on='emp_id')

print("Merged on Index:")
print(result)

Indicator and Suffixes

Track merge sources and handle duplicate column names.

Indicator Column

Use indicator=True to see which DataFrame each row came from.

# Create data with some non-matching rows
df1 = pd.DataFrame({'key': ['A', 'B', 'C'], 'value1': [1, 2, 3]})
df2 = pd.DataFrame({'key': ['B', 'C', 'D'], 'value2': [4, 5, 6]})

# Outer merge with indicator
merged = pd.merge(df1, df2, on='key', how='outer', indicator=True)

print("Merge with Indicator:")
print(merged)
print("\nIndicator counts:")
print(merged['_merge'].value_counts())

Output:

  key  value1  value2      _merge
0   A     1.0     NaN   left_only
1   B     2.0     4.0        both
2   C     3.0     5.0        both
3   D     NaN     6.0  right_only

Indicator values:

left_only: Row only in left DataFrame
right_only: Row only in right DataFrame
both: Row in both DataFrames

Handling Duplicate Column Names

When both DataFrames have columns with same names, use suffixes.

# Both DataFrames have 'score' column
midterm = pd.DataFrame({
    'student': ['James', 'Maria', 'David'],
    'score': [85, 90, 78]
})

final = pd.DataFrame({
    'student': ['James', 'Maria', 'David'],
    'score': [88, 92, 82]
})

# Merge with custom suffixes
all_scores = pd.merge(midterm, final, on='student', suffixes=('_midterm', '_final'))

# Calculate average
all_scores['average'] = (all_scores['score_midterm'] + all_scores['score_final']) / 2

print("All Scores:")
print(all_scores)

Output:

  student  score_midterm  score_final  average
0   James             85           88     86.5
1   Maria             90           92     91.0
2   David             78           82     80.0

Practice Exercises

Apply merging operations with these exercises.

Exercise 1: Customer Orders Analysis

Create customers DataFrame (id, name, city)
Create orders DataFrame (order_id, customer_id, amount)
Perform left join to keep all customers
Find customers who haven’t placed orders

# Your code here

Hint

Customers without orders will have NaN in the order_id column after a left join!

Exercise 2: Product Inventory

Create products DataFrame (product_id, name, category)
Create inventory DataFrame (product_id, warehouse, quantity)
Inner join to see products in stock
Calculate total quantity per product

# Your code here

Exercise 3: Student Grades from Multiple Exams

Create exam1 DataFrame (student, math_score)
Create exam2 DataFrame (student, science_score)
Outer join to include all students
Calculate average score (handling NaN)

# Your code here

Summary

You now master merging and joining DataFrames. Let’s review the key concepts.

Key Concepts

Basic Merge Syntax

Use pd.merge() to join DataFrames
Specify join key with on parameter
Choose join type with how parameter
Default is inner join

Join Types

Inner: Only matching rows (intersection)
Left: All left rows, matching right
Right: All right rows, matching left
Outer: All rows from both (union)

Multiple Keys

Pass list to on for multiple columns
Use left_on/right_on for different names
All specified columns must match

Index Joining

Use .join() for index-based joins
More efficient than merge for index operations
Can specify join type with how parameter

Syntax Reference

# Basic merge (inner join by default)
pd.merge(df1, df2, on='key')

# Specify join type
pd.merge(df1, df2, on='key', how='inner')  # Only matches
pd.merge(df1, df2, on='key', how='left')   # All left
pd.merge(df1, df2, on='key', how='right')  # All right
pd.merge(df1, df2, on='key', how='outer')  # All rows

# Multiple columns
pd.merge(df1, df2, on=['key1', 'key2'])

# Different column names
pd.merge(df1, df2, left_on='id', right_on='customer_id')

# Index-based joining
df1.join(df2)
pd.merge(df1, df2, left_index=True, right_index=True)

# With indicator
pd.merge(df1, df2, on='key', how='outer', indicator=True)

# Custom suffixes
pd.merge(df1, df2, on='key', suffixes=('_left', '_right'))

Join Types Summary

Type	Keeps	Use When
inner	Only matches	Default, most common
left	All left rows	Keep all from first DF
right	All right rows	Keep all from second DF
outer	All rows	Complete data audit

Common Patterns

Pattern 1: Add Details to Transactions

# Add product info to sales
sales_detailed = pd.merge(sales, products, on='product_id', how='left')

Pattern 2: Find Missing Relationships

# Find customers without orders
result = pd.merge(customers, orders, left_on='id', right_on='customer_id', how='left')
no_orders = result[result['order_id'].isna()]

Pattern 3: Combine Multiple Sources

# Merge three DataFrames
step1 = pd.merge(df1, df2, on='key')
final = pd.merge(step1, df3, on='key')

Merge vs Join vs Concat

Method	Purpose	Key
merge()	Join on columns	Column values
join()	Join on index	Index values
concat()	Stack DataFrames	No matching

Important Reminders

Check shapes: Always verify row counts before and after merge
Use indicator: indicator=True helps verify merge results
Handle NaN: Outer/left/right joins create missing values
Drop duplicates: Use drop() to remove duplicate columns from left_on/right_on
Data types: Ensure key columns have matching data types
Performance: Index-based joins are faster than column-based

Best Practices

Always check DataFrame shapes before and after merge
Use indicator=True to verify merge results
Handle NaN values appropriately after outer/left/right joins
Drop duplicate columns when using left_on/right_on
Validate key columns have matching data types
Consider using index-based joins for better performance

Next Steps

You can now merge and join DataFrames using SQL-style operations. In the next lesson, you will learn Multi-Index and Hierarchical Data—working with DataFrames that have multiple index levels.

Continue to Lesson 19 - MultiIndex and Hierarchical Data

Learn to work with multi-level indexes for complex data structures

Back to Lesson 17 - Concatenating DataFrames

Review stacking DataFrames vertically and horizontally

Master Data Relationships

Merging is the database analyst’s essential skill—now you can link related datasets just like SQL joins.

Use merging to enrich your data and uncover relationships between tables!

Lesson 17 - Concatenating DataFrames

Lesson 19 - MultiIndex and Hierarchical Data

Courses

DATATWEETS

Title here

Lesson 18 - Merging and Joining DataFrames

SQL-Style Joins in Pandas

Why Merge DataFrames?

Inner Join

Creating Sample DataFrames

Inner Join - Keep Only Matching Rows

Left, Right, and Outer Joins

Left Join - Keep All Left Rows

Right Join - Keep All Right Rows

Outer Join - Keep All Rows from Both

Understanding Join Types

Merging on Multiple Columns

Join on Multiple Keys

Different Column Names

Index-Based Joining

Join on Index with .join()

Merge on Index

Indicator and Suffixes

Indicator Column

Handling Duplicate Column Names

Practice Exercises

Exercise 1: Customer Orders Analysis

Exercise 2: Product Inventory

Exercise 3: Student Grades from Multiple Exams

Summary

Key Concepts

Syntax Reference

Join Types Summary

Common Patterns

Merge vs Join vs Concat

Important Reminders

Best Practices

Next Steps

Continue to Lesson 19 - MultiIndex and Hierarchical Data

Back to Lesson 17 - Concatenating DataFrames

Master Data Relationships

Lesson 18 - Merging and Joining DataFrames

SQL-Style Joins in Pandas#

Why Merge DataFrames?#

Inner Join#

Creating Sample DataFrames#

Inner Join - Keep Only Matching Rows#

Left, Right, and Outer Joins#

Left Join - Keep All Left Rows#

Right Join - Keep All Right Rows#

Outer Join - Keep All Rows from Both#

Understanding Join Types#

Merging on Multiple Columns#

Join on Multiple Keys#

Different Column Names#

Index-Based Joining#

Join on Index with .join()#

Merge on Index#

Indicator and Suffixes#

Indicator Column#

Handling Duplicate Column Names#

Practice Exercises#

Exercise 1: Customer Orders Analysis#

Exercise 2: Product Inventory#

Exercise 3: Student Grades from Multiple Exams#

Summary#

Key Concepts#

Syntax Reference#

Join Types Summary#

Common Patterns#

Merge vs Join vs Concat#

Important Reminders#

Best Practices#

Next Steps#

Continue to Lesson 19 - MultiIndex and Hierarchical Data

Back to Lesson 17 - Concatenating DataFrames

Master Data Relationships#

SQL-Style Joins in Pandas

Why Merge DataFrames?

Inner Join

Creating Sample DataFrames

Inner Join - Keep Only Matching Rows

Left, Right, and Outer Joins

Left Join - Keep All Left Rows

Right Join - Keep All Right Rows

Outer Join - Keep All Rows from Both

Understanding Join Types

Merging on Multiple Columns

Join on Multiple Keys

Different Column Names

Index-Based Joining

Join on Index with .join()

Merge on Index

Indicator and Suffixes

Indicator Column

Handling Duplicate Column Names

Practice Exercises

Exercise 1: Customer Orders Analysis

Exercise 2: Product Inventory

Exercise 3: Student Grades from Multiple Exams

Summary

Key Concepts

Syntax Reference

Join Types Summary

Common Patterns

Merge vs Join vs Concat

Important Reminders

Best Practices

Next Steps

Master Data Relationships