Lesson 18 - Merging and Joining DataFrames
SQL-Style Joins in Pandas
You can concatenate DataFrames by stacking. Now you will learn Merging and Joining—combining DataFrames based on matching key values, similar to SQL database joins.
By the end of this lesson, you will be able to:
- Understand different types of joins (inner, left, right, outer)
- Merge DataFrames using
.merge() - Join DataFrames on single and multiple columns
- Use
.join()for index-based joining - Add indicator columns to track merge sources
- Handle duplicate column names
Merging is essential for combining related datasets, enriching data with additional information, and linking tables together. Let’s master this powerful technique.
Why Merge DataFrames?
Merging (or joining) combines DataFrames based on common column values, similar to SQL joins.
Common use cases:
- Combine customer data from different systems
- Add product details to sales transactions
- Enrich datasets with additional information
- Link related tables (orders + customers, employees + departments)
Merge vs Concatenate:
- Concatenate: Stack DataFrames (rows/columns)
- Merge: Join based on matching key values
Inner Join
Keep only rows where the key exists in both DataFrames.
Creating Sample DataFrames
import pandas as pd
import numpy as np
# Employee data
employees = pd.DataFrame({
'employee_id': [101, 102, 103, 104, 105],
'name': ['James', 'Maria', 'David', 'Anna', 'Michael'],
'dept_id': [1, 2, 1, 3, 2],
'salary': [50000, 60000, 55000, 65000, 58000]
})
# Department data
departments = pd.DataFrame({
'dept_id': [1, 2, 3, 4],
'dept_name': ['Engineering', 'Sales', 'Marketing', 'HR'],
'location': ['New York', 'London', 'Paris', 'Berlin']
})
print("Employees:")
print(employees)
print("\nDepartments:")
print(departments)Inner Join - Keep Only Matching Rows
# Inner join on dept_id
inner_result = pd.merge(employees, departments, on='dept_id', how='inner')
print("Inner Join Result:")
print(inner_result)
print(f"\nShape: {inner_result.shape}")Output:
employee_id name dept_id salary dept_name location
0 101 James 1 50000 Engineering New York
1 103 David 1 55000 Engineering New York
2 102 Maria 2 60000 Sales London
3 105 Michael 2 58000 Sales London
4 104 Anna 3 65000 Marketing ParisNotice:
- All 5 employees matched (dept_id 1, 2, 3 exist in both)
- HR department (dept_id 4) not included (no employees)
- Result has columns from both DataFrames
Left, Right, and Outer Joins
Different join types keep different combinations of rows.
Left Join - Keep All Left Rows
Left join keeps all rows from left DataFrame, fills NaN for missing right values.
# Add employee without department
employees_extended = pd.DataFrame({
'employee_id': [101, 102, 103, 104, 105, 106],
'name': ['James', 'Maria', 'David', 'Anna', 'Michael', 'Sophie'],
'dept_id': [1, 2, 1, 3, 2, 99], # 99 doesn't exist in departments
'salary': [50000, 60000, 55000, 65000, 58000, 52000]
})
# Left join
left_result = pd.merge(employees_extended, departments, on='dept_id', how='left')
print("Left Join Result:")
print(left_result)Output:
employee_id name dept_id salary dept_name location
0 101 James 1 50000 Engineering New York
1 102 Maria 2 60000 Sales London
2 103 David 1 55000 Engineering New York
3 104 Anna 3 65000 Marketing Paris
4 105 Michael 2 58000 Sales London
5 106 Sophie 99 52000 NaN NaN ← Missing deptKey point: Sophie (dept_id=99) is included, but dept_name and location are NaN.
Right Join - Keep All Right Rows
Right join keeps all rows from right DataFrame.
# Right join (keep all departments)
right_result = pd.merge(employees, departments, on='dept_id', how='right')
print("Right Join Result:")
print(right_result)Output:
employee_id name dept_id salary dept_name location
0 101.0 James 1 50000.0 Engineering New York
1 103.0 David 1 55000.0 Engineering New York
2 102.0 Maria 2 60000.0 Sales London
3 105.0 Michael 2 58000.0 Sales London
4 104.0 Anna 3 65000.0 Marketing Paris
5 NaN NaN 4 NaN HR Berlin ← No employeesKey point: HR department (dept_id=4) is included even though no employees work there (NaN values).
Outer Join - Keep All Rows from Both
Outer join (full join) keeps all rows from both DataFrames, fills NaN where no match.
# Outer join
outer_result = pd.merge(employees_extended, departments, on='dept_id', how='outer')
print("Outer Join Result:")
print(outer_result)
print(f"\nShape: {outer_result.shape}")Output:
employee_id name dept_id salary dept_name location
0 101.0 James 1 50000.0 Engineering New York
1 103.0 David 1 55000.0 Engineering New York
2 102.0 Maria 2 60000.0 Sales London
3 105.0 Michael 2 58000.0 Sales London
4 104.0 Anna 3 65000.0 Marketing Paris
5 106.0 Sophie 99 52000.0 NaN NaN ← No dept
6 NaN NaN 4 NaN HR Berlin ← No empResult includes:
- Sophie (dept_id=99, no matching department)
- HR department (dept_id=4, no employees)
- All matching rows
Understanding Join Types
Visual comparison of join types:
| Join Type | SQL Equivalent | Description | Use Case |
|---|---|---|---|
| Inner | INNER JOIN | Only matching rows | Common analysis |
| Left | LEFT JOIN | All left + matching right | Keep all customers, add order details |
| Right | RIGHT JOIN | All right + matching left | Keep all products, add sales data |
| Outer | FULL OUTER JOIN | All rows from both | Complete data audit |
Visual Example:
Left DF: Right DF:
A, B, C B, C, D
Inner: B, C (intersection)
Left: A, B, C (all left)
Right: B, C, D (all right)
Outer: A, B, C, D (union)Merging on Multiple Columns
Sometimes you need to match on more than one column.
Join on Multiple Keys
# Sales data with region and product
sales = pd.DataFrame({
'region': ['North', 'North', 'South', 'South'],
'product': ['Laptop', 'Mouse', 'Laptop', 'Mouse'],
'quantity': [15, 120, 20, 150]
})
# Price data with region and product
prices = pd.DataFrame({
'region': ['North', 'North', 'South', 'South', 'East'],
'product': ['Laptop', 'Mouse', 'Laptop', 'Mouse', 'Laptop'],
'price': [1200, 30, 1200, 30, 1200]
})
# Merge on both region AND product
sales_with_prices = pd.merge(sales, prices, on=['region', 'product'], how='left')
# Calculate revenue
sales_with_prices['revenue'] = sales_with_prices['quantity'] * sales_with_prices['price']
print("Sales with Prices and Revenue:")
print(sales_with_prices)Key point: Both region AND product must match for the join to work.
Different Column Names
When key columns have different names, use left_on and right_on.
# Orders with customer_id
orders = pd.DataFrame({
'order_id': [1001, 1002, 1003],
'customer_id': [501, 502, 503],
'amount': [1500, 2000, 1200]
})
# Customers with id (different column name)
customers = pd.DataFrame({
'id': [501, 502, 503, 504],
'name': ['James', 'Maria', 'David', 'Anna'],
'city': ['New York', 'London', 'Paris', 'Berlin']
})
# Merge with different column names
orders_with_customers = pd.merge(orders, customers, left_on='customer_id', right_on='id', how='inner')
print("Orders with Customer Details:")
print(orders_with_customers)
# Drop duplicate id column
orders_with_customers = orders_with_customers.drop('id', axis=1)
print("\nCleaned (removed duplicate 'id' column):")
print(orders_with_customers)Index-Based Joining
Use .join() for efficient joining on indexes.
Join on Index with .join()
# Create DataFrames with matching indexes
employee_info = pd.DataFrame({
'name': ['James', 'Maria', 'David'],
'age': [30, 28, 35]
}, index=[101, 102, 103])
employee_salary = pd.DataFrame({
'salary': [50000, 60000, 55000],
'bonus': [5000, 6000, 5500]
}, index=[101, 102, 103])
# Join on index
complete_employee = employee_info.join(employee_salary)
print("Employee Info:")
print(employee_info)
print("\nEmployee Salary:")
print(employee_salary)
print("\nJoined (on index):")
print(complete_employee)Merge on Index
You can also use .merge() with index parameters.
# Merge on left index and right column
result = pd.merge(employee_info,
pd.DataFrame({'emp_id': [101, 102, 103], 'rating': [4.5, 4.8, 4.2]}),
left_index=True,
right_on='emp_id')
print("Merged on Index:")
print(result)Indicator and Suffixes
Track merge sources and handle duplicate column names.
Indicator Column
Use indicator=True to see which DataFrame each row came from.
# Create data with some non-matching rows
df1 = pd.DataFrame({'key': ['A', 'B', 'C'], 'value1': [1, 2, 3]})
df2 = pd.DataFrame({'key': ['B', 'C', 'D'], 'value2': [4, 5, 6]})
# Outer merge with indicator
merged = pd.merge(df1, df2, on='key', how='outer', indicator=True)
print("Merge with Indicator:")
print(merged)
print("\nIndicator counts:")
print(merged['_merge'].value_counts())Output:
key value1 value2 _merge
0 A 1.0 NaN left_only
1 B 2.0 4.0 both
2 C 3.0 5.0 both
3 D NaN 6.0 right_onlyIndicator values:
left_only: Row only in left DataFrameright_only: Row only in right DataFrameboth: Row in both DataFrames
Handling Duplicate Column Names
When both DataFrames have columns with same names, use suffixes.
# Both DataFrames have 'score' column
midterm = pd.DataFrame({
'student': ['James', 'Maria', 'David'],
'score': [85, 90, 78]
})
final = pd.DataFrame({
'student': ['James', 'Maria', 'David'],
'score': [88, 92, 82]
})
# Merge with custom suffixes
all_scores = pd.merge(midterm, final, on='student', suffixes=('_midterm', '_final'))
# Calculate average
all_scores['average'] = (all_scores['score_midterm'] + all_scores['score_final']) / 2
print("All Scores:")
print(all_scores)Output:
student score_midterm score_final average
0 James 85 88 86.5
1 Maria 90 92 91.0
2 David 78 82 80.0Practice Exercises
Apply merging operations with these exercises.
Exercise 1: Customer Orders Analysis
- Create customers DataFrame (id, name, city)
- Create orders DataFrame (order_id, customer_id, amount)
- Perform left join to keep all customers
- Find customers who haven’t placed orders
# Your code hereHint
Customers without orders will have NaN in the order_id column after a left join!
Exercise 2: Product Inventory
- Create products DataFrame (product_id, name, category)
- Create inventory DataFrame (product_id, warehouse, quantity)
- Inner join to see products in stock
- Calculate total quantity per product
# Your code hereExercise 3: Student Grades from Multiple Exams
- Create exam1 DataFrame (student, math_score)
- Create exam2 DataFrame (student, science_score)
- Outer join to include all students
- Calculate average score (handling NaN)
# Your code hereSummary
You now master merging and joining DataFrames. Let’s review the key concepts.
Key Concepts
Basic Merge Syntax
- Use
pd.merge()to join DataFrames - Specify join key with
onparameter - Choose join type with
howparameter - Default is inner join
Join Types
- Inner: Only matching rows (intersection)
- Left: All left rows, matching right
- Right: All right rows, matching left
- Outer: All rows from both (union)
Multiple Keys
- Pass list to
onfor multiple columns - Use
left_on/right_onfor different names - All specified columns must match
Index Joining
- Use
.join()for index-based joins - More efficient than merge for index operations
- Can specify join type with
howparameter
Syntax Reference
# Basic merge (inner join by default)
pd.merge(df1, df2, on='key')
# Specify join type
pd.merge(df1, df2, on='key', how='inner') # Only matches
pd.merge(df1, df2, on='key', how='left') # All left
pd.merge(df1, df2, on='key', how='right') # All right
pd.merge(df1, df2, on='key', how='outer') # All rows
# Multiple columns
pd.merge(df1, df2, on=['key1', 'key2'])
# Different column names
pd.merge(df1, df2, left_on='id', right_on='customer_id')
# Index-based joining
df1.join(df2)
pd.merge(df1, df2, left_index=True, right_index=True)
# With indicator
pd.merge(df1, df2, on='key', how='outer', indicator=True)
# Custom suffixes
pd.merge(df1, df2, on='key', suffixes=('_left', '_right'))Join Types Summary
| Type | Keeps | Use When |
|---|---|---|
| inner | Only matches | Default, most common |
| left | All left rows | Keep all from first DF |
| right | All right rows | Keep all from second DF |
| outer | All rows | Complete data audit |
Common Patterns
Pattern 1: Add Details to Transactions
# Add product info to sales
sales_detailed = pd.merge(sales, products, on='product_id', how='left')Pattern 2: Find Missing Relationships
# Find customers without orders
result = pd.merge(customers, orders, left_on='id', right_on='customer_id', how='left')
no_orders = result[result['order_id'].isna()]Pattern 3: Combine Multiple Sources
# Merge three DataFrames
step1 = pd.merge(df1, df2, on='key')
final = pd.merge(step1, df3, on='key')Merge vs Join vs Concat
| Method | Purpose | Key |
|---|---|---|
| merge() | Join on columns | Column values |
| join() | Join on index | Index values |
| concat() | Stack DataFrames | No matching |
Important Reminders
- Check shapes: Always verify row counts before and after merge
- Use indicator:
indicator=Truehelps verify merge results - Handle NaN: Outer/left/right joins create missing values
- Drop duplicates: Use
drop()to remove duplicate columns fromleft_on/right_on - Data types: Ensure key columns have matching data types
- Performance: Index-based joins are faster than column-based
Best Practices
- Always check DataFrame shapes before and after merge
- Use
indicator=Trueto verify merge results - Handle NaN values appropriately after outer/left/right joins
- Drop duplicate columns when using
left_on/right_on - Validate key columns have matching data types
- Consider using index-based joins for better performance
Next Steps
You can now merge and join DataFrames using SQL-style operations. In the next lesson, you will learn Multi-Index and Hierarchical Data—working with DataFrames that have multiple index levels.
Continue to Lesson 19 - MultiIndex and Hierarchical Data
Learn to work with multi-level indexes for complex data structures
Back to Lesson 17 - Concatenating DataFrames
Review stacking DataFrames vertically and horizontally
Master Data Relationships
Merging is the database analyst’s essential skill—now you can link related datasets just like SQL joins.
Use merging to enrich your data and uncover relationships between tables!