Lesson 17 - Concatenating DataFrames
Combining Multiple DataFrames
You can create pivot tables to reshape data. Now you will learn Concatenation—combining multiple DataFrames by stacking them together.
By the end of this lesson, you will be able to:
- Concatenate DataFrames vertically (stacking rows)
- Concatenate DataFrames horizontally (adding columns)
- Handle indexes during concatenation
- Use keys to create multi-level indexes
- Understand when to use concat vs merge
- Handle mismatched columns
Concatenation is essential when combining data from multiple files, time periods, or sources. Let’s master this fundamental skill.
Why Concatenate?
Concatenation combines DataFrames by stacking them:
- Vertical (axis=0): Stack rows on top of each other
- Horizontal (axis=1): Place columns side by side
Common use cases:
- Combining data from multiple months/years
- Merging data from multiple CSV files
- Adding new measurements to existing data
- Combining results from different data sources
Vertical Concatenation
Stack DataFrames on top of each other (add rows).
Creating Sample DataFrames
import pandas as pd
import numpy as np
# Q1 sales
q1_sales = pd.DataFrame({
'month': ['Jan', 'Feb', 'Mar'],
'revenue': [50000, 55000, 52000],
'expenses': [30000, 32000, 31000]
})
# Q2 sales
q2_sales = pd.DataFrame({
'month': ['Apr', 'May', 'Jun'],
'revenue': [60000, 58000, 65000],
'expenses': [35000, 34000, 37000]
})
print("Q1 Sales:")
print(q1_sales)
print("\nQ2 Sales:")
print(q2_sales)Stack Rows with pd.concat()
# Concatenate vertically (default axis=0)
h1_sales = pd.concat([q1_sales, q2_sales], axis=0)
print("Combined H1 Sales (vertical):")
print(h1_sales)
print(f"\nShape: {h1_sales.shape}")Output:
month revenue expenses
0 Jan 50000 30000
1 Feb 55000 32000
2 Mar 52000 31000
0 Apr 60000 35000 ← Index repeats!
1 May 58000 34000
2 Jun 65000 37000Notice: The index is preserved (0, 1, 2, 0, 1, 2). This might cause issues!
Reset Index After Concatenation
# Concatenate and reset index
h1_sales_clean = pd.concat([q1_sales, q2_sales], axis=0, ignore_index=True)
print("With reset index:")
print(h1_sales_clean)Output:
month revenue expenses
0 Jan 50000 30000
1 Feb 55000 32000
2 Mar 52000 31000
3 Apr 60000 35000 ← Clean sequential index
4 May 58000 34000
5 Jun 65000 37000Much better! Use ignore_index=True for clean sequential numbering.
Horizontal Concatenation
Place DataFrames side by side (add columns).
Creating DataFrames with Different Columns
# Revenue data
revenue_data = pd.DataFrame({
'month': ['Jan', 'Feb', 'Mar'],
'revenue': [50000, 55000, 52000]
})
# Employee data
employee_data = pd.DataFrame({
'month': ['Jan', 'Feb', 'Mar'],
'employees': [25, 27, 26]
})
print("Revenue:")
print(revenue_data)
print("\nEmployees:")
print(employee_data)Add Columns with axis=1
# Concatenate horizontally
combined = pd.concat([revenue_data, employee_data], axis=1)
print("Combined (horizontal):")
print(combined)Output:
month revenue month employees
0 Jan 50000 Jan 25
1 Feb 55000 Feb 27
2 Mar 52000 Mar 26Notice: The ‘month’ column appears twice! This happens when both DataFrames have the same column.
Remove Duplicate Columns
# Better approach: Drop month from second DataFrame before concatenating
employee_data_clean = employee_data[['employees']]
combined_clean = pd.concat([revenue_data, employee_data_clean], axis=1)
print("Without duplicate 'month':")
print(combined_clean)Output:
month revenue employees
0 Jan 50000 25
1 Feb 55000 27
2 Mar 52000 26Using Keys for Multi-Index
Add labels to track which DataFrame each row came from.
Add Keys to Track Data Sources
# Concatenate with keys
sales_with_keys = pd.concat([q1_sales, q2_sales], keys=['Q1', 'Q2'])
print("With keys (multi-index):")
print(sales_with_keys)Output:
month revenue expenses
Q1 0 Jan 50000 30000
1 Feb 55000 32000
2 Mar 52000 31000
Q2 0 Apr 60000 35000
1 May 58000 34000
2 Jun 65000 37000Now you have a hierarchical index showing which quarter each row belongs to!
Select Data by Key
# Select Q1 data
q1_only = sales_with_keys.loc['Q1']
print("Q1 data only:")
print(q1_only)Output:
month revenue expenses
0 Jan 50000 30000
1 Feb 55000 32000
2 Mar 52000 31000Handling Mismatched Columns
What happens when DataFrames have different columns?
Concatenate with Different Columns
# DataFrame 1: Has 'bonus' column
df1 = pd.DataFrame({
'name': ['James', 'Maria'],
'salary': [50000, 60000],
'bonus': [5000, 6000]
})
# DataFrame 2: Has 'department' column instead
df2 = pd.DataFrame({
'name': ['David', 'Anna'],
'salary': [55000, 65000],
'department': ['IT', 'Sales']
})
# Concatenate (outer join by default)
employees = pd.concat([df1, df2], ignore_index=True)
print("Concatenated with mismatched columns:")
print(employees)Output:
name salary bonus department
0 James 50000 5000.0 NaN
1 Maria 60000 6000.0 NaN
2 David 55000 NaN IT
3 Anna 65000 NaN SalesResult: Missing values (NaN) are filled where columns don’t exist. This is the default behavior (join='outer').
Inner Join - Keep Only Common Columns
# Concatenate with inner join
employees_inner = pd.concat([df1, df2], join='inner', ignore_index=True)
print("Inner join (only common columns):")
print(employees_inner)Output:
name salary
0 James 50000
1 Maria 60000
2 David 55000
3 Anna 65000Only ’name’ and ‘salary’ columns are kept (columns that exist in both DataFrames).
Concatenating Multiple DataFrames
Combine more than two DataFrames at once.
Stack Multiple DataFrames
# Create quarterly data
q1 = pd.DataFrame({'month': ['Jan', 'Feb', 'Mar'], 'sales': [100, 110, 105]})
q2 = pd.DataFrame({'month': ['Apr', 'May', 'Jun'], 'sales': [120, 115, 125]})
q3 = pd.DataFrame({'month': ['Jul', 'Aug', 'Sep'], 'sales': [130, 135, 128]})
q4 = pd.DataFrame({'month': ['Oct', 'Nov', 'Dec'], 'sales': [140, 145, 150]})
# Concatenate all at once
yearly_sales = pd.concat([q1, q2, q3, q4], ignore_index=True)
print("Full Year Sales:")
print(yearly_sales)
print(f"\nTotal annual sales: {yearly_sales['sales'].sum()}")Practice Exercises
Apply concatenation operations with these exercises.
Exercise 1: Combine Yearly Sales
- Create sales data for 2023 and 2024 (same columns)
- Concatenate them vertically
- Add a ‘year’ column to track which year each row is from
- Calculate total sales for each year
# Your code hereHint
Use keys=[‘2023’, ‘2024’] to add year labels, then reset_index() to convert to a regular column!
Exercise 2: Combine Department Data
- Create IT department employees (name, salary)
- Create Sales department employees (name, commission)
- Concatenate them vertically
- Fill missing values appropriately
# Your code hereExercise 3: Combine Monthly Files
Given monthly data files:
jan = pd.DataFrame({'product': ['A', 'B', 'C'], 'sales': [100, 150, 120]})
feb = pd.DataFrame({'product': ['A', 'B', 'C'], 'sales': [110, 160, 115]})
mar = pd.DataFrame({'product': ['A', 'B', 'C'], 'sales': [105, 155, 125]})- Concatenate all three months
- Add month labels using keys
- Calculate total sales per product across all months
# Your code hereSummary
You now master DataFrame concatenation. Let’s review the key concepts.
Key Concepts
Vertical Concatenation (Stacking Rows)
- Default behavior (
axis=0) - Combines DataFrames with same columns
- Use
ignore_index=Truefor clean sequential index - Perfect for combining time periods
Horizontal Concatenation (Adding Columns)
- Use
axis=1to place side by side - Watch for duplicate column names
- Rows are aligned by index
- Perfect for adding new features
Keys for Tracking Sources
- Add labels with
keysparameter - Creates multi-level index
- Easy to select data by source
- Useful for combining different categories
Handling Mismatched Columns
- Outer join (default): Keep all columns, fill NaN
- Inner join: Keep only common columns
- Choose based on analysis needs
Syntax Reference
# Vertical concatenation (stack rows)
pd.concat([df1, df2], axis=0)
# With reset index
pd.concat([df1, df2], axis=0, ignore_index=True)
# With keys for tracking
pd.concat([df1, df2], keys=['A', 'B'])
# Horizontal concatenation (add columns)
pd.concat([df1, df2], axis=1)
# Inner join (only common columns)
pd.concat([df1, df2], join='inner')
# Outer join (all columns, default)
pd.concat([df1, df2], join='outer')
# Multiple DataFrames
pd.concat([df1, df2, df3, df4], ignore_index=True)Common Parameters
| Parameter | Purpose | Example |
|---|---|---|
axis | 0=vertical, 1=horizontal | axis=0 (default) |
ignore_index | Reset index to 0, 1, 2… | ignore_index=True |
keys | Add multi-level index | keys=['Q1', 'Q2'] |
join | How to handle columns | join='inner' or 'outer' |
names | Name index levels | names=['quarter', 'row'] |
Concat vs Merge
Use pd.concat() when:
- Stacking similar DataFrames (same columns)
- Combining time periods (monthly to yearly)
- Adding new measurements (new rows or columns)
- Simple stacking operation
Use pd.merge() when:
- Joining on key columns (like SQL joins)
- Combining related tables (customers + orders)
- Need different join types (left, right, inner, outer)
- DataFrames have different structures
Common Patterns
Pattern 1: Load Multiple Files
# Load multiple CSV files and combine
jan = pd.read_csv('jan.csv')
feb = pd.read_csv('feb.csv')
mar = pd.read_csv('mar.csv')
# Combine into single DataFrame
q1 = pd.concat([jan, feb, mar], ignore_index=True)Pattern 2: Add Year Labels
# Concatenate with year tracking
all_data = pd.concat([df_2023, df_2024], keys=['2023', '2024'])
all_data = all_data.reset_index(level=0)
all_data.rename(columns={'level_0': 'year'}, inplace=True)Pattern 3: Combine Features
# Add new columns from different sources
features1 = df[['id', 'feature_a', 'feature_b']]
features2 = df[['feature_c', 'feature_d']]
all_features = pd.concat([features1, features2], axis=1)Important Reminders
- Index matters: Use
ignore_index=Trueto avoid duplicate indexes - Column names: Watch for duplicate columns in horizontal concat
- Alignment: Horizontal concat aligns by index
- Performance: Concatenating many small DataFrames in a loop is slow—collect them in a list first, then concat once
- Memory: Large concatenations can use significant memory
Next Steps
You can now combine multiple DataFrames through concatenation. In the next lesson, you will learn Merging and Joining—SQL-style joins for combining DataFrames based on key columns.
Continue to Lesson 18 - Merging and Joining
Learn SQL-style joins for combining related DataFrames
Back to Lesson 16 - Pivot Tables
Review creating Excel-style pivot tables
Master DataFrame Combination
Concatenation is your tool for stacking DataFrames together—essential for combining data from multiple sources or time periods.
Use concatenation to build comprehensive datasets from multiple sources!