Lesson 17 - Concatenating DataFrames

Combining Multiple DataFrames

You can create pivot tables to reshape data. Now you will learn Concatenation—combining multiple DataFrames by stacking them together.

By the end of this lesson, you will be able to:

Concatenate DataFrames vertically (stacking rows)
Concatenate DataFrames horizontally (adding columns)
Handle indexes during concatenation
Use keys to create multi-level indexes
Understand when to use concat vs merge
Handle mismatched columns

Concatenation is essential when combining data from multiple files, time periods, or sources. Let’s master this fundamental skill.

Why Concatenate?

Concatenation combines DataFrames by stacking them:

Vertical (axis=0): Stack rows on top of each other
Horizontal (axis=1): Place columns side by side

Common use cases:

Combining data from multiple months/years
Merging data from multiple CSV files
Adding new measurements to existing data
Combining results from different data sources

Vertical Concatenation

Stack DataFrames on top of each other (add rows).

Creating Sample DataFrames

import pandas as pd
import numpy as np

# Q1 sales
q1_sales = pd.DataFrame({
    'month': ['Jan', 'Feb', 'Mar'],
    'revenue': [50000, 55000, 52000],
    'expenses': [30000, 32000, 31000]
})

# Q2 sales
q2_sales = pd.DataFrame({
    'month': ['Apr', 'May', 'Jun'],
    'revenue': [60000, 58000, 65000],
    'expenses': [35000, 34000, 37000]
})

print("Q1 Sales:")
print(q1_sales)
print("\nQ2 Sales:")
print(q2_sales)

Stack Rows with pd.concat()

# Concatenate vertically (default axis=0)
h1_sales = pd.concat([q1_sales, q2_sales], axis=0)

print("Combined H1 Sales (vertical):")
print(h1_sales)
print(f"\nShape: {h1_sales.shape}")

Output:

  month  revenue  expenses
0   Jan    50000     30000
1   Feb    55000     32000
2   Mar    52000     31000
0   Apr    60000     35000  ← Index repeats!
1   May    58000     34000
2   Jun    65000     37000

Notice: The index is preserved (0, 1, 2, 0, 1, 2). This might cause issues!

Reset Index After Concatenation

# Concatenate and reset index
h1_sales_clean = pd.concat([q1_sales, q2_sales], axis=0, ignore_index=True)

print("With reset index:")
print(h1_sales_clean)

Output:

  month  revenue  expenses
0   Jan    50000     30000
1   Feb    55000     32000
2   Mar    52000     31000
3   Apr    60000     35000  ← Clean sequential index
4   May    58000     34000
5   Jun    65000     37000

Much better! Use ignore_index=True for clean sequential numbering.

Horizontal Concatenation

Place DataFrames side by side (add columns).

Creating DataFrames with Different Columns

# Revenue data
revenue_data = pd.DataFrame({
    'month': ['Jan', 'Feb', 'Mar'],
    'revenue': [50000, 55000, 52000]
})

# Employee data
employee_data = pd.DataFrame({
    'month': ['Jan', 'Feb', 'Mar'],
    'employees': [25, 27, 26]
})

print("Revenue:")
print(revenue_data)
print("\nEmployees:")
print(employee_data)

Add Columns with axis=1

# Concatenate horizontally
combined = pd.concat([revenue_data, employee_data], axis=1)

print("Combined (horizontal):")
print(combined)

Output:

  month  revenue month  employees
0   Jan    50000   Jan         25
1   Feb    55000   Feb         27
2   Mar    52000   Mar         26

Notice: The ‘month’ column appears twice! This happens when both DataFrames have the same column.

Remove Duplicate Columns

# Better approach: Drop month from second DataFrame before concatenating
employee_data_clean = employee_data[['employees']]
combined_clean = pd.concat([revenue_data, employee_data_clean], axis=1)

print("Without duplicate 'month':")
print(combined_clean)

Output:

  month  revenue  employees
0   Jan    50000         25
1   Feb    55000         27
2   Mar    52000         26

Using Keys for Multi-Index

Add labels to track which DataFrame each row came from.

Add Keys to Track Data Sources

# Concatenate with keys
sales_with_keys = pd.concat([q1_sales, q2_sales], keys=['Q1', 'Q2'])

print("With keys (multi-index):")
print(sales_with_keys)

Output:

      month  revenue  expenses
Q1 0    Jan    50000     30000
   1    Feb    55000     32000
   2    Mar    52000     31000
Q2 0    Apr    60000     35000
   1    May    58000     34000
   2    Jun    65000     37000

Now you have a hierarchical index showing which quarter each row belongs to!

Select Data by Key

# Select Q1 data
q1_only = sales_with_keys.loc['Q1']
print("Q1 data only:")
print(q1_only)

Output:

  month  revenue  expenses
0   Jan    50000     30000
1   Feb    55000     32000
2   Mar    52000     31000

Handling Mismatched Columns

What happens when DataFrames have different columns?

Concatenate with Different Columns

# DataFrame 1: Has 'bonus' column
df1 = pd.DataFrame({
    'name': ['James', 'Maria'],
    'salary': [50000, 60000],
    'bonus': [5000, 6000]
})

# DataFrame 2: Has 'department' column instead
df2 = pd.DataFrame({
    'name': ['David', 'Anna'],
    'salary': [55000, 65000],
    'department': ['IT', 'Sales']
})

# Concatenate (outer join by default)
employees = pd.concat([df1, df2], ignore_index=True)

print("Concatenated with mismatched columns:")
print(employees)

Output:

    name  salary   bonus department
0  James   50000  5000.0        NaN
1  Maria   60000  6000.0        NaN
2  David   55000     NaN         IT
3   Anna   65000     NaN      Sales

Result: Missing values (NaN) are filled where columns don’t exist. This is the default behavior (join='outer').

Inner Join - Keep Only Common Columns

# Concatenate with inner join
employees_inner = pd.concat([df1, df2], join='inner', ignore_index=True)

print("Inner join (only common columns):")
print(employees_inner)

Output:

    name  salary
0  James   50000
1  Maria   60000
2  David   55000
3   Anna   65000

Only ’name’ and ‘salary’ columns are kept (columns that exist in both DataFrames).

Concatenating Multiple DataFrames

Combine more than two DataFrames at once.

Stack Multiple DataFrames

# Create quarterly data
q1 = pd.DataFrame({'month': ['Jan', 'Feb', 'Mar'], 'sales': [100, 110, 105]})
q2 = pd.DataFrame({'month': ['Apr', 'May', 'Jun'], 'sales': [120, 115, 125]})
q3 = pd.DataFrame({'month': ['Jul', 'Aug', 'Sep'], 'sales': [130, 135, 128]})
q4 = pd.DataFrame({'month': ['Oct', 'Nov', 'Dec'], 'sales': [140, 145, 150]})

# Concatenate all at once
yearly_sales = pd.concat([q1, q2, q3, q4], ignore_index=True)

print("Full Year Sales:")
print(yearly_sales)
print(f"\nTotal annual sales: {yearly_sales['sales'].sum()}")

Practice Exercises

Apply concatenation operations with these exercises.

Exercise 1: Combine Yearly Sales

Create sales data for 2023 and 2024 (same columns)
Concatenate them vertically
Add a ‘year’ column to track which year each row is from
Calculate total sales for each year

# Your code here

Hint

Use keys=[‘2023’, ‘2024’] to add year labels, then reset_index() to convert to a regular column!

Exercise 2: Combine Department Data

Create IT department employees (name, salary)
Create Sales department employees (name, commission)
Concatenate them vertically
Fill missing values appropriately

# Your code here

Exercise 3: Combine Monthly Files

Given monthly data files:

jan = pd.DataFrame({'product': ['A', 'B', 'C'], 'sales': [100, 150, 120]})
feb = pd.DataFrame({'product': ['A', 'B', 'C'], 'sales': [110, 160, 115]})
mar = pd.DataFrame({'product': ['A', 'B', 'C'], 'sales': [105, 155, 125]})

Concatenate all three months
Add month labels using keys
Calculate total sales per product across all months

# Your code here

Summary

You now master DataFrame concatenation. Let’s review the key concepts.

Key Concepts

Vertical Concatenation (Stacking Rows)

Default behavior (axis=0)
Combines DataFrames with same columns
Use ignore_index=True for clean sequential index
Perfect for combining time periods

Horizontal Concatenation (Adding Columns)

Use axis=1 to place side by side
Watch for duplicate column names
Rows are aligned by index
Perfect for adding new features

Keys for Tracking Sources

Add labels with keys parameter
Creates multi-level index
Easy to select data by source
Useful for combining different categories

Handling Mismatched Columns

Outer join (default): Keep all columns, fill NaN
Inner join: Keep only common columns
Choose based on analysis needs

Syntax Reference

# Vertical concatenation (stack rows)
pd.concat([df1, df2], axis=0)

# With reset index
pd.concat([df1, df2], axis=0, ignore_index=True)

# With keys for tracking
pd.concat([df1, df2], keys=['A', 'B'])

# Horizontal concatenation (add columns)
pd.concat([df1, df2], axis=1)

# Inner join (only common columns)
pd.concat([df1, df2], join='inner')

# Outer join (all columns, default)
pd.concat([df1, df2], join='outer')

# Multiple DataFrames
pd.concat([df1, df2, df3, df4], ignore_index=True)

Common Parameters

Parameter	Purpose	Example
`axis`	0=vertical, 1=horizontal	`axis=0` (default)
`ignore_index`	Reset index to 0, 1, 2…	`ignore_index=True`
`keys`	Add multi-level index	`keys=['Q1', 'Q2']`
`join`	How to handle columns	`join='inner'` or `'outer'`
`names`	Name index levels	`names=['quarter', 'row']`

Concat vs Merge

Use pd.concat() when:

Stacking similar DataFrames (same columns)
Combining time periods (monthly to yearly)
Adding new measurements (new rows or columns)
Simple stacking operation

Use pd.merge() when:

Joining on key columns (like SQL joins)
Combining related tables (customers + orders)
Need different join types (left, right, inner, outer)
DataFrames have different structures

Common Patterns

Pattern 1: Load Multiple Files

# Load multiple CSV files and combine
jan = pd.read_csv('jan.csv')
feb = pd.read_csv('feb.csv')
mar = pd.read_csv('mar.csv')

# Combine into single DataFrame
q1 = pd.concat([jan, feb, mar], ignore_index=True)

Pattern 2: Add Year Labels

# Concatenate with year tracking
all_data = pd.concat([df_2023, df_2024], keys=['2023', '2024'])
all_data = all_data.reset_index(level=0)
all_data.rename(columns={'level_0': 'year'}, inplace=True)

Pattern 3: Combine Features

# Add new columns from different sources
features1 = df[['id', 'feature_a', 'feature_b']]
features2 = df[['feature_c', 'feature_d']]
all_features = pd.concat([features1, features2], axis=1)

Important Reminders

Index matters: Use ignore_index=True to avoid duplicate indexes
Column names: Watch for duplicate columns in horizontal concat
Alignment: Horizontal concat aligns by index
Performance: Concatenating many small DataFrames in a loop is slow—collect them in a list first, then concat once
Memory: Large concatenations can use significant memory

Next Steps

You can now combine multiple DataFrames through concatenation. In the next lesson, you will learn Merging and Joining—SQL-style joins for combining DataFrames based on key columns.

Continue to Lesson 18 - Merging and Joining

Learn SQL-style joins for combining related DataFrames

Back to Lesson 16 - Pivot Tables

Review creating Excel-style pivot tables

Master DataFrame Combination

Concatenation is your tool for stacking DataFrames together—essential for combining data from multiple sources or time periods.

Use concatenation to build comprehensive datasets from multiple sources!

Lesson 16 - Pivot Tables

Lesson 18 - Merging and Joining DataFrames

Courses

DATATWEETS

Title here

Lesson 17 - Concatenating DataFrames

Combining Multiple DataFrames

Why Concatenate?

Vertical Concatenation

Creating Sample DataFrames

Stack Rows with pd.concat()

Reset Index After Concatenation

Horizontal Concatenation

Creating DataFrames with Different Columns

Add Columns with axis=1

Remove Duplicate Columns

Using Keys for Multi-Index

Add Keys to Track Data Sources

Select Data by Key

Handling Mismatched Columns

Concatenate with Different Columns

Inner Join - Keep Only Common Columns

Concatenating Multiple DataFrames

Stack Multiple DataFrames

Practice Exercises

Exercise 1: Combine Yearly Sales

Exercise 2: Combine Department Data

Exercise 3: Combine Monthly Files

Summary

Key Concepts

Syntax Reference

Common Parameters

Concat vs Merge

Common Patterns

Important Reminders

Next Steps

Continue to Lesson 18 - Merging and Joining

Back to Lesson 16 - Pivot Tables

Master DataFrame Combination

Lesson 17 - Concatenating DataFrames

Combining Multiple DataFrames#

Why Concatenate?#

Vertical Concatenation#

Creating Sample DataFrames#

Stack Rows with pd.concat()#

Reset Index After Concatenation#

Horizontal Concatenation#

Creating DataFrames with Different Columns#

Add Columns with axis=1#

Remove Duplicate Columns#

Using Keys for Multi-Index#

Add Keys to Track Data Sources#

Select Data by Key#

Handling Mismatched Columns#

Concatenate with Different Columns#

Inner Join - Keep Only Common Columns#

Concatenating Multiple DataFrames#

Stack Multiple DataFrames#

Practice Exercises#

Exercise 1: Combine Yearly Sales#

Exercise 2: Combine Department Data#

Exercise 3: Combine Monthly Files#

Summary#

Key Concepts#

Syntax Reference#

Common Parameters#

Concat vs Merge#

Common Patterns#

Important Reminders#

Next Steps#

Continue to Lesson 18 - Merging and Joining

Back to Lesson 16 - Pivot Tables

Master DataFrame Combination#

Combining Multiple DataFrames

Why Concatenate?

Vertical Concatenation

Creating Sample DataFrames

Stack Rows with pd.concat()

Reset Index After Concatenation

Horizontal Concatenation

Creating DataFrames with Different Columns

Add Columns with axis=1

Remove Duplicate Columns

Using Keys for Multi-Index

Add Keys to Track Data Sources

Select Data by Key

Handling Mismatched Columns

Concatenate with Different Columns

Inner Join - Keep Only Common Columns

Concatenating Multiple DataFrames

Stack Multiple DataFrames

Practice Exercises

Exercise 1: Combine Yearly Sales

Exercise 2: Combine Department Data

Exercise 3: Combine Monthly Files

Summary

Key Concepts

Syntax Reference

Common Parameters

Concat vs Merge

Common Patterns

Important Reminders

Next Steps

Master DataFrame Combination