Lesson 17 - Concatenating DataFrames

Combining Multiple DataFrames

You can create pivot tables to reshape data. Now you will learn Concatenation—combining multiple DataFrames by stacking them together.

By the end of this lesson, you will be able to:

  • Concatenate DataFrames vertically (stacking rows)
  • Concatenate DataFrames horizontally (adding columns)
  • Handle indexes during concatenation
  • Use keys to create multi-level indexes
  • Understand when to use concat vs merge
  • Handle mismatched columns

Concatenation is essential when combining data from multiple files, time periods, or sources. Let’s master this fundamental skill.


Why Concatenate?

Concatenation combines DataFrames by stacking them:

  • Vertical (axis=0): Stack rows on top of each other
  • Horizontal (axis=1): Place columns side by side

Common use cases:

  • Combining data from multiple months/years
  • Merging data from multiple CSV files
  • Adding new measurements to existing data
  • Combining results from different data sources

Vertical Concatenation

Stack DataFrames on top of each other (add rows).

Creating Sample DataFrames

import pandas as pd
import numpy as np

# Q1 sales
q1_sales = pd.DataFrame({
    'month': ['Jan', 'Feb', 'Mar'],
    'revenue': [50000, 55000, 52000],
    'expenses': [30000, 32000, 31000]
})

# Q2 sales
q2_sales = pd.DataFrame({
    'month': ['Apr', 'May', 'Jun'],
    'revenue': [60000, 58000, 65000],
    'expenses': [35000, 34000, 37000]
})

print("Q1 Sales:")
print(q1_sales)
print("\nQ2 Sales:")
print(q2_sales)

Stack Rows with pd.concat()

# Concatenate vertically (default axis=0)
h1_sales = pd.concat([q1_sales, q2_sales], axis=0)

print("Combined H1 Sales (vertical):")
print(h1_sales)
print(f"\nShape: {h1_sales.shape}")

Output:

  month  revenue  expenses
0   Jan    50000     30000
1   Feb    55000     32000
2   Mar    52000     31000
0   Apr    60000     35000  ← Index repeats!
1   May    58000     34000
2   Jun    65000     37000

Notice: The index is preserved (0, 1, 2, 0, 1, 2). This might cause issues!

Reset Index After Concatenation

# Concatenate and reset index
h1_sales_clean = pd.concat([q1_sales, q2_sales], axis=0, ignore_index=True)

print("With reset index:")
print(h1_sales_clean)

Output:

  month  revenue  expenses
0   Jan    50000     30000
1   Feb    55000     32000
2   Mar    52000     31000
3   Apr    60000     35000  ← Clean sequential index
4   May    58000     34000
5   Jun    65000     37000

Much better! Use ignore_index=True for clean sequential numbering.


Horizontal Concatenation

Place DataFrames side by side (add columns).

Creating DataFrames with Different Columns

# Revenue data
revenue_data = pd.DataFrame({
    'month': ['Jan', 'Feb', 'Mar'],
    'revenue': [50000, 55000, 52000]
})

# Employee data
employee_data = pd.DataFrame({
    'month': ['Jan', 'Feb', 'Mar'],
    'employees': [25, 27, 26]
})

print("Revenue:")
print(revenue_data)
print("\nEmployees:")
print(employee_data)

Add Columns with axis=1

# Concatenate horizontally
combined = pd.concat([revenue_data, employee_data], axis=1)

print("Combined (horizontal):")
print(combined)

Output:

  month  revenue month  employees
0   Jan    50000   Jan         25
1   Feb    55000   Feb         27
2   Mar    52000   Mar         26

Notice: The ‘month’ column appears twice! This happens when both DataFrames have the same column.

Remove Duplicate Columns

# Better approach: Drop month from second DataFrame before concatenating
employee_data_clean = employee_data[['employees']]
combined_clean = pd.concat([revenue_data, employee_data_clean], axis=1)

print("Without duplicate 'month':")
print(combined_clean)

Output:

  month  revenue  employees
0   Jan    50000         25
1   Feb    55000         27
2   Mar    52000         26

Using Keys for Multi-Index

Add labels to track which DataFrame each row came from.

Add Keys to Track Data Sources

# Concatenate with keys
sales_with_keys = pd.concat([q1_sales, q2_sales], keys=['Q1', 'Q2'])

print("With keys (multi-index):")
print(sales_with_keys)

Output:

      month  revenue  expenses
Q1 0    Jan    50000     30000
   1    Feb    55000     32000
   2    Mar    52000     31000
Q2 0    Apr    60000     35000
   1    May    58000     34000
   2    Jun    65000     37000

Now you have a hierarchical index showing which quarter each row belongs to!

Select Data by Key

# Select Q1 data
q1_only = sales_with_keys.loc['Q1']
print("Q1 data only:")
print(q1_only)

Output:

  month  revenue  expenses
0   Jan    50000     30000
1   Feb    55000     32000
2   Mar    52000     31000

Handling Mismatched Columns

What happens when DataFrames have different columns?

Concatenate with Different Columns

# DataFrame 1: Has 'bonus' column
df1 = pd.DataFrame({
    'name': ['James', 'Maria'],
    'salary': [50000, 60000],
    'bonus': [5000, 6000]
})

# DataFrame 2: Has 'department' column instead
df2 = pd.DataFrame({
    'name': ['David', 'Anna'],
    'salary': [55000, 65000],
    'department': ['IT', 'Sales']
})

# Concatenate (outer join by default)
employees = pd.concat([df1, df2], ignore_index=True)

print("Concatenated with mismatched columns:")
print(employees)

Output:

    name  salary   bonus department
0  James   50000  5000.0        NaN
1  Maria   60000  6000.0        NaN
2  David   55000     NaN         IT
3   Anna   65000     NaN      Sales

Result: Missing values (NaN) are filled where columns don’t exist. This is the default behavior (join='outer').

Inner Join - Keep Only Common Columns

# Concatenate with inner join
employees_inner = pd.concat([df1, df2], join='inner', ignore_index=True)

print("Inner join (only common columns):")
print(employees_inner)

Output:

    name  salary
0  James   50000
1  Maria   60000
2  David   55000
3   Anna   65000

Only ’name’ and ‘salary’ columns are kept (columns that exist in both DataFrames).


Concatenating Multiple DataFrames

Combine more than two DataFrames at once.

Stack Multiple DataFrames

# Create quarterly data
q1 = pd.DataFrame({'month': ['Jan', 'Feb', 'Mar'], 'sales': [100, 110, 105]})
q2 = pd.DataFrame({'month': ['Apr', 'May', 'Jun'], 'sales': [120, 115, 125]})
q3 = pd.DataFrame({'month': ['Jul', 'Aug', 'Sep'], 'sales': [130, 135, 128]})
q4 = pd.DataFrame({'month': ['Oct', 'Nov', 'Dec'], 'sales': [140, 145, 150]})

# Concatenate all at once
yearly_sales = pd.concat([q1, q2, q3, q4], ignore_index=True)

print("Full Year Sales:")
print(yearly_sales)
print(f"\nTotal annual sales: {yearly_sales['sales'].sum()}")

Practice Exercises

Apply concatenation operations with these exercises.

Exercise 1: Combine Yearly Sales

  1. Create sales data for 2023 and 2024 (same columns)
  2. Concatenate them vertically
  3. Add a ‘year’ column to track which year each row is from
  4. Calculate total sales for each year
# Your code here

Hint

Use keys=[‘2023’, ‘2024’] to add year labels, then reset_index() to convert to a regular column!

Exercise 2: Combine Department Data

  1. Create IT department employees (name, salary)
  2. Create Sales department employees (name, commission)
  3. Concatenate them vertically
  4. Fill missing values appropriately
# Your code here

Exercise 3: Combine Monthly Files

Given monthly data files:

jan = pd.DataFrame({'product': ['A', 'B', 'C'], 'sales': [100, 150, 120]})
feb = pd.DataFrame({'product': ['A', 'B', 'C'], 'sales': [110, 160, 115]})
mar = pd.DataFrame({'product': ['A', 'B', 'C'], 'sales': [105, 155, 125]})
  1. Concatenate all three months
  2. Add month labels using keys
  3. Calculate total sales per product across all months
# Your code here

Summary

You now master DataFrame concatenation. Let’s review the key concepts.

Key Concepts

Vertical Concatenation (Stacking Rows)

  • Default behavior (axis=0)
  • Combines DataFrames with same columns
  • Use ignore_index=True for clean sequential index
  • Perfect for combining time periods

Horizontal Concatenation (Adding Columns)

  • Use axis=1 to place side by side
  • Watch for duplicate column names
  • Rows are aligned by index
  • Perfect for adding new features

Keys for Tracking Sources

  • Add labels with keys parameter
  • Creates multi-level index
  • Easy to select data by source
  • Useful for combining different categories

Handling Mismatched Columns

  • Outer join (default): Keep all columns, fill NaN
  • Inner join: Keep only common columns
  • Choose based on analysis needs

Syntax Reference

# Vertical concatenation (stack rows)
pd.concat([df1, df2], axis=0)

# With reset index
pd.concat([df1, df2], axis=0, ignore_index=True)

# With keys for tracking
pd.concat([df1, df2], keys=['A', 'B'])

# Horizontal concatenation (add columns)
pd.concat([df1, df2], axis=1)

# Inner join (only common columns)
pd.concat([df1, df2], join='inner')

# Outer join (all columns, default)
pd.concat([df1, df2], join='outer')

# Multiple DataFrames
pd.concat([df1, df2, df3, df4], ignore_index=True)

Common Parameters

ParameterPurposeExample
axis0=vertical, 1=horizontalaxis=0 (default)
ignore_indexReset index to 0, 1, 2…ignore_index=True
keysAdd multi-level indexkeys=['Q1', 'Q2']
joinHow to handle columnsjoin='inner' or 'outer'
namesName index levelsnames=['quarter', 'row']

Concat vs Merge

Use pd.concat() when:

  • Stacking similar DataFrames (same columns)
  • Combining time periods (monthly to yearly)
  • Adding new measurements (new rows or columns)
  • Simple stacking operation

Use pd.merge() when:

  • Joining on key columns (like SQL joins)
  • Combining related tables (customers + orders)
  • Need different join types (left, right, inner, outer)
  • DataFrames have different structures

Common Patterns

Pattern 1: Load Multiple Files

# Load multiple CSV files and combine
jan = pd.read_csv('jan.csv')
feb = pd.read_csv('feb.csv')
mar = pd.read_csv('mar.csv')

# Combine into single DataFrame
q1 = pd.concat([jan, feb, mar], ignore_index=True)

Pattern 2: Add Year Labels

# Concatenate with year tracking
all_data = pd.concat([df_2023, df_2024], keys=['2023', '2024'])
all_data = all_data.reset_index(level=0)
all_data.rename(columns={'level_0': 'year'}, inplace=True)

Pattern 3: Combine Features

# Add new columns from different sources
features1 = df[['id', 'feature_a', 'feature_b']]
features2 = df[['feature_c', 'feature_d']]
all_features = pd.concat([features1, features2], axis=1)

Important Reminders

  • Index matters: Use ignore_index=True to avoid duplicate indexes
  • Column names: Watch for duplicate columns in horizontal concat
  • Alignment: Horizontal concat aligns by index
  • Performance: Concatenating many small DataFrames in a loop is slow—collect them in a list first, then concat once
  • Memory: Large concatenations can use significant memory

Next Steps

You can now combine multiple DataFrames through concatenation. In the next lesson, you will learn Merging and Joining—SQL-style joins for combining DataFrames based on key columns.

Continue to Lesson 18 - Merging and Joining

Learn SQL-style joins for combining related DataFrames

Back to Lesson 16 - Pivot Tables

Review creating Excel-style pivot tables


Master DataFrame Combination

Concatenation is your tool for stacking DataFrames together—essential for combining data from multiple sources or time periods.

Use concatenation to build comprehensive datasets from multiple sources!