Lesson 19 - MultiIndex and Hierarchical Data
Working with Hierarchical Indexes
You can merge related DataFrames. Now you will learn MultiIndex—using hierarchical indexes to store and manipulate higher-dimensional data in a 2D DataFrame.
By the end of this lesson, you will be able to:
- Create multi-level indexes with
set_index() - Select data from multi-indexed DataFrames
- Use
.stack()and.unstack()to reshape data - Reset and manipulate hierarchical indexes
- Understand when multi-index is useful
- Aggregate data by specific index levels
Multi-index enables elegant representation of complex data hierarchies like Country → City → Store or Year → Quarter → Month. Let’s master this advanced technique.
Why MultiIndex?
Multi-index (hierarchical indexing) allows you to:
- Store higher-dimensional data in 2D DataFrame
- Group data by multiple levels (e.g., Country → City → Store)
- Perform efficient operations on grouped data
- Reshape data between wide and long formats
Common use cases:
- Time series with multiple locations
- Sales data by Region → City → Product
- Scientific data with multiple experimental factors
- Panel data (multiple metrics over time for different entities)
Sample Data
Let’s create sales data across different regions and products:
import pandas as pd
import numpy as np
# Sales data
sales = pd.DataFrame({
'region': ['North', 'North', 'North', 'South', 'South', 'South',
'East', 'East', 'East', 'West', 'West', 'West'],
'city': ['New York', 'New York', 'New York', 'London', 'London', 'London',
'Paris', 'Paris', 'Paris', 'Berlin', 'Berlin', 'Berlin'],
'product': ['Laptop', 'Mouse', 'Keyboard', 'Laptop', 'Mouse', 'Keyboard',
'Laptop', 'Mouse', 'Keyboard', 'Laptop', 'Mouse', 'Keyboard'],
'quantity': [15, 120, 45, 20, 150, 60, 12, 90, 30, 18, 110, 40],
'revenue': [18000, 3600, 3600, 24000, 4500, 4800, 14400, 2700, 2400, 21600, 3300, 3200]
})
print("Original Sales Data:")
print(sales)
print(f"\nShape: {sales.shape}")This dataset has 12 records across multiple regions, cities, and products—perfect for demonstrating hierarchical indexing.
Creating Multi-Index
Transform regular columns into hierarchical index levels.
Create Multi-Index with set_index()
# Create multi-index: Region → City → Product
sales_multi = sales.set_index(['region', 'city', 'product'])
print("Multi-Indexed DataFrame:")
print(sales_multi)
print(f"\nIndex levels: {sales_multi.index.names}")
print(f"Number of levels: {sales_multi.index.nlevels}")Output:
quantity revenue
region city product
North New York Laptop 15 18000
Mouse 120 3600
Keyboard 45 3600
South London Laptop 20 24000
Mouse 150 4500
Keyboard 60 4800
East Paris Laptop 12 14400
Mouse 90 2700
Keyboard 30 2400
West Berlin Laptop 18 21600
Mouse 110 3300
Keyboard 40 3200Notice the hierarchical structure:
- Level 0: region
- Level 1: city
- Level 2: product
Selecting from Multi-Index
Navigate hierarchical data using .loc[] with different approaches.
Select by Single Level
# Select all data for North region
north_sales = sales_multi.loc['North']
print("North Region Sales:")
print(north_sales)Output:
quantity revenue
city product
New York Laptop 15 18000
Mouse 120 3600
Keyboard 45 3600Select by Multiple Levels
Use tuples to select specific combinations:
# Select North → New York
north_newyork = sales_multi.loc[('North', 'New York')]
print("North Region, New York City:")
print(north_newyork)Output:
quantity revenue
product
Laptop 15 18000
Mouse 120 3600
Keyboard 45 3600Select Specific Item
# Select specific product: North → New York → Laptop
specific = sales_multi.loc[('North', 'New York', 'Laptop')]
print("North → New York → Laptop:")
print(specific)
print(f"\nType: {type(specific)}")Output:
quantity 15
revenue 18000
Name: (North, New York, Laptop), dtype: int64Slice Across Levels
Use IndexSlice for cleaner syntax when slicing:
# Select all Laptops across all regions/cities
idx = pd.IndexSlice
all_laptops = sales_multi.loc[idx[:, :, 'Laptop'], :]
print("All Laptop Sales:")
print(all_laptops)Output:
quantity revenue
region city product
North New York Laptop 15 18000
South London Laptop 20 24000
East Paris Laptop 12 14400
West Berlin Laptop 18 21600Aggregating Multi-Index Data
Perform calculations across specific index levels.
Aggregate by Level
# Total revenue by region (level 0)
revenue_by_region = sales_multi.groupby(level=0)['revenue'].sum()
print("Revenue by Region:")
print(revenue_by_region)
print(f"\nBest region: {revenue_by_region.idxmax()} (${revenue_by_region.max():,})")Output:
region
East 19500
North 25200
South 33300
West 28100
Name: revenue, dtype: int64
Best region: South ($33,300)Aggregate by Multiple Levels
# Total quantity by product (level 2)
quantity_by_product = sales_multi.groupby(level=2)['quantity'].sum().sort_values(ascending=False)
print("Quantity by Product:")
print(quantity_by_product)Output:
product
Mouse 470
Keyboard 175
Laptop 65
Name: quantity, dtype: int64Aggregate by Multiple Levels Simultaneously
# Revenue by region and product
revenue_by_region_product = sales_multi.groupby(level=[0, 2])['revenue'].sum()
print("Revenue by Region and Product:")
print(revenue_by_region_product)Output:
region product
East Keyboard 2400
Laptop 14400
Mouse 2700
North Keyboard 3600
Laptop 18000
Mouse 3600
South Keyboard 4800
Laptop 24000
Mouse 4500
West Keyboard 3200
Laptop 21600
Mouse 3300
Name: revenue, dtype: int64Stack and Unstack
Reshape data by moving index levels to columns and vice versa.
Unstack - Move Index to Columns
.unstack() pivots the innermost index level to columns:
# Unstack product level (innermost)
unstacked = sales_multi.unstack()
print("Unstacked (Products as Columns):")
print(unstacked)Output:
quantity revenue
product Keyboard Laptop Mouse Keyboard Laptop Mouse
region city
East Paris 30 12 90 2400 14400 2700
North New York 45 15 120 3600 18000 3600
South London 60 20 150 4800 24000 4500
West Berlin 40 18 110 3200 21600 3300This creates a wider DataFrame with products as column headers.
Unstack Specific Level
# Unstack region level
unstacked_region = sales_multi.unstack(level='region')
print("Unstacked by Region:")
print(unstacked_region.head())Stack - Move Columns to Index
.stack() does the opposite—moves column level to index:
# Stack back to original shape
stacked_back = unstacked.stack()
print("Stacked Back:")
print(stacked_back.head())Practical Use - Creating Pivot-like Views
# Revenue by Region (rows) and Product (columns)
revenue_pivot = sales_multi['revenue'].unstack(level='product')
print("Revenue: Regions vs Products:")
print(revenue_pivot)Output:
product Keyboard Laptop Mouse
region city
East Paris 2400 14400.0 2700
North New York 3600 18000.0 3600
South London 4800 24000.0 4500
West Berlin 3200 21600.0 3300This creates a pivot table-like view without using pivot_table()!
Resetting Index
Convert multi-index back to regular columns.
Reset All Index Levels
# Reset index to make it a regular DataFrame
flat_df = sales_multi.reset_index()
print("Flattened DataFrame:")
print(flat_df)Output:
region city product quantity revenue
0 North New York Laptop 15 18000
1 North New York Mouse 120 3600
2 North New York Keyboard 45 3600
...Reset Specific Level
# Reset only the product level
partial_reset = sales_multi.reset_index(level='product')
print("Reset Product Level Only:")
print(partial_reset.head())Practice Exercises
Apply multi-index operations with these exercises.
Exercise 1: Student Performance Multi-Index
Create and analyze student performance across subjects and semesters:
students = pd.DataFrame({
'student': ['James', 'James', 'James', 'Maria', 'Maria', 'Maria', 'David', 'David', 'David'],
'semester': ['Fall', 'Fall', 'Spring', 'Fall', 'Fall', 'Spring', 'Fall', 'Fall', 'Spring'],
'subject': ['Math', 'Physics', 'Math', 'Math', 'Physics', 'Math', 'Math', 'Physics', 'Math'],
'score': [85, 78, 92, 90, 88, 95, 76, 82, 80]
})- Set multi-index on student and semester
- Find James’s scores for all semesters
- Calculate average score by student
- Unstack to show subjects as columns
# Your code hereHint
Use .loc['James'] to select all data for James across all index levels!
Exercise 2: Temperature Data Reshaping
Work with temperature data across cities and seasons:
temps = pd.DataFrame({
'city': ['New York', 'New York', 'New York', 'London', 'London', 'London', 'Paris', 'Paris', 'Paris'],
'season': ['Winter', 'Spring', 'Summer', 'Winter', 'Spring', 'Summer', 'Winter', 'Spring', 'Summer'],
'temp_celsius': [8, 18, 35, 20, 30, 42, 5, 15, 28]
})- Create multi-index on city and season
- Use unstack to create a wide format (seasons as columns)
- Find the warmest city in Summer
- Calculate average temperature per city
# Your code hereSummary
You now master multi-index and hierarchical data. Let’s review the key concepts.
Key Concepts
Creating Multi-Index
- Use
set_index()with multiple columns - Creates hierarchical index structure
- Each level can be referenced by number or name
- Enables elegant representation of complex hierarchies
Selecting Data
- Single level:
df.loc['level0_value'] - Multiple levels:
df.loc[('level0', 'level1')] - Slicing with IndexSlice:
df.loc[idx[:, :, 'value'], :] - Returns DataFrame or Series depending on selection
Aggregating by Level
- Group by level number:
df.groupby(level=0) - Group by level name:
df.groupby(level='name') - Multiple levels:
df.groupby(level=[0, 2])
Reshaping
- Unstack: Index → Columns (creates wide format)
- Stack: Columns → Index (creates long format)
- Specify level to control which level moves
- Useful for pivot-like transformations
Syntax Reference
# Create multi-index
df.set_index(['col1', 'col2', 'col3'])
# Select by levels
df.loc['level0']
df.loc[('level0', 'level1')]
df.loc[('level0', 'level1', 'level2')]
# IndexSlice for complex slicing
idx = pd.IndexSlice
df.loc[idx[:, :, 'value'], :]
# Aggregate by level
df.groupby(level=0).sum()
df.groupby(level='name').mean()
df.groupby(level=[0, 2]).sum()
# Stack and unstack
df.unstack() # Innermost to columns
df.unstack(level=0) # Specific level
df.unstack(level='name') # By name
df.stack() # Columns to index
# Reset index
df.reset_index() # All levels
df.reset_index(level=0) # Specific level
df.reset_index(level='name')When to Use Multi-Index?
Good Use Cases:
- Time series data with multiple locations
- Data with natural hierarchy (Country → State → City)
- Results of GroupBy operations (keep the multi-index)
- Panel data (multiple metrics over time for entities)
- Scientific experiments with multiple factors
Avoid When:
- Simple, flat data structure is sufficient
- Frequent need to reset index for operations
- Sharing data with users unfamiliar with hierarchical indexing
- Performance is critical (multi-index has overhead)
Important Reminders
- Tuples for selection: Use
df.loc[('A', 'B')]for multiple levels - IndexSlice: Use
pd.IndexSlicefor complex slicing with:syntax - Unstack creates NaN: Missing combinations become NaN in wide format
- Level parameter: Can use numbers (0, 1, 2) or names
- GroupBy creates multi-index: Many operations naturally create it
- Reset when needed: Use
reset_index()to flatten for export
Common Patterns
Pattern 1: Create and Aggregate
df_multi = df.set_index(['category', 'subcategory'])
result = df_multi.groupby(level=0).sum()Pattern 2: Reshape for Analysis
df_multi = df.set_index(['date', 'metric'])
wide = df_multi.unstack(level='metric')Pattern 3: Select and Compute
idx = pd.IndexSlice
subset = df_multi.loc[idx[:, 'SpecificValue'], :]
stats = subset.groupby(level=0).mean()Next Steps
You can now work with hierarchical indexes for complex data structures. In the next lesson, you will learn Window Functions and Rolling Operations—calculating moving averages, cumulative sums, and time-based window calculations.
Continue to Lesson 20 - Window Functions and Rolling
Learn rolling averages, cumulative sums, and window calculations
Back to Lesson 18 - Merging and Joining
Review SQL-style joins for combining DataFrames
Master Hierarchical Data
Multi-index unlocks powerful hierarchical data representations—essential for complex analyses and elegant data structures.
Use multi-index to organize and analyze multi-dimensional data efficiently!