Lesson 19 - MultiIndex and Hierarchical Data

Working with Hierarchical Indexes

You can merge related DataFrames. Now you will learn MultiIndex—using hierarchical indexes to store and manipulate higher-dimensional data in a 2D DataFrame.

By the end of this lesson, you will be able to:

  • Create multi-level indexes with set_index()
  • Select data from multi-indexed DataFrames
  • Use .stack() and .unstack() to reshape data
  • Reset and manipulate hierarchical indexes
  • Understand when multi-index is useful
  • Aggregate data by specific index levels

Multi-index enables elegant representation of complex data hierarchies like Country → City → Store or Year → Quarter → Month. Let’s master this advanced technique.


Why MultiIndex?

Multi-index (hierarchical indexing) allows you to:

  • Store higher-dimensional data in 2D DataFrame
  • Group data by multiple levels (e.g., Country → City → Store)
  • Perform efficient operations on grouped data
  • Reshape data between wide and long formats

Common use cases:

  • Time series with multiple locations
  • Sales data by Region → City → Product
  • Scientific data with multiple experimental factors
  • Panel data (multiple metrics over time for different entities)

Sample Data

Let’s create sales data across different regions and products:

import pandas as pd
import numpy as np

# Sales data
sales = pd.DataFrame({
    'region': ['North', 'North', 'North', 'South', 'South', 'South',
               'East', 'East', 'East', 'West', 'West', 'West'],
    'city': ['New York', 'New York', 'New York', 'London', 'London', 'London',
             'Paris', 'Paris', 'Paris', 'Berlin', 'Berlin', 'Berlin'],
    'product': ['Laptop', 'Mouse', 'Keyboard', 'Laptop', 'Mouse', 'Keyboard',
                'Laptop', 'Mouse', 'Keyboard', 'Laptop', 'Mouse', 'Keyboard'],
    'quantity': [15, 120, 45, 20, 150, 60, 12, 90, 30, 18, 110, 40],
    'revenue': [18000, 3600, 3600, 24000, 4500, 4800, 14400, 2700, 2400, 21600, 3300, 3200]
})

print("Original Sales Data:")
print(sales)
print(f"\nShape: {sales.shape}")

This dataset has 12 records across multiple regions, cities, and products—perfect for demonstrating hierarchical indexing.


Creating Multi-Index

Transform regular columns into hierarchical index levels.

Create Multi-Index with set_index()

# Create multi-index: Region → City → Product
sales_multi = sales.set_index(['region', 'city', 'product'])

print("Multi-Indexed DataFrame:")
print(sales_multi)
print(f"\nIndex levels: {sales_multi.index.names}")
print(f"Number of levels: {sales_multi.index.nlevels}")

Output:

                                    quantity  revenue
region city     product
North  New York Laptop                   15    18000
                Mouse                   120     3600
                Keyboard                 45     3600
South  London   Laptop                   20    24000
                Mouse                   150     4500
                Keyboard                 60     4800
East   Paris    Laptop                   12    14400
                Mouse                    90     2700
                Keyboard                 30     2400
West   Berlin   Laptop                   18    21600
                Mouse                   110     3300
                Keyboard                 40     3200

Notice the hierarchical structure:

  • Level 0: region
  • Level 1: city
  • Level 2: product

Selecting from Multi-Index

Navigate hierarchical data using .loc[] with different approaches.

Select by Single Level

# Select all data for North region
north_sales = sales_multi.loc['North']

print("North Region Sales:")
print(north_sales)

Output:

                         quantity  revenue
city     product
New York Laptop                15    18000
         Mouse                120     3600
         Keyboard              45     3600

Select by Multiple Levels

Use tuples to select specific combinations:

# Select North → New York
north_newyork = sales_multi.loc[('North', 'New York')]

print("North Region, New York City:")
print(north_newyork)

Output:

          quantity  revenue
product
Laptop          15    18000
Mouse          120     3600
Keyboard        45     3600

Select Specific Item

# Select specific product: North → New York → Laptop
specific = sales_multi.loc[('North', 'New York', 'Laptop')]

print("North → New York → Laptop:")
print(specific)
print(f"\nType: {type(specific)}")

Output:

quantity     15
revenue   18000
Name: (North, New York, Laptop), dtype: int64

Slice Across Levels

Use IndexSlice for cleaner syntax when slicing:

# Select all Laptops across all regions/cities
idx = pd.IndexSlice
all_laptops = sales_multi.loc[idx[:, :, 'Laptop'], :]

print("All Laptop Sales:")
print(all_laptops)

Output:

                               quantity  revenue
region city     product
North  New York Laptop               15    18000
South  London   Laptop               20    24000
East   Paris    Laptop               12    14400
West   Berlin   Laptop               18    21600

Aggregating Multi-Index Data

Perform calculations across specific index levels.

Aggregate by Level

# Total revenue by region (level 0)
revenue_by_region = sales_multi.groupby(level=0)['revenue'].sum()

print("Revenue by Region:")
print(revenue_by_region)
print(f"\nBest region: {revenue_by_region.idxmax()} (${revenue_by_region.max():,})")

Output:

region
East     19500
North    25200
South    33300
West     28100
Name: revenue, dtype: int64

Best region: South ($33,300)

Aggregate by Multiple Levels

# Total quantity by product (level 2)
quantity_by_product = sales_multi.groupby(level=2)['quantity'].sum().sort_values(ascending=False)

print("Quantity by Product:")
print(quantity_by_product)

Output:

product
Mouse       470
Keyboard    175
Laptop       65
Name: quantity, dtype: int64

Aggregate by Multiple Levels Simultaneously

# Revenue by region and product
revenue_by_region_product = sales_multi.groupby(level=[0, 2])['revenue'].sum()

print("Revenue by Region and Product:")
print(revenue_by_region_product)

Output:

region  product
East    Keyboard     2400
        Laptop      14400
        Mouse        2700
North   Keyboard     3600
        Laptop      18000
        Mouse        3600
South   Keyboard     4800
        Laptop      24000
        Mouse        4500
West    Keyboard     3200
        Laptop      21600
        Mouse        3300
Name: revenue, dtype: int64

Stack and Unstack

Reshape data by moving index levels to columns and vice versa.

Unstack - Move Index to Columns

.unstack() pivots the innermost index level to columns:

# Unstack product level (innermost)
unstacked = sales_multi.unstack()

print("Unstacked (Products as Columns):")
print(unstacked)

Output:

                quantity                       revenue
product          Keyboard Laptop Mouse Keyboard  Laptop  Mouse
region city
East   Paris           30     12    90     2400   14400   2700
North  New York        45     15   120     3600   18000   3600
South  London          60     20   150     4800   24000   4500
West   Berlin          40     18   110     3200   21600   3300

This creates a wider DataFrame with products as column headers.

Unstack Specific Level

# Unstack region level
unstacked_region = sales_multi.unstack(level='region')

print("Unstacked by Region:")
print(unstacked_region.head())

Stack - Move Columns to Index

.stack() does the opposite—moves column level to index:

# Stack back to original shape
stacked_back = unstacked.stack()

print("Stacked Back:")
print(stacked_back.head())

Practical Use - Creating Pivot-like Views

# Revenue by Region (rows) and Product (columns)
revenue_pivot = sales_multi['revenue'].unstack(level='product')

print("Revenue: Regions vs Products:")
print(revenue_pivot)

Output:

product        Keyboard   Laptop  Mouse
region city
East   Paris       2400  14400.0   2700
North  New York    3600  18000.0   3600
South  London      4800  24000.0   4500
West   Berlin      3200  21600.0   3300

This creates a pivot table-like view without using pivot_table()!


Resetting Index

Convert multi-index back to regular columns.

Reset All Index Levels

# Reset index to make it a regular DataFrame
flat_df = sales_multi.reset_index()

print("Flattened DataFrame:")
print(flat_df)

Output:

  region      city   product  quantity  revenue
0  North  New York    Laptop        15    18000
1  North  New York     Mouse       120     3600
2  North  New York  Keyboard        45     3600
...

Reset Specific Level

# Reset only the product level
partial_reset = sales_multi.reset_index(level='product')

print("Reset Product Level Only:")
print(partial_reset.head())

Practice Exercises

Apply multi-index operations with these exercises.

Exercise 1: Student Performance Multi-Index

Create and analyze student performance across subjects and semesters:

students = pd.DataFrame({
    'student': ['James', 'James', 'James', 'Maria', 'Maria', 'Maria', 'David', 'David', 'David'],
    'semester': ['Fall', 'Fall', 'Spring', 'Fall', 'Fall', 'Spring', 'Fall', 'Fall', 'Spring'],
    'subject': ['Math', 'Physics', 'Math', 'Math', 'Physics', 'Math', 'Math', 'Physics', 'Math'],
    'score': [85, 78, 92, 90, 88, 95, 76, 82, 80]
})
  1. Set multi-index on student and semester
  2. Find James’s scores for all semesters
  3. Calculate average score by student
  4. Unstack to show subjects as columns
# Your code here

Hint

Use .loc['James'] to select all data for James across all index levels!

Exercise 2: Temperature Data Reshaping

Work with temperature data across cities and seasons:

temps = pd.DataFrame({
    'city': ['New York', 'New York', 'New York', 'London', 'London', 'London', 'Paris', 'Paris', 'Paris'],
    'season': ['Winter', 'Spring', 'Summer', 'Winter', 'Spring', 'Summer', 'Winter', 'Spring', 'Summer'],
    'temp_celsius': [8, 18, 35, 20, 30, 42, 5, 15, 28]
})
  1. Create multi-index on city and season
  2. Use unstack to create a wide format (seasons as columns)
  3. Find the warmest city in Summer
  4. Calculate average temperature per city
# Your code here

Summary

You now master multi-index and hierarchical data. Let’s review the key concepts.

Key Concepts

Creating Multi-Index

  • Use set_index() with multiple columns
  • Creates hierarchical index structure
  • Each level can be referenced by number or name
  • Enables elegant representation of complex hierarchies

Selecting Data

  • Single level: df.loc['level0_value']
  • Multiple levels: df.loc[('level0', 'level1')]
  • Slicing with IndexSlice: df.loc[idx[:, :, 'value'], :]
  • Returns DataFrame or Series depending on selection

Aggregating by Level

  • Group by level number: df.groupby(level=0)
  • Group by level name: df.groupby(level='name')
  • Multiple levels: df.groupby(level=[0, 2])

Reshaping

  • Unstack: Index → Columns (creates wide format)
  • Stack: Columns → Index (creates long format)
  • Specify level to control which level moves
  • Useful for pivot-like transformations

Syntax Reference

# Create multi-index
df.set_index(['col1', 'col2', 'col3'])

# Select by levels
df.loc['level0']
df.loc[('level0', 'level1')]
df.loc[('level0', 'level1', 'level2')]

# IndexSlice for complex slicing
idx = pd.IndexSlice
df.loc[idx[:, :, 'value'], :]

# Aggregate by level
df.groupby(level=0).sum()
df.groupby(level='name').mean()
df.groupby(level=[0, 2]).sum()

# Stack and unstack
df.unstack()              # Innermost to columns
df.unstack(level=0)       # Specific level
df.unstack(level='name')  # By name
df.stack()                # Columns to index

# Reset index
df.reset_index()          # All levels
df.reset_index(level=0)   # Specific level
df.reset_index(level='name')

When to Use Multi-Index?

Good Use Cases:

  • Time series data with multiple locations
  • Data with natural hierarchy (Country → State → City)
  • Results of GroupBy operations (keep the multi-index)
  • Panel data (multiple metrics over time for entities)
  • Scientific experiments with multiple factors

Avoid When:

  • Simple, flat data structure is sufficient
  • Frequent need to reset index for operations
  • Sharing data with users unfamiliar with hierarchical indexing
  • Performance is critical (multi-index has overhead)

Important Reminders

  • Tuples for selection: Use df.loc[('A', 'B')] for multiple levels
  • IndexSlice: Use pd.IndexSlice for complex slicing with : syntax
  • Unstack creates NaN: Missing combinations become NaN in wide format
  • Level parameter: Can use numbers (0, 1, 2) or names
  • GroupBy creates multi-index: Many operations naturally create it
  • Reset when needed: Use reset_index() to flatten for export

Common Patterns

Pattern 1: Create and Aggregate

df_multi = df.set_index(['category', 'subcategory'])
result = df_multi.groupby(level=0).sum()

Pattern 2: Reshape for Analysis

df_multi = df.set_index(['date', 'metric'])
wide = df_multi.unstack(level='metric')

Pattern 3: Select and Compute

idx = pd.IndexSlice
subset = df_multi.loc[idx[:, 'SpecificValue'], :]
stats = subset.groupby(level=0).mean()

Next Steps

You can now work with hierarchical indexes for complex data structures. In the next lesson, you will learn Window Functions and Rolling Operations—calculating moving averages, cumulative sums, and time-based window calculations.

Continue to Lesson 20 - Window Functions and Rolling

Learn rolling averages, cumulative sums, and window calculations

Back to Lesson 18 - Merging and Joining

Review SQL-style joins for combining DataFrames


Master Hierarchical Data

Multi-index unlocks powerful hierarchical data representations—essential for complex analyses and elegant data structures.

Use multi-index to organize and analyze multi-dimensional data efficiently!