Lesson 21 - Final Project: Real-World Data Analysis

Comprehensive Real-World Analysis

You have learned all essential pandas techniques. Now you will apply everything to a real-world dataset—analyzing 541,909 transactions from a UK-based online retail company.

This project demonstrates:

All 20 previous lessons in action
Professional data analysis workflow
Business intelligence techniques
Real-world data challenges and solutions

This is your opportunity to prove your pandas mastery!

Project Overview

Dataset Information

Source: UCI Machine Learning Repository
Company: UK online retailer selling unique all-occasion gifts
Period: December 2010 - December 2011 (13 months)
Customers: Mostly wholesalers from various countries
Size: 541,909 transactions

Columns

Column	Type	Description
`InvoiceNo`	string	6-digit invoice number; ‘C’ prefix = cancelled
`StockCode`	string	5-digit product code
`Description`	string	Product name
`Quantity`	int	Number of items per transaction
`InvoiceDate`	datetime	Transaction date and time
`UnitPrice`	float	Price per unit in GBP (£)
`CustomerID`	float	5-digit unique customer number
`Country`	string	Customer’s country

Download Dataset

Dataset Download

Download the dataset to follow along with this project:

Download ecommerce_data.csv (43 MB)

Save the file in your working directory to run the code examples in this lesson.

Alternative sources:

Original source: UCI Machine Learning Repository
Dataset contains 541,909 transactions from December 2010 to December 2011

Skills Coverage

This project demonstrates mastery of all 20 lessons:

Fundamentals (Lessons 1-5)

Reading CSV, DataFrame exploration
Label and position selection (.loc, .iloc)
Series operations, value_counts()
DateTime parsing and extraction

Filtering & Sorting (Lessons 6-8)

Boolean filtering, multiple conditions
Sorting, ranking, finding top N

Transformation (Lessons 9-10)

Adding calculated columns
apply(), map(), lambda functions

Data Cleaning (Lessons 11-14)

Handling missing values
Data type conversion
Duplicates and outliers
String operations

Aggregation (Lessons 15-16)

GroupBy operations
Pivot tables and cross-tabulation

Combining Data (Lessons 17-19)

Concatenating DataFrames
Merging and joining
Multi-index operations

Advanced (Lesson 20)

Window functions and rolling averages

Business Questions

You will answer these real-world questions:

What are total sales and revenue trends?
Which products are bestsellers?
Who are the top customers? (RFM Analysis)
Which countries generate the most revenue?
What are monthly and seasonal patterns?
How many orders are cancelled and why?
What is the average basket size?
Is the business growing month-over-month?

Let’s begin your comprehensive analysis!

Part 1: Data Loading and Exploration

Lessons Applied: 1, 2, 3, 4

Step 1: Import and Load Data

import pandas as pd
import numpy as np
from datetime import datetime

# Display settings for better output
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)

# Load data with proper encoding
df = pd.read_csv('ecommerce_data.csv', encoding='ISO-8859-1')

print(f"Dataset Loaded: {df.shape[0]:,} rows × {df.shape[1]} columns")
print(f"Memory usage: {df.memory_usage(deep=True).sum() / 1024**2:.1f} MB")

Step 2: Initial Exploration

# First look at the data
print("First 5 rows:")
print(df.head())

print("\nLast 5 rows:")
print(df.tail())

# DataFrame structure
print("\nDataFrame Info:")
df.info()

# Statistical summary
print("\nStatistical Summary:")
print(df.describe())

# Check missing values
print("\nMissing Values:")
missing = df.isnull().sum()
missing_pct = (missing / len(df) * 100).round(2)
print(pd.DataFrame({'Count': missing, 'Percentage': missing_pct}))

Part 2: Data Cleaning

Lessons Applied: 11, 12, 13, 14

Step 1: Handle Missing Values

# Analyze missing data
print(f"CustomerID missing: {df['CustomerID'].isna().sum():,} ({df['CustomerID'].isna().mean()*100:.1f}%)")
print(f"Description missing: {df['Description'].isna().sum():,}")

# Remove rows with missing descriptions (critical for analysis)
df_clean = df.dropna(subset=['Description']).copy()
print(f"\nRows after cleaning: {len(df_clean):,}")

Step 2: Convert DateTime

# Convert InvoiceDate to datetime
df_clean['InvoiceDate'] = pd.to_datetime(df_clean['InvoiceDate'])

print(f"Date Range: {df_clean['InvoiceDate'].min()} to {df_clean['InvoiceDate'].max()}")
print(f"Total days covered: {(df_clean['InvoiceDate'].max() - df_clean['InvoiceDate'].min()).days} days")

Step 3: String Operations

# Identify cancelled orders (InvoiceNo starting with 'C')
df_clean['IsCancelled'] = df_clean['InvoiceNo'].astype(str).str.startswith('C')

print(f"Cancelled orders: {df_clean['IsCancelled'].sum():,} ({df_clean['IsCancelled'].mean()*100:.1f}%)")

# Clean product descriptions
df_clean['Description'] = df_clean['Description'].str.strip().str.upper()

Step 4: Extract DateTime Components

# Extract date components for analysis
df_clean['Year'] = df_clean['InvoiceDate'].dt.year
df_clean['Month'] = df_clean['InvoiceDate'].dt.month
df_clean['MonthName'] = df_clean['InvoiceDate'].dt.month_name()
df_clean['DayOfWeek'] = df_clean['InvoiceDate'].dt.day_name()
df_clean['Hour'] = df_clean['InvoiceDate'].dt.hour
df_clean['Quarter'] = df_clean['InvoiceDate'].dt.quarter

print("DateTime components extracted successfully!")

Part 3: Data Transformation

Lessons Applied: 9, 10

Step 1: Add Calculated Columns

# Filter completed orders only
df_completed = df_clean[~df_clean['IsCancelled']].copy()

# Calculate total price
df_completed['TotalPrice'] = df_completed['Quantity'] * df_completed['UnitPrice']

# Calculate revenue (only positive values)
df_completed['Revenue'] = df_completed['TotalPrice'].apply(lambda x: x if x > 0 else 0)

print(f"Total transactions: {len(df_completed):,}")
print(f"Total revenue: £{df_completed['Revenue'].sum():,.2f}")

Step 2: Categorize Data

# Price category
def categorize_price(price):
    if price < 5:
        return 'Budget'
    elif price < 20:
        return 'Mid-Range'
    else:
        return 'Premium'

df_completed['PriceCategory'] = df_completed['UnitPrice'].apply(categorize_price)

# Order value tier
def order_value_tier(total):
    if total < 50:
        return 'Small'
    elif total < 200:
        return 'Medium'
    elif total < 500:
        return 'Large'
    else:
        return 'Enterprise'

df_completed['OrderTier'] = df_completed['TotalPrice'].apply(order_value_tier)

print("Categories created successfully!")

Step 3: Create Regional Mapping

# Map countries to regions
country_region_map = {
    'United Kingdom': 'UK',
    'Germany': 'Europe',
    'France': 'Europe',
    'Spain': 'Europe',
    'Netherlands': 'Europe',
    'Belgium': 'Europe',
    'Switzerland': 'Europe',
    'Portugal': 'Europe',
    'Australia': 'Asia-Pacific',
    'Japan': 'Asia-Pacific',
    'Singapore': 'Asia-Pacific'
}

df_completed['Region'] = df_completed['Country'].map(country_region_map).fillna('Other')

Part 4: Analysis - Bestselling Products

Lessons Applied: 7, 8, 15

Find Top Products

# Group by product and aggregate
bestsellers = df_completed.groupby('Description').agg({
    'Quantity': 'sum',
    'Revenue': 'sum',
    'InvoiceNo': 'nunique'
}).rename(columns={'InvoiceNo': 'Orders'}).sort_values('Quantity', ascending=False)

bestsellers['AvgPrice'] = (bestsellers['Revenue'] / bestsellers['Quantity']).round(2)

print("Top 20 Bestselling Products:")
print(bestsellers.head(20))

# Top revenue generators
print("\nTop 10 Products by Revenue:")
print(bestsellers.nlargest(10, 'Revenue'))

Part 5: Analysis - Geographic Performance

Lessons Applied: 15, 16

Country Performance

# Aggregate by country
country_performance = df_completed.groupby('Country').agg({
    'Revenue': 'sum',
    'InvoiceNo': 'nunique',
    'CustomerID': 'nunique',
    'Quantity': 'sum'
}).rename(columns={
    'InvoiceNo': 'Orders',
    'CustomerID': 'Customers',
    'Quantity': 'Items'
}).sort_values('Revenue', ascending=False)

# Calculate metrics
country_performance['AvgOrderValue'] = (country_performance['Revenue'] / country_performance['Orders']).round(2)
country_performance['AvgBasketSize'] = (country_performance['Items'] / country_performance['Orders']).round(1)

# Calculate percentages
total_revenue = country_performance['Revenue'].sum()
country_performance['RevenuePct'] = (country_performance['Revenue'] / total_revenue * 100).round(2)

print("Top 15 Countries by Performance:")
print(country_performance.head(15))

Create Pivot Table

# Revenue by Country and Quarter
revenue_pivot = df_completed.pivot_table(
    values='Revenue',
    index='Country',
    columns='Quarter',
    aggfunc='sum',
    fill_value=0,
    margins=True
).round(2)

print("Revenue by Country and Quarter (Top 10 Countries):")
print(revenue_pivot.head(10))

Part 6: Analysis - Customer Segmentation (RFM)

Lessons Applied: 15, 18

RFM Analysis

# RFM: Recency, Frequency, Monetary
customer_data = df_completed[df_completed['CustomerID'].notna()].copy()
reference_date = customer_data['InvoiceDate'].max()

rfm = customer_data.groupby('CustomerID').agg({
    'InvoiceDate': lambda x: (reference_date - x.max()).days,
    'InvoiceNo': 'nunique',
    'Revenue': 'sum'
}).rename(columns={
    'InvoiceDate': 'Recency',
    'InvoiceNo': 'Frequency',
    'Revenue': 'Monetary'
})

# Score customers (1-4, higher is better)
rfm['R_Score'] = pd.qcut(rfm['Recency'], 4, labels=[4, 3, 2, 1])
rfm['F_Score'] = pd.qcut(rfm['Frequency'].rank(method='first'), 4, labels=[1, 2, 3, 4])
rfm['M_Score'] = pd.qcut(rfm['Monetary'], 4, labels=[1, 2, 3, 4])

rfm['RFM_Score'] = rfm['R_Score'].astype(str) + rfm['F_Score'].astype(str) + rfm['M_Score'].astype(str)

# Segment customers
def segment_customer(score):
    if score in ['444', '443', '434', '344', '433']:
        return 'VIP'
    elif score[0] in ['3', '4'] and score[1] in ['3', '4']:
        return 'Loyal'
    elif score[0] in ['1', '2']:
        return 'At Risk'
    else:
        return 'Regular'

rfm['Segment'] = rfm['RFM_Score'].apply(segment_customer)

print("Customer Segmentation:")
print(rfm['Segment'].value_counts())

print("\nTop 10 VIP Customers:")
vip_customers = rfm[rfm['Segment'] == 'VIP'].sort_values('Monetary', ascending=False)
print(vip_customers.head(10))

Part 7: Analysis - Time Series and Trends

Lessons Applied: 20

Monthly Trends

# Aggregate by month
monthly_sales = df_completed.groupby(['Year', 'Month']).agg({
    'Revenue': 'sum',
    'InvoiceNo': 'nunique',
    'CustomerID': 'nunique'
}).rename(columns={
    'InvoiceNo': 'Orders',
    'CustomerID': 'Customers'
})

monthly_sales['AvgOrderValue'] = (monthly_sales['Revenue'] / monthly_sales['Orders']).round(2)

# Calculate month-over-month growth
monthly_sales['RevenueGrowth%'] = monthly_sales['Revenue'].pct_change() * 100

print("Monthly Performance:")
print(monthly_sales)

print(f"\nAverage monthly growth: {monthly_sales['RevenueGrowth%'].mean():.2f}%")

Rolling Averages

# Create daily revenue time series
daily_revenue = df_completed.groupby(df_completed['InvoiceDate'].dt.date)['Revenue'].sum().sort_index()

# Calculate moving averages
daily_df = pd.DataFrame({
    'Revenue': daily_revenue,
    'MA_7': daily_revenue.rolling(window=7).mean(),
    'MA_30': daily_revenue.rolling(window=30).mean(),
    'Cumulative': daily_revenue.cumsum()
})

print("Daily Revenue with Moving Averages (Last 30 days):")
print(daily_df.tail(30))

print(f"\nTotal cumulative revenue: £{daily_df['Cumulative'].iloc[-1]:,.2f}")
print(f"Average daily revenue: £{daily_revenue.mean():,.2f}")

Part 8: Analysis - Seasonality

Lessons Applied: 15, 16

Monthly Patterns

# Revenue by month
monthly_pattern = df_completed.groupby('MonthName').agg({
    'Revenue': 'sum',
    'InvoiceNo': 'nunique'
}).rename(columns={'InvoiceNo': 'Orders'})

# Sort by month order
month_order = ['January', 'February', 'March', 'April', 'May', 'June',
               'July', 'August', 'September', 'October', 'November', 'December']
monthly_pattern = monthly_pattern.reindex(month_order)

print("Revenue by Month:")
print(monthly_pattern)

# Best month
best_month = monthly_pattern['Revenue'].idxmax()
print(f"\nBest month: {best_month} (£{monthly_pattern.loc[best_month, 'Revenue']:,.2f})")

Day of Week Patterns

# Revenue by day of week
dow_pattern = df_completed.groupby('DayOfWeek').agg({
    'Revenue': 'sum',
    'InvoiceNo': 'nunique'
}).rename(columns={'InvoiceNo': 'Orders'})

# Sort by day order
day_order = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
dow_pattern = dow_pattern.reindex(day_order)

print("Revenue by Day of Week:")
print(dow_pattern)

# Best day
best_day = dow_pattern['Revenue'].idxmax()
print(f"\nBest day: {best_day} (£{dow_pattern.loc[best_day, 'Revenue']:,.2f})")

Part 9: Executive Summary

Generate a comprehensive report of findings:

# Calculate key metrics
total_revenue = df_completed['Revenue'].sum()
total_orders = df_completed['InvoiceNo'].nunique()
total_customers = df_completed['CustomerID'].nunique()
total_products = df_completed['StockCode'].nunique()
avg_order_value = total_revenue / total_orders
avg_basket_size = df_completed.groupby('InvoiceNo')['Quantity'].sum().mean()

print("=" * 80)
print(" " * 20 + "EXECUTIVE SUMMARY")
print(" " * 15 + "Online Retail Performance Analysis")
print(" " * 18 + "December 2010 - December 2011")
print("=" * 80)

print(f"\nOVERALL METRICS:")
print(f"   Total Revenue:        £{total_revenue:,.2f}")
print(f"   Total Orders:         {total_orders:,}")
print(f"   Unique Customers:     {total_customers:,}")
print(f"   Unique Products:      {total_products:,}")
print(f"   Avg Order Value:      £{avg_order_value:.2f}")
print(f"   Avg Basket Size:      {avg_basket_size:.1f} items")

print(f"\nTOP PERFORMERS:")
print(f"   Best Country:         {country_performance.index[0]} (£{country_performance.iloc[0]['Revenue']:,.2f})")
print(f"   Top Product:          {bestsellers.index[0][:50]}")
print(f"   Peak Month:           {best_month}")
print(f"   VIP Customers:        {len(vip_customers)}")

print(f"\nKEY INSIGHTS:")
print(f"   1. UK dominates with {country_performance.iloc[0]['RevenuePct']:.1f}% of total revenue")
print(f"   2. Top 10 products drive significant portion of sales")
print(f"   3. {len(vip_customers)} VIP customers generate £{vip_customers['Monetary'].sum():,.2f}")
print(f"   4. Cancellation rate: {(df_clean['IsCancelled'].mean()*100):.2f}%")
print(f"   5. Strong seasonality: Q4 shows highest sales")

print(f"\nRECOMMENDATIONS:")
print(f"   • Focus retention on {len(rfm[rfm['Segment']=='At Risk'])} at-risk customers")
print(f"   • Expand inventory of top 20 bestselling products")
print(f"   • Investigate cancellations to reduce rate")
print(f"   • Target marketing in top 5 countries")
print(f"   • Prepare for seasonal peak: stock up in Q4")

print("=" * 80)
print("\nAnalysis complete! All 20 Pandas lessons demonstrated.")

Project Summary

Skills Demonstrated

You successfully applied:

Data Loading & Exploration

Reading large CSV files with encoding
DataFrame exploration (head, tail, info, describe)
Understanding data structure and types

Data Cleaning

Handling missing values appropriately
Converting data types (datetime)
Identifying duplicates
String operations for data extraction

Data Transformation

Creating calculated columns
Applying custom functions
Categorizing data
Mapping values

Filtering & Selection

Boolean filtering with multiple conditions
Selecting subsets of data
Using .isin() and .between()

Aggregation

GroupBy operations
Multiple aggregations
Pivot tables for cross-tabulation

Time Series Analysis

DateTime component extraction
Rolling averages
Trend analysis
Seasonality detection

Customer Analytics

RFM segmentation
Customer lifetime value
Behavioral analysis

Business Insights Generated

Revenue Analysis: £8M+ analyzed across 540K+ transactions
Geographic Insights: UK market dominance identified
Product Performance: Bestsellers and revenue drivers found
Customer Segmentation: 4,300+ customers segmented using RFM
Seasonality: Strong Q4 patterns revealed
Operational Metrics: Cancellation impact quantified
KPIs: Key business metrics calculated and tracked

Congratulations!

You have completed the comprehensive pandas course!

You can now:

Load and explore any dataset
Clean and prepare messy data
Transform and enhance data
Analyze patterns and trends
Generate business insights
Create professional reports

Next Steps

Ready for the next step?

Continue your learning journey with Data Visualization
Learn to create compelling charts and visualizations with Matplotlib to communicate your data insights effectively

Start Data Visualization →

Additional Learning Paths:

Data Visualization: Create charts with Matplotlib
Python Advanced: Master OOP and advanced Python concepts
Software Engineering: Learn professional development practices
Practice: Apply these skills to your own datasets
Automate: Build ETL pipelines and data processing workflows

Keep practicing and building your data science portfolio!

Back to Lesson 20 - Window Functions

Review window operations and rolling calculations

Pandas Course Overview

Review all 21 lessons and course structure

You Are Now a Pandas Expert!

You have mastered all essential pandas techniques and applied them to real-world data analysis. This comprehensive project demonstrates your ability to handle complex datasets, extract insights, and generate business value.

Well done! You are ready for professional data analysis work!

Lesson 20 - Window Functions and Rolling Operations

Data Visualization with Python - Course Overview

Lesson 21 - Final Project: Real-World Data Analysis

Comprehensive Real-World Analysis#

Project Overview#

Dataset Information#

Columns#

Download Dataset#

Skills Coverage#

Business Questions#

Part 1: Data Loading and Exploration#

Step 1: Import and Load Data#

Step 2: Initial Exploration#

Part 2: Data Cleaning#

Step 1: Handle Missing Values#

Step 2: Convert DateTime#

Step 3: String Operations#

Step 4: Extract DateTime Components#

Part 3: Data Transformation#

Step 1: Add Calculated Columns#

Step 2: Categorize Data#

Step 3: Create Regional Mapping#

Part 4: Analysis - Bestselling Products#

Find Top Products#

Part 5: Analysis - Geographic Performance#

Country Performance#

Create Pivot Table#

Part 6: Analysis - Customer Segmentation (RFM)#

RFM Analysis#

Part 7: Analysis - Time Series and Trends#

Monthly Trends#

Rolling Averages#

Part 8: Analysis - Seasonality#

Monthly Patterns#

Day of Week Patterns#

Part 9: Executive Summary#

Project Summary#

Skills Demonstrated#

Business Insights Generated#

Congratulations!#

Next Steps#