Lesson 21 - Final Project: Real-World Data Analysis

Comprehensive Real-World Analysis

You have learned all essential pandas techniques. Now you will apply everything to a real-world dataset—analyzing 541,909 transactions from a UK-based online retail company.

This project demonstrates:

  • All 20 previous lessons in action
  • Professional data analysis workflow
  • Business intelligence techniques
  • Real-world data challenges and solutions

This is your opportunity to prove your pandas mastery!


Project Overview

Dataset Information

  • Source: UCI Machine Learning Repository
  • Company: UK online retailer selling unique all-occasion gifts
  • Period: December 2010 - December 2011 (13 months)
  • Customers: Mostly wholesalers from various countries
  • Size: 541,909 transactions

Columns

ColumnTypeDescription
InvoiceNostring6-digit invoice number; ‘C’ prefix = cancelled
StockCodestring5-digit product code
DescriptionstringProduct name
QuantityintNumber of items per transaction
InvoiceDatedatetimeTransaction date and time
UnitPricefloatPrice per unit in GBP (£)
CustomerIDfloat5-digit unique customer number
CountrystringCustomer’s country

Download Dataset

Dataset Download

Download the dataset to follow along with this project:

Download ecommerce_data.csv (43 MB)

Save the file in your working directory to run the code examples in this lesson.

Alternative sources:


Skills Coverage

This project demonstrates mastery of all 20 lessons:

Fundamentals (Lessons 1-5)

  • Reading CSV, DataFrame exploration
  • Label and position selection (.loc, .iloc)
  • Series operations, value_counts()
  • DateTime parsing and extraction

Filtering & Sorting (Lessons 6-8)

  • Boolean filtering, multiple conditions
  • Sorting, ranking, finding top N

Transformation (Lessons 9-10)

  • Adding calculated columns
  • apply(), map(), lambda functions

Data Cleaning (Lessons 11-14)

  • Handling missing values
  • Data type conversion
  • Duplicates and outliers
  • String operations

Aggregation (Lessons 15-16)

  • GroupBy operations
  • Pivot tables and cross-tabulation

Combining Data (Lessons 17-19)

  • Concatenating DataFrames
  • Merging and joining
  • Multi-index operations

Advanced (Lesson 20)

  • Window functions and rolling averages

Business Questions

You will answer these real-world questions:

  1. What are total sales and revenue trends?
  2. Which products are bestsellers?
  3. Who are the top customers? (RFM Analysis)
  4. Which countries generate the most revenue?
  5. What are monthly and seasonal patterns?
  6. How many orders are cancelled and why?
  7. What is the average basket size?
  8. Is the business growing month-over-month?

Let’s begin your comprehensive analysis!


Part 1: Data Loading and Exploration

Lessons Applied: 1, 2, 3, 4

Step 1: Import and Load Data

import pandas as pd
import numpy as np
from datetime import datetime

# Display settings for better output
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)

# Load data with proper encoding
df = pd.read_csv('ecommerce_data.csv', encoding='ISO-8859-1')

print(f"Dataset Loaded: {df.shape[0]:,} rows × {df.shape[1]} columns")
print(f"Memory usage: {df.memory_usage(deep=True).sum() / 1024**2:.1f} MB")

Step 2: Initial Exploration

# First look at the data
print("First 5 rows:")
print(df.head())

print("\nLast 5 rows:")
print(df.tail())

# DataFrame structure
print("\nDataFrame Info:")
df.info()

# Statistical summary
print("\nStatistical Summary:")
print(df.describe())

# Check missing values
print("\nMissing Values:")
missing = df.isnull().sum()
missing_pct = (missing / len(df) * 100).round(2)
print(pd.DataFrame({'Count': missing, 'Percentage': missing_pct}))

Part 2: Data Cleaning

Lessons Applied: 11, 12, 13, 14

Step 1: Handle Missing Values

# Analyze missing data
print(f"CustomerID missing: {df['CustomerID'].isna().sum():,} ({df['CustomerID'].isna().mean()*100:.1f}%)")
print(f"Description missing: {df['Description'].isna().sum():,}")

# Remove rows with missing descriptions (critical for analysis)
df_clean = df.dropna(subset=['Description']).copy()
print(f"\nRows after cleaning: {len(df_clean):,}")

Step 2: Convert DateTime

# Convert InvoiceDate to datetime
df_clean['InvoiceDate'] = pd.to_datetime(df_clean['InvoiceDate'])

print(f"Date Range: {df_clean['InvoiceDate'].min()} to {df_clean['InvoiceDate'].max()}")
print(f"Total days covered: {(df_clean['InvoiceDate'].max() - df_clean['InvoiceDate'].min()).days} days")

Step 3: String Operations

# Identify cancelled orders (InvoiceNo starting with 'C')
df_clean['IsCancelled'] = df_clean['InvoiceNo'].astype(str).str.startswith('C')

print(f"Cancelled orders: {df_clean['IsCancelled'].sum():,} ({df_clean['IsCancelled'].mean()*100:.1f}%)")

# Clean product descriptions
df_clean['Description'] = df_clean['Description'].str.strip().str.upper()

Step 4: Extract DateTime Components

# Extract date components for analysis
df_clean['Year'] = df_clean['InvoiceDate'].dt.year
df_clean['Month'] = df_clean['InvoiceDate'].dt.month
df_clean['MonthName'] = df_clean['InvoiceDate'].dt.month_name()
df_clean['DayOfWeek'] = df_clean['InvoiceDate'].dt.day_name()
df_clean['Hour'] = df_clean['InvoiceDate'].dt.hour
df_clean['Quarter'] = df_clean['InvoiceDate'].dt.quarter

print("DateTime components extracted successfully!")

Part 3: Data Transformation

Lessons Applied: 9, 10

Step 1: Add Calculated Columns

# Filter completed orders only
df_completed = df_clean[~df_clean['IsCancelled']].copy()

# Calculate total price
df_completed['TotalPrice'] = df_completed['Quantity'] * df_completed['UnitPrice']

# Calculate revenue (only positive values)
df_completed['Revenue'] = df_completed['TotalPrice'].apply(lambda x: x if x > 0 else 0)

print(f"Total transactions: {len(df_completed):,}")
print(f"Total revenue: £{df_completed['Revenue'].sum():,.2f}")

Step 2: Categorize Data

# Price category
def categorize_price(price):
    if price < 5:
        return 'Budget'
    elif price < 20:
        return 'Mid-Range'
    else:
        return 'Premium'

df_completed['PriceCategory'] = df_completed['UnitPrice'].apply(categorize_price)

# Order value tier
def order_value_tier(total):
    if total < 50:
        return 'Small'
    elif total < 200:
        return 'Medium'
    elif total < 500:
        return 'Large'
    else:
        return 'Enterprise'

df_completed['OrderTier'] = df_completed['TotalPrice'].apply(order_value_tier)

print("Categories created successfully!")

Step 3: Create Regional Mapping

# Map countries to regions
country_region_map = {
    'United Kingdom': 'UK',
    'Germany': 'Europe',
    'France': 'Europe',
    'Spain': 'Europe',
    'Netherlands': 'Europe',
    'Belgium': 'Europe',
    'Switzerland': 'Europe',
    'Portugal': 'Europe',
    'Australia': 'Asia-Pacific',
    'Japan': 'Asia-Pacific',
    'Singapore': 'Asia-Pacific'
}

df_completed['Region'] = df_completed['Country'].map(country_region_map).fillna('Other')

Part 4: Analysis - Bestselling Products

Lessons Applied: 7, 8, 15

Find Top Products

# Group by product and aggregate
bestsellers = df_completed.groupby('Description').agg({
    'Quantity': 'sum',
    'Revenue': 'sum',
    'InvoiceNo': 'nunique'
}).rename(columns={'InvoiceNo': 'Orders'}).sort_values('Quantity', ascending=False)

bestsellers['AvgPrice'] = (bestsellers['Revenue'] / bestsellers['Quantity']).round(2)

print("Top 20 Bestselling Products:")
print(bestsellers.head(20))

# Top revenue generators
print("\nTop 10 Products by Revenue:")
print(bestsellers.nlargest(10, 'Revenue'))

Part 5: Analysis - Geographic Performance

Lessons Applied: 15, 16

Country Performance

# Aggregate by country
country_performance = df_completed.groupby('Country').agg({
    'Revenue': 'sum',
    'InvoiceNo': 'nunique',
    'CustomerID': 'nunique',
    'Quantity': 'sum'
}).rename(columns={
    'InvoiceNo': 'Orders',
    'CustomerID': 'Customers',
    'Quantity': 'Items'
}).sort_values('Revenue', ascending=False)

# Calculate metrics
country_performance['AvgOrderValue'] = (country_performance['Revenue'] / country_performance['Orders']).round(2)
country_performance['AvgBasketSize'] = (country_performance['Items'] / country_performance['Orders']).round(1)

# Calculate percentages
total_revenue = country_performance['Revenue'].sum()
country_performance['RevenuePct'] = (country_performance['Revenue'] / total_revenue * 100).round(2)

print("Top 15 Countries by Performance:")
print(country_performance.head(15))

Create Pivot Table

# Revenue by Country and Quarter
revenue_pivot = df_completed.pivot_table(
    values='Revenue',
    index='Country',
    columns='Quarter',
    aggfunc='sum',
    fill_value=0,
    margins=True
).round(2)

print("Revenue by Country and Quarter (Top 10 Countries):")
print(revenue_pivot.head(10))

Part 6: Analysis - Customer Segmentation (RFM)

Lessons Applied: 15, 18

RFM Analysis

# RFM: Recency, Frequency, Monetary
customer_data = df_completed[df_completed['CustomerID'].notna()].copy()
reference_date = customer_data['InvoiceDate'].max()

rfm = customer_data.groupby('CustomerID').agg({
    'InvoiceDate': lambda x: (reference_date - x.max()).days,
    'InvoiceNo': 'nunique',
    'Revenue': 'sum'
}).rename(columns={
    'InvoiceDate': 'Recency',
    'InvoiceNo': 'Frequency',
    'Revenue': 'Monetary'
})

# Score customers (1-4, higher is better)
rfm['R_Score'] = pd.qcut(rfm['Recency'], 4, labels=[4, 3, 2, 1])
rfm['F_Score'] = pd.qcut(rfm['Frequency'].rank(method='first'), 4, labels=[1, 2, 3, 4])
rfm['M_Score'] = pd.qcut(rfm['Monetary'], 4, labels=[1, 2, 3, 4])

rfm['RFM_Score'] = rfm['R_Score'].astype(str) + rfm['F_Score'].astype(str) + rfm['M_Score'].astype(str)

# Segment customers
def segment_customer(score):
    if score in ['444', '443', '434', '344', '433']:
        return 'VIP'
    elif score[0] in ['3', '4'] and score[1] in ['3', '4']:
        return 'Loyal'
    elif score[0] in ['1', '2']:
        return 'At Risk'
    else:
        return 'Regular'

rfm['Segment'] = rfm['RFM_Score'].apply(segment_customer)

print("Customer Segmentation:")
print(rfm['Segment'].value_counts())

print("\nTop 10 VIP Customers:")
vip_customers = rfm[rfm['Segment'] == 'VIP'].sort_values('Monetary', ascending=False)
print(vip_customers.head(10))

Lessons Applied: 20

# Aggregate by month
monthly_sales = df_completed.groupby(['Year', 'Month']).agg({
    'Revenue': 'sum',
    'InvoiceNo': 'nunique',
    'CustomerID': 'nunique'
}).rename(columns={
    'InvoiceNo': 'Orders',
    'CustomerID': 'Customers'
})

monthly_sales['AvgOrderValue'] = (monthly_sales['Revenue'] / monthly_sales['Orders']).round(2)

# Calculate month-over-month growth
monthly_sales['RevenueGrowth%'] = monthly_sales['Revenue'].pct_change() * 100

print("Monthly Performance:")
print(monthly_sales)

print(f"\nAverage monthly growth: {monthly_sales['RevenueGrowth%'].mean():.2f}%")

Rolling Averages

# Create daily revenue time series
daily_revenue = df_completed.groupby(df_completed['InvoiceDate'].dt.date)['Revenue'].sum().sort_index()

# Calculate moving averages
daily_df = pd.DataFrame({
    'Revenue': daily_revenue,
    'MA_7': daily_revenue.rolling(window=7).mean(),
    'MA_30': daily_revenue.rolling(window=30).mean(),
    'Cumulative': daily_revenue.cumsum()
})

print("Daily Revenue with Moving Averages (Last 30 days):")
print(daily_df.tail(30))

print(f"\nTotal cumulative revenue: £{daily_df['Cumulative'].iloc[-1]:,.2f}")
print(f"Average daily revenue: £{daily_revenue.mean():,.2f}")

Part 8: Analysis - Seasonality

Lessons Applied: 15, 16

Monthly Patterns

# Revenue by month
monthly_pattern = df_completed.groupby('MonthName').agg({
    'Revenue': 'sum',
    'InvoiceNo': 'nunique'
}).rename(columns={'InvoiceNo': 'Orders'})

# Sort by month order
month_order = ['January', 'February', 'March', 'April', 'May', 'June',
               'July', 'August', 'September', 'October', 'November', 'December']
monthly_pattern = monthly_pattern.reindex(month_order)

print("Revenue by Month:")
print(monthly_pattern)

# Best month
best_month = monthly_pattern['Revenue'].idxmax()
print(f"\nBest month: {best_month}{monthly_pattern.loc[best_month, 'Revenue']:,.2f})")

Day of Week Patterns

# Revenue by day of week
dow_pattern = df_completed.groupby('DayOfWeek').agg({
    'Revenue': 'sum',
    'InvoiceNo': 'nunique'
}).rename(columns={'InvoiceNo': 'Orders'})

# Sort by day order
day_order = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
dow_pattern = dow_pattern.reindex(day_order)

print("Revenue by Day of Week:")
print(dow_pattern)

# Best day
best_day = dow_pattern['Revenue'].idxmax()
print(f"\nBest day: {best_day}{dow_pattern.loc[best_day, 'Revenue']:,.2f})")

Part 9: Executive Summary

Generate a comprehensive report of findings:

# Calculate key metrics
total_revenue = df_completed['Revenue'].sum()
total_orders = df_completed['InvoiceNo'].nunique()
total_customers = df_completed['CustomerID'].nunique()
total_products = df_completed['StockCode'].nunique()
avg_order_value = total_revenue / total_orders
avg_basket_size = df_completed.groupby('InvoiceNo')['Quantity'].sum().mean()

print("=" * 80)
print(" " * 20 + "EXECUTIVE SUMMARY")
print(" " * 15 + "Online Retail Performance Analysis")
print(" " * 18 + "December 2010 - December 2011")
print("=" * 80)

print(f"\nOVERALL METRICS:")
print(f"   Total Revenue:        £{total_revenue:,.2f}")
print(f"   Total Orders:         {total_orders:,}")
print(f"   Unique Customers:     {total_customers:,}")
print(f"   Unique Products:      {total_products:,}")
print(f"   Avg Order Value:      £{avg_order_value:.2f}")
print(f"   Avg Basket Size:      {avg_basket_size:.1f} items")

print(f"\nTOP PERFORMERS:")
print(f"   Best Country:         {country_performance.index[0]}{country_performance.iloc[0]['Revenue']:,.2f})")
print(f"   Top Product:          {bestsellers.index[0][:50]}")
print(f"   Peak Month:           {best_month}")
print(f"   VIP Customers:        {len(vip_customers)}")

print(f"\nKEY INSIGHTS:")
print(f"   1. UK dominates with {country_performance.iloc[0]['RevenuePct']:.1f}% of total revenue")
print(f"   2. Top 10 products drive significant portion of sales")
print(f"   3. {len(vip_customers)} VIP customers generate £{vip_customers['Monetary'].sum():,.2f}")
print(f"   4. Cancellation rate: {(df_clean['IsCancelled'].mean()*100):.2f}%")
print(f"   5. Strong seasonality: Q4 shows highest sales")

print(f"\nRECOMMENDATIONS:")
print(f"   • Focus retention on {len(rfm[rfm['Segment']=='At Risk'])} at-risk customers")
print(f"   • Expand inventory of top 20 bestselling products")
print(f"   • Investigate cancellations to reduce rate")
print(f"   • Target marketing in top 5 countries")
print(f"   • Prepare for seasonal peak: stock up in Q4")

print("=" * 80)
print("\nAnalysis complete! All 20 Pandas lessons demonstrated.")

Project Summary

Skills Demonstrated

You successfully applied:

Data Loading & Exploration

  • Reading large CSV files with encoding
  • DataFrame exploration (head, tail, info, describe)
  • Understanding data structure and types

Data Cleaning

  • Handling missing values appropriately
  • Converting data types (datetime)
  • Identifying duplicates
  • String operations for data extraction

Data Transformation

  • Creating calculated columns
  • Applying custom functions
  • Categorizing data
  • Mapping values

Filtering & Selection

  • Boolean filtering with multiple conditions
  • Selecting subsets of data
  • Using .isin() and .between()

Aggregation

  • GroupBy operations
  • Multiple aggregations
  • Pivot tables for cross-tabulation

Time Series Analysis

  • DateTime component extraction
  • Rolling averages
  • Trend analysis
  • Seasonality detection

Customer Analytics

  • RFM segmentation
  • Customer lifetime value
  • Behavioral analysis

Business Insights Generated

  1. Revenue Analysis: £8M+ analyzed across 540K+ transactions
  2. Geographic Insights: UK market dominance identified
  3. Product Performance: Bestsellers and revenue drivers found
  4. Customer Segmentation: 4,300+ customers segmented using RFM
  5. Seasonality: Strong Q4 patterns revealed
  6. Operational Metrics: Cancellation impact quantified
  7. KPIs: Key business metrics calculated and tracked

Congratulations!

You have completed the comprehensive pandas course!

You can now:

  • Load and explore any dataset
  • Clean and prepare messy data
  • Transform and enhance data
  • Analyze patterns and trends
  • Generate business insights
  • Create professional reports

Next Steps

Ready for the next step?

Continue your learning journey with Data Visualization
Learn to create compelling charts and visualizations with Matplotlib to communicate your data insights effectively

Start Data Visualization →

Additional Learning Paths:

  1. Data Visualization: Create charts with Matplotlib
  2. Python Advanced: Master OOP and advanced Python concepts
  3. Software Engineering: Learn professional development practices
  4. Practice: Apply these skills to your own datasets
  5. Automate: Build ETL pipelines and data processing workflows

Keep practicing and building your data science portfolio!


Back to Lesson 20 - Window Functions

Review window operations and rolling calculations

Pandas Course Overview

Review all 21 lessons and course structure


You Are Now a Pandas Expert!

You have mastered all essential pandas techniques and applied them to real-world data analysis. This comprehensive project demonstrates your ability to handle complex datasets, extract insights, and generate business value.

Well done! You are ready for professional data analysis work!