Lesson 21 - Final Project: Real-World Data Analysis
On this page
- Comprehensive Real-World Analysis
- Project Overview
- Skills Coverage
- Business Questions
- Part 1: Data Loading and Exploration
- Part 2: Data Cleaning
- Part 3: Data Transformation
- Part 4: Analysis - Bestselling Products
- Part 5: Analysis - Geographic Performance
- Part 6: Analysis - Customer Segmentation (RFM)
- Part 7: Analysis - Time Series and Trends
- Part 8: Analysis - Seasonality
- Part 9: Executive Summary
- Project Summary
- Congratulations!
- You Are Now a Pandas Expert!
Comprehensive Real-World Analysis
You have learned all essential pandas techniques. Now you will apply everything to a real-world dataset—analyzing 541,909 transactions from a UK-based online retail company.
This project demonstrates:
- All 20 previous lessons in action
- Professional data analysis workflow
- Business intelligence techniques
- Real-world data challenges and solutions
This is your opportunity to prove your pandas mastery!
Project Overview
Dataset Information
- Source: UCI Machine Learning Repository
- Company: UK online retailer selling unique all-occasion gifts
- Period: December 2010 - December 2011 (13 months)
- Customers: Mostly wholesalers from various countries
- Size: 541,909 transactions
Columns
| Column | Type | Description |
|---|---|---|
InvoiceNo | string | 6-digit invoice number; ‘C’ prefix = cancelled |
StockCode | string | 5-digit product code |
Description | string | Product name |
Quantity | int | Number of items per transaction |
InvoiceDate | datetime | Transaction date and time |
UnitPrice | float | Price per unit in GBP (£) |
CustomerID | float | 5-digit unique customer number |
Country | string | Customer’s country |
Download Dataset
Dataset Download
Download the dataset to follow along with this project:
Download ecommerce_data.csv (43 MB)
Save the file in your working directory to run the code examples in this lesson.
Alternative sources:
- Original source: UCI Machine Learning Repository
- Dataset contains 541,909 transactions from December 2010 to December 2011
Skills Coverage
This project demonstrates mastery of all 20 lessons:
Fundamentals (Lessons 1-5)
- Reading CSV, DataFrame exploration
- Label and position selection (.loc, .iloc)
- Series operations, value_counts()
- DateTime parsing and extraction
Filtering & Sorting (Lessons 6-8)
- Boolean filtering, multiple conditions
- Sorting, ranking, finding top N
Transformation (Lessons 9-10)
- Adding calculated columns
- apply(), map(), lambda functions
Data Cleaning (Lessons 11-14)
- Handling missing values
- Data type conversion
- Duplicates and outliers
- String operations
Aggregation (Lessons 15-16)
- GroupBy operations
- Pivot tables and cross-tabulation
Combining Data (Lessons 17-19)
- Concatenating DataFrames
- Merging and joining
- Multi-index operations
Advanced (Lesson 20)
- Window functions and rolling averages
Business Questions
You will answer these real-world questions:
- What are total sales and revenue trends?
- Which products are bestsellers?
- Who are the top customers? (RFM Analysis)
- Which countries generate the most revenue?
- What are monthly and seasonal patterns?
- How many orders are cancelled and why?
- What is the average basket size?
- Is the business growing month-over-month?
Let’s begin your comprehensive analysis!
Part 1: Data Loading and Exploration
Lessons Applied: 1, 2, 3, 4
Step 1: Import and Load Data
import pandas as pd
import numpy as np
from datetime import datetime
# Display settings for better output
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)
# Load data with proper encoding
df = pd.read_csv('ecommerce_data.csv', encoding='ISO-8859-1')
print(f"Dataset Loaded: {df.shape[0]:,} rows × {df.shape[1]} columns")
print(f"Memory usage: {df.memory_usage(deep=True).sum() / 1024**2:.1f} MB")Step 2: Initial Exploration
# First look at the data
print("First 5 rows:")
print(df.head())
print("\nLast 5 rows:")
print(df.tail())
# DataFrame structure
print("\nDataFrame Info:")
df.info()
# Statistical summary
print("\nStatistical Summary:")
print(df.describe())
# Check missing values
print("\nMissing Values:")
missing = df.isnull().sum()
missing_pct = (missing / len(df) * 100).round(2)
print(pd.DataFrame({'Count': missing, 'Percentage': missing_pct}))Part 2: Data Cleaning
Lessons Applied: 11, 12, 13, 14
Step 1: Handle Missing Values
# Analyze missing data
print(f"CustomerID missing: {df['CustomerID'].isna().sum():,} ({df['CustomerID'].isna().mean()*100:.1f}%)")
print(f"Description missing: {df['Description'].isna().sum():,}")
# Remove rows with missing descriptions (critical for analysis)
df_clean = df.dropna(subset=['Description']).copy()
print(f"\nRows after cleaning: {len(df_clean):,}")Step 2: Convert DateTime
# Convert InvoiceDate to datetime
df_clean['InvoiceDate'] = pd.to_datetime(df_clean['InvoiceDate'])
print(f"Date Range: {df_clean['InvoiceDate'].min()} to {df_clean['InvoiceDate'].max()}")
print(f"Total days covered: {(df_clean['InvoiceDate'].max() - df_clean['InvoiceDate'].min()).days} days")Step 3: String Operations
# Identify cancelled orders (InvoiceNo starting with 'C')
df_clean['IsCancelled'] = df_clean['InvoiceNo'].astype(str).str.startswith('C')
print(f"Cancelled orders: {df_clean['IsCancelled'].sum():,} ({df_clean['IsCancelled'].mean()*100:.1f}%)")
# Clean product descriptions
df_clean['Description'] = df_clean['Description'].str.strip().str.upper()Step 4: Extract DateTime Components
# Extract date components for analysis
df_clean['Year'] = df_clean['InvoiceDate'].dt.year
df_clean['Month'] = df_clean['InvoiceDate'].dt.month
df_clean['MonthName'] = df_clean['InvoiceDate'].dt.month_name()
df_clean['DayOfWeek'] = df_clean['InvoiceDate'].dt.day_name()
df_clean['Hour'] = df_clean['InvoiceDate'].dt.hour
df_clean['Quarter'] = df_clean['InvoiceDate'].dt.quarter
print("DateTime components extracted successfully!")Part 3: Data Transformation
Lessons Applied: 9, 10
Step 1: Add Calculated Columns
# Filter completed orders only
df_completed = df_clean[~df_clean['IsCancelled']].copy()
# Calculate total price
df_completed['TotalPrice'] = df_completed['Quantity'] * df_completed['UnitPrice']
# Calculate revenue (only positive values)
df_completed['Revenue'] = df_completed['TotalPrice'].apply(lambda x: x if x > 0 else 0)
print(f"Total transactions: {len(df_completed):,}")
print(f"Total revenue: £{df_completed['Revenue'].sum():,.2f}")Step 2: Categorize Data
# Price category
def categorize_price(price):
if price < 5:
return 'Budget'
elif price < 20:
return 'Mid-Range'
else:
return 'Premium'
df_completed['PriceCategory'] = df_completed['UnitPrice'].apply(categorize_price)
# Order value tier
def order_value_tier(total):
if total < 50:
return 'Small'
elif total < 200:
return 'Medium'
elif total < 500:
return 'Large'
else:
return 'Enterprise'
df_completed['OrderTier'] = df_completed['TotalPrice'].apply(order_value_tier)
print("Categories created successfully!")Step 3: Create Regional Mapping
# Map countries to regions
country_region_map = {
'United Kingdom': 'UK',
'Germany': 'Europe',
'France': 'Europe',
'Spain': 'Europe',
'Netherlands': 'Europe',
'Belgium': 'Europe',
'Switzerland': 'Europe',
'Portugal': 'Europe',
'Australia': 'Asia-Pacific',
'Japan': 'Asia-Pacific',
'Singapore': 'Asia-Pacific'
}
df_completed['Region'] = df_completed['Country'].map(country_region_map).fillna('Other')Part 4: Analysis - Bestselling Products
Lessons Applied: 7, 8, 15
Find Top Products
# Group by product and aggregate
bestsellers = df_completed.groupby('Description').agg({
'Quantity': 'sum',
'Revenue': 'sum',
'InvoiceNo': 'nunique'
}).rename(columns={'InvoiceNo': 'Orders'}).sort_values('Quantity', ascending=False)
bestsellers['AvgPrice'] = (bestsellers['Revenue'] / bestsellers['Quantity']).round(2)
print("Top 20 Bestselling Products:")
print(bestsellers.head(20))
# Top revenue generators
print("\nTop 10 Products by Revenue:")
print(bestsellers.nlargest(10, 'Revenue'))Part 5: Analysis - Geographic Performance
Lessons Applied: 15, 16
Country Performance
# Aggregate by country
country_performance = df_completed.groupby('Country').agg({
'Revenue': 'sum',
'InvoiceNo': 'nunique',
'CustomerID': 'nunique',
'Quantity': 'sum'
}).rename(columns={
'InvoiceNo': 'Orders',
'CustomerID': 'Customers',
'Quantity': 'Items'
}).sort_values('Revenue', ascending=False)
# Calculate metrics
country_performance['AvgOrderValue'] = (country_performance['Revenue'] / country_performance['Orders']).round(2)
country_performance['AvgBasketSize'] = (country_performance['Items'] / country_performance['Orders']).round(1)
# Calculate percentages
total_revenue = country_performance['Revenue'].sum()
country_performance['RevenuePct'] = (country_performance['Revenue'] / total_revenue * 100).round(2)
print("Top 15 Countries by Performance:")
print(country_performance.head(15))Create Pivot Table
# Revenue by Country and Quarter
revenue_pivot = df_completed.pivot_table(
values='Revenue',
index='Country',
columns='Quarter',
aggfunc='sum',
fill_value=0,
margins=True
).round(2)
print("Revenue by Country and Quarter (Top 10 Countries):")
print(revenue_pivot.head(10))Part 6: Analysis - Customer Segmentation (RFM)
Lessons Applied: 15, 18
RFM Analysis
# RFM: Recency, Frequency, Monetary
customer_data = df_completed[df_completed['CustomerID'].notna()].copy()
reference_date = customer_data['InvoiceDate'].max()
rfm = customer_data.groupby('CustomerID').agg({
'InvoiceDate': lambda x: (reference_date - x.max()).days,
'InvoiceNo': 'nunique',
'Revenue': 'sum'
}).rename(columns={
'InvoiceDate': 'Recency',
'InvoiceNo': 'Frequency',
'Revenue': 'Monetary'
})
# Score customers (1-4, higher is better)
rfm['R_Score'] = pd.qcut(rfm['Recency'], 4, labels=[4, 3, 2, 1])
rfm['F_Score'] = pd.qcut(rfm['Frequency'].rank(method='first'), 4, labels=[1, 2, 3, 4])
rfm['M_Score'] = pd.qcut(rfm['Monetary'], 4, labels=[1, 2, 3, 4])
rfm['RFM_Score'] = rfm['R_Score'].astype(str) + rfm['F_Score'].astype(str) + rfm['M_Score'].astype(str)
# Segment customers
def segment_customer(score):
if score in ['444', '443', '434', '344', '433']:
return 'VIP'
elif score[0] in ['3', '4'] and score[1] in ['3', '4']:
return 'Loyal'
elif score[0] in ['1', '2']:
return 'At Risk'
else:
return 'Regular'
rfm['Segment'] = rfm['RFM_Score'].apply(segment_customer)
print("Customer Segmentation:")
print(rfm['Segment'].value_counts())
print("\nTop 10 VIP Customers:")
vip_customers = rfm[rfm['Segment'] == 'VIP'].sort_values('Monetary', ascending=False)
print(vip_customers.head(10))Part 7: Analysis - Time Series and Trends
Lessons Applied: 20
Monthly Trends
# Aggregate by month
monthly_sales = df_completed.groupby(['Year', 'Month']).agg({
'Revenue': 'sum',
'InvoiceNo': 'nunique',
'CustomerID': 'nunique'
}).rename(columns={
'InvoiceNo': 'Orders',
'CustomerID': 'Customers'
})
monthly_sales['AvgOrderValue'] = (monthly_sales['Revenue'] / monthly_sales['Orders']).round(2)
# Calculate month-over-month growth
monthly_sales['RevenueGrowth%'] = monthly_sales['Revenue'].pct_change() * 100
print("Monthly Performance:")
print(monthly_sales)
print(f"\nAverage monthly growth: {monthly_sales['RevenueGrowth%'].mean():.2f}%")Rolling Averages
# Create daily revenue time series
daily_revenue = df_completed.groupby(df_completed['InvoiceDate'].dt.date)['Revenue'].sum().sort_index()
# Calculate moving averages
daily_df = pd.DataFrame({
'Revenue': daily_revenue,
'MA_7': daily_revenue.rolling(window=7).mean(),
'MA_30': daily_revenue.rolling(window=30).mean(),
'Cumulative': daily_revenue.cumsum()
})
print("Daily Revenue with Moving Averages (Last 30 days):")
print(daily_df.tail(30))
print(f"\nTotal cumulative revenue: £{daily_df['Cumulative'].iloc[-1]:,.2f}")
print(f"Average daily revenue: £{daily_revenue.mean():,.2f}")Part 8: Analysis - Seasonality
Lessons Applied: 15, 16
Monthly Patterns
# Revenue by month
monthly_pattern = df_completed.groupby('MonthName').agg({
'Revenue': 'sum',
'InvoiceNo': 'nunique'
}).rename(columns={'InvoiceNo': 'Orders'})
# Sort by month order
month_order = ['January', 'February', 'March', 'April', 'May', 'June',
'July', 'August', 'September', 'October', 'November', 'December']
monthly_pattern = monthly_pattern.reindex(month_order)
print("Revenue by Month:")
print(monthly_pattern)
# Best month
best_month = monthly_pattern['Revenue'].idxmax()
print(f"\nBest month: {best_month} (£{monthly_pattern.loc[best_month, 'Revenue']:,.2f})")Day of Week Patterns
# Revenue by day of week
dow_pattern = df_completed.groupby('DayOfWeek').agg({
'Revenue': 'sum',
'InvoiceNo': 'nunique'
}).rename(columns={'InvoiceNo': 'Orders'})
# Sort by day order
day_order = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
dow_pattern = dow_pattern.reindex(day_order)
print("Revenue by Day of Week:")
print(dow_pattern)
# Best day
best_day = dow_pattern['Revenue'].idxmax()
print(f"\nBest day: {best_day} (£{dow_pattern.loc[best_day, 'Revenue']:,.2f})")Part 9: Executive Summary
Generate a comprehensive report of findings:
# Calculate key metrics
total_revenue = df_completed['Revenue'].sum()
total_orders = df_completed['InvoiceNo'].nunique()
total_customers = df_completed['CustomerID'].nunique()
total_products = df_completed['StockCode'].nunique()
avg_order_value = total_revenue / total_orders
avg_basket_size = df_completed.groupby('InvoiceNo')['Quantity'].sum().mean()
print("=" * 80)
print(" " * 20 + "EXECUTIVE SUMMARY")
print(" " * 15 + "Online Retail Performance Analysis")
print(" " * 18 + "December 2010 - December 2011")
print("=" * 80)
print(f"\nOVERALL METRICS:")
print(f" Total Revenue: £{total_revenue:,.2f}")
print(f" Total Orders: {total_orders:,}")
print(f" Unique Customers: {total_customers:,}")
print(f" Unique Products: {total_products:,}")
print(f" Avg Order Value: £{avg_order_value:.2f}")
print(f" Avg Basket Size: {avg_basket_size:.1f} items")
print(f"\nTOP PERFORMERS:")
print(f" Best Country: {country_performance.index[0]} (£{country_performance.iloc[0]['Revenue']:,.2f})")
print(f" Top Product: {bestsellers.index[0][:50]}")
print(f" Peak Month: {best_month}")
print(f" VIP Customers: {len(vip_customers)}")
print(f"\nKEY INSIGHTS:")
print(f" 1. UK dominates with {country_performance.iloc[0]['RevenuePct']:.1f}% of total revenue")
print(f" 2. Top 10 products drive significant portion of sales")
print(f" 3. {len(vip_customers)} VIP customers generate £{vip_customers['Monetary'].sum():,.2f}")
print(f" 4. Cancellation rate: {(df_clean['IsCancelled'].mean()*100):.2f}%")
print(f" 5. Strong seasonality: Q4 shows highest sales")
print(f"\nRECOMMENDATIONS:")
print(f" • Focus retention on {len(rfm[rfm['Segment']=='At Risk'])} at-risk customers")
print(f" • Expand inventory of top 20 bestselling products")
print(f" • Investigate cancellations to reduce rate")
print(f" • Target marketing in top 5 countries")
print(f" • Prepare for seasonal peak: stock up in Q4")
print("=" * 80)
print("\nAnalysis complete! All 20 Pandas lessons demonstrated.")Project Summary
Skills Demonstrated
You successfully applied:
Data Loading & Exploration
- Reading large CSV files with encoding
- DataFrame exploration (head, tail, info, describe)
- Understanding data structure and types
Data Cleaning
- Handling missing values appropriately
- Converting data types (datetime)
- Identifying duplicates
- String operations for data extraction
Data Transformation
- Creating calculated columns
- Applying custom functions
- Categorizing data
- Mapping values
Filtering & Selection
- Boolean filtering with multiple conditions
- Selecting subsets of data
- Using .isin() and .between()
Aggregation
- GroupBy operations
- Multiple aggregations
- Pivot tables for cross-tabulation
Time Series Analysis
- DateTime component extraction
- Rolling averages
- Trend analysis
- Seasonality detection
Customer Analytics
- RFM segmentation
- Customer lifetime value
- Behavioral analysis
Business Insights Generated
- Revenue Analysis: £8M+ analyzed across 540K+ transactions
- Geographic Insights: UK market dominance identified
- Product Performance: Bestsellers and revenue drivers found
- Customer Segmentation: 4,300+ customers segmented using RFM
- Seasonality: Strong Q4 patterns revealed
- Operational Metrics: Cancellation impact quantified
- KPIs: Key business metrics calculated and tracked
Congratulations!
You have completed the comprehensive pandas course!
You can now:
- Load and explore any dataset
- Clean and prepare messy data
- Transform and enhance data
- Analyze patterns and trends
- Generate business insights
- Create professional reports
Next Steps
Ready for the next step?
Continue your learning journey with Data Visualization
Learn to create compelling charts and visualizations with Matplotlib to communicate your data insights effectively
Additional Learning Paths:
- Data Visualization: Create charts with Matplotlib
- Python Advanced: Master OOP and advanced Python concepts
- Software Engineering: Learn professional development practices
- Practice: Apply these skills to your own datasets
- Automate: Build ETL pipelines and data processing workflows
Keep practicing and building your data science portfolio!
Back to Lesson 20 - Window Functions
Review window operations and rolling calculations
Pandas Course Overview
Review all 21 lessons and course structure
You Are Now a Pandas Expert!
You have mastered all essential pandas techniques and applied them to real-world data analysis. This comprehensive project demonstrates your ability to handle complex datasets, extract insights, and generate business value.
Well done! You are ready for professional data analysis work!