Lesson 12 - Data Type Conversion and Cleaning
Converting Data Types
Real data arrives with wrong types—numbers stored as text, dates as strings, booleans as words. Now you will learn to convert data types, handle conversion errors, and prepare data for analysis.
By the end of this lesson, you will be able to:
- Check and understand data types in DataFrames
- Convert between numeric types safely
- Parse dates from strings
- Convert text to boolean values
- Use category type for memory optimization
- Handle conversion errors gracefully
Wrong data types prevent operations—you cannot calculate the average of string numbers or filter dates stored as text. Type conversion is essential for analysis.
Understanding Data Types
Pandas uses specific types for different kinds of data.
Common Pandas Data Types
- object: Text (strings) or mixed types
- int64: Integer numbers
- float64: Decimal numbers
- bool: True/False
- datetime64: Dates and times
- category: Categorical data (limited unique values)
Sample Data with Type Issues
import pandas as pd
import numpy as np
# Create dataset with incorrect types
companies = pd.DataFrame({
'company': ['TechCorp Global', 'FreshMart Inc', 'AutoDrive Motors', 'FirstBank Holdings', 'PowerGen Energy'],
'revenues': ['125000', '89000', '156000', '234000', '178000'], # Stored as strings!
'profits': ['12000', '8500', '-3000', '45000', 'N/A'], # Mix of numbers and text
'employees': [1200, 890, 2300, 5600, 3400],
'founded': ['1990', '2005', '1985', '2010', '1998'], # Years as strings
'established_date': ['2020-01-15', '2019-06-20', '2018-03-10', '2021-11-05', '2017-09-25'],
'is_public': ['Yes', 'No', 'Yes', 'Yes', 'No'], # Should be boolean
'rating': ['4.5', '3.8', '4.2', '4.7', '3.9'] # Numbers as strings
})
print("Dataset with type issues:")
print(companies)
print("\nData types:")
print(companies.dtypes)Output shows most numeric columns are ‘object’ type!
Checking Data Types
# Get detailed info
print("DataFrame info:")
companies.info()Output:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 8 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 company 5 non-null object
1 revenues 5 non-null object ← Should be int/float!
2 profits 5 non-null object ← Should be int/float!
3 employees 5 non-null int64 ✓ Correct
4 founded 5 non-null object ← Should be int!
5 established_date 5 non-null object ← Should be datetime!
6 is_public 5 non-null object ← Should be bool!
7 rating 5 non-null object ← Should be float!Problem with Wrong Types
# Try to calculate sum on string numbers
try:
total_revenue = companies['revenues'].sum()
print(f"Total revenue: {total_revenue}")
except TypeError as e:
print(f"Error: {e}")
print("\nActual result (string concatenation):")
print(companies['revenues'].sum()) # Concatenates strings!Output: '125000890001560002340000178000' - String concatenation, not addition!
Converting Numeric Types
Convert string numbers to actual numeric types.
Using .astype()
# Convert string to integer
companies['revenues'] = companies['revenues'].astype(int)
print("After converting revenues to int:")
print(companies['revenues'].dtype)
print("\nNow we can calculate:")
print(f"Total revenue: ${companies['revenues'].sum():,}")
print(f"Average revenue: ${companies['revenues'].mean():,.0f}")Output:
After converting revenues to int:
int64
Now we can calculate:
Total revenue: $782,000
Average revenue: $156,400Handling Conversion Errors
# Problem: .astype() fails with non-numeric values
try:
companies['profits'] = companies['profits'].astype(int)
except ValueError as e:
print(f"Error: {e}")
print("\nThe 'N/A' value cannot be converted to integer!")Using pd.to_numeric() with Error Handling
# Solution: Use pd.to_numeric() with errors='coerce'
companies['profits'] = pd.to_numeric(companies['profits'], errors='coerce')
print("After converting with errors='coerce':")
print(companies['profits'])
print(f"\nType: {companies['profits'].dtype}")
print("\nNote: 'N/A' became NaN (missing value)")Output:
0 12000.0
1 8500.0
2 -3000.0
3 45000.0
4 NaN ← 'N/A' converted to NaN
Name: profits, dtype: float64Error Handling Options
# Different error handling strategies
test_data = pd.Series(['100', '200', 'invalid', '400'])
print("Original:")
print(test_data)
# errors='coerce' - convert invalid to NaN
print("\nWith errors='coerce':")
print(pd.to_numeric(test_data, errors='coerce'))
# errors='ignore' - return original if any error
print("\nWith errors='ignore':")
print(pd.to_numeric(test_data, errors='ignore'))Output:
With errors='coerce':
0 100.0
1 200.0
2 NaN ← Invalid value
3 400.0
With errors='ignore':
0 100
1 200
2 invalid ← Unchanged
3 400Float to Int Conversion
# Convert float to int (careful - drops decimals!)
companies['rating'] = companies['rating'].astype(float)
print("Rating as float:")
print(companies['rating'])
# Convert to int (truncates decimals)
companies['rating_int'] = companies['rating'].astype(int)
print("\nRating as int (decimals truncated):")
print(companies[['rating', 'rating_int']])Output:
rating rating_int
0 4.5 4 ← Decimal lost
1 3.8 3
2 4.2 4
3 4.7 4
4 3.9 3Rounding Before Converting
# Round first, then convert
companies['rating_rounded'] = companies['rating'].round().astype(int)
print("Rating rounded then converted:")
print(companies[['rating', 'rating_int', 'rating_rounded']])Output:
rating rating_int rating_rounded
0 4.5 4 4 ← Rounds to nearest
1 3.8 3 4
2 4.2 4 4
3 4.7 4 5
4 3.9 3 4Working with Datetime
Convert string dates to proper datetime objects.
Basic Date Conversion
# Convert string to datetime
companies['established_date'] = pd.to_datetime(companies['established_date'])
print("After converting to datetime:")
print(companies['established_date'])
print(f"\nType: {companies['established_date'].dtype}")Output:
0 2020-01-15
1 2019-06-20
2 2018-03-10
3 2021-11-05
4 2017-09-25
Name: established_date, dtype: datetime64[ns]Extracting Date Components
# Extract components
companies['year'] = companies['established_date'].dt.year
companies['month'] = companies['established_date'].dt.month
companies['day_of_week'] = companies['established_date'].dt.day_name()
print("Extracted date components:")
print(companies[['established_date', 'year', 'month', 'day_of_week']])Calculating Time Differences
# Calculate years in operation
today = pd.Timestamp('2024-01-01')
companies['days_operating'] = (today - companies['established_date']).dt.days
companies['years_operating'] = (companies['days_operating'] / 365.25).round(1)
print("Years in operation:")
print(companies[['company', 'established_date', 'years_operating']])Parsing Different Date Formats
# Pandas automatically parses many formats
date_examples = pd.Series([
'2020-01-15', # ISO format
'15/01/2020', # Day/Month/Year
'Jan 15, 2020', # Month name
'2020.01.15' # Dot separator
])
parsed_dates = pd.to_datetime(date_examples)
print("All parsed to same format:")
print(parsed_dates)Boolean and Category Types
Convert text to boolean and use categories for efficiency.
Converting to Boolean
# Map Yes/No to True/False
companies['is_public_bool'] = companies['is_public'].map({'Yes': True, 'No': False})
print("Converted to boolean:")
print(companies[['company', 'is_public', 'is_public_bool']])
print(f"\nType: {companies['is_public_bool'].dtype}")Output:
company is_public is_public_bool
0 TechCorp Global Yes True
1 FreshMart Inc No False
2 AutoDrive Motors Yes True
3 FirstBank Holdings Yes True
4 PowerGen Energy No False
Type: boolUsing Boolean for Filtering
# Filter public companies
public_companies = companies[companies['is_public_bool']]
print("Public companies:")
print(public_companies[['company', 'is_public_bool']])Converting to Category Type
# Add sector column for demonstration
companies['sector'] = ['Technology', 'Food', 'Automotive', 'Financials', 'Energy']
print("Before category conversion:")
print(f"Memory usage: {companies['sector'].memory_usage(deep=True)} bytes")
print(f"Type: {companies['sector'].dtype}")
# Convert to category
companies['sector'] = companies['sector'].astype('category')
print("\nAfter category conversion:")
print(f"Memory usage: {companies['sector'].memory_usage(deep=True)} bytes")
print(f"Type: {companies['sector'].dtype}")
print(f"\nCategories: {companies['sector'].cat.categories.tolist()}")Category type saves memory for columns with repeated values!
When to Use Category Type
Use category type for:
- Columns with limited unique values (gender, status, region)
- Categorical data that repeats often
- Ordered categories (ratings: Poor, Good, Excellent)
# Ordered categories
performance = pd.Series(['Good', 'Excellent', 'Poor', 'Good', 'Excellent'])
performance_cat = pd.Categorical(
performance,
categories=['Poor', 'Good', 'Excellent'],
ordered=True
)
print("Ordered categories:")
print(performance_cat)
print("\nCan compare:")
print(performance_cat > 'Good')Output:
[Good, Excellent, Poor, Good, Excellent]
Categories (3, object): ['Poor' < 'Good' < 'Excellent']
Can compare:
[False, True, False, False, True]Batch Type Conversion
# Convert multiple columns at once
type_conversions = {
'founded': 'int',
'employees': 'int'
}
companies = companies.astype(type_conversions)
print("After batch conversion:")
print(companies.dtypes)Practice Exercises
Apply type conversion with these exercises.
Exercise 1: Numeric Conversions
Given this data:
sales = pd.DataFrame({
'product': ['A', 'B', 'C', 'D'],
'price': ['25.50', '30.00', 'N/A', '45.75'],
'quantity': ['100', '150', '200', '75']
})- Convert ‘price’ to float, handling the ‘N/A’ value
- Convert ‘quantity’ to integer
- Calculate total sales (price × quantity) for non-null prices
# Your code hereExercise 2: Datetime Operations
Given:
orders = pd.DataFrame({
'order_id': [1, 2, 3, 4],
'order_date': ['2023-01-15', '2023-02-20', '2023-03-10', '2023-04-05'],
'delivery_date': ['2023-01-20', '2023-02-25', '2023-03-15', '2023-04-10']
})- Convert both date columns to datetime
- Calculate delivery time in days
- Extract the month name from order_date
- Find orders placed on a Monday
# Your code hereHint
Use .dt.day_name() to get the day of the week!
Exercise 3: Boolean and Category
Given:
employees = pd.DataFrame({
'name': ['Ali', 'Sara', 'Reza', 'Maryam'],
'department': ['IT', 'HR', 'IT', 'Sales'],
'is_manager': ['yes', 'no', 'yes', 'no'],
'status': ['Active', 'Active', 'On Leave', 'Active']
})- Convert ‘is_manager’ to boolean (yes→True, no→False)
- Convert ‘department’ and ‘status’ to category type
- Count how many managers there are
- Compare memory usage before and after category conversion
# Your code hereSummary
You now master data type conversion. Let’s review the key concepts.
Key Concepts
Checking Types
- Use
.dtypesand.info()to inspect types - Wrong types prevent operations
- Type checking is the first data quality step
Numeric Conversion
.astype()for simple conversionpd.to_numeric()for error handlingerrors='coerce'converts invalid to NaN
Datetime Conversion
pd.to_datetime()parses dates- Extract components with
.dtaccessor - Calculate time differences for analysis
Boolean and Category
- Map text to boolean with dictionary
- Category type saves memory
- Use categories for repeated values
Syntax Reference
# Check types
df.dtypes # Show all types
df.info() # Detailed info
df['col'].dtype # Specific column
# Convert numeric
df['col'].astype(int) # Convert to integer
df['col'].astype(float) # Convert to float
pd.to_numeric(df['col'], errors='coerce') # Handle errors
# Convert datetime
pd.to_datetime(df['col']) # Parse dates
df['date'].dt.year # Extract year
df['date'].dt.month # Extract month
df['date'].dt.day_name() # Get day name
# Boolean
df['col'].map({'yes': True, 'no': False})
# Category
df['col'].astype('category') # Convert to categoryType Conversion Guide
| From | To | Method | Notes |
|---|---|---|---|
| String numbers | int/float | pd.to_numeric() | Use errors='coerce' |
| String dates | datetime | pd.to_datetime() | Auto-detects formats |
| Yes/No text | boolean | .map() | Create mapping dict |
| Repeated values | category | .astype('category') | Saves memory |
| Float | int | .astype(int) | Truncates decimals |
| Any type | string | .astype(str) | Always works |
Error Handling
# errors='coerce' - invalid becomes NaN (recommended)
pd.to_numeric(series, errors='coerce')
# errors='ignore' - return original if error
pd.to_numeric(series, errors='ignore')
# errors='raise' - raise error (default)
pd.to_numeric(series, errors='raise')Best Practices
- Always check types after loading data
- Use appropriate types - saves memory, enables operations
- Handle errors - use
errors='coerce'for dirty data - Category for repetition - countries, sectors, statuses
- Parse dates early - enables datetime operations
- Document conversions - note what you changed and why
Next Steps
You can now convert data types appropriately. In the next lesson, you will learn to find and handle duplicates and outliers—ensuring data quality and removing problematic values.
Continue to Lesson 13 - Duplicates and Outliers
Learn to find duplicates and detect unusual values
Back to Lesson 11 - Handling Missing Data
Review detecting and filling missing values
Master Type Conversion
Correct data types are fundamental to analysis. You cannot perform calculations on text, filter dates stored as strings, or optimize memory without proper types.
Use type conversion to prepare data for successful analysis!