Lesson 12 - Data Type Conversion and Cleaning

Converting Data Types

Real data arrives with wrong types—numbers stored as text, dates as strings, booleans as words. Now you will learn to convert data types, handle conversion errors, and prepare data for analysis.

By the end of this lesson, you will be able to:

Check and understand data types in DataFrames
Convert between numeric types safely
Parse dates from strings
Convert text to boolean values
Use category type for memory optimization
Handle conversion errors gracefully

Wrong data types prevent operations—you cannot calculate the average of string numbers or filter dates stored as text. Type conversion is essential for analysis.

Understanding Data Types

Pandas uses specific types for different kinds of data.

Common Pandas Data Types

object: Text (strings) or mixed types
int64: Integer numbers
float64: Decimal numbers
bool: True/False
datetime64: Dates and times
category: Categorical data (limited unique values)

Sample Data with Type Issues

import pandas as pd
import numpy as np

# Create dataset with incorrect types
companies = pd.DataFrame({
    'company': ['TechCorp Global', 'FreshMart Inc', 'AutoDrive Motors', 'FirstBank Holdings', 'PowerGen Energy'],
    'revenues': ['125000', '89000', '156000', '234000', '178000'],  # Stored as strings!
    'profits': ['12000', '8500', '-3000', '45000', 'N/A'],  # Mix of numbers and text
    'employees': [1200, 890, 2300, 5600, 3400],
    'founded': ['1990', '2005', '1985', '2010', '1998'],  # Years as strings
    'established_date': ['2020-01-15', '2019-06-20', '2018-03-10', '2021-11-05', '2017-09-25'],
    'is_public': ['Yes', 'No', 'Yes', 'Yes', 'No'],  # Should be boolean
    'rating': ['4.5', '3.8', '4.2', '4.7', '3.9']  # Numbers as strings
})

print("Dataset with type issues:")
print(companies)
print("\nData types:")
print(companies.dtypes)

Output shows most numeric columns are ‘object’ type!

Checking Data Types

# Get detailed info
print("DataFrame info:")
companies.info()

Output:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 8 columns):
 #   Column             Non-Null Count  Dtype
---  ------             --------------  -----
 0   company            5 non-null      object
 1   revenues           5 non-null      object  ← Should be int/float!
 2   profits            5 non-null      object  ← Should be int/float!
 3   employees          5 non-null      int64   ✓ Correct
 4   founded            5 non-null      object  ← Should be int!
 5   established_date   5 non-null      object  ← Should be datetime!
 6   is_public          5 non-null      object  ← Should be bool!
 7   rating             5 non-null      object  ← Should be float!

Problem with Wrong Types

# Try to calculate sum on string numbers
try:
    total_revenue = companies['revenues'].sum()
    print(f"Total revenue: {total_revenue}")
except TypeError as e:
    print(f"Error: {e}")
    print("\nActual result (string concatenation):")
    print(companies['revenues'].sum())  # Concatenates strings!

Output: '125000890001560002340000178000' - String concatenation, not addition!

Converting Numeric Types

Convert string numbers to actual numeric types.

Using .astype()

# Convert string to integer
companies['revenues'] = companies['revenues'].astype(int)

print("After converting revenues to int:")
print(companies['revenues'].dtype)
print("\nNow we can calculate:")
print(f"Total revenue: ${companies['revenues'].sum():,}")
print(f"Average revenue: ${companies['revenues'].mean():,.0f}")

Output:

After converting revenues to int:
int64

Now we can calculate:
Total revenue: $782,000
Average revenue: $156,400

Handling Conversion Errors

# Problem: .astype() fails with non-numeric values
try:
    companies['profits'] = companies['profits'].astype(int)
except ValueError as e:
    print(f"Error: {e}")
    print("\nThe 'N/A' value cannot be converted to integer!")

Using pd.to_numeric() with Error Handling

# Solution: Use pd.to_numeric() with errors='coerce'
companies['profits'] = pd.to_numeric(companies['profits'], errors='coerce')

print("After converting with errors='coerce':")
print(companies['profits'])
print(f"\nType: {companies['profits'].dtype}")
print("\nNote: 'N/A' became NaN (missing value)")

Output:

0    12000.0
1     8500.0
2    -3000.0
3    45000.0
4       NaN   ← 'N/A' converted to NaN
Name: profits, dtype: float64

Error Handling Options

# Different error handling strategies
test_data = pd.Series(['100', '200', 'invalid', '400'])

print("Original:")
print(test_data)

# errors='coerce' - convert invalid to NaN
print("\nWith errors='coerce':")
print(pd.to_numeric(test_data, errors='coerce'))

# errors='ignore' - return original if any error
print("\nWith errors='ignore':")
print(pd.to_numeric(test_data, errors='ignore'))

Output:

With errors='coerce':
0    100.0
1    200.0
2      NaN  ← Invalid value
3    400.0

With errors='ignore':
0        100
1        200
2    invalid  ← Unchanged
3        400

Float to Int Conversion

# Convert float to int (careful - drops decimals!)
companies['rating'] = companies['rating'].astype(float)
print("Rating as float:")
print(companies['rating'])

# Convert to int (truncates decimals)
companies['rating_int'] = companies['rating'].astype(int)
print("\nRating as int (decimals truncated):")
print(companies[['rating', 'rating_int']])

Output:

   rating  rating_int
0     4.5           4  ← Decimal lost
1     3.8           3
2     4.2           4
3     4.7           4
4     3.9           3

Rounding Before Converting

# Round first, then convert
companies['rating_rounded'] = companies['rating'].round().astype(int)

print("Rating rounded then converted:")
print(companies[['rating', 'rating_int', 'rating_rounded']])

Output:

   rating  rating_int  rating_rounded
0     4.5           4               4  ← Rounds to nearest
1     3.8           3               4
2     4.2           4               4
3     4.7           4               5
4     3.9           3               4

Working with Datetime

Convert string dates to proper datetime objects.

Basic Date Conversion

# Convert string to datetime
companies['established_date'] = pd.to_datetime(companies['established_date'])

print("After converting to datetime:")
print(companies['established_date'])
print(f"\nType: {companies['established_date'].dtype}")

Output:

0   2020-01-15
1   2019-06-20
2   2018-03-10
3   2021-11-05
4   2017-09-25
Name: established_date, dtype: datetime64[ns]

Extracting Date Components

# Extract components
companies['year'] = companies['established_date'].dt.year
companies['month'] = companies['established_date'].dt.month
companies['day_of_week'] = companies['established_date'].dt.day_name()

print("Extracted date components:")
print(companies[['established_date', 'year', 'month', 'day_of_week']])

Calculating Time Differences

# Calculate years in operation
today = pd.Timestamp('2024-01-01')
companies['days_operating'] = (today - companies['established_date']).dt.days
companies['years_operating'] = (companies['days_operating'] / 365.25).round(1)

print("Years in operation:")
print(companies[['company', 'established_date', 'years_operating']])

Parsing Different Date Formats

# Pandas automatically parses many formats
date_examples = pd.Series([
    '2020-01-15',        # ISO format
    '15/01/2020',        # Day/Month/Year
    'Jan 15, 2020',      # Month name
    '2020.01.15'         # Dot separator
])

parsed_dates = pd.to_datetime(date_examples)
print("All parsed to same format:")
print(parsed_dates)

Boolean and Category Types

Convert text to boolean and use categories for efficiency.

Converting to Boolean

# Map Yes/No to True/False
companies['is_public_bool'] = companies['is_public'].map({'Yes': True, 'No': False})

print("Converted to boolean:")
print(companies[['company', 'is_public', 'is_public_bool']])
print(f"\nType: {companies['is_public_bool'].dtype}")

Output:

       company is_public  is_public_bool
0     TechCorp Global       Yes            True
1   FreshMart Inc        No           False
2  AutoDrive Motors       Yes            True
3  FirstBank Holdings       Yes            True
4 PowerGen Energy        No           False

Type: bool

Using Boolean for Filtering

# Filter public companies
public_companies = companies[companies['is_public_bool']]
print("Public companies:")
print(public_companies[['company', 'is_public_bool']])

Converting to Category Type

# Add sector column for demonstration
companies['sector'] = ['Technology', 'Food', 'Automotive', 'Financials', 'Energy']

print("Before category conversion:")
print(f"Memory usage: {companies['sector'].memory_usage(deep=True)} bytes")
print(f"Type: {companies['sector'].dtype}")

# Convert to category
companies['sector'] = companies['sector'].astype('category')

print("\nAfter category conversion:")
print(f"Memory usage: {companies['sector'].memory_usage(deep=True)} bytes")
print(f"Type: {companies['sector'].dtype}")
print(f"\nCategories: {companies['sector'].cat.categories.tolist()}")

Category type saves memory for columns with repeated values!

When to Use Category Type

Use category type for:

Columns with limited unique values (gender, status, region)
Categorical data that repeats often
Ordered categories (ratings: Poor, Good, Excellent)

# Ordered categories
performance = pd.Series(['Good', 'Excellent', 'Poor', 'Good', 'Excellent'])
performance_cat = pd.Categorical(
    performance,
    categories=['Poor', 'Good', 'Excellent'],
    ordered=True
)

print("Ordered categories:")
print(performance_cat)
print("\nCan compare:")
print(performance_cat > 'Good')

Output:

[Good, Excellent, Poor, Good, Excellent]
Categories (3, object): ['Poor' < 'Good' < 'Excellent']

Can compare:
[False, True, False, False, True]

Batch Type Conversion

# Convert multiple columns at once
type_conversions = {
    'founded': 'int',
    'employees': 'int'
}

companies = companies.astype(type_conversions)

print("After batch conversion:")
print(companies.dtypes)

Practice Exercises

Apply type conversion with these exercises.

Exercise 1: Numeric Conversions

Given this data:

sales = pd.DataFrame({
    'product': ['A', 'B', 'C', 'D'],
    'price': ['25.50', '30.00', 'N/A', '45.75'],
    'quantity': ['100', '150', '200', '75']
})

Convert ‘price’ to float, handling the ‘N/A’ value
Convert ‘quantity’ to integer
Calculate total sales (price × quantity) for non-null prices

# Your code here

Exercise 2: Datetime Operations

Given:

orders = pd.DataFrame({
    'order_id': [1, 2, 3, 4],
    'order_date': ['2023-01-15', '2023-02-20', '2023-03-10', '2023-04-05'],
    'delivery_date': ['2023-01-20', '2023-02-25', '2023-03-15', '2023-04-10']
})

Convert both date columns to datetime
Calculate delivery time in days
Extract the month name from order_date
Find orders placed on a Monday

# Your code here

Hint

Use .dt.day_name() to get the day of the week!

Exercise 3: Boolean and Category

Given:

employees = pd.DataFrame({
    'name': ['Ali', 'Sara', 'Reza', 'Maryam'],
    'department': ['IT', 'HR', 'IT', 'Sales'],
    'is_manager': ['yes', 'no', 'yes', 'no'],
    'status': ['Active', 'Active', 'On Leave', 'Active']
})

Convert ‘is_manager’ to boolean (yes→True, no→False)
Convert ‘department’ and ‘status’ to category type
Count how many managers there are
Compare memory usage before and after category conversion

# Your code here

Summary

You now master data type conversion. Let’s review the key concepts.

Key Concepts

Checking Types

Use .dtypes and .info() to inspect types
Wrong types prevent operations
Type checking is the first data quality step

Numeric Conversion

.astype() for simple conversion
pd.to_numeric() for error handling
errors='coerce' converts invalid to NaN

Datetime Conversion

pd.to_datetime() parses dates
Extract components with .dt accessor
Calculate time differences for analysis

Boolean and Category

Map text to boolean with dictionary
Category type saves memory
Use categories for repeated values

Syntax Reference

# Check types
df.dtypes                      # Show all types
df.info()                      # Detailed info
df['col'].dtype                # Specific column

# Convert numeric
df['col'].astype(int)          # Convert to integer
df['col'].astype(float)        # Convert to float
pd.to_numeric(df['col'], errors='coerce')  # Handle errors

# Convert datetime
pd.to_datetime(df['col'])      # Parse dates
df['date'].dt.year             # Extract year
df['date'].dt.month            # Extract month
df['date'].dt.day_name()       # Get day name

# Boolean
df['col'].map({'yes': True, 'no': False})

# Category
df['col'].astype('category')   # Convert to category

Type Conversion Guide

From	To	Method	Notes
String numbers	int/float	`pd.to_numeric()`	Use `errors='coerce'`
String dates	datetime	`pd.to_datetime()`	Auto-detects formats
Yes/No text	boolean	`.map()`	Create mapping dict
Repeated values	category	`.astype('category')`	Saves memory
Float	int	`.astype(int)`	Truncates decimals
Any type	string	`.astype(str)`	Always works

Error Handling

# errors='coerce' - invalid becomes NaN (recommended)
pd.to_numeric(series, errors='coerce')

# errors='ignore' - return original if error
pd.to_numeric(series, errors='ignore')

# errors='raise' - raise error (default)
pd.to_numeric(series, errors='raise')

Best Practices

Always check types after loading data
Use appropriate types - saves memory, enables operations
Handle errors - use errors='coerce' for dirty data
Category for repetition - countries, sectors, statuses
Parse dates early - enables datetime operations
Document conversions - note what you changed and why

Next Steps

You can now convert data types appropriately. In the next lesson, you will learn to find and handle duplicates and outliers—ensuring data quality and removing problematic values.

Continue to Lesson 13 - Duplicates and Outliers

Learn to find duplicates and detect unusual values

Back to Lesson 11 - Handling Missing Data

Review detecting and filling missing values

Master Type Conversion

Correct data types are fundamental to analysis. You cannot perform calculations on text, filter dates stored as strings, or optimize memory without proper types.

Use type conversion to prepare data for successful analysis!

Lesson 11 - Handling Missing Data

Lesson 13 - Removing Duplicates and Handling Outliers

Lesson 12 - Data Type Conversion and Cleaning

Converting Data Types#

Understanding Data Types#

Common Pandas Data Types#

Sample Data with Type Issues#

Checking Data Types#

Problem with Wrong Types#

Converting Numeric Types#

Using .astype()#

Handling Conversion Errors#

Using pd.to_numeric() with Error Handling#

Error Handling Options#

Float to Int Conversion#

Rounding Before Converting#

Working with Datetime#

Basic Date Conversion#

Extracting Date Components#

Calculating Time Differences#

Parsing Different Date Formats#

Boolean and Category Types#

Converting to Boolean#

Using Boolean for Filtering#

Converting to Category Type#

When to Use Category Type#

Batch Type Conversion#

Practice Exercises#

Exercise 1: Numeric Conversions#

Exercise 2: Datetime Operations#

Exercise 3: Boolean and Category#

Summary#

Key Concepts#

Syntax Reference#

Type Conversion Guide#

Error Handling#

Best Practices#

Next Steps#

Continue to Lesson 13 - Duplicates and Outliers

Back to Lesson 11 - Handling Missing Data

Master Type Conversion#

Converting Data Types

Understanding Data Types

Common Pandas Data Types

Sample Data with Type Issues

Checking Data Types

Problem with Wrong Types

Converting Numeric Types

Using .astype()

Handling Conversion Errors

Using pd.to_numeric() with Error Handling

Error Handling Options

Float to Int Conversion

Rounding Before Converting

Working with Datetime

Basic Date Conversion

Extracting Date Components

Calculating Time Differences

Parsing Different Date Formats

Boolean and Category Types

Converting to Boolean

Using Boolean for Filtering

Converting to Category Type

When to Use Category Type

Batch Type Conversion

Practice Exercises

Exercise 1: Numeric Conversions

Exercise 2: Datetime Operations

Exercise 3: Boolean and Category

Summary

Key Concepts

Syntax Reference

Type Conversion Guide

Error Handling

Best Practices

Next Steps

Master Type Conversion