Lesson 12 - Data Type Conversion and Cleaning

Converting Data Types

Real data arrives with wrong types—numbers stored as text, dates as strings, booleans as words. Now you will learn to convert data types, handle conversion errors, and prepare data for analysis.

By the end of this lesson, you will be able to:

  • Check and understand data types in DataFrames
  • Convert between numeric types safely
  • Parse dates from strings
  • Convert text to boolean values
  • Use category type for memory optimization
  • Handle conversion errors gracefully

Wrong data types prevent operations—you cannot calculate the average of string numbers or filter dates stored as text. Type conversion is essential for analysis.


Understanding Data Types

Pandas uses specific types for different kinds of data.

Common Pandas Data Types

  • object: Text (strings) or mixed types
  • int64: Integer numbers
  • float64: Decimal numbers
  • bool: True/False
  • datetime64: Dates and times
  • category: Categorical data (limited unique values)

Sample Data with Type Issues

import pandas as pd
import numpy as np

# Create dataset with incorrect types
companies = pd.DataFrame({
    'company': ['TechCorp Global', 'FreshMart Inc', 'AutoDrive Motors', 'FirstBank Holdings', 'PowerGen Energy'],
    'revenues': ['125000', '89000', '156000', '234000', '178000'],  # Stored as strings!
    'profits': ['12000', '8500', '-3000', '45000', 'N/A'],  # Mix of numbers and text
    'employees': [1200, 890, 2300, 5600, 3400],
    'founded': ['1990', '2005', '1985', '2010', '1998'],  # Years as strings
    'established_date': ['2020-01-15', '2019-06-20', '2018-03-10', '2021-11-05', '2017-09-25'],
    'is_public': ['Yes', 'No', 'Yes', 'Yes', 'No'],  # Should be boolean
    'rating': ['4.5', '3.8', '4.2', '4.7', '3.9']  # Numbers as strings
})

print("Dataset with type issues:")
print(companies)
print("\nData types:")
print(companies.dtypes)

Output shows most numeric columns are ‘object’ type!

Checking Data Types

# Get detailed info
print("DataFrame info:")
companies.info()

Output:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 8 columns):
 #   Column             Non-Null Count  Dtype
---  ------             --------------  -----
 0   company            5 non-null      object
 1   revenues           5 non-null      object  ← Should be int/float!
 2   profits            5 non-null      object  ← Should be int/float!
 3   employees          5 non-null      int64   ✓ Correct
 4   founded            5 non-null      object  ← Should be int!
 5   established_date   5 non-null      object  ← Should be datetime!
 6   is_public          5 non-null      object  ← Should be bool!
 7   rating             5 non-null      object  ← Should be float!

Problem with Wrong Types

# Try to calculate sum on string numbers
try:
    total_revenue = companies['revenues'].sum()
    print(f"Total revenue: {total_revenue}")
except TypeError as e:
    print(f"Error: {e}")
    print("\nActual result (string concatenation):")
    print(companies['revenues'].sum())  # Concatenates strings!

Output: '125000890001560002340000178000' - String concatenation, not addition!


Converting Numeric Types

Convert string numbers to actual numeric types.

Using .astype()

# Convert string to integer
companies['revenues'] = companies['revenues'].astype(int)

print("After converting revenues to int:")
print(companies['revenues'].dtype)
print("\nNow we can calculate:")
print(f"Total revenue: ${companies['revenues'].sum():,}")
print(f"Average revenue: ${companies['revenues'].mean():,.0f}")

Output:

After converting revenues to int:
int64

Now we can calculate:
Total revenue: $782,000
Average revenue: $156,400

Handling Conversion Errors

# Problem: .astype() fails with non-numeric values
try:
    companies['profits'] = companies['profits'].astype(int)
except ValueError as e:
    print(f"Error: {e}")
    print("\nThe 'N/A' value cannot be converted to integer!")

Using pd.to_numeric() with Error Handling

# Solution: Use pd.to_numeric() with errors='coerce'
companies['profits'] = pd.to_numeric(companies['profits'], errors='coerce')

print("After converting with errors='coerce':")
print(companies['profits'])
print(f"\nType: {companies['profits'].dtype}")
print("\nNote: 'N/A' became NaN (missing value)")

Output:

0    12000.0
1     8500.0
2    -3000.0
3    45000.0
4       NaN   ← 'N/A' converted to NaN
Name: profits, dtype: float64

Error Handling Options

# Different error handling strategies
test_data = pd.Series(['100', '200', 'invalid', '400'])

print("Original:")
print(test_data)

# errors='coerce' - convert invalid to NaN
print("\nWith errors='coerce':")
print(pd.to_numeric(test_data, errors='coerce'))

# errors='ignore' - return original if any error
print("\nWith errors='ignore':")
print(pd.to_numeric(test_data, errors='ignore'))

Output:

With errors='coerce':
0    100.0
1    200.0
2      NaN  ← Invalid value
3    400.0

With errors='ignore':
0        100
1        200
2    invalid  ← Unchanged
3        400

Float to Int Conversion

# Convert float to int (careful - drops decimals!)
companies['rating'] = companies['rating'].astype(float)
print("Rating as float:")
print(companies['rating'])

# Convert to int (truncates decimals)
companies['rating_int'] = companies['rating'].astype(int)
print("\nRating as int (decimals truncated):")
print(companies[['rating', 'rating_int']])

Output:

   rating  rating_int
0     4.5           4  ← Decimal lost
1     3.8           3
2     4.2           4
3     4.7           4
4     3.9           3

Rounding Before Converting

# Round first, then convert
companies['rating_rounded'] = companies['rating'].round().astype(int)

print("Rating rounded then converted:")
print(companies[['rating', 'rating_int', 'rating_rounded']])

Output:

   rating  rating_int  rating_rounded
0     4.5           4               4  ← Rounds to nearest
1     3.8           3               4
2     4.2           4               4
3     4.7           4               5
4     3.9           3               4

Working with Datetime

Convert string dates to proper datetime objects.

Basic Date Conversion

# Convert string to datetime
companies['established_date'] = pd.to_datetime(companies['established_date'])

print("After converting to datetime:")
print(companies['established_date'])
print(f"\nType: {companies['established_date'].dtype}")

Output:

0   2020-01-15
1   2019-06-20
2   2018-03-10
3   2021-11-05
4   2017-09-25
Name: established_date, dtype: datetime64[ns]

Extracting Date Components

# Extract components
companies['year'] = companies['established_date'].dt.year
companies['month'] = companies['established_date'].dt.month
companies['day_of_week'] = companies['established_date'].dt.day_name()

print("Extracted date components:")
print(companies[['established_date', 'year', 'month', 'day_of_week']])

Calculating Time Differences

# Calculate years in operation
today = pd.Timestamp('2024-01-01')
companies['days_operating'] = (today - companies['established_date']).dt.days
companies['years_operating'] = (companies['days_operating'] / 365.25).round(1)

print("Years in operation:")
print(companies[['company', 'established_date', 'years_operating']])

Parsing Different Date Formats

# Pandas automatically parses many formats
date_examples = pd.Series([
    '2020-01-15',        # ISO format
    '15/01/2020',        # Day/Month/Year
    'Jan 15, 2020',      # Month name
    '2020.01.15'         # Dot separator
])

parsed_dates = pd.to_datetime(date_examples)
print("All parsed to same format:")
print(parsed_dates)

Boolean and Category Types

Convert text to boolean and use categories for efficiency.

Converting to Boolean

# Map Yes/No to True/False
companies['is_public_bool'] = companies['is_public'].map({'Yes': True, 'No': False})

print("Converted to boolean:")
print(companies[['company', 'is_public', 'is_public_bool']])
print(f"\nType: {companies['is_public_bool'].dtype}")

Output:

       company is_public  is_public_bool
0     TechCorp Global       Yes            True
1   FreshMart Inc        No           False
2  AutoDrive Motors       Yes            True
3  FirstBank Holdings       Yes            True
4 PowerGen Energy        No           False

Type: bool

Using Boolean for Filtering

# Filter public companies
public_companies = companies[companies['is_public_bool']]
print("Public companies:")
print(public_companies[['company', 'is_public_bool']])

Converting to Category Type

# Add sector column for demonstration
companies['sector'] = ['Technology', 'Food', 'Automotive', 'Financials', 'Energy']

print("Before category conversion:")
print(f"Memory usage: {companies['sector'].memory_usage(deep=True)} bytes")
print(f"Type: {companies['sector'].dtype}")

# Convert to category
companies['sector'] = companies['sector'].astype('category')

print("\nAfter category conversion:")
print(f"Memory usage: {companies['sector'].memory_usage(deep=True)} bytes")
print(f"Type: {companies['sector'].dtype}")
print(f"\nCategories: {companies['sector'].cat.categories.tolist()}")

Category type saves memory for columns with repeated values!

When to Use Category Type

Use category type for:

  • Columns with limited unique values (gender, status, region)
  • Categorical data that repeats often
  • Ordered categories (ratings: Poor, Good, Excellent)
# Ordered categories
performance = pd.Series(['Good', 'Excellent', 'Poor', 'Good', 'Excellent'])
performance_cat = pd.Categorical(
    performance,
    categories=['Poor', 'Good', 'Excellent'],
    ordered=True
)

print("Ordered categories:")
print(performance_cat)
print("\nCan compare:")
print(performance_cat > 'Good')

Output:

[Good, Excellent, Poor, Good, Excellent]
Categories (3, object): ['Poor' < 'Good' < 'Excellent']

Can compare:
[False, True, False, False, True]

Batch Type Conversion

# Convert multiple columns at once
type_conversions = {
    'founded': 'int',
    'employees': 'int'
}

companies = companies.astype(type_conversions)

print("After batch conversion:")
print(companies.dtypes)

Practice Exercises

Apply type conversion with these exercises.

Exercise 1: Numeric Conversions

Given this data:

sales = pd.DataFrame({
    'product': ['A', 'B', 'C', 'D'],
    'price': ['25.50', '30.00', 'N/A', '45.75'],
    'quantity': ['100', '150', '200', '75']
})
  1. Convert ‘price’ to float, handling the ‘N/A’ value
  2. Convert ‘quantity’ to integer
  3. Calculate total sales (price × quantity) for non-null prices
# Your code here

Exercise 2: Datetime Operations

Given:

orders = pd.DataFrame({
    'order_id': [1, 2, 3, 4],
    'order_date': ['2023-01-15', '2023-02-20', '2023-03-10', '2023-04-05'],
    'delivery_date': ['2023-01-20', '2023-02-25', '2023-03-15', '2023-04-10']
})
  1. Convert both date columns to datetime
  2. Calculate delivery time in days
  3. Extract the month name from order_date
  4. Find orders placed on a Monday
# Your code here

Hint

Use .dt.day_name() to get the day of the week!

Exercise 3: Boolean and Category

Given:

employees = pd.DataFrame({
    'name': ['Ali', 'Sara', 'Reza', 'Maryam'],
    'department': ['IT', 'HR', 'IT', 'Sales'],
    'is_manager': ['yes', 'no', 'yes', 'no'],
    'status': ['Active', 'Active', 'On Leave', 'Active']
})
  1. Convert ‘is_manager’ to boolean (yes→True, no→False)
  2. Convert ‘department’ and ‘status’ to category type
  3. Count how many managers there are
  4. Compare memory usage before and after category conversion
# Your code here

Summary

You now master data type conversion. Let’s review the key concepts.

Key Concepts

Checking Types

  • Use .dtypes and .info() to inspect types
  • Wrong types prevent operations
  • Type checking is the first data quality step

Numeric Conversion

  • .astype() for simple conversion
  • pd.to_numeric() for error handling
  • errors='coerce' converts invalid to NaN

Datetime Conversion

  • pd.to_datetime() parses dates
  • Extract components with .dt accessor
  • Calculate time differences for analysis

Boolean and Category

  • Map text to boolean with dictionary
  • Category type saves memory
  • Use categories for repeated values

Syntax Reference

# Check types
df.dtypes                      # Show all types
df.info()                      # Detailed info
df['col'].dtype                # Specific column

# Convert numeric
df['col'].astype(int)          # Convert to integer
df['col'].astype(float)        # Convert to float
pd.to_numeric(df['col'], errors='coerce')  # Handle errors

# Convert datetime
pd.to_datetime(df['col'])      # Parse dates
df['date'].dt.year             # Extract year
df['date'].dt.month            # Extract month
df['date'].dt.day_name()       # Get day name

# Boolean
df['col'].map({'yes': True, 'no': False})

# Category
df['col'].astype('category')   # Convert to category

Type Conversion Guide

FromToMethodNotes
String numbersint/floatpd.to_numeric()Use errors='coerce'
String datesdatetimepd.to_datetime()Auto-detects formats
Yes/No textboolean.map()Create mapping dict
Repeated valuescategory.astype('category')Saves memory
Floatint.astype(int)Truncates decimals
Any typestring.astype(str)Always works

Error Handling

# errors='coerce' - invalid becomes NaN (recommended)
pd.to_numeric(series, errors='coerce')

# errors='ignore' - return original if error
pd.to_numeric(series, errors='ignore')

# errors='raise' - raise error (default)
pd.to_numeric(series, errors='raise')

Best Practices

  1. Always check types after loading data
  2. Use appropriate types - saves memory, enables operations
  3. Handle errors - use errors='coerce' for dirty data
  4. Category for repetition - countries, sectors, statuses
  5. Parse dates early - enables datetime operations
  6. Document conversions - note what you changed and why

Next Steps

You can now convert data types appropriately. In the next lesson, you will learn to find and handle duplicates and outliers—ensuring data quality and removing problematic values.

Continue to Lesson 13 - Duplicates and Outliers

Learn to find duplicates and detect unusual values

Back to Lesson 11 - Handling Missing Data

Review detecting and filling missing values


Master Type Conversion

Correct data types are fundamental to analysis. You cannot perform calculations on text, filter dates stored as strings, or optimize memory without proper types.

Use type conversion to prepare data for successful analysis!