Lesson 14 - Working with String Data

Efficient Text Data Operations

Text data needs cleaning—inconsistent capitalization, whitespace, special characters, mixed formats. Now you will learn vectorized string operations that apply to entire columns at once.

By the end of this lesson, you will be able to:

Use .str accessor for string methods
Clean and format text data
Extract information from strings
Search and filter text efficiently
Apply basic regex patterns

String operations in pandas are vectorized—they work on entire Series at once, much faster than loops. Let’s master these essential text manipulation skills.

Understanding the .str Accessor

The .str accessor unlocks string methods for pandas Series. It works like Python string methods, but operates on entire columns at once.

Think of it this way: instead of looping through each value and applying a string method, you apply the method to the whole column in one operation. This is both faster and cleaner code.

# Without .str (slow, manual loop)
cleaned = []
for value in df['name']:
    cleaned.append(value.lower())
df['name_lower'] = cleaned

# With .str (fast, vectorized)
df['name_lower'] = df['name'].str.lower()

The .str accessor makes the second approach possible—one line, much faster.

Section 1: Basic String Operations

Case Conversion

Converting text case is essential for standardizing data. Different users enter data differently—some use ALL CAPS, some use lowercase, some use Title Case.

import pandas as pd

companies = pd.DataFrame({
    'company': ['TechCorp Global Solutions', 'SARA FOODS CO.', 'maryam bank',
                'PowerGen Energy Corp', '  RetailHub Ltd  ']
})

# Convert to lowercase
companies['company_lower'] = companies['company'].str.lower()

# Convert to uppercase
companies['company_upper'] = companies['company'].str.upper()

# Title case (capitalize first letter of each word)
companies['company_title'] = companies['company'].str.title()

print(companies[['company', 'company_lower', 'company_upper', 'company_title']])

Output shows all three versions side by side:

            company          company_lower          company_upper          company_title
0  TechCorp Global Solutions  ali tech solutions  ALI TECH SOLUTIONS  TechCorp Global Solutions
1    SARA FOODS CO.    sara foods co.    SARA FOODS CO.    FreshMart Inc Co.
2      maryam bank      maryam bank      MARYAM BANK      FirstBank Holdings

Use .lower() when you need consistency for comparisons or matching. Use .title() for display names that should look professional.

Removing Whitespace

Whitespace at the start or end of strings causes problems when matching or comparing data. Notice the spaces around “RetailHub Ltd” in the example above.

# Strip whitespace from both ends
companies['company_clean'] = companies['company'].str.strip()

# Before strip
print(repr(companies.loc[4, 'company']))
# Output: '  RetailHub Ltd  '  (notice the spaces)

# After strip
print(repr(companies.loc[4, 'company_clean']))
# Output: 'RetailHub Ltd'  (spaces removed)

The .strip() method removes spaces, tabs, and newlines from both ends. Use .lstrip() to remove only from the left, or .rstrip() to remove only from the right.

String Length

Finding string length helps identify data quality issues—names that are too short (possibly incomplete) or too long (possibly incorrect).

# Get length of each string
companies['name_length'] = companies['company'].str.len()

# Find longest and shortest names
print(companies[['company', 'name_length']].sort_values('name_length', ascending=False))

Output shows names sorted by length:

            company  name_length
0  TechCorp Global Solutions           18
1    SARA FOODS CO.           15
3  PowerGen Energy Corp           18

Use length to validate data or find outliers.

Replacing Text

Replace text to standardize abbreviations, fix typos, or clean data.

# Replace abbreviations with full words
companies['company_replaced'] = companies['company'].str.replace('Ltd.', 'Limited')
companies['company_replaced'] = companies['company_replaced'].str.replace('Inc', 'Incorporated')
companies['company_replaced'] = companies['company_replaced'].str.replace('Corp', 'Corporation')

print(companies[['company', 'company_replaced']])

The .replace() method finds exact matches and replaces them. For pattern matching, use regex=True parameter.

Section 2: Extracting and Splitting Strings

Splitting Strings

Split strings to extract parts of the data. Common use: separating first names from last names, or splitting email addresses.

# Split by space and get first word
companies['owner_name'] = companies['company'].str.split().str[0]

print(companies[['company', 'owner_name']])

Output:

            company   owner_name
0  TechCorp Global Solutions         Ali
1    SARA FOODS CO.        SARA
2      maryam bank      maryam

The .split() method returns a list of words. Use .str[0] to get the first word, .str[1] for the second, and so on.

Splitting into Multiple Columns

Sometimes you want to split one column into several columns. Use expand=True:

# Create email column
companies['email'] = ['[email protected]', '[email protected]', '[email protected]']

# Split email into username and domain
email_parts = companies['email'].str.split('@', expand=True)
email_parts.columns = ['email_user', 'email_domain']

# Add to original DataFrame
companies = pd.concat([companies, email_parts], axis=1)

print(companies[['email', 'email_user', 'email_domain']])

Output:

                email  email_user     email_domain
0     [email protected]        info       alitech.ir
1  [email protected]     contact   sarafoods.com
2  [email protected]     support   maryambank.qa

Now you have the email split into its component parts for further analysis.

Extracting with Indexing

You can also extract parts by position:

# Get first 3 characters as company code
companies['company_code'] = companies['company'].str[:3].str.upper()

print(companies[['company', 'company_code']])

Output:

            company company_code
0  TechCorp Global Solutions          ALI
1    SARA FOODS CO.          SAR
2      maryam bank          MAR

Slicing works just like Python string slicing: str[:3] gets first 3 characters, str[-2:] gets last 2 characters.

Extracting with Split

Extract parts after splitting:

# Get domain extension from email
companies['domain_extension'] = companies['email_domain'].str.split('.').str[-1]

print(companies[['email_domain', 'domain_extension']])

Output:

     email_domain domain_extension
0       alitech.ir               ir
1   sarafoods.com              com
2   maryambank.qa               qa

The .str[-1] gets the last element after splitting by period.

Section 3: Searching and Filtering Text

Checking for Substrings

Find rows where text contains specific words:

# Find companies with 'Tech' in name
has_tech = companies['company'].str.contains('Tech', case=False)

print("Companies with 'Tech' in name:")
print(companies[has_tech][['company']])

Output:

            company
0  TechCorp Global Solutions

The case=False parameter makes the search case-insensitive, so it finds “Tech”, “tech”, “TECH”, etc.

Multiple Search Terms

Use the pipe symbol | for OR logic:

# Find companies with Tech, Energy, or Bank
has_keywords = companies['company'].str.contains('Tech|Energy|Bank', case=False)

print(companies[has_keywords][['company']])

This finds any row containing at least one of these words.

Starts With and Ends With

Check if strings start or end with specific text:

# Companies starting with 'A'
starts_with_a = companies['company'].str.startswith('A', na=False)

print("Companies starting with 'A':")
print(companies[starts_with_a][['company']])

# Companies ending with specific suffixes
ends_with_corp = companies['company'].str.endswith(('Corp', 'Ltd.', 'Inc'), na=False)

print("Companies ending with Corp, Ltd., or Inc:")
print(companies[ends_with_corp][['company']])

The na=False parameter handles missing values safely.

Regular Expression Search

Use regex patterns for complex searches:

# Find Iranian emails (ending with .ir)
iranian_emails = companies['email'].str.contains('.ir$', regex=True)

print("Companies with .ir email:")
print(companies[iranian_emails][['company', 'email']])

The $ in the regex means “end of string”, so it only matches emails ending with .ir.

Section 4: Cleaning and Formatting

Cleaning Phone Numbers

Remove all non-numeric characters from phone numbers:

companies['phone'] = ['+98-21-1234567', '+971 4 555 1234', '(974) 44-55-6677']

# Remove everything except digits
companies['phone_clean'] = companies['phone'].str.replace(r'[^0-9]', '', regex=True)

print(companies[['phone', 'phone_clean']])

Output:

            phone    phone_clean
0  +98-21-1234567  98211234567
1  +971 4 555 1234 9714555123
2  (974) 44-55-6677 97444556677

The regex pattern [^0-9] means “anything that is NOT a digit”. Replacing with empty string removes all non-digits.

Removing Special Characters

Clean company names by removing punctuation:

# Remove everything except letters, numbers, and spaces
companies['company_alphanumeric'] = companies['company'].str.replace(r'[^a-zA-Z0-9\s]', '', regex=True)

print(companies[['company', 'company_alphanumeric']])

Output:

            company company_alphanumeric
0  TechCorp Global Solutions  TechCorp Global Solutions
1    SARA FOODS CO.       SARA FOODS CO
2      maryam bank          maryam bank

Notice periods and other punctuation are removed.

Normalizing Whitespace

Replace multiple spaces with single space:

companies['description'] = ['Leading    tech    provider', 'Food  service  company']

# Replace multiple spaces with single space
companies['description_clean'] = companies['description'].str.replace(r'\s+', ' ', regex=True).str.strip()

print(companies[['description', 'description_clean']])

Output:

                  description        description_clean
0  Leading    tech    provider  Leading tech provider
1   Food  service  company     Food service company

The regex \s+ means “one or more whitespace characters”.

Extracting Numbers from Text

Pull numbers out of text using regex:

test_data = pd.Series(['Founded in 1990', 'Established 2005', 'Since 1985', 'New company'])

# Extract 4-digit years
years = test_data.str.extract(r'(\d{4})', expand=False)

print(pd.DataFrame({'text': test_data, 'year': years}))

Output:

              text  year
0   Founded in 1990  1990
1  Established 2005  2005
2      Since 1985  1985
3     New company   NaN

The regex \d{4} means “exactly 4 digits”. Missing values become NaN.

Counting Words

Count words in descriptions:

# Count words by splitting on whitespace
word_count = companies['description'].str.split().str.len()

companies['description_words'] = word_count

print(companies[['description', 'description_words']])

This splits each description into words and counts how many words result.

Common Patterns and Use Cases

Email Validation Pattern

# Basic email pattern check
is_valid_email = companies['email'].str.match(r'^[\w\.-]+@[\w\.-]+\.\w+$')

URL Standardization

def standardize_url(url):
    url = url.lower().strip()
    if url.startswith('http'):
        return url
    elif url.startswith('www.'):
        return 'https://' + url
    else:
        return 'https://www.' + url

companies['website_clean'] = companies['website'].apply(standardize_url)

Data Quality Scoring

Create a score based on data completeness:

score = 0
score += companies['phone'].notna().astype(int) * 25  # Has phone
score += companies['email'].str.islower().astype(int) * 25  # Email is lowercase
score += companies['website'].notna().astype(int) * 25  # Has website
score += companies['description'].str.len().gt(20).astype(int) * 25  # Description > 20 chars

companies['data_quality_score'] = score

Summary

You have completed an extensive lesson on pandas string operations! Here is what you learned:

Key Concepts

The .str Accessor

Enables vectorized string operations
Works on entire Series at once
Much faster than Python loops

Basic Operations

Case conversion: .str.lower(), .str.upper(), .str.title()
Whitespace: .str.strip(), .str.lstrip(), .str.rstrip()
Length: .str.len()
Replace: .str.replace()

Extracting and Splitting

Split: .str.split(), access parts with .str[index]
Slice: .str[:n] for first n characters
Extract with regex: .str.extract()

Searching and Filtering

Contains: .str.contains()
Starts and ends: .str.startswith(), .str.endswith()
Use for filtering DataFrames

Cleaning and Formatting

Remove special characters with regex
Normalize whitespace
Extract numbers from text
Standardize formats

String operations are essential for real-world data cleaning. The .str accessor makes text manipulation fast and efficient.

Next Steps

You can now clean and manipulate text data efficiently. In the next lesson, you will learn GroupBy and aggregation—the most powerful pandas feature for category-based analysis.

Continue to Lesson 15 - GroupBy & Aggregation

Learn to group data and calculate statistics per category

Back to Lesson 13 - Duplicates & Outliers

Review finding and handling data quality issues

Master Text Data Manipulation

String operations are essential for cleaning real-world data. Use the .str accessor to transform text efficiently without writing loops!

Lesson 13 - Removing Duplicates and Handling Outliers

Lesson 15 - GroupBy and Aggregation

Courses

DATATWEETS

Title here

Lesson 14 - Working with String Data

Efficient Text Data Operations

Understanding the .str Accessor

Section 1: Basic String Operations

Case Conversion

Removing Whitespace

String Length

Replacing Text

Section 2: Extracting and Splitting Strings

Splitting Strings

Splitting into Multiple Columns

Extracting with Indexing

Extracting with Split

Section 3: Searching and Filtering Text

Checking for Substrings

Multiple Search Terms

Starts With and Ends With

Regular Expression Search

Section 4: Cleaning and Formatting

Cleaning Phone Numbers

Removing Special Characters

Normalizing Whitespace

Extracting Numbers from Text

Counting Words

Common Patterns and Use Cases

Email Validation Pattern

URL Standardization

Data Quality Scoring

Summary

Key Concepts

Next Steps

Continue to Lesson 15 - GroupBy & Aggregation

Back to Lesson 13 - Duplicates & Outliers

Master Text Data Manipulation

Lesson 14 - Working with String Data

Efficient Text Data Operations#

Understanding the .str Accessor#

Section 1: Basic String Operations#

Case Conversion#

Removing Whitespace#

String Length#

Replacing Text#

Section 2: Extracting and Splitting Strings#

Splitting Strings#

Splitting into Multiple Columns#

Extracting with Indexing#

Extracting with Split#

Section 3: Searching and Filtering Text#

Checking for Substrings#

Multiple Search Terms#

Starts With and Ends With#

Regular Expression Search#

Section 4: Cleaning and Formatting#

Cleaning Phone Numbers#

Removing Special Characters#

Normalizing Whitespace#

Extracting Numbers from Text#

Counting Words#

Common Patterns and Use Cases#

Email Validation Pattern#

URL Standardization#

Data Quality Scoring#

Summary#

Key Concepts#

Next Steps#

Continue to Lesson 15 - GroupBy & Aggregation

Back to Lesson 13 - Duplicates & Outliers

Master Text Data Manipulation#

Efficient Text Data Operations

Understanding the .str Accessor

Section 1: Basic String Operations

Case Conversion

Removing Whitespace

String Length

Replacing Text

Section 2: Extracting and Splitting Strings

Splitting Strings

Splitting into Multiple Columns

Extracting with Indexing

Extracting with Split

Section 3: Searching and Filtering Text

Checking for Substrings

Multiple Search Terms

Starts With and Ends With

Regular Expression Search

Section 4: Cleaning and Formatting

Cleaning Phone Numbers

Removing Special Characters

Normalizing Whitespace

Extracting Numbers from Text

Counting Words

Common Patterns and Use Cases

Email Validation Pattern

URL Standardization

Data Quality Scoring

Summary

Key Concepts

Next Steps

Master Text Data Manipulation