Lesson 14 - Working with String Data

Efficient Text Data Operations

Text data needs cleaning—inconsistent capitalization, whitespace, special characters, mixed formats. Now you will learn vectorized string operations that apply to entire columns at once.

By the end of this lesson, you will be able to:

  • Use .str accessor for string methods
  • Clean and format text data
  • Extract information from strings
  • Search and filter text efficiently
  • Apply basic regex patterns

String operations in pandas are vectorized—they work on entire Series at once, much faster than loops. Let’s master these essential text manipulation skills.


Understanding the .str Accessor

The .str accessor unlocks string methods for pandas Series. It works like Python string methods, but operates on entire columns at once.

Think of it this way: instead of looping through each value and applying a string method, you apply the method to the whole column in one operation. This is both faster and cleaner code.

# Without .str (slow, manual loop)
cleaned = []
for value in df['name']:
    cleaned.append(value.lower())
df['name_lower'] = cleaned

# With .str (fast, vectorized)
df['name_lower'] = df['name'].str.lower()

The .str accessor makes the second approach possible—one line, much faster.


Section 1: Basic String Operations

Case Conversion

Converting text case is essential for standardizing data. Different users enter data differently—some use ALL CAPS, some use lowercase, some use Title Case.

import pandas as pd

companies = pd.DataFrame({
    'company': ['TechCorp Global Solutions', 'SARA FOODS CO.', 'maryam bank',
                'PowerGen Energy Corp', '  RetailHub Ltd  ']
})

# Convert to lowercase
companies['company_lower'] = companies['company'].str.lower()

# Convert to uppercase
companies['company_upper'] = companies['company'].str.upper()

# Title case (capitalize first letter of each word)
companies['company_title'] = companies['company'].str.title()

print(companies[['company', 'company_lower', 'company_upper', 'company_title']])

Output shows all three versions side by side:

            company          company_lower          company_upper          company_title
0  TechCorp Global Solutions  ali tech solutions  ALI TECH SOLUTIONS  TechCorp Global Solutions
1    SARA FOODS CO.    sara foods co.    SARA FOODS CO.    FreshMart Inc Co.
2      maryam bank      maryam bank      MARYAM BANK      FirstBank Holdings

Use .lower() when you need consistency for comparisons or matching. Use .title() for display names that should look professional.

Removing Whitespace

Whitespace at the start or end of strings causes problems when matching or comparing data. Notice the spaces around “RetailHub Ltd” in the example above.

# Strip whitespace from both ends
companies['company_clean'] = companies['company'].str.strip()

# Before strip
print(repr(companies.loc[4, 'company']))
# Output: '  RetailHub Ltd  '  (notice the spaces)

# After strip
print(repr(companies.loc[4, 'company_clean']))
# Output: 'RetailHub Ltd'  (spaces removed)

The .strip() method removes spaces, tabs, and newlines from both ends. Use .lstrip() to remove only from the left, or .rstrip() to remove only from the right.

String Length

Finding string length helps identify data quality issues—names that are too short (possibly incomplete) or too long (possibly incorrect).

# Get length of each string
companies['name_length'] = companies['company'].str.len()

# Find longest and shortest names
print(companies[['company', 'name_length']].sort_values('name_length', ascending=False))

Output shows names sorted by length:

            company  name_length
0  TechCorp Global Solutions           18
1    SARA FOODS CO.           15
3  PowerGen Energy Corp           18

Use length to validate data or find outliers.

Replacing Text

Replace text to standardize abbreviations, fix typos, or clean data.

# Replace abbreviations with full words
companies['company_replaced'] = companies['company'].str.replace('Ltd.', 'Limited')
companies['company_replaced'] = companies['company_replaced'].str.replace('Inc', 'Incorporated')
companies['company_replaced'] = companies['company_replaced'].str.replace('Corp', 'Corporation')

print(companies[['company', 'company_replaced']])

The .replace() method finds exact matches and replaces them. For pattern matching, use regex=True parameter.


Section 2: Extracting and Splitting Strings

Splitting Strings

Split strings to extract parts of the data. Common use: separating first names from last names, or splitting email addresses.

# Split by space and get first word
companies['owner_name'] = companies['company'].str.split().str[0]

print(companies[['company', 'owner_name']])

Output:

            company   owner_name
0  TechCorp Global Solutions         Ali
1    SARA FOODS CO.        SARA
2      maryam bank      maryam

The .split() method returns a list of words. Use .str[0] to get the first word, .str[1] for the second, and so on.

Splitting into Multiple Columns

Sometimes you want to split one column into several columns. Use expand=True:

# Create email column
companies['email'] = ['[email protected]', '[email protected]', '[email protected]']

# Split email into username and domain
email_parts = companies['email'].str.split('@', expand=True)
email_parts.columns = ['email_user', 'email_domain']

# Add to original DataFrame
companies = pd.concat([companies, email_parts], axis=1)

print(companies[['email', 'email_user', 'email_domain']])

Output:

                email  email_user     email_domain
0     [email protected]        info       alitech.ir
1  [email protected]     contact   sarafoods.com
2  [email protected]     support   maryambank.qa

Now you have the email split into its component parts for further analysis.

Extracting with Indexing

You can also extract parts by position:

# Get first 3 characters as company code
companies['company_code'] = companies['company'].str[:3].str.upper()

print(companies[['company', 'company_code']])

Output:

            company company_code
0  TechCorp Global Solutions          ALI
1    SARA FOODS CO.          SAR
2      maryam bank          MAR

Slicing works just like Python string slicing: str[:3] gets first 3 characters, str[-2:] gets last 2 characters.

Extracting with Split

Extract parts after splitting:

# Get domain extension from email
companies['domain_extension'] = companies['email_domain'].str.split('.').str[-1]

print(companies[['email_domain', 'domain_extension']])

Output:

     email_domain domain_extension
0       alitech.ir               ir
1   sarafoods.com              com
2   maryambank.qa               qa

The .str[-1] gets the last element after splitting by period.


Section 3: Searching and Filtering Text

Checking for Substrings

Find rows where text contains specific words:

# Find companies with 'Tech' in name
has_tech = companies['company'].str.contains('Tech', case=False)

print("Companies with 'Tech' in name:")
print(companies[has_tech][['company']])

Output:

            company
0  TechCorp Global Solutions

The case=False parameter makes the search case-insensitive, so it finds “Tech”, “tech”, “TECH”, etc.

Multiple Search Terms

Use the pipe symbol | for OR logic:

# Find companies with Tech, Energy, or Bank
has_keywords = companies['company'].str.contains('Tech|Energy|Bank', case=False)

print(companies[has_keywords][['company']])

This finds any row containing at least one of these words.

Starts With and Ends With

Check if strings start or end with specific text:

# Companies starting with 'A'
starts_with_a = companies['company'].str.startswith('A', na=False)

print("Companies starting with 'A':")
print(companies[starts_with_a][['company']])

# Companies ending with specific suffixes
ends_with_corp = companies['company'].str.endswith(('Corp', 'Ltd.', 'Inc'), na=False)

print("Companies ending with Corp, Ltd., or Inc:")
print(companies[ends_with_corp][['company']])

The na=False parameter handles missing values safely.

Use regex patterns for complex searches:

# Find Iranian emails (ending with .ir)
iranian_emails = companies['email'].str.contains('.ir$', regex=True)

print("Companies with .ir email:")
print(companies[iranian_emails][['company', 'email']])

The $ in the regex means “end of string”, so it only matches emails ending with .ir.


Section 4: Cleaning and Formatting

Cleaning Phone Numbers

Remove all non-numeric characters from phone numbers:

companies['phone'] = ['+98-21-1234567', '+971 4 555 1234', '(974) 44-55-6677']

# Remove everything except digits
companies['phone_clean'] = companies['phone'].str.replace(r'[^0-9]', '', regex=True)

print(companies[['phone', 'phone_clean']])

Output:

            phone    phone_clean
0  +98-21-1234567  98211234567
1  +971 4 555 1234 9714555123
2  (974) 44-55-6677 97444556677

The regex pattern [^0-9] means “anything that is NOT a digit”. Replacing with empty string removes all non-digits.

Removing Special Characters

Clean company names by removing punctuation:

# Remove everything except letters, numbers, and spaces
companies['company_alphanumeric'] = companies['company'].str.replace(r'[^a-zA-Z0-9\s]', '', regex=True)

print(companies[['company', 'company_alphanumeric']])

Output:

            company company_alphanumeric
0  TechCorp Global Solutions  TechCorp Global Solutions
1    SARA FOODS CO.       SARA FOODS CO
2      maryam bank          maryam bank

Notice periods and other punctuation are removed.

Normalizing Whitespace

Replace multiple spaces with single space:

companies['description'] = ['Leading    tech    provider', 'Food  service  company']

# Replace multiple spaces with single space
companies['description_clean'] = companies['description'].str.replace(r'\s+', ' ', regex=True).str.strip()

print(companies[['description', 'description_clean']])

Output:

                  description        description_clean
0  Leading    tech    provider  Leading tech provider
1   Food  service  company     Food service company

The regex \s+ means “one or more whitespace characters”.

Extracting Numbers from Text

Pull numbers out of text using regex:

test_data = pd.Series(['Founded in 1990', 'Established 2005', 'Since 1985', 'New company'])

# Extract 4-digit years
years = test_data.str.extract(r'(\d{4})', expand=False)

print(pd.DataFrame({'text': test_data, 'year': years}))

Output:

              text  year
0   Founded in 1990  1990
1  Established 2005  2005
2      Since 1985  1985
3     New company   NaN

The regex \d{4} means “exactly 4 digits”. Missing values become NaN.

Counting Words

Count words in descriptions:

# Count words by splitting on whitespace
word_count = companies['description'].str.split().str.len()

companies['description_words'] = word_count

print(companies[['description', 'description_words']])

This splits each description into words and counts how many words result.


Common Patterns and Use Cases

Email Validation Pattern

# Basic email pattern check
is_valid_email = companies['email'].str.match(r'^[\w\.-]+@[\w\.-]+\.\w+$')

URL Standardization

def standardize_url(url):
    url = url.lower().strip()
    if url.startswith('http'):
        return url
    elif url.startswith('www.'):
        return 'https://' + url
    else:
        return 'https://www.' + url

companies['website_clean'] = companies['website'].apply(standardize_url)

Data Quality Scoring

Create a score based on data completeness:

score = 0
score += companies['phone'].notna().astype(int) * 25  # Has phone
score += companies['email'].str.islower().astype(int) * 25  # Email is lowercase
score += companies['website'].notna().astype(int) * 25  # Has website
score += companies['description'].str.len().gt(20).astype(int) * 25  # Description > 20 chars

companies['data_quality_score'] = score

Summary

You have completed an extensive lesson on pandas string operations! Here is what you learned:

Key Concepts

The .str Accessor

  • Enables vectorized string operations
  • Works on entire Series at once
  • Much faster than Python loops

Basic Operations

  • Case conversion: .str.lower(), .str.upper(), .str.title()
  • Whitespace: .str.strip(), .str.lstrip(), .str.rstrip()
  • Length: .str.len()
  • Replace: .str.replace()

Extracting and Splitting

  • Split: .str.split(), access parts with .str[index]
  • Slice: .str[:n] for first n characters
  • Extract with regex: .str.extract()

Searching and Filtering

  • Contains: .str.contains()
  • Starts and ends: .str.startswith(), .str.endswith()
  • Use for filtering DataFrames

Cleaning and Formatting

  • Remove special characters with regex
  • Normalize whitespace
  • Extract numbers from text
  • Standardize formats

String operations are essential for real-world data cleaning. The .str accessor makes text manipulation fast and efficient.


Next Steps

You can now clean and manipulate text data efficiently. In the next lesson, you will learn GroupBy and aggregation—the most powerful pandas feature for category-based analysis.

Continue to Lesson 15 - GroupBy & Aggregation

Learn to group data and calculate statistics per category

Back to Lesson 13 - Duplicates & Outliers

Review finding and handling data quality issues


Master Text Data Manipulation

String operations are essential for cleaning real-world data. Use the .str accessor to transform text efficiently without writing loops!