Lesson 14 - Working with String Data
Efficient Text Data Operations
Text data needs cleaning—inconsistent capitalization, whitespace, special characters, mixed formats. Now you will learn vectorized string operations that apply to entire columns at once.
By the end of this lesson, you will be able to:
- Use
.straccessor for string methods - Clean and format text data
- Extract information from strings
- Search and filter text efficiently
- Apply basic regex patterns
String operations in pandas are vectorized—they work on entire Series at once, much faster than loops. Let’s master these essential text manipulation skills.
Understanding the .str Accessor
The .str accessor unlocks string methods for pandas Series. It works like Python string methods, but operates on entire columns at once.
Think of it this way: instead of looping through each value and applying a string method, you apply the method to the whole column in one operation. This is both faster and cleaner code.
# Without .str (slow, manual loop)
cleaned = []
for value in df['name']:
cleaned.append(value.lower())
df['name_lower'] = cleaned
# With .str (fast, vectorized)
df['name_lower'] = df['name'].str.lower()The .str accessor makes the second approach possible—one line, much faster.
Section 1: Basic String Operations
Case Conversion
Converting text case is essential for standardizing data. Different users enter data differently—some use ALL CAPS, some use lowercase, some use Title Case.
import pandas as pd
companies = pd.DataFrame({
'company': ['TechCorp Global Solutions', 'SARA FOODS CO.', 'maryam bank',
'PowerGen Energy Corp', ' RetailHub Ltd ']
})
# Convert to lowercase
companies['company_lower'] = companies['company'].str.lower()
# Convert to uppercase
companies['company_upper'] = companies['company'].str.upper()
# Title case (capitalize first letter of each word)
companies['company_title'] = companies['company'].str.title()
print(companies[['company', 'company_lower', 'company_upper', 'company_title']])Output shows all three versions side by side:
company company_lower company_upper company_title
0 TechCorp Global Solutions ali tech solutions ALI TECH SOLUTIONS TechCorp Global Solutions
1 SARA FOODS CO. sara foods co. SARA FOODS CO. FreshMart Inc Co.
2 maryam bank maryam bank MARYAM BANK FirstBank HoldingsUse .lower() when you need consistency for comparisons or matching. Use .title() for display names that should look professional.
Removing Whitespace
Whitespace at the start or end of strings causes problems when matching or comparing data. Notice the spaces around “RetailHub Ltd” in the example above.
# Strip whitespace from both ends
companies['company_clean'] = companies['company'].str.strip()
# Before strip
print(repr(companies.loc[4, 'company']))
# Output: ' RetailHub Ltd ' (notice the spaces)
# After strip
print(repr(companies.loc[4, 'company_clean']))
# Output: 'RetailHub Ltd' (spaces removed)The .strip() method removes spaces, tabs, and newlines from both ends. Use .lstrip() to remove only from the left, or .rstrip() to remove only from the right.
String Length
Finding string length helps identify data quality issues—names that are too short (possibly incomplete) or too long (possibly incorrect).
# Get length of each string
companies['name_length'] = companies['company'].str.len()
# Find longest and shortest names
print(companies[['company', 'name_length']].sort_values('name_length', ascending=False))Output shows names sorted by length:
company name_length
0 TechCorp Global Solutions 18
1 SARA FOODS CO. 15
3 PowerGen Energy Corp 18Use length to validate data or find outliers.
Replacing Text
Replace text to standardize abbreviations, fix typos, or clean data.
# Replace abbreviations with full words
companies['company_replaced'] = companies['company'].str.replace('Ltd.', 'Limited')
companies['company_replaced'] = companies['company_replaced'].str.replace('Inc', 'Incorporated')
companies['company_replaced'] = companies['company_replaced'].str.replace('Corp', 'Corporation')
print(companies[['company', 'company_replaced']])The .replace() method finds exact matches and replaces them. For pattern matching, use regex=True parameter.
Section 2: Extracting and Splitting Strings
Splitting Strings
Split strings to extract parts of the data. Common use: separating first names from last names, or splitting email addresses.
# Split by space and get first word
companies['owner_name'] = companies['company'].str.split().str[0]
print(companies[['company', 'owner_name']])Output:
company owner_name
0 TechCorp Global Solutions Ali
1 SARA FOODS CO. SARA
2 maryam bank maryamThe .split() method returns a list of words. Use .str[0] to get the first word, .str[1] for the second, and so on.
Splitting into Multiple Columns
Sometimes you want to split one column into several columns. Use expand=True:
# Create email column
companies['email'] = ['[email protected]', '[email protected]', '[email protected]']
# Split email into username and domain
email_parts = companies['email'].str.split('@', expand=True)
email_parts.columns = ['email_user', 'email_domain']
# Add to original DataFrame
companies = pd.concat([companies, email_parts], axis=1)
print(companies[['email', 'email_user', 'email_domain']])Output:
email email_user email_domain
0 [email protected] info alitech.ir
1 [email protected] contact sarafoods.com
2 [email protected] support maryambank.qaNow you have the email split into its component parts for further analysis.
Extracting with Indexing
You can also extract parts by position:
# Get first 3 characters as company code
companies['company_code'] = companies['company'].str[:3].str.upper()
print(companies[['company', 'company_code']])Output:
company company_code
0 TechCorp Global Solutions ALI
1 SARA FOODS CO. SAR
2 maryam bank MARSlicing works just like Python string slicing: str[:3] gets first 3 characters, str[-2:] gets last 2 characters.
Extracting with Split
Extract parts after splitting:
# Get domain extension from email
companies['domain_extension'] = companies['email_domain'].str.split('.').str[-1]
print(companies[['email_domain', 'domain_extension']])Output:
email_domain domain_extension
0 alitech.ir ir
1 sarafoods.com com
2 maryambank.qa qaThe .str[-1] gets the last element after splitting by period.
Section 3: Searching and Filtering Text
Checking for Substrings
Find rows where text contains specific words:
# Find companies with 'Tech' in name
has_tech = companies['company'].str.contains('Tech', case=False)
print("Companies with 'Tech' in name:")
print(companies[has_tech][['company']])Output:
company
0 TechCorp Global SolutionsThe case=False parameter makes the search case-insensitive, so it finds “Tech”, “tech”, “TECH”, etc.
Multiple Search Terms
Use the pipe symbol | for OR logic:
# Find companies with Tech, Energy, or Bank
has_keywords = companies['company'].str.contains('Tech|Energy|Bank', case=False)
print(companies[has_keywords][['company']])This finds any row containing at least one of these words.
Starts With and Ends With
Check if strings start or end with specific text:
# Companies starting with 'A'
starts_with_a = companies['company'].str.startswith('A', na=False)
print("Companies starting with 'A':")
print(companies[starts_with_a][['company']])
# Companies ending with specific suffixes
ends_with_corp = companies['company'].str.endswith(('Corp', 'Ltd.', 'Inc'), na=False)
print("Companies ending with Corp, Ltd., or Inc:")
print(companies[ends_with_corp][['company']])The na=False parameter handles missing values safely.
Regular Expression Search
Use regex patterns for complex searches:
# Find Iranian emails (ending with .ir)
iranian_emails = companies['email'].str.contains('.ir$', regex=True)
print("Companies with .ir email:")
print(companies[iranian_emails][['company', 'email']])The $ in the regex means “end of string”, so it only matches emails ending with .ir.
Section 4: Cleaning and Formatting
Cleaning Phone Numbers
Remove all non-numeric characters from phone numbers:
companies['phone'] = ['+98-21-1234567', '+971 4 555 1234', '(974) 44-55-6677']
# Remove everything except digits
companies['phone_clean'] = companies['phone'].str.replace(r'[^0-9]', '', regex=True)
print(companies[['phone', 'phone_clean']])Output:
phone phone_clean
0 +98-21-1234567 98211234567
1 +971 4 555 1234 9714555123
2 (974) 44-55-6677 97444556677The regex pattern [^0-9] means “anything that is NOT a digit”. Replacing with empty string removes all non-digits.
Removing Special Characters
Clean company names by removing punctuation:
# Remove everything except letters, numbers, and spaces
companies['company_alphanumeric'] = companies['company'].str.replace(r'[^a-zA-Z0-9\s]', '', regex=True)
print(companies[['company', 'company_alphanumeric']])Output:
company company_alphanumeric
0 TechCorp Global Solutions TechCorp Global Solutions
1 SARA FOODS CO. SARA FOODS CO
2 maryam bank maryam bankNotice periods and other punctuation are removed.
Normalizing Whitespace
Replace multiple spaces with single space:
companies['description'] = ['Leading tech provider', 'Food service company']
# Replace multiple spaces with single space
companies['description_clean'] = companies['description'].str.replace(r'\s+', ' ', regex=True).str.strip()
print(companies[['description', 'description_clean']])Output:
description description_clean
0 Leading tech provider Leading tech provider
1 Food service company Food service companyThe regex \s+ means “one or more whitespace characters”.
Extracting Numbers from Text
Pull numbers out of text using regex:
test_data = pd.Series(['Founded in 1990', 'Established 2005', 'Since 1985', 'New company'])
# Extract 4-digit years
years = test_data.str.extract(r'(\d{4})', expand=False)
print(pd.DataFrame({'text': test_data, 'year': years}))Output:
text year
0 Founded in 1990 1990
1 Established 2005 2005
2 Since 1985 1985
3 New company NaNThe regex \d{4} means “exactly 4 digits”. Missing values become NaN.
Counting Words
Count words in descriptions:
# Count words by splitting on whitespace
word_count = companies['description'].str.split().str.len()
companies['description_words'] = word_count
print(companies[['description', 'description_words']])This splits each description into words and counts how many words result.
Common Patterns and Use Cases
Email Validation Pattern
# Basic email pattern check
is_valid_email = companies['email'].str.match(r'^[\w\.-]+@[\w\.-]+\.\w+$')URL Standardization
def standardize_url(url):
url = url.lower().strip()
if url.startswith('http'):
return url
elif url.startswith('www.'):
return 'https://' + url
else:
return 'https://www.' + url
companies['website_clean'] = companies['website'].apply(standardize_url)Data Quality Scoring
Create a score based on data completeness:
score = 0
score += companies['phone'].notna().astype(int) * 25 # Has phone
score += companies['email'].str.islower().astype(int) * 25 # Email is lowercase
score += companies['website'].notna().astype(int) * 25 # Has website
score += companies['description'].str.len().gt(20).astype(int) * 25 # Description > 20 chars
companies['data_quality_score'] = scoreSummary
You have completed an extensive lesson on pandas string operations! Here is what you learned:
Key Concepts
The .str Accessor
- Enables vectorized string operations
- Works on entire Series at once
- Much faster than Python loops
Basic Operations
- Case conversion:
.str.lower(),.str.upper(),.str.title() - Whitespace:
.str.strip(),.str.lstrip(),.str.rstrip() - Length:
.str.len() - Replace:
.str.replace()
Extracting and Splitting
- Split:
.str.split(), access parts with.str[index] - Slice:
.str[:n]for first n characters - Extract with regex:
.str.extract()
Searching and Filtering
- Contains:
.str.contains() - Starts and ends:
.str.startswith(),.str.endswith() - Use for filtering DataFrames
Cleaning and Formatting
- Remove special characters with regex
- Normalize whitespace
- Extract numbers from text
- Standardize formats
String operations are essential for real-world data cleaning. The .str accessor makes text manipulation fast and efficient.
Next Steps
You can now clean and manipulate text data efficiently. In the next lesson, you will learn GroupBy and aggregation—the most powerful pandas feature for category-based analysis.
Continue to Lesson 15 - GroupBy & Aggregation
Learn to group data and calculate statistics per category
Back to Lesson 13 - Duplicates & Outliers
Review finding and handling data quality issues
Master Text Data Manipulation
String operations are essential for cleaning real-world data. Use the .str accessor to transform text efficiently without writing loops!