Lesson 2 - DataFrames and Reading Data
From Concepts to Real Data
In the previous lesson, you learned pandas fundamentals—Series and DataFrames. Now you will learn to work with real data from files, which is where data analysis actually begins.
By the end of this lesson, you will be able to:
- Read CSV files into pandas DataFrames
- Write DataFrames back to CSV files
- Load data from JSON files
- Save results to Excel files
- Handle character encoding issues
- Use common parameters to customize file reading
Most real-world data comes in files. This lesson teaches you how to load, explore, and save that data.
Reading CSV Files
CSV (Comma-Separated Values) is the most common format for tabular data. Excel exports to CSV, databases export to CSV, and most data downloads come as CSV files.
Loading a CSV File
The pd.read_csv() function loads CSV files into DataFrames:
import pandas as pd
# Read a CSV file
df = pd.read_csv('employees.csv')
print("Data loaded successfully!")
print(f"Type: {type(df)}")
print(df)Let’s create a sample CSV to practice with:
# Create sample CSV content
csv_content = """name,age,city,salary
Ali,28,Tehran,50000
Sara,32,Isfahan,65000
Reza,25,Shiraz,45000
Maryam,30,Tehran,70000
Hassan,35,Mashhad,60000"""
# Write to file
with open('employees.csv', 'w') as f:
f.write(csv_content)
print("Created employees.csv")Now read it:
# Read the CSV file
df = pd.read_csv('employees.csv')
print(df)Output:
name age city salary
0 Ali 28 Tehran 50000
1 Sara 32 Isfahan 65000
2 Reza 25 Shiraz 45000
3 Maryam 30 Tehran 70000
4 Hassan 35 Mashhad 60000Pandas automatically:
- Detects the delimiter (comma)
- Uses the first row as column names
- Creates a numeric index (0, 1, 2, …)
- Infers data types for each column
Common read_csv() Parameters
Customize how CSV files are loaded:
Different Delimiter:
# Create CSV with semicolon delimiter
with open('employees_semicolon.csv', 'w') as f:
f.write("name;age;city\nAli;28;Tehran\nSara;32;Isfahan")
# Read with semicolon delimiter
df_semi = pd.read_csv('employees_semicolon.csv', sep=';')
print(df_semi)Load Specific Columns:
# Read only name and salary columns
df_subset = pd.read_csv('employees.csv', usecols=['name', 'salary'])
print("Only name and salary columns:")
print(df_subset)Output:
Only name and salary columns:
name salary
0 Ali 50000
1 Sara 65000
2 Reza 45000
3 Maryam 70000
4 Hassan 60000Set Index Column:
# Use 'name' column as the index
df_indexed = pd.read_csv('employees.csv', index_col='name')
print("With 'name' as index:")
print(df_indexed)Output:
With 'name' as index:
age city salary
name
Ali 28 Tehran 50000
Sara 32 Isfahan 65000
Reza 25 Shiraz 45000
Maryam 30 Tehran 70000
Hassan 35 Mashhad 60000Now you can access rows by name directly!
Writing CSV Files
After analyzing data, you often need to save results. Use .to_csv() to write DataFrames to CSV files.
Basic CSV Writing
# Add a new column
df['bonus'] = df['salary'] * 0.1
# Save to CSV (without index)
df.to_csv('employees_with_bonus.csv', index=False)
print("Saved to employees_with_bonus.csv")Always Use index=False
Unless you specifically need the index saved, use index=False to avoid an extra unnamed column in your CSV file.
Verify the save worked:
# Read it back to verify
df_loaded = pd.read_csv('employees_with_bonus.csv')
print("Loaded back:")
print(df_loaded)Custom Writing Parameters
Control what gets saved and how:
Save Specific Columns:
# Save only name and salary columns
df.to_csv('employees_names_salaries.csv',
columns=['name', 'salary'],
index=False)
print("Saved only name and salary columns")Different Delimiter:
# Save with tab delimiter
df.to_csv('employees_tab.csv', sep='\t', index=False)
print("Saved with tab delimiter")Reading JSON Files
JSON (JavaScript Object Notation) is common for web APIs and nested data structures.
Loading JSON Data
import json
# Create sample JSON
products = [
{"name": "Laptop", "price": 1200, "stock": 15},
{"name": "Mouse", "price": 25, "stock": 50},
{"name": "Keyboard", "price": 75, "stock": 30}
]
with open('products.json', 'w') as f:
json.dump(products, f, indent=2)
print("Created products.json")Read JSON into DataFrame:
# Read JSON into DataFrame
df_json = pd.read_json('products.json')
print("JSON loaded into DataFrame:")
print(df_json)Output:
JSON loaded into DataFrame:
name price stock
0 Laptop 1200 15
1 Mouse 25 50
2 Keyboard 75 30Writing to JSON
# Add total value column
df_json['total_value'] = df_json['price'] * df_json['stock']
# Save to JSON
df_json.to_json('products_with_value.json',
orient='records',
indent=2)
print("Saved to products_with_value.json")The orient='records' parameter creates a list of dictionaries, and indent=2 makes it readable.
Reading Excel Files
Excel files are common in business environments. Pandas can read and write Excel files.
Requires openpyxl
Install the openpyxl library first: pip install openpyxl
Loading Excel Files
# Create sample data
sales_data = pd.DataFrame({
'month': ['Jan', 'Feb', 'Mar', 'Apr'],
'revenue': [50000, 55000, 52000, 60000],
'expenses': [30000, 32000, 31000, 35000]
})
# Save to Excel
sales_data.to_excel('sales.xlsx', index=False, sheet_name='Q1')
print("Created sales.xlsx")
# Read Excel file
df_excel = pd.read_excel('sales.xlsx', sheet_name='Q1')
print("\nLoaded from Excel:")
print(df_excel)Output:
Loaded from Excel:
month revenue expenses
0 Jan 50000 30000
1 Feb 55000 32000
2 Mar 52000 31000
3 Apr 60000 35000You can read specific sheets by name or by index (0-based).
Handling Encoding Issues
Character encoding problems occur frequently, especially with non-English text.
Understanding Encoding
Different languages and systems use different character encodings. If you see strange characters or errors when loading a file, you likely have an encoding mismatch.
Common encodings:
- UTF-8: Universal standard (use this first)
- latin-1: Western European languages
- cp1256: Arabic and Persian (Windows)
- iso-8859-1: Western European (older systems)
Reading Files with Specific Encoding
# Create file with non-English text
data_content = """name,city
Ali,Tehran
Sara,Isfahan
Reza,Shiraz"""
# Save with UTF-8 encoding
with open('data.csv', 'w', encoding='utf-8') as f:
f.write(data_content)
# Read with explicit encoding
df_data = pd.read_csv('data.csv', encoding='utf-8')
print("Data loaded:")
print(df_data)Troubleshooting Encoding Errors
If you get a UnicodeDecodeError, try these encodings:
# Try different encodings until one works
encodings_to_try = ['utf-8', 'latin-1', 'cp1256', 'iso-8859-1']
for enc in encodings_to_try:
try:
df = pd.read_csv('problematic_file.csv', encoding=enc)
print(f"Success with encoding: {enc}")
break
except UnicodeDecodeError:
print(f"Failed with: {enc}")
continueEncoding Best Practice
Always save your files with UTF-8 encoding to avoid problems. When loading files from others, try UTF-8 first, then latin-1 if that fails.
Common read_csv() Parameters Reference
Here are the most useful parameters:
| Parameter | Purpose | Example |
|---|---|---|
sep | Column delimiter | sep=';' or sep='\t' |
usecols | Load specific columns only | usecols=['name', 'age'] |
index_col | Set index column | index_col='id' |
encoding | Character encoding | encoding='utf-8' |
header | Row number for column names | header=0 (default) |
names | Provide column names | names=['col1', 'col2'] |
skiprows | Skip rows | skiprows=5 |
nrows | Read only n rows | nrows=1000 |
na_values | Additional NA values | na_values=['N/A', 'missing'] |
Examples:
# Read tab-delimited file
df = pd.read_csv('data.txt', sep='\t')
# Read only first 1000 rows
df = pd.read_csv('large_file.csv', nrows=1000)
# Skip first 3 rows
df = pd.read_csv('data.csv', skiprows=3)
# Provide custom column names
df = pd.read_csv('data.csv',
names=['name', 'age', 'salary'],
header=0)Practice Exercises
Apply what you learned with these exercises.
Exercise 1: Student Data CSV
- Create a CSV file with student data (name, subject, score)
- Read it into a DataFrame
- Add a ‘grade’ column (A if score >= 90, B if >= 80, else C)
- Save to a new CSV file
# Your code hereHint
Use a lambda function with .apply() or np.where() to create the grade column.
Exercise 2: Product Inventory JSON
- Create a list of products with name, quantity, and price
- Save as a JSON file
- Read back into DataFrame
- Calculate total inventory value
- Save result to CSV
# Your code hereExercise 3: Encoding Practice
- Create a CSV file with your native language text
- Save it with UTF-8 encoding
- Read it back and verify it displays correctly
- Try reading with a wrong encoding to see the error
# Your code hereSummary
You now know how to load and save data in pandas. Let’s review the key concepts.
Key Concepts
Reading CSV Files
# Basic
df = pd.read_csv('file.csv')
# With common parameters
df = pd.read_csv('file.csv',
sep=';', # Custom delimiter
usecols=['col1', 'col2'], # Specific columns
index_col='name', # Set index column
encoding='utf-8') # Handle encodingWriting CSV Files
# Basic (always use index=False)
df.to_csv('output.csv', index=False)
# With parameters
df.to_csv('output.csv',
index=False, # Don't save index
columns=['col1', 'col2'], # Specific columns
sep='\t') # Tab delimiterReading/Writing JSON
# Read JSON
df = pd.read_json('file.json')
# Write JSON
df.to_json('output.json',
orient='records', # Format as list of records
indent=2) # Pretty printReading/Writing Excel
# Read Excel (requires openpyxl)
df = pd.read_excel('file.xlsx', sheet_name='Sheet1')
# Write Excel
df.to_excel('output.xlsx',
sheet_name='Data',
index=False)Important Reminders
- Always start with file I/O: You cannot analyze data you cannot load
- CSV is most common: Master
pd.read_csv()anddf.to_csv() - Use
index=Falsewhen saving to avoid extra index columns - Specify encoding for non-English text
- JSON for nested data: Common in web APIs
- Excel needs openpyxl: Install it separately
Common Workflow
# 1. Load data
df = pd.read_csv('input.csv')
# 2. Analyze and transform
df['new_column'] = df['old_column'] * 2
# 3. Save results
df.to_csv('output.csv', index=False)This three-step pattern appears in almost every data analysis project.
Next Steps
You can now load data from files into pandas. In the next lesson, you will learn to select specific rows and columns using label-based indexing with .loc[].
Continue to Lesson 3 - Selecting with .loc[]
Master label-based selection to access exactly the data you need
Back to Lesson 1 - Introduction and Series
Review pandas fundamentals and Series objects
Master Data Loading
You have learned the essential skill of loading data into pandas. This is where all analysis begins. With CSV, JSON, and Excel support, you can now work with data from virtually any source.
In the next lesson, you will learn precise data selection techniques to extract exactly the information you need for analysis!