Lesson 2 - DataFrames and Reading Data

From Concepts to Real Data

In the previous lesson, you learned pandas fundamentals—Series and DataFrames. Now you will learn to work with real data from files, which is where data analysis actually begins.

By the end of this lesson, you will be able to:

Read CSV files into pandas DataFrames
Write DataFrames back to CSV files
Load data from JSON files
Save results to Excel files
Handle character encoding issues
Use common parameters to customize file reading

Most real-world data comes in files. This lesson teaches you how to load, explore, and save that data.

Reading CSV Files

CSV (Comma-Separated Values) is the most common format for tabular data. Excel exports to CSV, databases export to CSV, and most data downloads come as CSV files.

Loading a CSV File

The pd.read_csv() function loads CSV files into DataFrames:

import pandas as pd

# Read a CSV file
df = pd.read_csv('employees.csv')

print("Data loaded successfully!")
print(f"Type: {type(df)}")
print(df)

Let’s create a sample CSV to practice with:

# Create sample CSV content
csv_content = """name,age,city,salary
Ali,28,Tehran,50000
Sara,32,Isfahan,65000
Reza,25,Shiraz,45000
Maryam,30,Tehran,70000
Hassan,35,Mashhad,60000"""

# Write to file
with open('employees.csv', 'w') as f:
    f.write(csv_content)

print("Created employees.csv")

Now read it:

# Read the CSV file
df = pd.read_csv('employees.csv')

print(df)

Output:

     name  age     city  salary
0     Ali   28   Tehran   50000
1    Sara   32  Isfahan   65000
2    Reza   25   Shiraz   45000
3  Maryam   30   Tehran   70000
4  Hassan   35  Mashhad   60000

Pandas automatically:

Detects the delimiter (comma)
Uses the first row as column names
Creates a numeric index (0, 1, 2, …)
Infers data types for each column

Common read_csv() Parameters

Customize how CSV files are loaded:

Different Delimiter:

# Create CSV with semicolon delimiter
with open('employees_semicolon.csv', 'w') as f:
    f.write("name;age;city\nAli;28;Tehran\nSara;32;Isfahan")

# Read with semicolon delimiter
df_semi = pd.read_csv('employees_semicolon.csv', sep=';')
print(df_semi)

Load Specific Columns:

# Read only name and salary columns
df_subset = pd.read_csv('employees.csv', usecols=['name', 'salary'])
print("Only name and salary columns:")
print(df_subset)

Output:

Only name and salary columns:
     name  salary
0     Ali   50000
1    Sara   65000
2    Reza   45000
3  Maryam   70000
4  Hassan   60000

Set Index Column:

# Use 'name' column as the index
df_indexed = pd.read_csv('employees.csv', index_col='name')
print("With 'name' as index:")
print(df_indexed)

Output:

With 'name' as index:
        age     city  salary
name
Ali      28   Tehran   50000
Sara     32  Isfahan   65000
Reza     25   Shiraz   45000
Maryam   30   Tehran   70000
Hassan   35  Mashhad   60000

Now you can access rows by name directly!

Writing CSV Files

After analyzing data, you often need to save results. Use .to_csv() to write DataFrames to CSV files.

Basic CSV Writing

# Add a new column
df['bonus'] = df['salary'] * 0.1

# Save to CSV (without index)
df.to_csv('employees_with_bonus.csv', index=False)

print("Saved to employees_with_bonus.csv")

Always Use index=False

Unless you specifically need the index saved, use index=False to avoid an extra unnamed column in your CSV file.

Verify the save worked:

# Read it back to verify
df_loaded = pd.read_csv('employees_with_bonus.csv')
print("Loaded back:")
print(df_loaded)

Custom Writing Parameters

Control what gets saved and how:

Save Specific Columns:

# Save only name and salary columns
df.to_csv('employees_names_salaries.csv',
          columns=['name', 'salary'],
          index=False)
print("Saved only name and salary columns")

Different Delimiter:

# Save with tab delimiter
df.to_csv('employees_tab.csv', sep='\t', index=False)
print("Saved with tab delimiter")

Reading JSON Files

JSON (JavaScript Object Notation) is common for web APIs and nested data structures.

Loading JSON Data

import json

# Create sample JSON
products = [
    {"name": "Laptop", "price": 1200, "stock": 15},
    {"name": "Mouse", "price": 25, "stock": 50},
    {"name": "Keyboard", "price": 75, "stock": 30}
]

with open('products.json', 'w') as f:
    json.dump(products, f, indent=2)

print("Created products.json")

Read JSON into DataFrame:

# Read JSON into DataFrame
df_json = pd.read_json('products.json')

print("JSON loaded into DataFrame:")
print(df_json)

Output:

JSON loaded into DataFrame:
       name  price  stock
0    Laptop   1200     15
1     Mouse     25     50
2  Keyboard     75     30

Writing to JSON

# Add total value column
df_json['total_value'] = df_json['price'] * df_json['stock']

# Save to JSON
df_json.to_json('products_with_value.json',
                orient='records',
                indent=2)

print("Saved to products_with_value.json")

The orient='records' parameter creates a list of dictionaries, and indent=2 makes it readable.

Reading Excel Files

Excel files are common in business environments. Pandas can read and write Excel files.

Requires openpyxl

Install the openpyxl library first: pip install openpyxl

Loading Excel Files

# Create sample data
sales_data = pd.DataFrame({
    'month': ['Jan', 'Feb', 'Mar', 'Apr'],
    'revenue': [50000, 55000, 52000, 60000],
    'expenses': [30000, 32000, 31000, 35000]
})

# Save to Excel
sales_data.to_excel('sales.xlsx', index=False, sheet_name='Q1')
print("Created sales.xlsx")

# Read Excel file
df_excel = pd.read_excel('sales.xlsx', sheet_name='Q1')
print("\nLoaded from Excel:")
print(df_excel)

Output:

Loaded from Excel:
  month  revenue  expenses
0   Jan    50000     30000
1   Feb    55000     32000
2   Mar    52000     31000
3   Apr    60000     35000

You can read specific sheets by name or by index (0-based).

Handling Encoding Issues

Character encoding problems occur frequently, especially with non-English text.

Understanding Encoding

Different languages and systems use different character encodings. If you see strange characters or errors when loading a file, you likely have an encoding mismatch.

Common encodings:

UTF-8: Universal standard (use this first)
latin-1: Western European languages
cp1256: Arabic and Persian (Windows)
iso-8859-1: Western European (older systems)

Reading Files with Specific Encoding

# Create file with non-English text
data_content = """name,city
Ali,Tehran
Sara,Isfahan
Reza,Shiraz"""

# Save with UTF-8 encoding
with open('data.csv', 'w', encoding='utf-8') as f:
    f.write(data_content)

# Read with explicit encoding
df_data = pd.read_csv('data.csv', encoding='utf-8')
print("Data loaded:")
print(df_data)

Troubleshooting Encoding Errors

If you get a UnicodeDecodeError, try these encodings:

# Try different encodings until one works
encodings_to_try = ['utf-8', 'latin-1', 'cp1256', 'iso-8859-1']

for enc in encodings_to_try:
    try:
        df = pd.read_csv('problematic_file.csv', encoding=enc)
        print(f"Success with encoding: {enc}")
        break
    except UnicodeDecodeError:
        print(f"Failed with: {enc}")
        continue

Encoding Best Practice

Always save your files with UTF-8 encoding to avoid problems. When loading files from others, try UTF-8 first, then latin-1 if that fails.

Common read_csv() Parameters Reference

Here are the most useful parameters:

Parameter	Purpose	Example
`sep`	Column delimiter	`sep=';'` or `sep='\t'`
`usecols`	Load specific columns only	`usecols=['name', 'age']`
`index_col`	Set index column	`index_col='id'`
`encoding`	Character encoding	`encoding='utf-8'`
`header`	Row number for column names	`header=0` (default)
`names`	Provide column names	`names=['col1', 'col2']`
`skiprows`	Skip rows	`skiprows=5`
`nrows`	Read only n rows	`nrows=1000`
`na_values`	Additional NA values	`na_values=['N/A', 'missing']`

Examples:

# Read tab-delimited file
df = pd.read_csv('data.txt', sep='\t')

# Read only first 1000 rows
df = pd.read_csv('large_file.csv', nrows=1000)

# Skip first 3 rows
df = pd.read_csv('data.csv', skiprows=3)

# Provide custom column names
df = pd.read_csv('data.csv',
                 names=['name', 'age', 'salary'],
                 header=0)

Practice Exercises

Apply what you learned with these exercises.

Exercise 1: Student Data CSV

Create a CSV file with student data (name, subject, score)
Read it into a DataFrame
Add a ‘grade’ column (A if score >= 90, B if >= 80, else C)
Save to a new CSV file

# Your code here

Hint

Use a lambda function with .apply() or np.where() to create the grade column.

Exercise 2: Product Inventory JSON

Create a list of products with name, quantity, and price
Save as a JSON file
Read back into DataFrame
Calculate total inventory value
Save result to CSV

# Your code here

Exercise 3: Encoding Practice

Create a CSV file with your native language text
Save it with UTF-8 encoding
Read it back and verify it displays correctly
Try reading with a wrong encoding to see the error

# Your code here

Summary

You now know how to load and save data in pandas. Let’s review the key concepts.

Key Concepts

Reading CSV Files

# Basic
df = pd.read_csv('file.csv')

# With common parameters
df = pd.read_csv('file.csv',
                 sep=';',                    # Custom delimiter
                 usecols=['col1', 'col2'],   # Specific columns
                 index_col='name',           # Set index column
                 encoding='utf-8')           # Handle encoding

Writing CSV Files

# Basic (always use index=False)
df.to_csv('output.csv', index=False)

# With parameters
df.to_csv('output.csv',
          index=False,                  # Don't save index
          columns=['col1', 'col2'],     # Specific columns
          sep='\t')                     # Tab delimiter

Reading/Writing JSON

# Read JSON
df = pd.read_json('file.json')

# Write JSON
df.to_json('output.json',
           orient='records',    # Format as list of records
           indent=2)            # Pretty print

Reading/Writing Excel

# Read Excel (requires openpyxl)
df = pd.read_excel('file.xlsx', sheet_name='Sheet1')

# Write Excel
df.to_excel('output.xlsx',
            sheet_name='Data',
            index=False)

Important Reminders

Always start with file I/O: You cannot analyze data you cannot load
CSV is most common: Master pd.read_csv() and df.to_csv()
Use index=False when saving to avoid extra index columns
Specify encoding for non-English text
JSON for nested data: Common in web APIs
Excel needs openpyxl: Install it separately

Common Workflow

# 1. Load data
df = pd.read_csv('input.csv')

# 2. Analyze and transform
df['new_column'] = df['old_column'] * 2

# 3. Save results
df.to_csv('output.csv', index=False)

This three-step pattern appears in almost every data analysis project.

Next Steps

You can now load data from files into pandas. In the next lesson, you will learn to select specific rows and columns using label-based indexing with .loc[].

Continue to Lesson 3 - Selecting with .loc[]

Master label-based selection to access exactly the data you need

Back to Lesson 1 - Introduction and Series

Review pandas fundamentals and Series objects

Master Data Loading

You have learned the essential skill of loading data into pandas. This is where all analysis begins. With CSV, JSON, and Excel support, you can now work with data from virtually any source.

In the next lesson, you will learn precise data selection techniques to extract exactly the information you need for analysis!

Lesson 1 - Introduction to Pandas and Series

Lesson 3 - Selecting Data with .loc[]

Courses

DATATWEETS

Title here

Lesson 2 - DataFrames and Reading Data

From Concepts to Real Data

Reading CSV Files

Loading a CSV File

Common read_csv() Parameters

Writing CSV Files

Basic CSV Writing

Custom Writing Parameters

Reading JSON Files

Loading JSON Data

Writing to JSON

Reading Excel Files

Loading Excel Files

Handling Encoding Issues

Understanding Encoding

Reading Files with Specific Encoding

Troubleshooting Encoding Errors

Common read_csv() Parameters Reference

Practice Exercises

Exercise 1: Student Data CSV

Exercise 2: Product Inventory JSON

Exercise 3: Encoding Practice

Summary

Key Concepts

Important Reminders

Common Workflow

Next Steps

Continue to Lesson 3 - Selecting with .loc[]

Back to Lesson 1 - Introduction and Series

Master Data Loading

Lesson 2 - DataFrames and Reading Data

From Concepts to Real Data#

Reading CSV Files#

Loading a CSV File#

Common read_csv() Parameters#

Writing CSV Files#

Basic CSV Writing#

Custom Writing Parameters#

Reading JSON Files#

Loading JSON Data#

Writing to JSON#

Reading Excel Files#

Loading Excel Files#

Handling Encoding Issues#

Understanding Encoding#

Reading Files with Specific Encoding#

Troubleshooting Encoding Errors#

Common read_csv() Parameters Reference#

Practice Exercises#

Exercise 1: Student Data CSV#

Exercise 2: Product Inventory JSON#

Exercise 3: Encoding Practice#

Summary#

Key Concepts#

Important Reminders#

Common Workflow#

Next Steps#

Continue to Lesson 3 - Selecting with .loc[]

Back to Lesson 1 - Introduction and Series

Master Data Loading#

From Concepts to Real Data

Reading CSV Files

Loading a CSV File

Common read_csv() Parameters

Writing CSV Files

Basic CSV Writing

Custom Writing Parameters

Reading JSON Files

Loading JSON Data

Writing to JSON

Reading Excel Files

Loading Excel Files

Handling Encoding Issues

Understanding Encoding

Reading Files with Specific Encoding

Troubleshooting Encoding Errors

Common read_csv() Parameters Reference

Practice Exercises

Exercise 1: Student Data CSV

Exercise 2: Product Inventory JSON

Exercise 3: Encoding Practice

Summary

Key Concepts

Important Reminders

Common Workflow

Next Steps

Master Data Loading