Lesson 2 - DataFrames and Reading Data

From Concepts to Real Data

In the previous lesson, you learned pandas fundamentals—Series and DataFrames. Now you will learn to work with real data from files, which is where data analysis actually begins.

By the end of this lesson, you will be able to:

  • Read CSV files into pandas DataFrames
  • Write DataFrames back to CSV files
  • Load data from JSON files
  • Save results to Excel files
  • Handle character encoding issues
  • Use common parameters to customize file reading

Most real-world data comes in files. This lesson teaches you how to load, explore, and save that data.


Reading CSV Files

CSV (Comma-Separated Values) is the most common format for tabular data. Excel exports to CSV, databases export to CSV, and most data downloads come as CSV files.

Loading a CSV File

The pd.read_csv() function loads CSV files into DataFrames:

import pandas as pd

# Read a CSV file
df = pd.read_csv('employees.csv')

print("Data loaded successfully!")
print(f"Type: {type(df)}")
print(df)

Let’s create a sample CSV to practice with:

# Create sample CSV content
csv_content = """name,age,city,salary
Ali,28,Tehran,50000
Sara,32,Isfahan,65000
Reza,25,Shiraz,45000
Maryam,30,Tehran,70000
Hassan,35,Mashhad,60000"""

# Write to file
with open('employees.csv', 'w') as f:
    f.write(csv_content)

print("Created employees.csv")

Now read it:

# Read the CSV file
df = pd.read_csv('employees.csv')

print(df)

Output:

     name  age     city  salary
0     Ali   28   Tehran   50000
1    Sara   32  Isfahan   65000
2    Reza   25   Shiraz   45000
3  Maryam   30   Tehran   70000
4  Hassan   35  Mashhad   60000

Pandas automatically:

  • Detects the delimiter (comma)
  • Uses the first row as column names
  • Creates a numeric index (0, 1, 2, …)
  • Infers data types for each column

Common read_csv() Parameters

Customize how CSV files are loaded:

Different Delimiter:

# Create CSV with semicolon delimiter
with open('employees_semicolon.csv', 'w') as f:
    f.write("name;age;city\nAli;28;Tehran\nSara;32;Isfahan")

# Read with semicolon delimiter
df_semi = pd.read_csv('employees_semicolon.csv', sep=';')
print(df_semi)

Load Specific Columns:

# Read only name and salary columns
df_subset = pd.read_csv('employees.csv', usecols=['name', 'salary'])
print("Only name and salary columns:")
print(df_subset)

Output:

Only name and salary columns:
     name  salary
0     Ali   50000
1    Sara   65000
2    Reza   45000
3  Maryam   70000
4  Hassan   60000

Set Index Column:

# Use 'name' column as the index
df_indexed = pd.read_csv('employees.csv', index_col='name')
print("With 'name' as index:")
print(df_indexed)

Output:

With 'name' as index:
        age     city  salary
name
Ali      28   Tehran   50000
Sara     32  Isfahan   65000
Reza     25   Shiraz   45000
Maryam   30   Tehran   70000
Hassan   35  Mashhad   60000

Now you can access rows by name directly!


Writing CSV Files

After analyzing data, you often need to save results. Use .to_csv() to write DataFrames to CSV files.

Basic CSV Writing

# Add a new column
df['bonus'] = df['salary'] * 0.1

# Save to CSV (without index)
df.to_csv('employees_with_bonus.csv', index=False)

print("Saved to employees_with_bonus.csv")

Always Use index=False

Unless you specifically need the index saved, use index=False to avoid an extra unnamed column in your CSV file.

Verify the save worked:

# Read it back to verify
df_loaded = pd.read_csv('employees_with_bonus.csv')
print("Loaded back:")
print(df_loaded)

Custom Writing Parameters

Control what gets saved and how:

Save Specific Columns:

# Save only name and salary columns
df.to_csv('employees_names_salaries.csv',
          columns=['name', 'salary'],
          index=False)
print("Saved only name and salary columns")

Different Delimiter:

# Save with tab delimiter
df.to_csv('employees_tab.csv', sep='\t', index=False)
print("Saved with tab delimiter")

Reading JSON Files

JSON (JavaScript Object Notation) is common for web APIs and nested data structures.

Loading JSON Data

import json

# Create sample JSON
products = [
    {"name": "Laptop", "price": 1200, "stock": 15},
    {"name": "Mouse", "price": 25, "stock": 50},
    {"name": "Keyboard", "price": 75, "stock": 30}
]

with open('products.json', 'w') as f:
    json.dump(products, f, indent=2)

print("Created products.json")

Read JSON into DataFrame:

# Read JSON into DataFrame
df_json = pd.read_json('products.json')

print("JSON loaded into DataFrame:")
print(df_json)

Output:

JSON loaded into DataFrame:
       name  price  stock
0    Laptop   1200     15
1     Mouse     25     50
2  Keyboard     75     30

Writing to JSON

# Add total value column
df_json['total_value'] = df_json['price'] * df_json['stock']

# Save to JSON
df_json.to_json('products_with_value.json',
                orient='records',
                indent=2)

print("Saved to products_with_value.json")

The orient='records' parameter creates a list of dictionaries, and indent=2 makes it readable.


Reading Excel Files

Excel files are common in business environments. Pandas can read and write Excel files.

Requires openpyxl

Install the openpyxl library first: pip install openpyxl

Loading Excel Files

# Create sample data
sales_data = pd.DataFrame({
    'month': ['Jan', 'Feb', 'Mar', 'Apr'],
    'revenue': [50000, 55000, 52000, 60000],
    'expenses': [30000, 32000, 31000, 35000]
})

# Save to Excel
sales_data.to_excel('sales.xlsx', index=False, sheet_name='Q1')
print("Created sales.xlsx")

# Read Excel file
df_excel = pd.read_excel('sales.xlsx', sheet_name='Q1')
print("\nLoaded from Excel:")
print(df_excel)

Output:

Loaded from Excel:
  month  revenue  expenses
0   Jan    50000     30000
1   Feb    55000     32000
2   Mar    52000     31000
3   Apr    60000     35000

You can read specific sheets by name or by index (0-based).


Handling Encoding Issues

Character encoding problems occur frequently, especially with non-English text.

Understanding Encoding

Different languages and systems use different character encodings. If you see strange characters or errors when loading a file, you likely have an encoding mismatch.

Common encodings:

  • UTF-8: Universal standard (use this first)
  • latin-1: Western European languages
  • cp1256: Arabic and Persian (Windows)
  • iso-8859-1: Western European (older systems)

Reading Files with Specific Encoding

# Create file with non-English text
data_content = """name,city
Ali,Tehran
Sara,Isfahan
Reza,Shiraz"""

# Save with UTF-8 encoding
with open('data.csv', 'w', encoding='utf-8') as f:
    f.write(data_content)

# Read with explicit encoding
df_data = pd.read_csv('data.csv', encoding='utf-8')
print("Data loaded:")
print(df_data)

Troubleshooting Encoding Errors

If you get a UnicodeDecodeError, try these encodings:

# Try different encodings until one works
encodings_to_try = ['utf-8', 'latin-1', 'cp1256', 'iso-8859-1']

for enc in encodings_to_try:
    try:
        df = pd.read_csv('problematic_file.csv', encoding=enc)
        print(f"Success with encoding: {enc}")
        break
    except UnicodeDecodeError:
        print(f"Failed with: {enc}")
        continue

Encoding Best Practice

Always save your files with UTF-8 encoding to avoid problems. When loading files from others, try UTF-8 first, then latin-1 if that fails.


Common read_csv() Parameters Reference

Here are the most useful parameters:

ParameterPurposeExample
sepColumn delimitersep=';' or sep='\t'
usecolsLoad specific columns onlyusecols=['name', 'age']
index_colSet index columnindex_col='id'
encodingCharacter encodingencoding='utf-8'
headerRow number for column namesheader=0 (default)
namesProvide column namesnames=['col1', 'col2']
skiprowsSkip rowsskiprows=5
nrowsRead only n rowsnrows=1000
na_valuesAdditional NA valuesna_values=['N/A', 'missing']

Examples:

# Read tab-delimited file
df = pd.read_csv('data.txt', sep='\t')

# Read only first 1000 rows
df = pd.read_csv('large_file.csv', nrows=1000)

# Skip first 3 rows
df = pd.read_csv('data.csv', skiprows=3)

# Provide custom column names
df = pd.read_csv('data.csv',
                 names=['name', 'age', 'salary'],
                 header=0)

Practice Exercises

Apply what you learned with these exercises.

Exercise 1: Student Data CSV

  1. Create a CSV file with student data (name, subject, score)
  2. Read it into a DataFrame
  3. Add a ‘grade’ column (A if score >= 90, B if >= 80, else C)
  4. Save to a new CSV file
# Your code here

Hint

Use a lambda function with .apply() or np.where() to create the grade column.

Exercise 2: Product Inventory JSON

  1. Create a list of products with name, quantity, and price
  2. Save as a JSON file
  3. Read back into DataFrame
  4. Calculate total inventory value
  5. Save result to CSV
# Your code here

Exercise 3: Encoding Practice

  1. Create a CSV file with your native language text
  2. Save it with UTF-8 encoding
  3. Read it back and verify it displays correctly
  4. Try reading with a wrong encoding to see the error
# Your code here

Summary

You now know how to load and save data in pandas. Let’s review the key concepts.

Key Concepts

Reading CSV Files

# Basic
df = pd.read_csv('file.csv')

# With common parameters
df = pd.read_csv('file.csv',
                 sep=';',                    # Custom delimiter
                 usecols=['col1', 'col2'],   # Specific columns
                 index_col='name',           # Set index column
                 encoding='utf-8')           # Handle encoding

Writing CSV Files

# Basic (always use index=False)
df.to_csv('output.csv', index=False)

# With parameters
df.to_csv('output.csv',
          index=False,                  # Don't save index
          columns=['col1', 'col2'],     # Specific columns
          sep='\t')                     # Tab delimiter

Reading/Writing JSON

# Read JSON
df = pd.read_json('file.json')

# Write JSON
df.to_json('output.json',
           orient='records',    # Format as list of records
           indent=2)            # Pretty print

Reading/Writing Excel

# Read Excel (requires openpyxl)
df = pd.read_excel('file.xlsx', sheet_name='Sheet1')

# Write Excel
df.to_excel('output.xlsx',
            sheet_name='Data',
            index=False)

Important Reminders

  1. Always start with file I/O: You cannot analyze data you cannot load
  2. CSV is most common: Master pd.read_csv() and df.to_csv()
  3. Use index=False when saving to avoid extra index columns
  4. Specify encoding for non-English text
  5. JSON for nested data: Common in web APIs
  6. Excel needs openpyxl: Install it separately

Common Workflow

# 1. Load data
df = pd.read_csv('input.csv')

# 2. Analyze and transform
df['new_column'] = df['old_column'] * 2

# 3. Save results
df.to_csv('output.csv', index=False)

This three-step pattern appears in almost every data analysis project.


Next Steps

You can now load data from files into pandas. In the next lesson, you will learn to select specific rows and columns using label-based indexing with .loc[].

Continue to Lesson 3 - Selecting with .loc[]

Master label-based selection to access exactly the data you need

Back to Lesson 1 - Introduction and Series

Review pandas fundamentals and Series objects


Master Data Loading

You have learned the essential skill of loading data into pandas. This is where all analysis begins. With CSV, JSON, and Excel support, you can now work with data from virtually any source.

In the next lesson, you will learn precise data selection techniques to extract exactly the information you need for analysis!