Lesson 1 - Introduction to Pandas and Series

Welcome to Pandas

You have mastered NumPy and can work with arrays efficiently. Now you are ready for pandas—the most powerful Python library for data analysis. Pandas builds on NumPy to provide labeled data structures that make working with real-world datasets intuitive and productive.

By the end of this lesson, you will be able to:

  • Understand what pandas is and why it matters
  • Explain the advantages of pandas over NumPy
  • Create and manipulate Series (one-dimensional labeled arrays)
  • Access Series elements by label and position
  • Perform vectorized operations on Series
  • Create simple DataFrames from dictionaries

Let’s begin your pandas journey.


The Problem with Pure NumPy

Imagine you need to work with employee data:

James  28    50000    New York
Maria  32    65000    London
David  25    45000    Paris

NumPy challenges:

  1. No column names: Is column 2 age or salary? You must remember the structure
  2. Single data type: Cannot mix strings (names) and numbers (age) in one array
  3. No row labels: Cannot say “get James’s salary,” only “get row 0, column 2”
  4. Low-level operations: Need many lines of code for simple analytical tasks

These limitations make NumPy unsuitable for typical data analysis workflows.


Enter Pandas

Pandas stands for Panel Data Structure. Built on top of NumPy, pandas adds the features you need for real data analysis.

What pandas provides:

  • Column names: Call columns by name—df['salary'] instead of arr[:, 2]
  • Row labels: Access data by meaningful identifiers
  • Mixed data types: Strings, numbers, dates, categories in one table
  • High-level methods: Powerful operations in single lines of code
  • File I/O: Read and write CSV, Excel, JSON, SQL databases easily

Key insight: Everything you learned in NumPy transfers to pandas. Pandas uses NumPy arrays internally, so your NumPy skills enhance your pandas work.

Import Pandas

The standard convention is to import pandas as pd:

import pandas as pd
import numpy as np

print(f"Pandas version: {pd.__version__}")
print(f"NumPy version: {np.__version__}")

You will use pd to access all pandas functions and classes.


Series: The Building Block

A Series is a one-dimensional labeled array. Think of it as:

  • A single column from a spreadsheet
  • A Python dictionary with ordered keys
  • A NumPy array with an index (labels)

Series are the foundation of pandas. Each column in a DataFrame is a Series.

Creating a Series

You can create a Series from a Python list:

# Create a Series from a list (automatic index: 0, 1, 2, ...)
ages = pd.Series([28, 32, 25, 30, 35])

print("Simple Series:")
print(ages)

Output:

Simple Series:
0    28
1    32
2    25
3    30
4    35
dtype: int64

The left column shows the index (automatically 0, 1, 2, …), and the right column shows the values.

Creating a Series with Custom Labels

The real power comes from custom index labels:

# Create Series with custom labels (index)
ages = pd.Series(
    [28, 32, 25, 30, 35],
    index=['James', 'Maria', 'David', 'Anna', 'Michael']
)

print("Series with custom index:")
print(ages)

Output:

Series with custom index:
James     28
Maria     32
David     25
Anna      30
Michael   35
dtype: int64

Now you can access data by name instead of position!

Creating a Series from a Dictionary

Dictionaries naturally map to Series—keys become the index:

# Create Series from a dictionary (keys become index)
salaries = pd.Series({
    'James': 50000,
    'Maria': 65000,
    'David': 45000,
    'Anna': 70000,
    'Michael': 60000
})

print("Series from dictionary:")
print(salaries)

Output:

Series from dictionary:
James       50000
Maria      65000
David      45000
Anna    70000
Michael    60000
dtype: int64

Series Attributes

Series have several useful attributes:

print(f"Values (NumPy array): {salaries.values}")
print(f"Index (labels): {salaries.index.tolist()}")
print(f"Data type: {salaries.dtype}")
print(f"Shape: {salaries.shape}")
print(f"Size: {salaries.size}")
print(f"Name: {salaries.name}")

Output:

Values (NumPy array): [50000 65000 45000 70000 60000]
Index (labels): ['James', 'Maria', 'David', 'Anna', 'Michael']
Data type: int64
Shape: (5,)
Size: 5
Name: None

You can give a Series a descriptive name:

salaries.name = 'Monthly Salary'
print(salaries)

Output:

James       50000
Maria      65000
David      45000
Anna    70000
Michael    60000
Name: Monthly Salary, dtype: int64

The name appears at the bottom and helps document what the Series represents.


Accessing Series Elements

Series support two ways of accessing elements:

  1. By label (index name) — unique to pandas
  2. By position (like NumPy arrays)

Access by Label

This is the pandas way—use meaningful names:

# Access by label
print(f"James's salary: ${salaries['James']:,}")
print(f"Maria's salary: ${salaries['Maria']:,}")

Output:

James's salary: $50,000
Maria's salary: $65,000

Access multiple labels using a list:

# Access multiple labels
print("James and David's salaries:")
print(salaries[['James', 'David']])

Output:

James and David's salaries:
James     50000
David    45000
Name: Monthly Salary, dtype: int64

Access by Position

You can also use integer positions with .iloc[]:

# Access by position (like NumPy)
print(f"First salary: ${salaries.iloc[0]:,}")
print(f"Last salary: ${salaries.iloc[-1]:,}")

Output:

First salary: $50,000
Last salary: $60,000

Slice by position:

# Slice by position
print("First 3 salaries:")
print(salaries.iloc[0:3])

Output:

First 3 salaries:
James     50000
Maria    65000
David    45000
Name: Monthly Salary, dtype: int64

When to Use Which

Use label-based indexing (series['label']) when you know the label. Use position-based indexing (series.iloc[0]) when you care about position rather than identity.


Basic Operations on Series

Series support vectorized operations just like NumPy arrays:

Mathematical Operations

# Mathematical operations (vectorized!)
bonuses = salaries * 0.1  # 10% bonus

print("10% bonuses:")
print(bonuses)

Output:

10% bonuses:
James       5000.0
Maria      6500.0
David      4500.0
Anna    7000.0
Michael    6000.0
Name: Monthly Salary, dtype: float64

Combining Series

Series align by index automatically when combined:

# Combine Series (element-wise, aligned by index)
total_compensation = salaries + bonuses
total_compensation.name = 'Total Compensation'

print(total_compensation)

Output:

James       55000.0
Maria      71500.0
David      49500.0
Anna    77000.0
Michael    66000.0
Name: Total Compensation, dtype: float64

Aggregation Methods

Calculate statistics across the entire Series:

print(f"Average salary: ${salaries.mean():,.2f}")
print(f"Median salary: ${salaries.median():,.2f}")
print(f"Highest salary: ${salaries.max():,}")
print(f"Lowest salary: ${salaries.min():,}")
print(f"Total payroll: ${salaries.sum():,}")

Output:

Average salary: $58,000.00
Median salary: $60,000.00
Highest salary: $70,000
Lowest salary: $45,000
Total payroll: $290,000

Boolean Filtering

Filter Series using Boolean conditions:

# Boolean filtering (very powerful!)
high_earners = salaries[salaries >= 60000]

print("Employees earning $60,000 or more:")
print(high_earners)

Output:

Employees earning $60,000 or more:
Maria      65000
Anna    70000
Michael    60000
Name: Monthly Salary, dtype: int64

This Boolean indexing works exactly like NumPy, but with labels!


Introduction to DataFrames

A DataFrame is a collection of Series. Think of it as:

  • Multiple Series with the same index
  • A spreadsheet with row and column labels
  • A SQL table

While a Series is one-dimensional, a DataFrame is two-dimensional.

Creating a DataFrame from a Dictionary

The most common way to create a DataFrame is from a dictionary where keys become column names:

# Create DataFrame from dictionary
# Keys = column names, Values = lists of data
employees = pd.DataFrame({
    'name': ['James', 'Maria', 'David', 'Anna', 'Michael'],
    'age': [28, 32, 25, 30, 35],
    'salary': [50000, 65000, 45000, 70000, 60000],
    'city': ['New York', 'London', 'Paris', 'New York', 'Berlin']
})

print("DataFrame created!")
print(f"Type: {type(employees)}")
print()
print(employees)

Output:

DataFrame created!
Type: <class 'pandas.core.frame.DataFrame'>

     name  age  salary     city
0     James   28   50000   New York
1    Maria   32   65000  London
2    David   25   45000   Paris
3  Anna   30   70000   New York
4  Michael   35   60000  Berlin

Notice:

  • Rows have numeric index (0, 1, 2, 3, 4)
  • Columns have names (name, age, salary, city)
  • Different columns have different data types

DataFrame Attributes

print(f"Shape (rows, columns): {employees.shape}")
print(f"Number of rows: {employees.shape[0]}")
print(f"Number of columns: {employees.shape[1]}")
print(f"Column names: {employees.columns.tolist()}")
print(f"Index: {employees.index.tolist()}")

Output:

Shape (rows, columns): (5, 4)
Number of rows: 5
Number of columns: 4
Column names: ['name', 'age', 'salary', 'city']
Index: [0, 1, 2, 3, 4]

Extracting Columns from DataFrames

Single Column Extraction

Extract one column using single brackets—returns a Series:

# Extract a single column → Returns a Series
names = employees['name']

print(f"Type: {type(names)}")
print(names)

Output:

Type: <class 'pandas.core.series.Series'>
0       James
1      Maria
2      David
3    Anna
4    Michael
Name: name, dtype: object

Multiple Column Extraction

Extract multiple columns using double brackets—returns a DataFrame:

# Extract multiple columns → Returns a DataFrame
subset = employees[['name', 'salary']]

print(f"Type: {type(subset)}")
print(subset)

Output:

Type: <class 'pandas.core.frame.DataFrame'>
     name  salary
0     James   50000
1    Maria   65000
2    David   45000
3  Anna   70000
4  Michael   60000

Key Rule

  • Single brackets df['column'] → Series (1D)
  • Double brackets df[['col1', 'col2']] → DataFrame (2D)

Basic DataFrame Operations

Adding New Columns

Create new columns from calculations:

# Add a new column (using Series operations!)
employees['bonus'] = employees['salary'] * 0.1

print("DataFrame with bonus column:")
print(employees)

Output:

DataFrame with bonus column:
     name  age  salary     city   bonus
0     James   28   50000   New York  5000.0
1    Maria   32   65000  London  6500.0
2    David   25   45000   Paris  4500.0
3  Anna   30   70000   New York  7000.0
4  Michael   35   60000  Berlin  6000.0

Statistical Summary

Get a statistical overview of numeric columns:

print("Numeric columns summary:")
print(employees.describe())

Output:

Numeric columns summary:
             age        salary         bonus
count   5.000000      5.000000      5.000000
mean   30.000000  58000.000000   5800.000000
std     3.807887  10488.088482   1048.808848
min    25.000000  45000.000000   4500.000000
25%    28.000000  50000.000000   5000.000000
50%    30.000000  60000.000000   6000.000000
75%    32.000000  65000.000000   6500.000000
max    35.000000  70000.000000   7000.000000

This provides count, mean, standard deviation, min, quartiles, and max for each numeric column.


Series vs DataFrame: Visual Comparison

Series (1D)

Index    Values
James      50000
Maria     65000
David     45000

A Series has one index and one set of values.

DataFrame (2D)

Index    name     age    salary
0        James      28     50000
1        Maria     32     65000
2        David     25     45000

A DataFrame has one index (rows) and multiple columns. Each column is itself a Series.

Code Examples

# Create a Series
temperatures = pd.Series(
    [28, 32, 30, 29, 31],
    index=['Mon', 'Tue', 'Wed', 'Thu', 'Fri'],
    name='Temperature (°C)'
)

print("Series:")
print(temperatures)
print(f"\nShape: {temperatures.shape}")
print(f"Dimensions: {temperatures.ndim}D")

Output:

Series:
Mon    28
Tue    32
Wed    30
Thu    29
Fri    31
Name: Temperature (°C), dtype: int64

Shape: (5,)
Dimensions: 1D

DataFrame example:

# Create a DataFrame with multiple weather metrics
weather = pd.DataFrame({
    'temperature': [28, 32, 30, 29, 31],
    'humidity': [65, 70, 68, 72, 66],
    'wind_speed': [15, 12, 18, 10, 14]
}, index=['Mon', 'Tue', 'Wed', 'Thu', 'Fri'])

print("DataFrame:")
print(weather)
print(f"\nShape: {weather.shape}")
print(f"Dimensions: {weather.ndim}D")

Output:

DataFrame:
     temperature  humidity  wind_speed
Mon           28        65          15
Tue           32        70          12
Wed           30        68          18
Thu           29        72          10
Fri           31        66          14

Shape: (5, 3)
Dimensions: 2D

Practice Exercises

Apply what you have learned with these exercises.

Exercise 1: Create a Series

Create a Series of test scores for 5 students:

  • James: 85
  • Maria: 92
  • David: 78
  • Anna: 95
  • Michael: 88

Then:

  1. Print the Series with the name “Math Test Scores”
  2. Calculate the average score
  3. Find students who scored above 85
# Your code here

Solution

scores = pd.Series(
    [85, 92, 78, 95, 88],
    index=['James', 'Maria', 'David', 'Anna', 'Michael'],
    name='Math Test Scores'
)

# 1. Print the Series
print(scores)

# 2. Average score
print(f"\nAverage score: {scores.mean():.2f}")

# 3. Students above 85
print("\nStudents who scored above 85:")
print(scores[scores > 85])

Exercise 2: Create a DataFrame

Create a DataFrame with product information:

  • Products: Laptop, Mouse, Keyboard, Monitor
  • Prices: 1200, 25, 75, 300
  • Stock: 10, 50, 30, 15

Then:

  1. Display the DataFrame
  2. Add a column ’total_value’ (price × stock)
  3. Extract just the ‘product’ and ’total_value’ columns
# Your code here

Exercise 3: Series Operations

Create two Series for monthly sales in two cities:

  • New York: [100, 120, 115, 130, 125]
  • London: [90, 95, 100, 105, 110]
  • Months: Jan, Feb, Mar, Apr, May

Then:

  1. Calculate total sales for both cities combined
  2. Find the difference (New York - London) for each month
  3. Determine which city had higher sales in March
# Your code here

Summary

Congratulations! You have taken your first steps into pandas. Let’s review what you learned.

Key Concepts

Pandas Purpose

  • Builds on NumPy, adding labels and mixed data types
  • Essential tool for data analysis in Python
  • Handles real-world data better than NumPy arrays
  • Provides high-level operations for common tasks

Series (1D)

  • One-dimensional labeled array
  • Like a single column or Python dictionary
  • Created from lists, dictionaries, or NumPy arrays
  • Access by label or position
  • Supports vectorized operations and aggregations

DataFrame (2D)

  • Collection of Series sharing the same index
  • Like a spreadsheet or SQL table
  • Created from dictionary of lists
  • Each column is a Series
  • Can have mixed data types across columns

Key Syntax Reference

# Create Series
s = pd.Series([1, 2, 3], index=['a', 'b', 'c'], name='My Series')

# Create DataFrame
df = pd.DataFrame({
    'col1': [1, 2, 3],
    'col2': ['a', 'b', 'c']
})

# Access Series element
s['a']          # By label
s.iloc[0]       # By position

# Extract DataFrame column
df['col1']             # Series (1D)
df[['col1', 'col2']]   # DataFrame (2D)

# Series operations
s.mean()        # Average
s.max()         # Maximum
s[s > 5]        # Boolean filtering

Series vs DataFrame Comparison

FeatureSeriesDataFrame
Dimensions1D2D
Shape(n,)(rows, cols)
Analogous toSingle columnSpreadsheet
IndexOne indexRow index
ColumnsHas a nameMultiple column names
Create fromList, dictDict of lists
Data typesOne dtypeDifferent dtype per column

NumPy to Pandas Translation

NumPyPandas
1D arraySeries
2D arrayDataFrame
Numeric index onlyNamed index (labels)
arr[5]s['label'] or s.iloc[5]
Single dtypeMixed dtypes per column
No column namesNamed columns
arr[arr > 5]s[s > 5] (with labels!)

Important Reminders

  • Pandas builds on NumPy: Everything you learned about vectorization applies
  • Labels are powerful: They make code self-documenting
  • Single vs double brackets: Remember the difference for DataFrame column selection
  • Index alignment: When combining Series, pandas aligns by index automatically

Next Steps

You now understand pandas fundamentals and can work with Series and basic DataFrames. In the next lesson, you will learn to load real datasets from files and explore DataFrames in depth.

Continue to Lesson 2 - DataFrames and Reading Data

Learn to load CSV files and explore DataFrames with pandas methods

Back to Module Overview

Return to the Pandas Data Analysis module overview


Begin Your Pandas Journey

You have learned the foundation of pandas—Series and DataFrames. These labeled data structures will transform how you work with data. Every pandas operation builds on these fundamentals.

In the next lesson, you will load real datasets and learn the essential methods for exploring and understanding data. Keep building your skills!