Lesson 1 - Introduction to Pandas and Series
On this page
- Welcome to Pandas
- The Problem with Pure NumPy
- Enter Pandas
- Series: The Building Block
- Series Attributes
- Accessing Series Elements
- Basic Operations on Series
- Introduction to DataFrames
- Extracting Columns from DataFrames
- Basic DataFrame Operations
- Series vs DataFrame: Visual Comparison
- Practice Exercises
- Summary
- Next Steps
- Begin Your Pandas Journey
Welcome to Pandas
You have mastered NumPy and can work with arrays efficiently. Now you are ready for pandas—the most powerful Python library for data analysis. Pandas builds on NumPy to provide labeled data structures that make working with real-world datasets intuitive and productive.
By the end of this lesson, you will be able to:
- Understand what pandas is and why it matters
- Explain the advantages of pandas over NumPy
- Create and manipulate Series (one-dimensional labeled arrays)
- Access Series elements by label and position
- Perform vectorized operations on Series
- Create simple DataFrames from dictionaries
Let’s begin your pandas journey.
The Problem with Pure NumPy
Imagine you need to work with employee data:
James 28 50000 New York
Maria 32 65000 London
David 25 45000 ParisNumPy challenges:
- No column names: Is column 2 age or salary? You must remember the structure
- Single data type: Cannot mix strings (names) and numbers (age) in one array
- No row labels: Cannot say “get James’s salary,” only “get row 0, column 2”
- Low-level operations: Need many lines of code for simple analytical tasks
These limitations make NumPy unsuitable for typical data analysis workflows.
Enter Pandas
Pandas stands for Panel Data Structure. Built on top of NumPy, pandas adds the features you need for real data analysis.
What pandas provides:
- Column names: Call columns by name—
df['salary']instead ofarr[:, 2] - Row labels: Access data by meaningful identifiers
- Mixed data types: Strings, numbers, dates, categories in one table
- High-level methods: Powerful operations in single lines of code
- File I/O: Read and write CSV, Excel, JSON, SQL databases easily
Key insight: Everything you learned in NumPy transfers to pandas. Pandas uses NumPy arrays internally, so your NumPy skills enhance your pandas work.
Import Pandas
The standard convention is to import pandas as pd:
import pandas as pd
import numpy as np
print(f"Pandas version: {pd.__version__}")
print(f"NumPy version: {np.__version__}")You will use pd to access all pandas functions and classes.
Series: The Building Block
A Series is a one-dimensional labeled array. Think of it as:
- A single column from a spreadsheet
- A Python dictionary with ordered keys
- A NumPy array with an index (labels)
Series are the foundation of pandas. Each column in a DataFrame is a Series.
Creating a Series
You can create a Series from a Python list:
# Create a Series from a list (automatic index: 0, 1, 2, ...)
ages = pd.Series([28, 32, 25, 30, 35])
print("Simple Series:")
print(ages)Output:
Simple Series:
0 28
1 32
2 25
3 30
4 35
dtype: int64The left column shows the index (automatically 0, 1, 2, …), and the right column shows the values.
Creating a Series with Custom Labels
The real power comes from custom index labels:
# Create Series with custom labels (index)
ages = pd.Series(
[28, 32, 25, 30, 35],
index=['James', 'Maria', 'David', 'Anna', 'Michael']
)
print("Series with custom index:")
print(ages)Output:
Series with custom index:
James 28
Maria 32
David 25
Anna 30
Michael 35
dtype: int64Now you can access data by name instead of position!
Creating a Series from a Dictionary
Dictionaries naturally map to Series—keys become the index:
# Create Series from a dictionary (keys become index)
salaries = pd.Series({
'James': 50000,
'Maria': 65000,
'David': 45000,
'Anna': 70000,
'Michael': 60000
})
print("Series from dictionary:")
print(salaries)Output:
Series from dictionary:
James 50000
Maria 65000
David 45000
Anna 70000
Michael 60000
dtype: int64Series Attributes
Series have several useful attributes:
print(f"Values (NumPy array): {salaries.values}")
print(f"Index (labels): {salaries.index.tolist()}")
print(f"Data type: {salaries.dtype}")
print(f"Shape: {salaries.shape}")
print(f"Size: {salaries.size}")
print(f"Name: {salaries.name}")Output:
Values (NumPy array): [50000 65000 45000 70000 60000]
Index (labels): ['James', 'Maria', 'David', 'Anna', 'Michael']
Data type: int64
Shape: (5,)
Size: 5
Name: NoneYou can give a Series a descriptive name:
salaries.name = 'Monthly Salary'
print(salaries)Output:
James 50000
Maria 65000
David 45000
Anna 70000
Michael 60000
Name: Monthly Salary, dtype: int64The name appears at the bottom and helps document what the Series represents.
Accessing Series Elements
Series support two ways of accessing elements:
- By label (index name) — unique to pandas
- By position (like NumPy arrays)
Access by Label
This is the pandas way—use meaningful names:
# Access by label
print(f"James's salary: ${salaries['James']:,}")
print(f"Maria's salary: ${salaries['Maria']:,}")Output:
James's salary: $50,000
Maria's salary: $65,000Access multiple labels using a list:
# Access multiple labels
print("James and David's salaries:")
print(salaries[['James', 'David']])Output:
James and David's salaries:
James 50000
David 45000
Name: Monthly Salary, dtype: int64Access by Position
You can also use integer positions with .iloc[]:
# Access by position (like NumPy)
print(f"First salary: ${salaries.iloc[0]:,}")
print(f"Last salary: ${salaries.iloc[-1]:,}")Output:
First salary: $50,000
Last salary: $60,000Slice by position:
# Slice by position
print("First 3 salaries:")
print(salaries.iloc[0:3])Output:
First 3 salaries:
James 50000
Maria 65000
David 45000
Name: Monthly Salary, dtype: int64When to Use Which
Use label-based indexing (series['label']) when you know the label. Use position-based indexing (series.iloc[0]) when you care about position rather than identity.
Basic Operations on Series
Series support vectorized operations just like NumPy arrays:
Mathematical Operations
# Mathematical operations (vectorized!)
bonuses = salaries * 0.1 # 10% bonus
print("10% bonuses:")
print(bonuses)Output:
10% bonuses:
James 5000.0
Maria 6500.0
David 4500.0
Anna 7000.0
Michael 6000.0
Name: Monthly Salary, dtype: float64Combining Series
Series align by index automatically when combined:
# Combine Series (element-wise, aligned by index)
total_compensation = salaries + bonuses
total_compensation.name = 'Total Compensation'
print(total_compensation)Output:
James 55000.0
Maria 71500.0
David 49500.0
Anna 77000.0
Michael 66000.0
Name: Total Compensation, dtype: float64Aggregation Methods
Calculate statistics across the entire Series:
print(f"Average salary: ${salaries.mean():,.2f}")
print(f"Median salary: ${salaries.median():,.2f}")
print(f"Highest salary: ${salaries.max():,}")
print(f"Lowest salary: ${salaries.min():,}")
print(f"Total payroll: ${salaries.sum():,}")Output:
Average salary: $58,000.00
Median salary: $60,000.00
Highest salary: $70,000
Lowest salary: $45,000
Total payroll: $290,000Boolean Filtering
Filter Series using Boolean conditions:
# Boolean filtering (very powerful!)
high_earners = salaries[salaries >= 60000]
print("Employees earning $60,000 or more:")
print(high_earners)Output:
Employees earning $60,000 or more:
Maria 65000
Anna 70000
Michael 60000
Name: Monthly Salary, dtype: int64This Boolean indexing works exactly like NumPy, but with labels!
Introduction to DataFrames
A DataFrame is a collection of Series. Think of it as:
- Multiple Series with the same index
- A spreadsheet with row and column labels
- A SQL table
While a Series is one-dimensional, a DataFrame is two-dimensional.
Creating a DataFrame from a Dictionary
The most common way to create a DataFrame is from a dictionary where keys become column names:
# Create DataFrame from dictionary
# Keys = column names, Values = lists of data
employees = pd.DataFrame({
'name': ['James', 'Maria', 'David', 'Anna', 'Michael'],
'age': [28, 32, 25, 30, 35],
'salary': [50000, 65000, 45000, 70000, 60000],
'city': ['New York', 'London', 'Paris', 'New York', 'Berlin']
})
print("DataFrame created!")
print(f"Type: {type(employees)}")
print()
print(employees)Output:
DataFrame created!
Type: <class 'pandas.core.frame.DataFrame'>
name age salary city
0 James 28 50000 New York
1 Maria 32 65000 London
2 David 25 45000 Paris
3 Anna 30 70000 New York
4 Michael 35 60000 BerlinNotice:
- Rows have numeric index (0, 1, 2, 3, 4)
- Columns have names (name, age, salary, city)
- Different columns have different data types
DataFrame Attributes
print(f"Shape (rows, columns): {employees.shape}")
print(f"Number of rows: {employees.shape[0]}")
print(f"Number of columns: {employees.shape[1]}")
print(f"Column names: {employees.columns.tolist()}")
print(f"Index: {employees.index.tolist()}")Output:
Shape (rows, columns): (5, 4)
Number of rows: 5
Number of columns: 4
Column names: ['name', 'age', 'salary', 'city']
Index: [0, 1, 2, 3, 4]Extracting Columns from DataFrames
Single Column Extraction
Extract one column using single brackets—returns a Series:
# Extract a single column → Returns a Series
names = employees['name']
print(f"Type: {type(names)}")
print(names)Output:
Type: <class 'pandas.core.series.Series'>
0 James
1 Maria
2 David
3 Anna
4 Michael
Name: name, dtype: objectMultiple Column Extraction
Extract multiple columns using double brackets—returns a DataFrame:
# Extract multiple columns → Returns a DataFrame
subset = employees[['name', 'salary']]
print(f"Type: {type(subset)}")
print(subset)Output:
Type: <class 'pandas.core.frame.DataFrame'>
name salary
0 James 50000
1 Maria 65000
2 David 45000
3 Anna 70000
4 Michael 60000Key Rule
- Single brackets
df['column']→ Series (1D) - Double brackets
df[['col1', 'col2']]→ DataFrame (2D)
Basic DataFrame Operations
Adding New Columns
Create new columns from calculations:
# Add a new column (using Series operations!)
employees['bonus'] = employees['salary'] * 0.1
print("DataFrame with bonus column:")
print(employees)Output:
DataFrame with bonus column:
name age salary city bonus
0 James 28 50000 New York 5000.0
1 Maria 32 65000 London 6500.0
2 David 25 45000 Paris 4500.0
3 Anna 30 70000 New York 7000.0
4 Michael 35 60000 Berlin 6000.0Statistical Summary
Get a statistical overview of numeric columns:
print("Numeric columns summary:")
print(employees.describe())Output:
Numeric columns summary:
age salary bonus
count 5.000000 5.000000 5.000000
mean 30.000000 58000.000000 5800.000000
std 3.807887 10488.088482 1048.808848
min 25.000000 45000.000000 4500.000000
25% 28.000000 50000.000000 5000.000000
50% 30.000000 60000.000000 6000.000000
75% 32.000000 65000.000000 6500.000000
max 35.000000 70000.000000 7000.000000This provides count, mean, standard deviation, min, quartiles, and max for each numeric column.
Series vs DataFrame: Visual Comparison
Series (1D)
Index Values
James 50000
Maria 65000
David 45000A Series has one index and one set of values.
DataFrame (2D)
Index name age salary
0 James 28 50000
1 Maria 32 65000
2 David 25 45000A DataFrame has one index (rows) and multiple columns. Each column is itself a Series.
Code Examples
# Create a Series
temperatures = pd.Series(
[28, 32, 30, 29, 31],
index=['Mon', 'Tue', 'Wed', 'Thu', 'Fri'],
name='Temperature (°C)'
)
print("Series:")
print(temperatures)
print(f"\nShape: {temperatures.shape}")
print(f"Dimensions: {temperatures.ndim}D")Output:
Series:
Mon 28
Tue 32
Wed 30
Thu 29
Fri 31
Name: Temperature (°C), dtype: int64
Shape: (5,)
Dimensions: 1DDataFrame example:
# Create a DataFrame with multiple weather metrics
weather = pd.DataFrame({
'temperature': [28, 32, 30, 29, 31],
'humidity': [65, 70, 68, 72, 66],
'wind_speed': [15, 12, 18, 10, 14]
}, index=['Mon', 'Tue', 'Wed', 'Thu', 'Fri'])
print("DataFrame:")
print(weather)
print(f"\nShape: {weather.shape}")
print(f"Dimensions: {weather.ndim}D")Output:
DataFrame:
temperature humidity wind_speed
Mon 28 65 15
Tue 32 70 12
Wed 30 68 18
Thu 29 72 10
Fri 31 66 14
Shape: (5, 3)
Dimensions: 2DPractice Exercises
Apply what you have learned with these exercises.
Exercise 1: Create a Series
Create a Series of test scores for 5 students:
- James: 85
- Maria: 92
- David: 78
- Anna: 95
- Michael: 88
Then:
- Print the Series with the name “Math Test Scores”
- Calculate the average score
- Find students who scored above 85
# Your code hereSolution
scores = pd.Series(
[85, 92, 78, 95, 88],
index=['James', 'Maria', 'David', 'Anna', 'Michael'],
name='Math Test Scores'
)
# 1. Print the Series
print(scores)
# 2. Average score
print(f"\nAverage score: {scores.mean():.2f}")
# 3. Students above 85
print("\nStudents who scored above 85:")
print(scores[scores > 85])Exercise 2: Create a DataFrame
Create a DataFrame with product information:
- Products: Laptop, Mouse, Keyboard, Monitor
- Prices: 1200, 25, 75, 300
- Stock: 10, 50, 30, 15
Then:
- Display the DataFrame
- Add a column ’total_value’ (price × stock)
- Extract just the ‘product’ and ’total_value’ columns
# Your code hereExercise 3: Series Operations
Create two Series for monthly sales in two cities:
- New York: [100, 120, 115, 130, 125]
- London: [90, 95, 100, 105, 110]
- Months: Jan, Feb, Mar, Apr, May
Then:
- Calculate total sales for both cities combined
- Find the difference (New York - London) for each month
- Determine which city had higher sales in March
# Your code hereSummary
Congratulations! You have taken your first steps into pandas. Let’s review what you learned.
Key Concepts
Pandas Purpose
- Builds on NumPy, adding labels and mixed data types
- Essential tool for data analysis in Python
- Handles real-world data better than NumPy arrays
- Provides high-level operations for common tasks
Series (1D)
- One-dimensional labeled array
- Like a single column or Python dictionary
- Created from lists, dictionaries, or NumPy arrays
- Access by label or position
- Supports vectorized operations and aggregations
DataFrame (2D)
- Collection of Series sharing the same index
- Like a spreadsheet or SQL table
- Created from dictionary of lists
- Each column is a Series
- Can have mixed data types across columns
Key Syntax Reference
# Create Series
s = pd.Series([1, 2, 3], index=['a', 'b', 'c'], name='My Series')
# Create DataFrame
df = pd.DataFrame({
'col1': [1, 2, 3],
'col2': ['a', 'b', 'c']
})
# Access Series element
s['a'] # By label
s.iloc[0] # By position
# Extract DataFrame column
df['col1'] # Series (1D)
df[['col1', 'col2']] # DataFrame (2D)
# Series operations
s.mean() # Average
s.max() # Maximum
s[s > 5] # Boolean filteringSeries vs DataFrame Comparison
| Feature | Series | DataFrame |
|---|---|---|
| Dimensions | 1D | 2D |
| Shape | (n,) | (rows, cols) |
| Analogous to | Single column | Spreadsheet |
| Index | One index | Row index |
| Columns | Has a name | Multiple column names |
| Create from | List, dict | Dict of lists |
| Data types | One dtype | Different dtype per column |
NumPy to Pandas Translation
| NumPy | Pandas |
|---|---|
| 1D array | Series |
| 2D array | DataFrame |
| Numeric index only | Named index (labels) |
arr[5] | s['label'] or s.iloc[5] |
| Single dtype | Mixed dtypes per column |
| No column names | Named columns |
arr[arr > 5] | s[s > 5] (with labels!) |
Important Reminders
- Pandas builds on NumPy: Everything you learned about vectorization applies
- Labels are powerful: They make code self-documenting
- Single vs double brackets: Remember the difference for DataFrame column selection
- Index alignment: When combining Series, pandas aligns by index automatically
Next Steps
You now understand pandas fundamentals and can work with Series and basic DataFrames. In the next lesson, you will learn to load real datasets from files and explore DataFrames in depth.
Continue to Lesson 2 - DataFrames and Reading Data
Learn to load CSV files and explore DataFrames with pandas methods
Back to Module Overview
Return to the Pandas Data Analysis module overview
Begin Your Pandas Journey
You have learned the foundation of pandas—Series and DataFrames. These labeled data structures will transform how you work with data. Every pandas operation builds on these fundamentals.
In the next lesson, you will load real datasets and learn the essential methods for exploring and understanding data. Keep building your skills!