Lesson 2 - 2D Arrays and Working with CSV Data

Moving Beyond One Dimension

In the previous lesson, you learned about one-dimensional arrays—simple lists of numbers. Most real-world data, however, comes in tables with rows and columns. This lesson introduces you to two-dimensional arrays and shows you how to load actual data from CSV files.

By the end of this lesson, you will be able to:

  • Create and understand two-dimensional arrays (matrices)
  • Use the shape property to understand data dimensions
  • Load CSV files into NumPy arrays
  • Handle header rows when loading data
  • Understand and identify NaN (Not a Number) values
  • Explore real datasets to understand their structure

Real data analytics begins here. Let’s get started.


Understanding Two-Dimensional Arrays

A two-dimensional (2D) array is an array with rows and columns. Think of it like a spreadsheet or a table. This structure is fundamental to data analytics because most datasets are organized this way.

What is a Matrix?

A 2D array is also called a matrix. Each element has two coordinates: its row position and its column position.

Real-world examples of 2D data:

  • Student scores: Rows represent students, columns represent subjects
  • Sales data: Rows represent days, columns represent products
  • Temperature data: Rows represent cities, columns represent months

Creating a 2D Array

Here is how you create a basic 2D array:

import numpy as np

# Create a 2D array (3 rows by 3 columns)
data_2d = np.array([[1, 2, 3],
                    [4, 5, 6],
                    [7, 8, 9]])

print("2D Array:")
print(data_2d)

Output:

2D Array:
[[1 2 3]
 [4 5 6]
 [7 8 9]]

Notice the structure:

       Column 0  Column 1  Column 2
Row 0     1         2         3
Row 1     4         5         6
Row 2     7         8         9

Each row is a list enclosed in brackets, and all rows are enclosed together in outer brackets.

Understanding Shape in 2D

The .shape property becomes even more important with 2D arrays:

# Shape tells us: (rows, columns)
print("Shape:", data_2d.shape)
# Output: (3, 3)

# This means 3 rows and 3 columns

# Total number of elements
print("Size:", data_2d.size)
# Output: 9 (3 × 3 = 9)

# Data type
print("Data type:", data_2d.dtype)
# Output: int64

The shape is always reported as (rows, columns). This order matters and follows mathematical conventions.

Real Example: Student Scores

Let’s create something more realistic:

# Student scores: 4 students × 3 subjects (Math, Physics, Chemistry)
scores = np.array([
    [85, 90, 88],  # Student 1
    [92, 85, 91],  # Student 2
    [78, 82, 80],  # Student 3
    [88, 87, 92]   # Student 4
])

print("Student Scores:")
print(scores)
print(f"\nShape: {scores.shape} (4 students, 3 subjects)")

Output:

Student Scores:
[[85 90 88]
 [92 85 91]
 [78 82 80]
 [88 87 92]]

Shape: (4, 4) (4 students, 3 subjects)

Visual representation:

           Math  Physics  Chemistry
Student 1   85     90        88
Student 2   92     85        91
Student 3   78     82        80
Student 4   88     87        92

Creating 2D Arrays with Specific Patterns

NumPy provides functions to create 2D arrays filled with zeros or ones:

# 2D array of zeros (3 rows × 4 columns)
zeros_2d = np.zeros((3, 4))
print("Zeros (3×4):")
print(zeros_2d)

# Output:
# [[0. 0. 0. 0.]
#  [0. 0. 0. 0.]
#  [0. 0. 0. 0.]]

# 2D array of ones (2 rows × 5 columns)
ones_2d = np.ones((2, 5))
print("\nOnes (2×5):")
print(ones_2d)

# Output:
# [[1. 1. 1. 1. 1.]
#  [1. 1. 1. 1. 1.]]

Notice that for 2D arrays, you pass a tuple (rows, columns) to specify the shape. The double parentheses are required: np.zeros((3, 4)).


Loading CSV Files

CSV (Comma-Separated Values) is the most common format for sharing data. Excel files, database exports, and data downloads often come as CSV files. NumPy can load CSV files directly into arrays.

Why CSV Files Matter

CSV files are:

  • Universal: Nearly every tool can read and write them
  • Simple: Plain text format that humans can read
  • Portable: Work across different operating systems and programs
  • Common: Most data you download will be in CSV format

Being able to load CSV files is essential for working with real data.

Creating a Sample CSV File

Let’s create a CSV file to practice with:

import csv

sample_data = [
    ['name', 'math', 'physics', 'chemistry'],  # Header
    ['Student1', '85', '90', '88'],
    ['Student2', '92', '85', '91'],
    ['Student3', '78', '82', '80'],
    ['Student4', '88', '87', '92'],
    ['Student5', '95', '89', '93']
]

with open('students.csv', 'w', newline='') as file:
    writer = csv.writer(file)
    writer.writerows(sample_data)

print("Created students.csv")

This creates a file that looks like:

name,math,physics,chemistry
Student1,85,90,88
Student2,92,85,91
Student3,78,82,80
Student4,88,87,92
Student5,95,89,93

Loading CSV with np.genfromtxt()

NumPy provides the genfromtxt() function to load CSV files:

# Load CSV file
data = np.genfromtxt('students.csv', delimiter=',')

print("Loaded data:")
print(data)
print(f"\nShape: {data.shape}")

Output:

Loaded data:
[[nan nan nan nan]
 [nan 85. 90. 88.]
 [nan 92. 85. 91.]
 [nan 78. 82. 80.]
 [nan 88. 87. 92.]
 [nan 95. 89. 93.]]

Shape: (6, 4)

Wait—what are those nan values? Let’s understand this problem.


Understanding and Handling NaN Values

What is NaN?

NaN stands for “Not a Number.” It represents missing or invalid data. NumPy cannot convert text like “name” or “Student1” into numbers, so it inserts NaN instead.

# Example of an array with NaN
data_with_nan = np.array([10, 20, np.nan, 40, 50])

print("Array with NaN:", data_with_nan)
# Output: [10. 20. nan 40. 50.]

Where NaN Comes From

You encounter NaN values in three common situations:

  1. Loading CSV with text columns: Text cannot be converted to numbers
  2. Missing data in files: Empty cells become NaN
  3. Invalid calculations: Operations like 0/0 or infinity produce NaN

The Header Row Problem

When we loaded the CSV file, the header row contains text (“name”, “math”, “physics”, “chemistry”). NumPy tries to convert everything to numbers, but text cannot be converted, so it becomes NaN.

The first column also contains NaN because the student names are text.

Solution: Skip the Header Row

The solution is to tell NumPy to skip the header row:

# Load CSV and skip the header row
data = np.genfromtxt('students.csv', delimiter=',', skip_header=1)

print("Loaded data (skipped header):")
print(data)
print(f"\nShape: {data.shape}")

Output:

Loaded data (skipped header):
[[nan 85. 90. 88.]
 [nan 92. 85. 91.]
 [nan 78. 82. 80.]
 [nan 88. 87. 92.]
 [nan 95. 89. 93.]]

Shape: (5, 4)

Better! The header row is gone, but the first column still has NaN because the names column contains text.

Important

The first column still has NaN values because student names cannot be converted to numbers. For now, we will focus on numeric columns only. Later, when you learn pandas, you will discover better ways to handle mixed data types.

Loading Numeric-Only CSV Files

For practice, let’s create a CSV file with only numbers:

# Create numeric-only CSV (no names)
numeric_data = [
    ['math', 'physics', 'chemistry'],  # Header
    [85, 90, 88],
    [92, 85, 91],
    [78, 82, 80],
    [88, 87, 92],
    [95, 89, 93]
]

with open('scores_numeric.csv', 'w', newline='') as file:
    writer = csv.writer(file)
    writer.writerows(numeric_data)

# Load it with skip_header
scores = np.genfromtxt('scores_numeric.csv', delimiter=',', skip_header=1)

print("Clean numeric data:")
print(scores)
print(f"\nShape: {scores.shape} (5 students, 3 subjects)")

Output:

Clean numeric data:
[[85. 90. 88.]
 [92. 85. 91.]
 [78. 82. 80.]
 [88. 87. 92.]
 [95. 89. 93.]]

Shape: (5, 3) (5 students, 3 subjects)

Perfect! No NaN values because every cell contains a number.

Dealing with Missing Data

Sometimes CSV files have missing values:

# Create CSV with missing values
data_with_missing = [
    ['student', 'score1', 'score2', 'score3'],
    ['Student1', '85', '90', '88'],
    ['Student2', '92', '', '91'],      # Missing score2
    ['Student3', '78', '82', ''],      # Missing score3
    ['Student4', '', '87', '92']       # Missing score1
]

with open('scores_missing.csv', 'w', newline='') as file:
    writer = csv.writer(file)
    writer.writerows(data_with_missing)

# Load it
data_missing = np.genfromtxt('scores_missing.csv', delimiter=',', skip_header=1)

print("Data with missing values:")
print(data_missing)

Output:

Data with missing values:
[[nan 85. 90. 88.]
 [nan 92. nan 91.]
 [nan 78. 82. nan]
 [nan nan 87. 92.]]

The empty cells become NaN. This is actually useful—NaN marks where data is missing so you can handle it appropriately.

Checking for NaN Values

You can detect and count NaN values:

# Check if array has any NaN values
has_nan = np.isnan(data_missing).any()
print(f"Has NaN values? {has_nan}")
# Output: True

# Count NaN values
nan_count = np.isnan(data_missing).sum()
print(f"Number of NaN values: {nan_count}")
# Output: 7

The np.isnan() function returns a boolean array (True where NaN exists, False otherwise). The .any() method checks if any value is True. The .sum() method counts how many True values exist.

Best Practices for Loading CSV Files

Follow these guidelines when loading CSV files with NumPy:

  1. Always use skip_header=1 if your file has a header row
  2. Keep numeric data in separate columns from text data
  3. Check for NaN values after loading
  4. Remember: pandas (which you will learn later) handles mixed data types much better

For now, focus on numeric data and always skip header rows.


Exploring Real Datasets

Let’s work with a more realistic dataset to practice what you have learned.

Creating a Realistic Sales Dataset

# Monthly sales data for 4 products across 12 months
sales_data = [
    ['month', 'product_a', 'product_b', 'product_c', 'product_d'],
    [1, 120, 135, 98, 110],   # January
    [2, 135, 142, 105, 118],  # February
    [3, 150, 138, 112, 125],  # March
    [4, 145, 155, 108, 130],  # April
    [5, 160, 148, 118, 135],  # May
    [6, 175, 162, 125, 142],  # June
    [7, 190, 170, 132, 150],  # July
    [8, 185, 165, 128, 145],  # August
    [9, 170, 158, 120, 138],  # September
    [10, 155, 150, 115, 132], # October
    [11, 140, 145, 110, 128], # November
    [12, 165, 160, 122, 140]  # December
]

with open('monthly_sales.csv', 'w', newline='') as file:
    writer = csv.writer(file)
    writer.writerows(sales_data)

print("Created monthly_sales.csv")

Loading the Sales Data

# Load the sales data
sales = np.genfromtxt('monthly_sales.csv', delimiter=',', skip_header=1)

print("Monthly Sales Data:")
print(sales)

Output:

Monthly Sales Data:
[[  1. 120. 135.  98. 110.]
 [  2. 135. 142. 105. 118.]
 [  3. 150. 138. 112. 125.]
 [  4. 145. 155. 108. 130.]
 [  5. 160. 148. 118. 135.]
 [  6. 175. 162. 125. 142.]
 [  7. 190. 170. 132. 150.]
 [  8. 185. 165. 128. 145.]
 [  9. 170. 158. 120. 138.]
 [ 10. 155. 150. 115. 132.]
 [ 11. 140. 145. 110. 128.]
 [ 12. 165. 160. 122. 140.]]

Exploring Dataset Properties

Always start by understanding the shape and structure of your data:

# Dataset dimensions
print(f"Shape: {sales.shape}")
# Output: (12, 5)

print(f"Total values: {sales.size}")
# Output: 60 (12 × 5)

print(f"Data type: {sales.dtype}")
# Output: float64

Understanding:

  • 12 rows: One for each month
  • 5 columns: Month number plus 4 products
  • 60 total values: 12 months × 5 columns

Viewing Parts of the Dataset

For large datasets, you often want to see just the beginning or end:

# View first 5 rows
print("First 5 months:")
print(sales[:5])

# Output:
# [[  1. 120. 135.  98. 110.]
#  [  2. 135. 142. 105. 118.]
#  [  3. 150. 138. 112. 125.]
#  [  4. 145. 155. 108. 130.]
#  [  5. 160. 148. 118. 135.]]

# View last 3 rows
print("Last 3 months:")
print(sales[-3:])

# Output:
# [[ 10. 155. 150. 115. 132.]
#  [ 11. 140. 145. 110. 128.]
#  [ 12. 165. 160. 122. 140.]]

This technique helps you quickly check if the data loaded correctly without printing thousands of rows.

Understanding Column Structure

When you skip the header row, you need to remember what each column represents:

# Column meanings (remember: we skipped the header)
# Column 0: Month (1-12)
# Column 1: Product A sales
# Column 2: Product B sales
# Column 3: Product C sales
# Column 4: Product D sales

print("Dataset structure:")
print(f"Rows (months): {sales.shape[0]}")
print(f"Columns (month + products): {sales.shape[1]}")

Visual representation:

        Month  Prod_A  Prod_B  Prod_C  Prod_D
Row 0     1     120     135      98     110
Row 1     2     135     142     105     118
Row 2     3     150     138     112     125
...
Row 11   12     165     160     122     140

In the next lesson, you will learn how to extract specific rows and columns from this data.


Practice Exercises

Time to apply what you have learned with hands-on exercises.

Exercise 1: Create a 2D Array

Create a 2D array representing temperatures in 3 cities over 4 days. Each row should represent a city, and each column should represent a day.

# Your code here
# Create a 3×4 array with temperature values
# Print the array and its shape

Hint

Use np.array() with nested lists. Each inner list is a row.

Exercise 2: Load and Explore CSV

Create a CSV file with product prices and load it into NumPy.

# Your code here
# 1. Create CSV with 5 products and 3 price columns (cost, sale_price, discount)
# 2. Load it with np.genfromtxt()
# 3. Print shape and first 3 rows

Hint

Remember to use skip_header=1 and delimiter=',' when loading.

Exercise 3: Check Dataset Properties

Load the monthly_sales.csv file you created earlier and answer:

  • How many rows does it have?
  • How many columns does it have?
  • What is its data type?
# Your code here

Summary

Congratulations! You now understand two-dimensional arrays and can load real data from files. Let’s review the key concepts.

Key Concepts

Two-Dimensional Arrays

  • 2D arrays have rows and columns (like spreadsheets)
  • Create with nested lists: np.array([[row1], [row2]])
  • Create filled arrays: np.zeros((rows, cols)) or np.ones((rows, cols))
  • Shape format: (rows, columns)

Array Shape

  • .shape returns a tuple with (rows, columns)
  • First number is always rows, second is always columns
  • .size returns total elements (rows × columns)

Loading CSV Files

  • Use np.genfromtxt(filename, delimiter=',', skip_header=1)
  • delimiter=',' tells NumPy that commas separate values
  • skip_header=1 skips the first row (usually column names)
  • Always check the shape after loading

NaN Values

  • NaN means “Not a Number”
  • Appears when text cannot be converted to numbers
  • Appears in place of missing data
  • Check with np.isnan() function
  • Count with np.isnan().sum()

Dataset Exploration

  • Always check shape, size, and dtype first
  • View first few rows: array[:5]
  • View last few rows: array[-3:]
  • Understand what each row and column represents

Key Functions Reference

np.array([[...], [...]])              # Create 2D array
np.zeros((rows, cols))                # 2D array of zeros
np.ones((rows, cols))                 # 2D array of ones
np.genfromtxt(file, delimiter, skip_header)  # Load CSV
.shape                                # Get (rows, columns)
np.isnan(array)                       # Check for NaN values
np.isnan(array).sum()                 # Count NaN values

Why This Matters

Most real data comes in tabular form—rows and columns. CSV files are the universal format for sharing data. The skills you learned in this lesson are fundamental to every data analytics project you will encounter.

Being able to load, inspect, and understand the structure of 2D arrays prepares you for the more advanced operations coming in the next lessons.


Next Steps

You can now create 2D arrays and load real data from CSV files. In the next lesson, you will learn how to select specific rows and columns from your data—a critical skill for data analysis.

Continue to Lesson 3 - Selecting and Slicing Data

Learn to extract specific rows, columns, and subsets from 2D arrays

Back to Lesson 1 - NumPy Essentials

Review 1D arrays and NumPy fundamentals


Keep Building Your Skills

You are making excellent progress! Two-dimensional arrays and CSV files are the foundation of real data analytics. In the next lesson, you will learn to extract specific pieces of data from these arrays, bringing you one step closer to performing actual analysis.

The combination of loading real data and selecting specific subsets will enable you to answer real business questions and uncover insights hidden in datasets!