Lesson 6 - Modifying Data and Assignment

Transforming Your Data

You can now create, analyze, and filter NumPy arrays. This final lesson teaches you how to modify arrays—updating values, cleaning data, and transforming datasets for analysis.

By the end of this lesson, you will be able to:

  • Assign new values to specific array elements
  • Update entire rows or columns at once
  • Modify data using Boolean indexing for targeted updates
  • Clean datasets by fixing invalid values
  • Add new calculated columns to arrays
  • Understand the difference between views and copies
  • Apply these techniques to real data preparation tasks

Data modification is essential for cleaning and preparing datasets before analysis. Let’s master these techniques.


Basic Assignment Operations

Assigning to Single Elements

You can change individual array elements by assigning new values:

import numpy as np

scores = np.array([85, 92, 78, 88, 95])

print("Original scores:")
print(scores)
# Output: [85 92 78 88 95]

Change specific elements:

# Change first score
scores[0] = 90

print("After changing first score to 90:")
print(scores)
# Output: [90 92 78 88 95]

# Change last score
scores[-1] = 100

print("After changing last score to 100:")
print(scores)
# Output: [90 92 78 88 100]

Assigning to Multiple Elements

Update multiple elements at once using slicing:

prices = np.array([10, 15, 20, 25, 30])

print("Original prices:")
print(prices)
# Output: [10 15 20 25 30]

Set multiple elements to the same value:

# Update first 3 prices to same value
prices[0:3] = 12

print("After setting first 3 to 12:")
print(prices)
# Output: [12 12 12 25 30]

Set multiple elements to different values:

# Update first 3 prices to different values
prices[0:3] = [8, 10, 12]

print("After setting to different values:")
print(prices)
# Output: [ 8 10 12 25 30]

Assigning in 2D Arrays

Assignment works with 2D arrays too:

# Student scores: Math, Physics, Chemistry
students = np.array([
    [85, 90, 88],
    [92, 85, 91],
    [78, 82, 80]
])

print("Original student scores:")
print(students)

Update individual elements:

# Change single element (student 0, subject 0)
students[0, 0] = 87

print("After changing [0, 0] to 87:")
print(students)
# Output:
# [[87 90 88]
#  [92 85 91]
#  [78 82 80]]

Update entire rows:

# Update entire first row (all scores for student 0)
students[0] = [90, 92, 89]

print("After updating first student's scores:")
print(students)
# Output:
# [[90 92 89]
#  [92 85 91]
#  [78 82 80]]

Update entire columns:

# Update entire Math column (column 0)
students[:, 0] = [95, 93, 85]

print("After updating Math column:")
print(students)
# Output:
# [[95 92 89]
#  [93 85 91]
#  [85 82 80]]

Views vs Copies: Important Difference

Critical Concept

NumPy slices create views, not copies. When you modify a slice, you modify the original array. This is different from Python lists!

# NumPy arrays: slices are VIEWS (modify original)
original = np.array([1, 2, 3, 4, 5])
slice_view = original[1:4]

print("Original array:", original)
print("Slice:", slice_view)
# Output: Original array: [1 2 3 4 5]
# Output: Slice: [2 3 4]

Modify the slice:

# Modify slice
slice_view[0] = 99

print("\nAfter modifying slice:")
print("Slice:", slice_view)
print("Original (also changed!):", original)
# Output: Slice: [99  3  4]
# Output: Original (also changed!): [ 1 99  3  4  5]

The original array changed! To prevent this, use .copy():

# Use .copy() to create independent copy
original = np.array([1, 2, 3, 4, 5])
slice_copy = original[1:4].copy()

slice_copy[0] = 99

print("\nWith .copy():")
print("Slice copy:", slice_copy)
print("Original (unchanged):", original)
# Output: Slice copy: [99  3  4]
# Output: Original (unchanged): [1 2 3 4 5]

Assignment with Boolean Indexing

Replacing Values That Meet Conditions

Boolean indexing enables targeted updates—change only values that meet specific criteria:

ages = np.array([25, 30, -5, 45, 18, -2, 35, 150, 28])

print("Original ages (with errors):")
print(ages)
# Output: [ 25  30  -5  45  18  -2  35 150  28]

Fix invalid negative ages:

# Fix negative ages (set to 0)
ages[ages < 0] = 0

print("After fixing negatives:")
print(ages)
# Output: [ 25  30   0  45  18   0  35 150  28]

Cap unrealistic ages:

# Cap unrealistic ages (> 100 → 100)
ages[ages > 100] = 100

print("After capping at 100:")
print(ages)
# Output: [ 25  30   0  45  18   0  35 100  28]

Practical Example: Data Cleaning

Sensor data often contains error codes that need cleaning:

# Temperature data with sensor errors
temps = np.array([25, 28, -999, 32, 30, -999, 27, 31, 29, -999])

print("Temperature readings (-999 = sensor error):")
print(temps)
# Output: [  25   28 -999   32   30 -999   27   31   29 -999]

Replace errors with NaN:

# Convert to float to support NaN
temps = temps.astype(float)

# Replace errors with NaN
temps[temps == -999] = np.nan

print("\nAfter replacing -999 with NaN:")
print(temps)
# Output: [25. 28. nan 32. 30. nan 27. 31. 29. nan]

Calculate statistics ignoring NaN:

# Calculate mean (ignoring NaN)
mean_temp = np.nanmean(temps)
print(f"\nMean temperature (ignoring NaN): {mean_temp:.1f}°C")
# Output: Mean temperature (ignoring NaN): 28.9°C

# Replace NaN with mean
temps[np.isnan(temps)] = mean_temp

print("\nAfter filling NaN with mean:")
print(temps)
# Output: [25.  28.  28.9 32.  30.  28.9 27.  31.  29.  28.9]

This is a common data cleaning pattern: identify invalid values, mark them as NaN, calculate statistics without them, then fill missing values.

Applying Different Updates Based on Ranges

Sometimes you need to apply different updates to different subsets:

# Sales performance bonuses
sales = np.array([120, 135, 98, 145, 110, 175, 190, 88])

print("Sales amounts:")
print(sales)
# Output: [120 135  98 145 110 175 190  88]

Calculate tiered bonuses:

# Create bonus array (copy to avoid modifying original)
bonuses = np.zeros(len(sales))

# Low sales (< 100): 5% bonus
low_mask = sales < 100
bonuses[low_mask] = sales[low_mask] * 0.05

# Medium sales (100-150): 10% bonus
medium_mask = (sales >= 100) & (sales <= 150)
bonuses[medium_mask] = sales[medium_mask] * 0.10

# High sales (> 150): 15% bonus
high_mask = sales > 150
bonuses[high_mask] = sales[high_mask] * 0.15

print("\nSales and bonuses:")
for sale, bonus in zip(sales, bonuses):
    print(f"Sales: ${sale:3d} → Bonus: ${bonus:5.2f}")

Output:

Sales and bonuses:
Sales: $120 → Bonus: $12.00
Sales: $135 → Bonus: $13.50
Sales: $ 98 → Bonus: $ 4.90
Sales: $145 → Bonus: $14.50
Sales: $110 → Bonus: $11.00
Sales: $175 → Bonus: $26.25
Sales: $190 → Bonus: $28.50
Sales: $ 88 → Bonus: $ 4.40

Conditional Assignment in 2D Arrays

Update Rows Based on Conditions

Apply updates to specific rows based on column values:

# Student data: ID, Math, Physics, Chemistry
students = np.array([
    [101, 85, 90, 88],
    [102, 92, 85, 91],
    [103, 55, 58, 52],  # Struggling student
    [104, 88, 87, 92],
    [105, 48, 45, 50]   # Struggling student
])

print("Original student data:")
print(students)

Add bonus points to struggling students:

# Add 5 bonus points to Math for students who scored < 60
low_math = students[:, 1] < 60
students[low_math, 1] = students[low_math, 1] + 5

print("After adding bonus to low Math scores:")
print(students)
# Output:
# [[101  85  90  88]
#  [102  92  85  91]
#  [103  60  58  52]  ← Math increased from 55 to 60
#  [104  88  87  92]
#  [105  53  45  50]] ← Math increased from 48 to 53

Update Specific Columns for All Rows

Apply updates to entire columns:

# Sales data: Product, Units, Price
products = np.array([
    [1, 120, 25],
    [2, 85, 40],
    [3, 200, 15],
    [4, 55, 60]
])

print("Original product data:")
print(products)

Increase all prices:

# Increase all prices by 10%
products[:, 2] = products[:, 2] * 1.1

print("After 10% price increase:")
print(products)
# Output:
# [[  1 120  27.5]
#  [  2  85  44. ]
#  [  3 200  16.5]
#  [  4  55  66. ]]

# Round prices to nearest integer
products[:, 2] = np.round(products[:, 2])

print("After rounding prices:")
print(products.astype(int))
# Output:
# [[  1 120  28]
#  [  2  85  44]
#  [  3 200  16]
#  [  4  55  66]]

Cap Values in Specific Columns

Prevent unrealistic values:

# Trip data: Distance (km), Time (min), Speed (km/h)
trips = np.array([
    [5.2, 12, 26],
    [15.8, 8, 118],   # Unrealistic speed
    [8.5, 15, 34],
    [22.4, 10, 134],  # Unrealistic speed
    [12.1, 18, 40]
])

print("Trip data (with unrealistic speeds):")
print(trips)

Cap speeds at reasonable maximum:

# Cap speeds at 100 km/h
trips[trips[:, 2] > 100, 2] = 100

print("\nAfter capping speeds at 100:")
print(trips)
# Output:
# [[  5.2  12.   26. ]
#  [ 15.8   8.  100. ]  ← Capped from 118
#  [  8.5  15.   34. ]
#  [ 22.4  10.  100. ]  ← Capped from 134
#  [ 12.1  18.   40. ]]

Adding New Columns

Using np.column_stack()

Add new columns to existing arrays:

# Student scores: Math, Physics
scores = np.array([
    [85, 90],
    [92, 85],
    [78, 82]
])

print("Original scores (Math, Physics):")
print(scores)
print(f"Shape: {scores.shape}")
# Output: Shape: (3, 2)

Add a Chemistry column:

# Add Chemistry scores
chemistry = np.array([88, 91, 80])

scores_with_chem = np.column_stack((scores, chemistry))

print("\nAfter adding Chemistry:")
print(scores_with_chem)
print(f"Shape: {scores_with_chem.shape}")
# Output:
# [[85 90 88]
#  [92 85 91]
#  [78 82 80]]
# Shape: (3, 3)

Using np.concatenate()

Alternative method using concatenate:

# Convert 1D array to column
chemistry_col = chemistry.reshape(-1, 1)

scores_concat = np.concatenate((scores, chemistry_col), axis=1)

print("Using np.concatenate:")
print(scores_concat)
print(f"Shape: {scores_concat.shape}")

Both methods produce the same result. Use whichever feels more natural.

Adding Calculated Columns

Create new columns from calculations:

# Sales data: Units, Price
sales = np.array([
    [120, 25],
    [85, 40],
    [200, 15]
])

print("Sales data (Units, Price):")
print(sales)

Calculate revenue:

# Calculate Revenue = Units × Price
revenue = sales[:, 0] * sales[:, 1]

print("\nRevenue:")
print(revenue)
# Output: [3000 3400 3000]

# Add Revenue as new column
sales_with_revenue = np.column_stack((sales, revenue))

print("\nSales with Revenue (Units, Price, Revenue):")
print(sales_with_revenue)
# Output:
# [[ 120   25 3000]
#  [  85   40 3400]
#  [ 200   15 3000]]

Complete Data Enhancement Example

Build a comprehensive dataset with multiple calculated columns:

# Product data: ID, Units_Sold, Unit_Price
products = np.array([
    [1, 120, 25],
    [2, 85, 40],
    [3, 200, 15],
    [4, 55, 60]
])

print("Original product data:")
print(products)

Add multiple calculated columns:

# Calculate Revenue
revenue = products[:, 1] * products[:, 2]

# Calculate Tax (9%)
tax = revenue * 0.09

# Calculate Total (Revenue + Tax)
total = revenue + tax

print("\nCalculated columns:")
print(f"Revenue: {revenue}")
print(f"Tax: {tax}")
print(f"Total: {total}")

Combine everything:

# Add all new columns
enhanced_products = np.column_stack((products, revenue, tax, total))

print("\nEnhanced product data:")
print("ID | Units | Price | Revenue | Tax    | Total")
print("=" * 55)
for row in enhanced_products:
    print(f"{int(row[0]):2d} | {int(row[1]):5d} | {int(row[2]):5d} | {row[3]:7.0f} | {row[4]:6.2f} | {row[5]:7.2f}")

Output:

Enhanced product data:
ID | Units | Price | Revenue | Tax    | Total
=======================================================
 1 |   120 |    25 |    3000 | 270.00 | 3270.00
 2 |    85 |    40 |    3400 | 306.00 | 3706.00
 3 |   200 |    15 |    3000 | 270.00 | 3270.00
 4 |    55 |    60 |    3300 | 297.00 | 3597.00

Practice Exercises

Apply data modification techniques to these exercises.

Exercise 1: Clean Invalid Data

Replace all negative values with 0:

measurements = np.array([25, -5, 32, 30, -8, 27, 31, -2, 29])

# Your code here:
# 1. Replace negative values with 0
# 2. Print cleaned array

Exercise 2: Apply Conditional Discount

Reduce prices greater than 50 by 10%:

prices = np.array([45, 65, 30, 75, 40, 80, 55])

# Your code here:
# 1. Apply 10% discount to prices > 50
# 2. Print updated prices

Hint

Use Boolean indexing: prices[prices > 50] = prices[prices > 50] * 0.9

Exercise 3: Add Calculated Column

Calculate and add total score for each student:

# Student scores: Math, Physics, Chemistry
students = np.array([
    [85, 90, 88],
    [92, 85, 91],
    [78, 82, 80]
])

# Your code here:
# 1. Calculate total for each student (sum across row)
# 2. Add as new column using np.column_stack()
# 3. Print result

Summary

You now know how to modify NumPy arrays effectively. Let’s review the key concepts.

Key Concepts

Basic Assignment

  • Single element: array[0] = 100
  • Multiple elements: array[0:3] = [1, 2, 3]
  • 2D single: array[0, 1] = 50
  • 2D row: array[0] = [1, 2, 3]
  • 2D column: array[:, 0] = [10, 20, 30]

Boolean Assignment

  • Replace matching values: array[array < 0] = 0
  • Cap values: array[array > 100] = 100
  • Clean data: replace errors with NaN or mean
  • Apply tiered updates based on value ranges

2D Conditional Assignment

  • Update column conditionally: array[mask, col] = new_value
  • Modify all rows in column: array[:, col] = values
  • Cap specific columns: array[array[:, col] > 100, col] = 100

Adding Columns

  • np.column_stack((array, new_col)) adds columns
  • np.concatenate((array, new_col), axis=1) alternative method
  • Add calculated columns (revenue, totals, averages)

Key Patterns Reference

# Basic assignment
array[index] = value
array[start:end] = values
array[:, col] = values

# Boolean assignment
array[array < 0] = 0                    # Fix negatives
array[array > 100] = 100                # Cap values
array[np.isnan(array)] = mean_value     # Fill NaN

# 2D assignment
array[:, col] = array[:, col] * 1.1     # Update column
array[mask, col] = value                # Conditional update

# Add columns
new_array = np.column_stack((array, new_col))

Important Reminders

  • Views vs Copies: Slices are views. Use .copy() for independent copies
  • Boolean indexing creates a copy, doesn’t affect original
  • Use .astype(float) before assigning NaN values
  • Reshape 1D to column: array.reshape(-1, 1)
  • Always verify shapes when combining arrays

Why This Matters

Data modification enables you to:

  • Clean invalid or missing values
  • Transform data for analysis
  • Calculate derived metrics
  • Prepare datasets for visualization
  • Fix data quality issues
  • Apply business rules to data

These skills are essential for real-world data analytics where raw data rarely arrives in perfect condition.


Conclusion: NumPy Fundamentals Complete

Congratulations! You have completed the NumPy Fundamentals module. You now possess a comprehensive set of skills for numerical computing and data manipulation:

What You Mastered:

  • Creating and understanding NumPy arrays
  • Loading real data from CSV files
  • Selecting specific rows, columns, and subsets
  • Performing vectorized calculations efficiently
  • Filtering data with Boolean indexing
  • Modifying and transforming datasets

Your Next Steps:

Continue to Pandas Data Analysis

Build on NumPy to work with labeled, tabular data using pandas DataFrames

Back to Module Overview

Review the complete NumPy Fundamentals module


You Are Now a NumPy Practitioner

The skills you learned in this module form the foundation of the entire Python data science ecosystem. NumPy arrays underpin pandas, scikit-learn, TensorFlow, and virtually every data science library you will encounter.

You can now:

  • Work with numerical data efficiently
  • Perform calculations on entire datasets without loops
  • Query and filter data using Boolean conditions
  • Clean and transform real-world datasets
  • Prepare data for analysis and visualization

These capabilities make you ready for advanced data analytics. Continue your journey with pandas, where you will apply these NumPy concepts to labeled, tabular data with even more powerful features!