Lesson 8 - Generators and Memory-Efficient Processing

Introduction

In the previous lesson, we learned about iterators. Creating iterators with classes requires writing __iter__ and __next__ methods, which can be verbose. Python provides a simpler way: generators.

Generators are a special type of iterator created using functions with the yield keyword. They’re:

  • Easier to write than class-based iterators
  • Memory-efficient (values generated on-demand)
  • Perfect for processing large datasets
  • Can be infinite

In this lesson, we’ll explore:

  • Generator functions and the yield keyword
  • Generator expressions
  • Sending values into generators
  • Using generators for data pipelines
  • The itertools module for generator utilities

Generator Functions

A generator function looks like a normal function but uses yield instead of return:

def count_up(start, end):
    """Generator that counts from start to end"""
    current = start
    while current <= end:
        yield current
        current += 1

# Create a generator
counter = count_up(1, 5)

# It's an iterator!
print(next(counter))  # 1
print(next(counter))  # 2

# Use in a for loop
for num in count_up(10, 13):
    print(num)

Output:

1
2
10
11
12
13

How yield Works

When a function contains yield:

  1. Calling the function returns a generator object (doesn’t execute the function yet)
  2. Calling next() on the generator executes until the first yield
  3. The value after yield is returned
  4. The function’s state is frozen
  5. Next next() call resumes from where it left off
def simple_generator():
    print("Starting")
    yield 1
    print("Between yields")
    yield 2
    print("Ending")
    yield 3

gen = simple_generator()
print("Generator created")

print(next(gen))
print(next(gen))
print(next(gen))

Output:

Generator created
Starting
1
Between yields
2
Ending
3

Comparing Class Iterator vs Generator

Here’s the same functionality as a class-based iterator and a generator:

# Class-based iterator (verbose)
class CounterIterator:
    def __init__(self, start, end):
        self.current = start
        self.end = end

    def __iter__(self):
        return self

    def __next__(self):
        if self.current > self.end:
            raise StopIteration
        value = self.current
        self.current += 1
        return value

# Generator (simple!)
def counter_generator(start, end):
    current = start
    while current <= end:
        yield current
        current += 1

# Both work the same way
for num in CounterIterator(1, 3):
    print(num, end=" ")

print()

for num in counter_generator(1, 3):
    print(num, end=" ")

Output:

1 2 3
1 2 3

Generators are much more concise!

Practical Example: Reading Large Files

Generators are perfect for processing large files line-by-line without loading everything into memory:

def read_books(filename):
    """Generator that yields book info from a CSV file"""
    try:
        with open(filename, 'r') as f:
            # Skip header
            next(f)

            for line in f:
                line = line.strip()
                if line:
                    parts = line.split(',')
                    yield {
                        'title': parts[0],
                        'author': parts[1],
                        'price': float(parts[2])
                    }
    except FileNotFoundError:
        print(f"File {filename} not found")

# Process books one at a time (memory-efficient)
for book in read_books('books.csv'):
    if book['price'] < 40:
        print(f"{book['title']} by {book['author']} - ${book['price']}")

This reads one line at a time, never loading the entire file into memory.

Generator Expressions

Similar to list comprehensions, but use parentheses and create generators:

# List comprehension - creates entire list in memory
squares_list = [x**2 for x in range(1000000)]

# Generator expression - generates values on-demand
squares_gen = (x**2 for x in range(1000000))

# Generator is much more memory-efficient
import sys
print(f"List size: {sys.getsizeof(squares_list)} bytes")
print(f"Generator size: {sys.getsizeof(squares_gen)} bytes")

# Both work in for loops
for num in (x**2 for x in range(5)):
    print(num, end=" ")

Output:

List size: 8000056 bytes
Generator size: 112 bytes
0 1 4 9 16

Practical Use: Filtering Books

books = [
    {'title': 'Python Basics', 'price': 29.99},
    {'title': 'Python Advanced', 'price': 49.99},
    {'title': 'Data Science 101', 'price': 39.99},
    {'title': 'Machine Learning', 'price': 59.99},
]

# Generator expression for affordable books
affordable = (book for book in books if book['price'] < 40)

# Memory-efficient iteration
for book in affordable:
    print(f"{book['title']}: ${book['price']}")

Output:

Python Basics: $29.99
Data Science 101: $39.99

Chaining Generators

Generators can be chained to create data processing pipelines:

def read_books():
    """Simulate reading books from a database"""
    books = [
        {'title': 'Python Basics', 'author': 'John Doe', 'price': 29.99, 'rating': 4.5},
        {'title': 'Python Advanced', 'author': 'Jane Smith', 'price': 49.99, 'rating': 4.8},
        {'title': 'Data Science 101', 'author': 'Bob Johnson', 'price': 39.99, 'rating': 4.2},
        {'title': 'Machine Learning', 'author': 'Alice Williams', 'price': 59.99, 'rating': 4.9},
        {'title': 'Web Development', 'author': 'Charlie Brown', 'price': 34.99, 'rating': 3.8},
    ]
    for book in books:
        yield book

def filter_by_price(books, max_price):
    """Filter books by maximum price"""
    for book in books:
        if book['price'] <= max_price:
            yield book

def filter_by_rating(books, min_rating):
    """Filter books by minimum rating"""
    for book in books:
        if book['rating'] >= min_rating:
            yield book

def format_book(books):
    """Format book information"""
    for book in books:
        yield f"'{book['title']}' by {book['author']} - ${book['price']} (⭐ {book['rating']})"

# Create a processing pipeline
all_books = read_books()
affordable_books = filter_by_price(all_books, 40)
highly_rated = filter_by_rating(affordable_books, 4.3)
formatted = format_book(highly_rated)

# Process the pipeline
print("Affordable, highly-rated books:")
for book_info in formatted:
    print(f"  {book_info}")

Output:

Affordable, highly-rated books:
  'Python Basics' by John Doe - $29.99 (⭐ 4.5)
  'Python Advanced' by Jane Smith - $49.99 (⭐ 4.8)

Each generator processes one item at a time, making this extremely memory-efficient even for millions of books.

Infinite Generators

Generators can produce infinite sequences:

def infinite_counter(start=0):
    """Count forever"""
    num = start
    while True:
        yield num
        num += 1

# Use with break to avoid infinite loop
counter = infinite_counter(1)
for num in counter:
    if num > 5:
        break
    print(num)

Output:

1
2
3
4
5

Practical: Infinite Book ID Generator

def book_id_generator(prefix="BOOK"):
    """Generate unique book IDs forever"""
    counter = 1
    while True:
        yield f"{prefix}-{counter:05d}"
        counter += 1

# Use it to assign IDs to books
id_gen = book_id_generator("LIB")

books = ["Python Basics", "Python Advanced", "Data Science 101"]

for book in books:
    book_id = next(id_gen)
    print(f"{book_id}: {book}")

Output:

LIB-00001: Python Basics
LIB-00002: Python Advanced
LIB-00003: Data Science 101

Sending Values into Generators

Generators can receive values using the send() method:

def running_average():
    """Calculate running average"""
    total = 0
    count = 0
    average = None

    while True:
        # Receive a value
        value = yield average
        if value is not None:
            total += value
            count += 1
            average = total / count

# Use the generator
avg = running_average()
next(avg)  # Prime the generator

print(avg.send(100))  # 100.0
print(avg.send(150))  # 125.0
print(avg.send(200))  # 150.0
print(avg.send(50))   # 125.0

Output:

100.0
125.0
150.0
125.0

Generator Methods: send(), throw(), close()

Generators have special methods for advanced control:

def book_processor():
    """Process books with error handling"""
    books_processed = 0

    try:
        while True:
            book = yield books_processed
            if book:
                print(f"Processing: {book}")
                books_processed += 1
    except GeneratorExit:
        print(f"Processed {books_processed} books total")
    except Exception as e:
        print(f"Error: {e}")
        yield -1

processor = book_processor()
next(processor)  # Prime it

print(processor.send("Python Basics"))
print(processor.send("Python Advanced"))

# Close the generator
processor.close()
# processor.send("Another book")  # Would raise StopIteration

Output:

Processing: Python Basics
1
Processing: Python Advanced
2
Processed 2 books total

Using yield from

The yield from expression delegates to another generator:

def read_programming_books():
    """Yield programming books"""
    yield "Python Basics"
    yield "JavaScript Intro"
    yield "Go Programming"

def read_data_books():
    """Yield data science books"""
    yield "Data Science 101"
    yield "Statistics Fundamentals"
    yield "Machine Learning"

def read_all_books():
    """Yield all books using yield from"""
    yield from read_programming_books()
    yield from read_data_books()

print("All books:")
for book in read_all_books():
    print(f"  - {book}")

Output:

All books:
  - Python Basics
  - JavaScript Intro
  - Go Programming
  - Data Science 101
  - Statistics Fundamentals
  - Machine Learning

This is cleaner than manually looping and yielding each item.

Practical Example: Data Processing Pipeline

Let’s build a realistic data processing pipeline using generators:

import csv
from io import StringIO

# Simulated CSV data
csv_data = """title,author,price,pages
Python Crash Course,Eric Matthes,39.99,544
Automate the Boring Stuff,Al Sweigart,29.99,504
Learning Python,Mark Lutz,64.99,1648
Python Basics,John Doe,24.99,320
Data Science 101,Jane Smith,44.99,420"""

def read_csv_data(csv_text):
    """Generator: Read CSV data"""
    reader = csv.DictReader(StringIO(csv_text))
    for row in reader:
        yield row

def parse_numeric_fields(books):
    """Generator: Convert numeric strings to numbers"""
    for book in books:
        book['price'] = float(book['price'])
        book['pages'] = int(book['pages'])
        yield book

def calculate_price_per_page(books):
    """Generator: Add price per page field"""
    for book in books:
        book['price_per_page'] = book['price'] / book['pages']
        yield book

def filter_good_deals(books, max_price_per_page=0.08):
    """Generator: Filter books with good price per page ratio"""
    for book in books:
        if book['price_per_page'] <= max_price_per_page:
            yield book

def format_output(books):
    """Generator: Format books for display"""
    for book in books:
        yield (
            f"'{book['title']}' by {book['author']}\n"
            f"  ${book['price']:.2f} for {book['pages']} pages "
            f"(${book['price_per_page']:.4f}/page)"
        )

# Build the pipeline
pipeline = read_csv_data(csv_data)
pipeline = parse_numeric_fields(pipeline)
pipeline = calculate_price_per_page(pipeline)
pipeline = filter_good_deals(pipeline, max_price_per_page=0.10)
pipeline = format_output(pipeline)

# Execute the pipeline
print("Best Value Books:")
for book_info in pipeline:
    print(book_info)

Output:

Best Value Books:
'Python Crash Course' by Eric Matthes
  $39.99 for 544 pages ($0.0735/page)
'Automate the Boring Stuff' by Al Sweigart
  $29.99 for 504 pages ($0.0595/page)
'Python Basics' by John Doe
  $24.99 for 320 pages ($0.0781/page)

Each stage processes one book at a time—very memory-efficient even for millions of books!

The itertools Module

Python’s itertools module provides many powerful generator utilities:

import itertools

books = ["Python Basics", "Python Advanced", "Data Science 101"]
prices = [29.99, 49.99, 39.99]

# chain - combine iterables
all_items = itertools.chain(books, ["Statistics", "ML"])
print("Chained:", list(all_items))

# zip_longest - zip with different lengths
for book, price in itertools.zip_longest(books, prices, fillvalue=0):
    print(f"{book}: ${price}")

# cycle - repeat infinitely
print("\nCycle (first 5):")
for i, book in enumerate(itertools.cycle(books)):
    if i >= 5:
        break
    print(f"  {book}")

# islice - slice an iterable
print("\nFirst 2 books:")
for book in itertools.islice(books, 2):
    print(f"  {book}")

# count - infinite counter
print("\nWith IDs:")
for book_id, book in zip(itertools.count(1), books):
    print(f"  {book_id}. {book}")

Output:

Chained: ['Python Basics', 'Python Advanced', 'Data Science 101', 'Statistics', 'ML']
Python Basics: $29.99
Python Advanced: $49.99
Data Science 101: $39.99

Cycle (first 5):
  Python Basics
  Python Advanced
  Data Science 101
  Python Basics
  Python Advanced

First 2 books:
  Python Basics
  Python Advanced

With IDs:
  1. Python Basics
  2. Python Advanced
  3. Data Science 101

Summary

In this lesson, we learned about generators:

  • Generator functions use yield instead of return
  • yield pauses execution and returns a value
  • Generators are iterators but much easier to write
  • Generator expressions are like list comprehensions but memory-efficient
  • Generators can be chained to create data pipelines
  • Generators can be infinite
  • send(), throw(), and close() provide advanced control
  • yield from delegates to another generator
  • itertools module provides powerful generator utilities

Generators are essential for:

  • Processing large files or datasets
  • Creating memory-efficient data pipelines
  • Working with infinite sequences
  • Lazy evaluation (compute only when needed)

When to use generators:

  • Processing large amounts of data
  • Data doesn’t need to be in memory all at once
  • Creating sequences (finite or infinite)
  • Building data processing pipelines

In the next lesson, we’ll learn about context managers and the with statement—tools for managing resources safely.