IDIOMATIC PYTHON

Iterators and Generators

How for loops work under the hood, and how to write your own.

SECTION 01

The iterator protocol

An iterable is anything that knows how to produce an iterator from itself, by implementing __iter__. An iterator is anything that produces values one at a time, by implementing __next__. When the iterator runs out, __next__ raises StopIteration.

A for loop is built on these two methods. It calls __iter__ on the iterable to get an iterator, then calls __next__ repeatedly, binding each value to the loop variable. When StopIteration shows up, the loop exits cleanly.

This is why a list, a string, a dict, a file, and a custom class can all be used in the same for loop. They all implement the same two methods. The protocol is the entire mechanism behind iteration in Python.

python

nums = [10, 20, 30]
it = iter(nums)
next(it)   # 10
next(it)   # 20
next(it)   # 30
# next(it) -> raises StopIteration

SECTION 02

yield and pausing

A function with yield somewhere in its body is not a regular function. Calling it returns a generator object: an iterator that runs the body lazily, pausing at each yield and resuming on the next call to next.

This turns out to be a tidy way to write iterators. Instead of implementing __iter__ and __next__ by hand, you write the function as if you were just printing values, and replace the prints with yield. The generator handles the protocol for you.

The pause-and-resume model also lets you express infinite sequences (yield x forever in a while True) without blowing memory. Whoever consumes the generator decides when to stop.

python

def count_up_to(n):
    i = 0
    while i < n:
        yield i
        i += 1

list(count_up_to(3))   # [0, 1, 2]

SECTION 03

Pipelines with generators

Generators compose. Each one reads from an input iterable and yields transformed values. Chaining several together produces a pipeline where one item travels through multiple stages, lazily, before reaching the consumer.

The canonical example is a streaming text processor: one generator yields lines from a file, another filters out comments, a third parses each line, a fourth aggregates results. At any moment only one line is in flight, regardless of how big the file is.

The big shift in mindset is from "read all the data, transform all of it, then process all of it" to "set up the stages, then ask the last one for values". The pipeline does the work on demand, which is exactly what large-data scripts need.

python

def lines(path):
    with open(path) as f:
        for line in f:
            yield line.rstrip()

def not_comments(seq):
    for s in seq:
        if not s.startswith("#"):
            yield s

for s in not_comments(lines("config.txt")):
    print(s)

← PREVIOUS

Comprehensions

Decorators