Lesson 3 - Speeding Up Queries with Indexes

Welcome to Indexes

In the last lesson you watched Postgres scan all 86,000 rows of homeless_by_coc just to find a handful that matched a filter. That sequential scan is the price of having no shortcut. An index is that shortcut.

As a data engineer, much of your job is making the database fast for everyone else — analysts, dashboards, and applications that query the same tables over and over. The queries a company runs tend to be similar, so you can tune the database specifically for them. The single most effective tuning tool is the index, and in this lesson you will build one and prove its impact with real timings.

By the end of this lesson, you will be able to:

Recognize an Index Scan and understand why primary keys are fast
Create an index with CREATE INDEX and measure the speedup
Explain why an index turns a full scan into a roughly logarithmic lookup
List existing indexes from pg_indexes and drop indexes you no longer need
Judge when an index helps a query and when it does not

You will keep working with the HUD database. Let’s see what an index does.

Data for this lesson

Engine: PostgreSQL. Dataset: homeless-by-CoC U.S. homelessness counts.

Tables used: homeless_by_coc, state_info

Real-world source: HUD — PIT and HIC Data Since 2007 (the HUD homelessness point-in-time counts this homeless_by_coc data is based on). Load it into your own PostgreSQL instance and run the queries in psql or a GUI client.

The Index Scan

In Lesson 2, every SELECT produced a Seq Scan. But not every query does. Filter homeless_by_coc on its id column and the plan changes:

import psycopg2
import json

conn = psycopg2.connect(dbname='hud', user='hud_admin', password='hud123')
cur = conn.cursor()

cur.execute("""
    EXPLAIN (ANALYZE, FORMAT json)
    SELECT * FROM homeless_by_coc
    WHERE id = 10;
""")
query_plan = cur.fetchone()
print(json.dumps(query_plan, indent=2))

The plan now shows a new node type:

[
  [
    {
      "Plan": {
        "Node Type": "Index Scan",
        "Scan Direction": "Forward",
        "Index Name": "homeless_by_coc_pkey",
        "Relation Name": "homeless_by_coc",
        "Startup Cost": 0.29,
        "Total Cost": 8.31,
        "Plan Rows": 1,
        "Actual Total Time": 0.04,
        "Actual Rows": 1,
        "Index Cond": "(id = 10)"
      },
      "Planning Time": 3.051,
      "Execution Time": 0.081
    }
  ]
]

This is an Index Scan, not a Seq Scan. The Index Name is homeless_by_coc_pkey — pkey is short for primary key. When the table was created, the id column was declared a PRIMARY KEY, and Postgres automatically built a B-tree index on it.

Why an Index Is Fast

A B-tree index makes two things efficient:

Finding a row with a given value of the indexed column (find the row where id = 10).
Finding all rows where the indexed column falls in a range (find rows where id is between 42 and 78).

Think of it like a book. A sequential scan is reading the entire book to find one fact. An index is the table of contents: you look up the topic and jump straight to the page. You do not need to know how a B-tree works internally yet — only that it lets Postgres jump near the answer instead of reading everything.

Measuring the Difference

To feel the difference, compare fetching one row by its indexed id against fetching it by a non-indexed text column, coc_name. Here is a real row sitting in the middle of the table:

id	year	state	coc_number	coc_name	measures	count
42704	2013-01-01	PA	PA-505	Chester County CoC	Chronically Homeless Individuals	72

Run EXPLAIN ANALYZE on both queries and read the Execution Time. Note how it is nested two lists deep in the JSON, so you reach it with [0][0]["Execution Time"]:

import psycopg2

conn = psycopg2.connect(dbname='hud', user='hud_admin', password='hud123')
cur = conn.cursor()

cur.execute("""
    EXPLAIN (ANALYZE, FORMAT json)
    SELECT * FROM homeless_by_coc
    WHERE coc_name = 'Chester County CoC'
    LIMIT 1;
""")
coc_name_plan = cur.fetchone()
print(coc_name_plan[0][0]["Execution Time"])

cur.execute("""
    EXPLAIN (ANALYZE, FORMAT json)
    SELECT * FROM homeless_by_coc
    WHERE id = 42704;
""")
id_plan = cur.fetchone()
print(id_plan[0][0]["Execution Time"])

The times come back roughly like this (exact numbers vary between runs):

0.408   # coc_name query (Seq Scan)
0.038   # id query (Index Scan — much faster)

The id query is about ten times faster. The coc_name column has no index, so Postgres must scan rows looking for a match; the id column has its primary-key index, so Postgres jumps straight to the row.

How Much Faster? The Logarithm

When you study B-trees in depth later, you will learn that an Index Scan inspects roughly $\log_2(N)$ rows to find a match, where $N$ is the number of rows. A Seq Scan inspects all $N$ .

The gap is enormous. For a table of one million rows, a Seq Scan checks 1,000,000 rows, but $\log_2(1{,}000{,}000) \approx 19.9$ — an index needs to inspect only about 20. For our homeless_by_coc table with about 86,530 rows, $\log_2(86{,}530) \approx 17$ . The index looks at around 17 rows where the sequential scan might look at all 86,530.

These Are Rough Estimates

The $\log_2(N)$ figure is a simplification. Postgres uses more sophisticated internal mechanics to estimate index costs, which is why the cost formula for an Index Scan is far more complex than the sequential-scan formula from Lesson 2. The takeaway is the order of magnitude, not the exact count.

Creating Your Own Index

Primary keys get an index automatically, but you rarely query by primary key in real analysis. Fortunately you can create an index on any column with the CREATE INDEX command:

CREATE INDEX index_name ON table_name(column_name);

The index name must be unique across the whole database, not just the table — Postgres errors if the name already exists. Create one for the coc_name column:

import psycopg2

conn = psycopg2.connect(dbname='hud', user='hud_admin', password='hud123')
cur = conn.cursor()

cur.execute("""
    CREATE INDEX coc_name_index ON homeless_by_coc(coc_name);
""")
conn.commit()
conn.close()

Note the conn.commit() — creating an index is a real change to the database, so you must commit it.

Confirming the Speedup

Re-run the coc_name lookup from before, now that the index exists:

import psycopg2

conn = psycopg2.connect(dbname='hud', user='hud_admin', password='hud123')
cur = conn.cursor()

cur.execute("""
    EXPLAIN (ANALYZE, FORMAT json)
    SELECT * FROM homeless_by_coc
    WHERE coc_name = 'Chester County CoC'
    LIMIT 1;
""")
coc_name_plan = cur.fetchone()
print(coc_name_plan[0][0]["Execution Time"])

The runtime drops from about 0.4 ms to about 0.1 ms — roughly four times faster. The query that used to demand a full scan now uses your new index.

Indexes Are Not Free

Each index is stored in its own file on disk, so it costs space. Worse, every INSERT, UPDATE, and DELETE must update every affected index, which slows down writes. Create indexes for the queries that matter, not for every column.

Managing Indexes

Production databases grow and change. You need to know which indexes exist so you can confirm the important queries are covered and remove indexes that are no longer used.

Listing Indexes

Postgres records indexes in the pg_indexes view. Its key columns are:

Name	Description
schemaname	Schema containing the table and index
tablename	Table the index is on
indexname	Name of the index
indexdef	A reconstructed `CREATE INDEX` command

The indexdef is especially handy — it shows the exact definition, including which kind of index it is and which column it covers. List the indexes on homeless_by_coc:

import psycopg2

conn = psycopg2.connect(dbname='hud', user='hud_admin', password='hud123')
cur = conn.cursor()

cur.execute("""
    SELECT * FROM pg_indexes
    WHERE tablename = 'homeless_by_coc';
""")
indexes = cur.fetchall()
for index in indexes:
    print(index)

You get two indexes: the coc_name_index you just created, and homeless_by_coc_pkey, the one Postgres made for the primary key.

Kinds of Indexes

By default CREATE INDEX builds a btree (B-tree) index, but Postgres supports six kinds: btree, hash, gist, spgist, gin, and brin. The two most common are:

btree — the default, good for range queries (and equality too). It is the default precisely because it accelerates the most common queries.
hash — specialized for equality (WHERE col = value); it can pinpoint a matching row extremely quickly.

You will meet more of these in the next lesson.

Dropping Indexes

When an index wastes space or is no longer needed, remove it with DROP INDEX:

DROP INDEX example_name;

If the index might not exist, add IF EXISTS so Postgres does nothing instead of raising an error:

DROP INDEX IF EXISTS example_name;

Drop the index you created:

import psycopg2

conn = psycopg2.connect(dbname='hud', user='hud_admin', password='hud123')
cur = conn.cursor()

cur.execute("""
    DROP INDEX IF EXISTS coc_name_index;
""")
conn.commit()
conn.close()

When an Index Does Not Help

An index is not a guaranteed win. Consider the join from Lesson 2:

SELECT homeless_by_coc.state, homeless_by_coc.coc_number,
       homeless_by_coc.coc_name, state_info.name
FROM homeless_by_coc, state_info
WHERE homeless_by_coc.state = state_info.postal;

The two columns in the WHERE clause are candidates for indexing. But state_info has only 50 rows — far too small to benefit from an index, since scanning 50 rows is faster than consulting an index. That leaves homeless_by_coc.state. So you might create state_index on it and re-time the query. Running the experiment five times gives:

Without index (ms)	With index (ms)
100.695	99.184
96.114	104.214
107.321	95.108
99.674	162.588
96.727	98.039

There is no clear winner — the index sometimes makes things slower. If you inspect both plans, they are identical: Postgres decided not to use state_index at all.

Why the Index Was Ignored

Count the rows the join actually returns:

SELECT COUNT(*)
FROM homeless_by_coc, state_info
WHERE homeless_by_coc.state = state_info.postal;

The table has about 86,531 rows, and the join returns 85,450 of them — nearly the whole table. An index quickly finds where a matching row is, but it still has to retrieve every result. When the result is almost the entire table, that pointer-chasing buys nothing, so Postgres correctly skips the index. This is the general rule: if a query returns a large fraction of the table, an index probably will not help.

When It Does Help

Now restrict the join to a selective filter — rows where count > 5000:

SELECT homeless_by_coc.state, homeless_by_coc.coc_number,
       homeless_by_coc.coc_name, state_info.name
FROM homeless_by_coc
INNER JOIN state_info
ON homeless_by_coc.state = state_info.postal
WHERE homeless_by_coc.count > 5000;

B-trees excel at ranges, so an index on count should let Postgres grab just the count > 5000 rows instead of scanning and comparing every row. Run the before-and-after:

import psycopg2

conn = psycopg2.connect(dbname='hud', user='hud_admin', password='hud123')
cur = conn.cursor()

cur.execute("""
    EXPLAIN (ANALYZE, FORMAT json)
    SELECT homeless_by_coc.state, homeless_by_coc.coc_number,
           homeless_by_coc.coc_name, state_info.name
    FROM homeless_by_coc
    INNER JOIN state_info
    ON homeless_by_coc.state = state_info.postal
    WHERE homeless_by_coc.count > 5000;
""")
no_index_plan = cur.fetchone()
print(no_index_plan[0][0]["Execution Time"])

cur.execute("CREATE INDEX count_index ON homeless_by_coc(count)")

cur.execute("""
    EXPLAIN (ANALYZE, FORMAT json)
    SELECT homeless_by_coc.state, homeless_by_coc.coc_number,
           homeless_by_coc.coc_name, state_info.name
    FROM homeless_by_coc
    INNER JOIN state_info
    ON homeless_by_coc.state = state_info.postal
    WHERE homeless_by_coc.count > 5000;
""")
index_plan = cur.fetchone()
print(index_plan[0][0]["Execution Time"])

conn.commit()

This time the index on count delivers a large speedup, because the filter is selective: only a small fraction of rows have count > 5000, so the index lets Postgres skip the rest.

Practice Exercises

Exercise 1: Index a Filter Column

Create an index named year_index on the year column of homeless_by_coc. Then run EXPLAIN ANALYZE on SELECT * FROM homeless_by_coc WHERE year = '2007-01-01' and check whether Postgres uses it. Drop the index when you are done.

Hint

Use CREATE INDEX year_index ON homeless_by_coc(year); and remember to conn.commit(). Whether Postgres uses it depends on how many rows match that year — a selective filter is more likely to trigger an Index Scan.

Exercise 2: Inspect Existing Indexes

Query pg_indexes for every index on homeless_by_coc and print just the indexname and indexdef columns. Identify which index belongs to the primary key.

Hint

Select specific columns: SELECT indexname, indexdef FROM pg_indexes WHERE tablename = 'homeless_by_coc';. The primary-key index is named with the _pkey suffix.

Exercise 3: Predict When an Index Helps

Without running anything, predict whether an index on state will speed up SELECT * FROM homeless_by_coc WHERE state = 'CA' versus SELECT COUNT(*) FROM homeless_by_coc. Explain your reasoning, then test it.

Hint

The state = 'CA' filter returns a small fraction of rows, so an index can help. The bare COUNT(*) touches every row, so an index on state is irrelevant — there is no filter to narrow.

Summary

You built your first index and measured a four-fold speedup, then learned that the win is not automatic. An index transforms a full scan into a near-logarithmic lookup, but only pays off when a query is selective — when it returns a small slice of the table. For broad queries that touch most rows, Postgres will rightly ignore the index, and you saw it do exactly that on a join.

Key Concepts

Index Scan — a plan node that uses an index to jump near matching rows instead of scanning the whole table.
B-tree index — the default index type; supports fast equality and range lookups, inspecting roughly $\log_2(N)$ rows instead of all $N$ .
CREATE INDEX — builds an index on a column; the name must be unique across the database, and the change must be committed.
pg_indexes — a view listing every index, including the indexdef that reconstructs its CREATE INDEX command.
Index trade-offs — indexes cost disk space and slow down writes, and they help only when a query returns a small fraction of the table.

Why This Matters

Indexes are the first and most powerful lever a data engineer pulls when queries get slow, and knowing when not to add one is just as valuable as knowing when to. An unnecessary index quietly taxes every write and wastes storage while delivering nothing. The discipline you practiced here — measure first, add the index, measure again, and accept the verdict the plan gives you — is exactly how performance tuning is done in production. Next you will go beyond single-column indexes to the multi-column, expression, and partial indexes that solve harder query patterns.

Next Steps

Continue to Lesson 4 - Advanced Indexing Strategies

Create multi-column indexes, choose the right index type, and order index columns for maximum benefit.

Back to Module Overview

Return to the Optimizing PostgreSQL module overview

Continue Building Your Skills

You now hold the most important tool in query optimization, and just as importantly, the judgment to know when to use it. Keep the habit of benchmarking every index you create — the plan never lies. In the next lesson you will combine columns, expressions, and conditions into more powerful indexes that target exactly the queries your team runs most.

Lesson 2 - Reading Query Plans with EXPLAIN

Lesson 4 - Advanced Indexing Strategies

Courses

DATATWEETS

Title here

Lesson 3 - Speeding Up Queries with Indexes

Welcome to Indexes

The Index Scan

Why an Index Is Fast

Measuring the Difference

How Much Faster? The Logarithm

Creating Your Own Index

Confirming the Speedup

Managing Indexes

Listing Indexes

Kinds of Indexes

Dropping Indexes

When an Index Does Not Help

Why the Index Was Ignored

When It Does Help

Practice Exercises

Exercise 1: Index a Filter Column

Exercise 2: Inspect Existing Indexes

Exercise 3: Predict When an Index Helps

Summary

Key Concepts

Why This Matters

Next Steps

Continue to Lesson 4 - Advanced Indexing Strategies

Back to Module Overview

Continue Building Your Skills

Lesson 3 - Speeding Up Queries with Indexes

Welcome to Indexes#

The Index Scan#

Why an Index Is Fast#

Measuring the Difference#

How Much Faster? The Logarithm#

Creating Your Own Index#

Confirming the Speedup#

Managing Indexes#

Listing Indexes#

Kinds of Indexes#

Dropping Indexes#

When an Index Does Not Help#

Why the Index Was Ignored#

When It Does Help#

Practice Exercises#

Exercise 1: Index a Filter Column#

Exercise 2: Inspect Existing Indexes#

Exercise 3: Predict When an Index Helps#

Summary#

Key Concepts#

Why This Matters#

Next Steps#

Continue to Lesson 4 - Advanced Indexing Strategies

Back to Module Overview

Continue Building Your Skills#

Welcome to Indexes

The Index Scan

Why an Index Is Fast

Measuring the Difference

How Much Faster? The Logarithm

Creating Your Own Index

Confirming the Speedup

Managing Indexes

Listing Indexes

Kinds of Indexes

Dropping Indexes

When an Index Does Not Help

Why the Index Was Ignored

When It Does Help

Practice Exercises

Exercise 1: Index a Filter Column

Exercise 2: Inspect Existing Indexes

Exercise 3: Predict When an Index Helps

Summary

Key Concepts

Why This Matters

Next Steps

Continue Building Your Skills