Lesson 3 - Speeding Up Queries with Indexes
Welcome to Indexes
In the last lesson you watched Postgres scan all 86,000 rows of homeless_by_coc just to find a handful that matched a filter. That sequential scan is the price of having no shortcut. An index is that shortcut.
As a data engineer, much of your job is making the database fast for everyone else — analysts, dashboards, and applications that query the same tables over and over. The queries a company runs tend to be similar, so you can tune the database specifically for them. The single most effective tuning tool is the index, and in this lesson you will build one and prove its impact with real timings.
By the end of this lesson, you will be able to:
- Recognize an
Index Scanand understand why primary keys are fast - Create an index with
CREATE INDEXand measure the speedup - Explain why an index turns a full scan into a roughly logarithmic lookup
- List existing indexes from
pg_indexesand drop indexes you no longer need - Judge when an index helps a query and when it does not
You will keep working with the HUD database. Let’s see what an index does.
Data for this lesson
Engine: PostgreSQL. Dataset: homeless-by-CoC U.S. homelessness counts.
Tables used: homeless_by_coc, state_info
Real-world source: HUD — PIT and HIC Data Since 2007 (the HUD homelessness point-in-time counts this homeless_by_coc data is based on). Load it into your own PostgreSQL instance and run the queries in psql or a GUI client.
The Index Scan
In Lesson 2, every SELECT produced a Seq Scan. But not every query does. Filter homeless_by_coc on its id column and the plan changes:
import psycopg2
import json
conn = psycopg2.connect(dbname='hud', user='hud_admin', password='hud123')
cur = conn.cursor()
cur.execute("""
EXPLAIN (ANALYZE, FORMAT json)
SELECT * FROM homeless_by_coc
WHERE id = 10;
""")
query_plan = cur.fetchone()
print(json.dumps(query_plan, indent=2))The plan now shows a new node type:
[
[
{
"Plan": {
"Node Type": "Index Scan",
"Scan Direction": "Forward",
"Index Name": "homeless_by_coc_pkey",
"Relation Name": "homeless_by_coc",
"Startup Cost": 0.29,
"Total Cost": 8.31,
"Plan Rows": 1,
"Actual Total Time": 0.04,
"Actual Rows": 1,
"Index Cond": "(id = 10)"
},
"Planning Time": 3.051,
"Execution Time": 0.081
}
]
]This is an Index Scan, not a Seq Scan. The Index Name is homeless_by_coc_pkey — pkey is short for primary key. When the table was created, the id column was declared a PRIMARY KEY, and Postgres automatically built a B-tree index on it.
Why an Index Is Fast
A B-tree index makes two things efficient:
- Finding a row with a given value of the indexed column (find the row where
id = 10). - Finding all rows where the indexed column falls in a range (find rows where
idis between42and78).
Think of it like a book. A sequential scan is reading the entire book to find one fact. An index is the table of contents: you look up the topic and jump straight to the page. You do not need to know how a B-tree works internally yet — only that it lets Postgres jump near the answer instead of reading everything.
Measuring the Difference
To feel the difference, compare fetching one row by its indexed id against fetching it by a non-indexed text column, coc_name. Here is a real row sitting in the middle of the table:
| id | year | state | coc_number | coc_name | measures | count |
|---|---|---|---|---|---|---|
| 42704 | 2013-01-01 | PA | PA-505 | Chester County CoC | Chronically Homeless Individuals | 72 |
Run EXPLAIN ANALYZE on both queries and read the Execution Time. Note how it is nested two lists deep in the JSON, so you reach it with [0][0]["Execution Time"]:
import psycopg2
conn = psycopg2.connect(dbname='hud', user='hud_admin', password='hud123')
cur = conn.cursor()
cur.execute("""
EXPLAIN (ANALYZE, FORMAT json)
SELECT * FROM homeless_by_coc
WHERE coc_name = 'Chester County CoC'
LIMIT 1;
""")
coc_name_plan = cur.fetchone()
print(coc_name_plan[0][0]["Execution Time"])
cur.execute("""
EXPLAIN (ANALYZE, FORMAT json)
SELECT * FROM homeless_by_coc
WHERE id = 42704;
""")
id_plan = cur.fetchone()
print(id_plan[0][0]["Execution Time"])The times come back roughly like this (exact numbers vary between runs):
0.408 # coc_name query (Seq Scan)
0.038 # id query (Index Scan — much faster)The id query is about ten times faster. The coc_name column has no index, so Postgres must scan rows looking for a match; the id column has its primary-key index, so Postgres jumps straight to the row.
How Much Faster? The Logarithm
When you study B-trees in depth later, you will learn that an Index Scan inspects roughly rows to find a match, where is the number of rows. A Seq Scan inspects all .
The gap is enormous. For a table of one million rows, a Seq Scan checks 1,000,000 rows, but — an index needs to inspect only about 20. For our homeless_by_coc table with about 86,530 rows, . The index looks at around 17 rows where the sequential scan might look at all 86,530.
These Are Rough Estimates
The figure is a simplification. Postgres uses more sophisticated internal mechanics to estimate index costs, which is why the cost formula for an Index Scan is far more complex than the sequential-scan formula from Lesson 2. The takeaway is the order of magnitude, not the exact count.
Creating Your Own Index
Primary keys get an index automatically, but you rarely query by primary key in real analysis. Fortunately you can create an index on any column with the CREATE INDEX command:
CREATE INDEX index_name ON table_name(column_name);The index name must be unique across the whole database, not just the table — Postgres errors if the name already exists. Create one for the coc_name column:
import psycopg2
conn = psycopg2.connect(dbname='hud', user='hud_admin', password='hud123')
cur = conn.cursor()
cur.execute("""
CREATE INDEX coc_name_index ON homeless_by_coc(coc_name);
""")
conn.commit()
conn.close()Note the conn.commit() — creating an index is a real change to the database, so you must commit it.
Confirming the Speedup
Re-run the coc_name lookup from before, now that the index exists:
import psycopg2
conn = psycopg2.connect(dbname='hud', user='hud_admin', password='hud123')
cur = conn.cursor()
cur.execute("""
EXPLAIN (ANALYZE, FORMAT json)
SELECT * FROM homeless_by_coc
WHERE coc_name = 'Chester County CoC'
LIMIT 1;
""")
coc_name_plan = cur.fetchone()
print(coc_name_plan[0][0]["Execution Time"])The runtime drops from about 0.4 ms to about 0.1 ms — roughly four times faster. The query that used to demand a full scan now uses your new index.
Indexes Are Not Free
Each index is stored in its own file on disk, so it costs space. Worse, every INSERT, UPDATE, and DELETE must update every affected index, which slows down writes. Create indexes for the queries that matter, not for every column.
Managing Indexes
Production databases grow and change. You need to know which indexes exist so you can confirm the important queries are covered and remove indexes that are no longer used.
Listing Indexes
Postgres records indexes in the pg_indexes view. Its key columns are:
| Name | Description |
|---|---|
| schemaname | Schema containing the table and index |
| tablename | Table the index is on |
| indexname | Name of the index |
| indexdef | A reconstructed CREATE INDEX command |
The indexdef is especially handy — it shows the exact definition, including which kind of index it is and which column it covers. List the indexes on homeless_by_coc:
import psycopg2
conn = psycopg2.connect(dbname='hud', user='hud_admin', password='hud123')
cur = conn.cursor()
cur.execute("""
SELECT * FROM pg_indexes
WHERE tablename = 'homeless_by_coc';
""")
indexes = cur.fetchall()
for index in indexes:
print(index)You get two indexes: the coc_name_index you just created, and homeless_by_coc_pkey, the one Postgres made for the primary key.
Kinds of Indexes
By default CREATE INDEX builds a btree (B-tree) index, but Postgres supports six kinds: btree, hash, gist, spgist, gin, and brin. The two most common are:
btree— the default, good for range queries (and equality too). It is the default precisely because it accelerates the most common queries.hash— specialized for equality (WHERE col = value); it can pinpoint a matching row extremely quickly.
You will meet more of these in the next lesson.
Dropping Indexes
When an index wastes space or is no longer needed, remove it with DROP INDEX:
DROP INDEX example_name;If the index might not exist, add IF EXISTS so Postgres does nothing instead of raising an error:
DROP INDEX IF EXISTS example_name;Drop the index you created:
import psycopg2
conn = psycopg2.connect(dbname='hud', user='hud_admin', password='hud123')
cur = conn.cursor()
cur.execute("""
DROP INDEX IF EXISTS coc_name_index;
""")
conn.commit()
conn.close()When an Index Does Not Help
An index is not a guaranteed win. Consider the join from Lesson 2:
SELECT homeless_by_coc.state, homeless_by_coc.coc_number,
homeless_by_coc.coc_name, state_info.name
FROM homeless_by_coc, state_info
WHERE homeless_by_coc.state = state_info.postal;The two columns in the WHERE clause are candidates for indexing. But state_info has only 50 rows — far too small to benefit from an index, since scanning 50 rows is faster than consulting an index. That leaves homeless_by_coc.state. So you might create state_index on it and re-time the query. Running the experiment five times gives:
| Without index (ms) | With index (ms) |
|---|---|
| 100.695 | 99.184 |
| 96.114 | 104.214 |
| 107.321 | 95.108 |
| 99.674 | 162.588 |
| 96.727 | 98.039 |
There is no clear winner — the index sometimes makes things slower. If you inspect both plans, they are identical: Postgres decided not to use state_index at all.
Why the Index Was Ignored
Count the rows the join actually returns:
SELECT COUNT(*)
FROM homeless_by_coc, state_info
WHERE homeless_by_coc.state = state_info.postal;85450The table has about 86,531 rows, and the join returns 85,450 of them — nearly the whole table. An index quickly finds where a matching row is, but it still has to retrieve every result. When the result is almost the entire table, that pointer-chasing buys nothing, so Postgres correctly skips the index. This is the general rule: if a query returns a large fraction of the table, an index probably will not help.
When It Does Help
Now restrict the join to a selective filter — rows where count > 5000:
SELECT homeless_by_coc.state, homeless_by_coc.coc_number,
homeless_by_coc.coc_name, state_info.name
FROM homeless_by_coc
INNER JOIN state_info
ON homeless_by_coc.state = state_info.postal
WHERE homeless_by_coc.count > 5000;B-trees excel at ranges, so an index on count should let Postgres grab just the count > 5000 rows instead of scanning and comparing every row. Run the before-and-after:
import psycopg2
conn = psycopg2.connect(dbname='hud', user='hud_admin', password='hud123')
cur = conn.cursor()
cur.execute("""
EXPLAIN (ANALYZE, FORMAT json)
SELECT homeless_by_coc.state, homeless_by_coc.coc_number,
homeless_by_coc.coc_name, state_info.name
FROM homeless_by_coc
INNER JOIN state_info
ON homeless_by_coc.state = state_info.postal
WHERE homeless_by_coc.count > 5000;
""")
no_index_plan = cur.fetchone()
print(no_index_plan[0][0]["Execution Time"])
cur.execute("CREATE INDEX count_index ON homeless_by_coc(count)")
cur.execute("""
EXPLAIN (ANALYZE, FORMAT json)
SELECT homeless_by_coc.state, homeless_by_coc.coc_number,
homeless_by_coc.coc_name, state_info.name
FROM homeless_by_coc
INNER JOIN state_info
ON homeless_by_coc.state = state_info.postal
WHERE homeless_by_coc.count > 5000;
""")
index_plan = cur.fetchone()
print(index_plan[0][0]["Execution Time"])
conn.commit()This time the index on count delivers a large speedup, because the filter is selective: only a small fraction of rows have count > 5000, so the index lets Postgres skip the rest.
Practice Exercises
Exercise 1: Index a Filter Column
Create an index named year_index on the year column of homeless_by_coc. Then run EXPLAIN ANALYZE on SELECT * FROM homeless_by_coc WHERE year = '2007-01-01' and check whether Postgres uses it. Drop the index when you are done.
Hint
Use CREATE INDEX year_index ON homeless_by_coc(year); and remember to conn.commit(). Whether Postgres uses it depends on how many rows match that year — a selective filter is more likely to trigger an Index Scan.
Exercise 2: Inspect Existing Indexes
Query pg_indexes for every index on homeless_by_coc and print just the indexname and indexdef columns. Identify which index belongs to the primary key.
Hint
Select specific columns: SELECT indexname, indexdef FROM pg_indexes WHERE tablename = 'homeless_by_coc';. The primary-key index is named with the _pkey suffix.
Exercise 3: Predict When an Index Helps
Without running anything, predict whether an index on state will speed up SELECT * FROM homeless_by_coc WHERE state = 'CA' versus SELECT COUNT(*) FROM homeless_by_coc. Explain your reasoning, then test it.
Hint
The state = 'CA' filter returns a small fraction of rows, so an index can help. The bare COUNT(*) touches every row, so an index on state is irrelevant — there is no filter to narrow.
Summary
You built your first index and measured a four-fold speedup, then learned that the win is not automatic. An index transforms a full scan into a near-logarithmic lookup, but only pays off when a query is selective — when it returns a small slice of the table. For broad queries that touch most rows, Postgres will rightly ignore the index, and you saw it do exactly that on a join.
Key Concepts
- Index Scan — a plan node that uses an index to jump near matching rows instead of scanning the whole table.
- B-tree index — the default index type; supports fast equality and range lookups, inspecting roughly rows instead of all .
CREATE INDEX— builds an index on a column; the name must be unique across the database, and the change must be committed.pg_indexes— a view listing every index, including theindexdefthat reconstructs itsCREATE INDEXcommand.- Index trade-offs — indexes cost disk space and slow down writes, and they help only when a query returns a small fraction of the table.
Why This Matters
Indexes are the first and most powerful lever a data engineer pulls when queries get slow, and knowing when not to add one is just as valuable as knowing when to. An unnecessary index quietly taxes every write and wastes storage while delivering nothing. The discipline you practiced here — measure first, add the index, measure again, and accept the verdict the plan gives you — is exactly how performance tuning is done in production. Next you will go beyond single-column indexes to the multi-column, expression, and partial indexes that solve harder query patterns.
Next Steps
Continue to Lesson 4 - Advanced Indexing Strategies
Create multi-column indexes, choose the right index type, and order index columns for maximum benefit.
Back to Module Overview
Return to the Optimizing PostgreSQL module overview
Continue Building Your Skills
You now hold the most important tool in query optimization, and just as importantly, the judgment to know when to use it. Keep the habit of benchmarking every index you create — the plan never lies. In the next lesson you will combine columns, expressions, and conditions into more powerful indexes that target exactly the queries your team runs most.