Lesson 4 - Advanced Indexing Strategies

Welcome to Advanced Indexing

A single-column index speeds up queries that filter on one column. But real queries rarely stop at one filter. Analysts ask for the rows in a given state and a given year, for case-insensitive text matches, and for slices that ignore the boring rows entirely. To serve those patterns efficiently, you need more than the basic index.

In this lesson you expand your indexing toolkit. You will learn how Postgres decides between scan strategies, build indexes that span multiple columns, create indexes on expressions and on subsets of rows, and — crucially — understand why the order of columns in an index changes which queries it can accelerate.

By the end of this lesson, you will be able to:

  • Read Bitmap Index Scan and Bitmap Heap Scan nodes and explain heap pages
  • Create multi-column indexes and order their columns correctly
  • Explain why an index helps a query only on a leading prefix of its columns
  • Create expression indexes (for example, case-insensitive search) and partial indexes
  • Combine these techniques to drive a real query below one millisecond

You will continue with the HUD database. Let’s start by adding a second filter.

Data for this lesson

Engine: PostgreSQL. Dataset: homeless-by-CoC U.S. homelessness counts.

Tables used: homeless_by_coc, state_info

Real-world source: HUD — PIT and HIC Data Since 2007 (the HUD homelessness point-in-time counts this homeless_by_coc data is based on). Load it into your own PostgreSQL instance and run the queries in psql or a GUI client.


Querying with Multiple Filters

A common need is to filter on two columns at once. Suppose you want every Florida row from after 2010:

SELECT * FROM homeless_by_coc
WHERE state = 'FL' AND year > '2010-01-01';

Create a single index on state and explain a similar two-filter query to see what plan Postgres chooses:

import json
import psycopg2

conn = psycopg2.connect("dbname=hud user=hud_admin password=hud_pwd")
cur = conn.cursor()
cur.execute("CREATE INDEX state_index ON homeless_by_coc(state);")
conn.commit()
cur.execute("""
    EXPLAIN (FORMAT json)
    SELECT state FROM homeless_by_coc
    WHERE state = 'CA' AND year < '2008-01-01';
""")
query_plan = cur.fetchone()
print(json.dumps(query_plan[0], indent=2))

The plan introduces two scan types you have not seen:

[
  {
    "Plan": {
      "Node Type": "Bitmap Heap Scan",
      "Relation Name": "homeless_by_coc",
      "Startup Cost": 11.7,
      "Total Cost": 903.46,
      "Plan Rows": 144,
      "Recheck Cond": "(state = 'CA'::bpchar)",
      "Filter": "(year < '2008-01-01'::date)",
      "Plans": [
        {
          "Node Type": "Bitmap Index Scan",
          "Index Name": "state_index",
          "Total Cost": 11.67,
          "Plan Rows": 433,
          "Index Cond": "(state = 'CA'::bpchar)"
        }
      ]
    }
  }
]

Instead of a plain Index Scan, you got a Bitmap Index Scan feeding a Bitmap Heap Scan. To understand why, you need to know how Postgres stores rows on disk.


Heap Pages and Bitmap Scans

The rows of a table are stored in fixed-size blocks called heap pages. You can picture a table as a stack of pages, each holding a group of rows. Heap pages live on disk, and loading one into memory is a costly operation.

Two Ways to Use an Index

A plain Index Scan works row by row: for each matching row, it asks the index which page and position the row is at, loads that page, and reads the row. If a query returns many rows, this means loading pages over and over — sometimes loading the same page multiple times. That is exactly why, as you saw in Lesson 3, indexes can perform poorly for large result sets.

The bitmap strategy avoids the repeated loads:

  • Bitmap Index Scan — uses the index to identify all the heap pages that contain matching rows.
  • Bitmap Heap Scan — reads each of those pages once, fully, and rechecks the conditions on the rows it finds.

The advantage is that every page is loaded only once. The disadvantage is that each page is read in full, so Postgres inspects some rows that do not match — that is why the Bitmap Heap Scan node carries a Recheck Cond. The planner weighs this trade-off for each query. When matching rows are concentrated in a few pages, the bitmap approach is very efficient; when nearly all rows match, Postgres skips the index and does a Seq Scan instead.


Multi-Column Indexes

A single-column index on state could only help with the state filter; the year filter still required scanning. You can do better with a multi-column index that covers both columns. The syntax adds the second column inside the parentheses, separated by a comma:

CREATE INDEX col1_col2_index ON table_name(col1_name, col2_name);

For the Florida-and-year query pattern, the columns in the WHERE clause are state and year, so index both:

import psycopg2

conn = psycopg2.connect("dbname=hud user=hud_admin password=hud_pwd")
cur = conn.cursor()

cur.execute("""
    EXPLAIN (ANALYZE, FORMAT json)
    SELECT * FROM homeless_by_coc
    WHERE state='CA' AND year < '2008-01-01';
""")
plan_single_index = cur.fetchone()
print(plan_single_index[0][0]["Execution Time"])

cur.execute("CREATE INDEX state_year_index ON homeless_by_coc(state, year);")
conn.commit()

cur.execute("""
    EXPLAIN (ANALYZE, FORMAT json)
    SELECT * FROM homeless_by_coc
    WHERE state='CA' AND year < '2008-01-01';
""")
plan_multi_index = cur.fetchone()
print(plan_multi_index[0][0]["Execution Time"])

The multi-column index produces a significant speedup over the single-column one, because Postgres can use the index to narrow on state and year together rather than filtering year afterward.

How Multi-Column Indexes Work

A multi-column index keeps its values sorted, first by the leading column and then, within each value of the leading column, by the next. Imagine a small table with columns id, letter, and number, indexed on (letter, number). The index sorts the pairs by letter, and for each letter, sorts the number values. Each entry maps back to the row’s id.

That layout makes a query like this fast:

SELECT * FROM table
WHERE letter = 'B' AND number = 2;

Postgres jumps to the letter = 'B' block, then — because the numbers within that block are sorted — quickly finds number = 2. This is binary search: to find a value in sorted data, you check the middle, discard the half that cannot contain it, and repeat. Like finding page 100 in a book, you open near the middle, see page 250, and know to look left. A B-tree is the data structure that makes this work while still supporting inserts and deletes.


Why Column Order Matters

The order of columns in a multi-column index is not cosmetic — it determines which queries the index can accelerate. With an index on (letter, number):

  • A query filtering on letter alone can use the index, because letter is sorted globally.
  • A query filtering on number alone cannot, because number is only sorted within each letter, not across all of them.
SELECT * FROM table
WHERE number = 1;

The rows with number = 1 are scattered throughout the index — there is no contiguous block to jump to — so the index does not help.

The Prefix Rule

This generalizes. Postgres allows up to 32 columns in one index:

CREATE INDEX multi_column_index ON table_name(col1, col2, col3, ..., colN);

An index on (col1, col2, col3, col4) can accelerate queries filtering on:

  • col1
  • col1 and col2
  • col1, col2, and col3
  • col1, col2, col3, and col4

In other words, any leading prefix of the index columns. It cannot help a query on col1, col3, and col4 that skips col2, because col3 is not sorted relative to col1 alone. You can confirm Postgres ignores an index when the prefix is broken:

import psycopg2
import json

conn = psycopg2.connect("dbname=hud user=hud_admin password=hud_pwd")
cur = conn.cursor()
cur.execute("CREATE INDEX state_count_year_index ON homeless_by_coc(state, count, year);")
conn.commit()
cur.execute("""
    EXPLAIN (ANALYZE, FORMAT json)
    SELECT * FROM homeless_by_coc
    WHERE year > '2011-01-01' AND count > 5000;
""")
query_plan = cur.fetchone()
print(json.dumps(query_plan, indent=2))

The query filters on year and count but not state, the leading column. Since the prefix is broken at the very start, Postgres falls back to a Seq Scan and ignores state_count_year_index entirely.

Put the Most Selective, Most-Filtered Column First

Because only leading prefixes are usable, the first column should be one your queries almost always filter on. A multi-column index is most valuable when its leading columns match the columns your common queries constrain.


Indexes on Expressions

Sometimes you filter not on a column’s raw value but on a transformation of it. The classic case is case-insensitive text search:

SELECT * FROM state_info
WHERE LOWER(state) = 'alabama';

Without help, Postgres scans every row, applies LOWER() to state, and compares. A plain index on state will not help, because the index stores 'Alabama' as-is and does not know it equals 'alabama'.

The solution is an expression index, which stores the result of applying a function:

CREATE INDEX state_lower_case_index ON state_info(LOWER(state));

Now any query using the same expression can use the index. Create one for case-insensitive search on the measures column:

import psycopg2

conn = psycopg2.connect("dbname=hud user=hud_admin password=hud_pwd")
cur = conn.cursor()
cur.execute("CREATE INDEX measures_index ON homeless_by_coc(LOWER(measures));")
conn.commit()
cur.execute("""
    SELECT * FROM homeless_by_coc
    WHERE LOWER(measures) = 'total homeless';
""")
total_homeless = cur.fetchall()

Expression indexes have a cost: the expression must be computed for every row on INSERT and UPDATE. The good news is it is not recomputed during a lookup — only at write time. So reserve expression indexes for tables you read far more than you write.


Partial Indexes

You can also index only a subset of rows with a partial index, by adding a WHERE clause to CREATE INDEX. A common use is to exclude NULL values:

CREATE INDEX not_null_index ON table_name(col_name) WHERE col_name IS NOT NULL;

The condition can even be on a different column than the one indexed:

CREATE INDEX state_partial_index ON homeless_by_coc(state) WHERE year > '2011-01-01';

Why Partial Indexes Help

There are two strong motivations. First, you can avoid indexing very frequent values. Since a query for a common value will not use the index anyway (large result sets skip the index, as you saw in Lesson 3), there is no point storing those rows in the index — leaving them out shrinks the index and speeds up the queries that do use it.

Second, partial indexes target a constant filter. If many queries pin a column to one value:

SELECT * FROM table_name
WHERE column_name = constant_value;

you can build an index covering only those rows:

CREATE INDEX partial_index_name ON table_name(any_column_name) WHERE column_name = constant_value;

In the HUD data, rows with count = 0 are special — they mark entries with no records. If your queries focus on those, create a partial index for them:

import psycopg2

conn = psycopg2.connect("dbname=hud user=hud_admin password=hud_pwd")
cur = conn.cursor()
cur.execute("CREATE INDEX partial_state_index ON homeless_by_coc(state) WHERE count = 0;")
conn.commit()
cur.execute("""
    SELECT * FROM homeless_by_coc
    WHERE state = 'CA' AND count = 0;
""")
ca_zero_count = cur.fetchall()

Putting It All Together

Combine everything into one index tuned for a real query. The HUD team runs this parameterized query constantly, passing state and measures as arguments and always pinning count = 0:

def query_state_and_measure_zero_count(state, measures):
    conn = psycopg2.connect(dbname='hud', user='hud_admin', password='eRqg123EEkl')
    cur = conn.cursor()
    cur.execute("""
        SELECT hbc.year, si.name, hbc.count
        FROM homeless_by_coc AS hbc
        INNER JOIN state_info AS si
        ON hbc.state = si.postal
        WHERE hbc.state = %s AND LOWER(hbc.measures) = %s AND hbc.count = 0;
    """, (state, measures))
    return cur.fetchall()

Three observations drive the index design: state is filtered on equality, measures is filtered case-insensitively with LOWER(), and count = 0 is a constant. That points to a multi-column expression index — on state and LOWER(measures) — restricted to the count = 0 rows with a partial condition. Note that expressions and ordinary columns can be mixed in a multi-column index:

import psycopg2
import json

conn = psycopg2.connect("dbname=hud user=hud_admin password=hud_pwd")
cur = conn.cursor()
cur.execute("""
    CREATE INDEX state_year_measures_idx
    ON homeless_by_coc(state, LOWER(measures))
    WHERE count = 0;
""")
conn.commit()
cur.execute("""
    EXPLAIN (ANALYZE, FORMAT json)
    SELECT hbc.year, si.name, hbc.count
    FROM homeless_by_coc AS hbc
    INNER JOIN state_info AS si
    ON hbc.state = si.postal
    WHERE hbc.state = 'CA' AND hbc.count = 0 AND LOWER(hbc.measures) = 'total homeless';
""")
print(json.dumps(cur.fetchone(), indent=2))

With this single, carefully designed index, the query’s Execution Time drops below one millisecond. The index matches the query exactly: it leads with the always-filtered state, includes the LOWER(measures) expression so the case-insensitive match can use it, and excludes every row where count is not zero.


Practice Exercises

Exercise 1: Order Two Columns Two Ways

Create idx_a on homeless_by_coc(year, state) and idx_b on homeless_by_coc(state, year). For a query filtering only on state, predict which index can be used, then confirm with EXPLAIN. Drop both when done.

Hint

Only the index whose leading column is state can serve a state-only filter. The other index has state in second position, where it is not globally sorted.

Exercise 2: Build a Case-Insensitive Index

Create an expression index on LOWER(coc_name) for homeless_by_coc, then run a query filtering WHERE LOWER(coc_name) = 'chester county coc'. Use EXPLAIN ANALYZE to confirm the index is used.

Hint

The query must apply the same expression the index uses — LOWER(coc_name) — or Postgres cannot match it to the expression index.

Exercise 3: Design a Partial Index

The HUD team frequently queries only rows where count > 0. Create a partial index on year restricted to count > 0, then explain a query selecting rows where year > '2011-01-01' AND count > 0. Does Postgres use it?

Hint

Use CREATE INDEX partial_year_index ON homeless_by_coc(year) WHERE count > 0;. Whether Postgres uses it still depends on how selective the year filter is within the count > 0 subset.


Summary

You moved well beyond the single-column index. You learned how Postgres reads rows in heap pages and why that produces bitmap scans, how multi-column indexes sort their columns in order, and why only a leading prefix of those columns is usable. You then created expression indexes for transformed values and partial indexes for row subsets, and combined all of it into one index that drove a real query under a millisecond.

Key Concepts

  • Heap pages — fixed-size disk blocks holding table rows; loading a page is costly, which shapes the planner’s choices.
  • Bitmap Index Scan / Bitmap Heap Scan — a two-step strategy that uses the index to find all relevant pages, then reads each page once and rechecks conditions.
  • Multi-column index — indexes several columns sorted in order; helps queries that filter on a leading prefix of those columns, so column order matters.
  • Expression index — indexes the result of a function like LOWER(col); computed at write time, reused at read time.
  • Partial index — indexes only rows matching a WHERE condition; shrinks the index and targets specific, common query patterns.

Why This Matters

The difference between a database that scales and one that crawls often comes down to whether its indexes match its workload. A senior data engineer does not sprinkle indexes randomly — they study the queries the business actually runs and craft indexes whose columns, expressions, and conditions mirror those queries precisely. The capstone index in this lesson is exactly that craft in miniature: every choice traced back to a clause in the query. Master this and you can make almost any read-heavy workload fast. Next you will handle the other side of a healthy database: cleaning up the space that destructive queries leave behind.


Next Steps

Continue to Lesson 5 - Vacuuming, Transactions, and ACID

Reclaim space with VACUUM, group work into transaction blocks, and understand the ACID guarantees behind reliable databases.

Back to Module Overview

Return to the Optimizing PostgreSQL module overview


Continue Building Your Skills

You can now design indexes that fit a query like a key fits a lock. Keep matching index structure to query structure — leading columns to the filters that always run, expressions to the transformations you query by, partial conditions to the rows you care about. In the final lesson you will learn why deleting data does not immediately free space, and how VACUUM keeps your tables lean and your queries fast.