Lesson 5 - Vacuuming, Transactions, and ACID
On this page
- Welcome to Vacuuming
- Destructive Queries
- ACID and Dead Rows
- Counting Dead Rows
- Vacuuming Dead Rows
- Transaction IDs and Nonremovable Rows
- Updating Statistics with VACUUM ANALYZE
- Reclaiming Space with VACUUM FULL
- Autovacuum and When to Vacuum Manually
- Practice Exercises
- Summary
- Next Steps
- Continue Building Your Skills
Welcome to Vacuuming
Here is a puzzle. You accidentally DELETE every row from a table, then calmly reload the data from its CSV file. The row count is identical to before, the data validates perfectly — and yet your queries are now noticeably slower. Nothing about the data changed, so why did performance drop?
The answer reveals something surprising about how Postgres handles destructive queries, and it connects directly to the guarantees that make databases trustworthy under heavy, concurrent use. In this lesson you will learn why deleted rows do not actually disappear, why that is a feature and not a bug, and how to clean them up with the VACUUM command.
By the end of this lesson, you will be able to:
- Explain why
DELETEmarks rows as dead instead of removing them immediately - Connect dead rows to the four ACID properties and multi-version control
- Count dead rows and reclaim their space with
VACUUM - Use
VACUUM ANALYZEto refresh planner statistics andVACUUM FULLto return space to the server - Decide when to vacuum manually versus rely on autovacuum
You will work one last time with the HUD database. Let’s start by deleting some data.
Data for this lesson
Engine: PostgreSQL. Dataset: homeless-by-CoC U.S. homelessness counts.
Tables used: homeless_by_coc
Real-world source: HUD — PIT and HIC Data Since 2007 (the HUD homelessness point-in-time counts this homeless_by_coc data is based on). Load it into your own PostgreSQL instance and run the queries in psql or a GUI client.
Destructive Queries
A destructive query alters or removes table data — UPDATE and DELETE are the prime examples. Recall the DELETE syntax, with or without a filter:
DELETE FROM table_name;
DELETE FROM table_name
WHERE column_name = value;Clear out homeless_by_coc and confirm it worked:
import psycopg2
conn = psycopg2.connect(dbname="hud", user="hud_admin", password="hud123")
cur = conn.cursor()
cur.execute("SELECT * FROM homeless_by_coc;")
num_rows_before = len(cur.fetchall())
cur.execute("DELETE FROM homeless_by_coc;")
cur.execute("SELECT * FROM homeless_by_coc;")
num_rows_after = len(cur.fetchall())
print(num_rows_before)
print(num_rows_after)This prints 86530 then 0 — the table looks empty. But appearances are misleading.
ACID and Dead Rows
When you run DELETE, Postgres does not actually erase the rows. It marks them as dead, meaning they will eventually be removed. It is like dragging a file to the trash: gone from view, but still on disk until something empties the bin.
Why work this way? Because Postgres, unlike SQLite, lets many users read and write the same database at the same time. To keep things sane, it groups work into transactions that obey four properties known by the acronym ACID:
- Atomicity — if any part of a transaction fails, the whole transaction fails.
- Consistency — a transaction moves the database from one valid state to another.
- Isolation — concurrent changes are applied as if they happened sequentially.
- Durability — once committed, a change survives crashes, power loss, and the like.
Why Immediate Deletion Would Break Isolation
Imagine two users on an accounts table. Mary runs a COUNT of all users and then asks for their names. Bill, at the same moment, deletes the user Steve Johnson. If Bill’s delete physically removed the row between Mary’s two queries, Mary’s count and her name list would disagree — she would count one more user than she could name. That inconsistency violates isolation: concurrent changes leaked into the middle of Mary’s transaction.
Postgres prevents this with multi-version control. Instead of erasing a row, it marks the row with version information so it knows which transactions should still see it. Mary’s transaction continues to see the row Bill “deleted,” so both of her queries agree. Only after no transaction needs the old version can the dead row truly be removed. That is why a DELETE leaves dead rows behind — they are the price of correct, concurrent behavior.
Counting Dead Rows
Dead rows protect consistency, but if they linger they hurt performance. Any scan over a table with many dead rows runs slower, because the table is physically larger — comparable to opening a 1 GB file instead of a 1 MB one.
Postgres tracks dead rows in the pg_stat_all_tables view, whose n_dead_tup column holds the estimated number of dead rows:
| Column | Type | Description |
|---|---|---|
| n_dead_tup | bigint | Estimated number of dead rows |
Check how many dead rows your DELETE produced:
import psycopg2
conn = psycopg2.connect(dbname="hud", user="hud_admin", password="hud123")
cur = conn.cursor()
cur.execute("""
SELECT n_dead_tup
FROM pg_catalog.pg_stat_all_tables
WHERE relname='homeless_by_coc';
""")
homeless_dead_rows = cur.fetchone()[0]
print(homeless_dead_rows)The deleted rows now show up as dead — exactly the rows weighing your queries down.
Vacuuming Dead Rows
To remove dead rows and reclaim the space they occupy, run the VACUUM command. Vacuum one table by name, or omit the name to vacuum every table the current user can access:
VACUUM homeless_by_coc;
VACUUM;Its documented job is to “reclaim table storage space occupied by dead tuples.” Two practical notes: VACUUM cannot run inside a transaction block, so you must set conn.autocommit = True first; and the VERBOSE option reports what it did through the conn.notices attribute:
import psycopg2
conn = psycopg2.connect(dbname="hud", user="hud_admin", password="hud123")
cur = conn.cursor()
conn.autocommit = True
cur.execute("VACUUM VERBOSE homeless_by_coc;")
for notice in conn.notices:
print(notice)The verbose output is detailed:
INFO: vacuuming "public.homeless_by_coc"
INFO: scanned index "homeless_by_coc_pkey" to remove 86530 row versions
INFO: "homeless_by_coc": removed 86530 row versions in 1280 pages
INFO: index "homeless_by_coc_pkey" now contains 0 row versions in 239 pages
INFO: "homeless_by_coc": found 86530 removable, 0 nonremovable row versions in 1280 out of 1280 pages
DETAIL: 0 dead row versions cannot be removed yet, oldest xmin: 557
INFO: "homeless_by_coc": truncated 1280 to 0 pagesAll 86,530 dead rows were removed, along with their entries in the homeless_by_coc_pkey index. The line “0 nonremovable row versions” means nothing was blocking removal this time.
Transaction IDs and Nonremovable Rows
Sometimes VACUUM cannot remove a dead row yet. If another transaction started before the row was deleted and has not finished, removing the row would change that transaction’s view — violating ACID. Postgres tracks this with two hidden columns on every table:
xmin— the transaction ID that inserted the row.xmax— the transaction ID that deleted the row.
These columns always exist but are hidden unless you ask for them:
SELECT xmin, xmax FROM table_name;Postgres compares these IDs against the IDs of running transactions to decide whether a row is safe to remove. Insert a fresh row, then vacuum and inspect xmin:
import psycopg2
row = (1, '2007-01-01', 'AK', 'AK-500', 'Anchorage CoC', 'Chronically Homeless Individuals', 224)
conn = psycopg2.connect(dbname='hud', user='hud_admin', password='hud123')
conn.autocommit = True
cur = conn.cursor()
cur.execute("INSERT INTO homeless_by_coc VALUES (%s, %s, %s, %s, %s, %s, %s);", row)
cur.execute("VACUUM VERBOSE homeless_by_coc;")
for line in conn.notices:
print(line)
cur.execute("SELECT xmin FROM homeless_by_coc;")
xmin = cur.fetchone()
print(xmin)This time the verbose output reports one nonremovable row:
INFO: "homeless_by_coc": found 86530 removable, 1 nonremovable row versions in 1280 out of 1280 pagesThe nonremovable row is the live one you just inserted — it is not dead, so VACUUM leaves it alone.
Updating Statistics with VACUUM ANALYZE
In Lesson 2 you saw that EXPLAIN’s row estimates can be wrong because the planner statistics drift out of date. VACUUM can fix that too. The ANALYZE option updates the pg_stats data with fresh cost, row-count, and width figures for the table:
VACUUM ANALYZE table_name;Watch the plan’s estimates change before and after:
import psycopg2
conn = psycopg2.connect(dbname="hud", user="hud_admin", password="hud123")
conn.autocommit = True
cur = conn.cursor()
cur.execute("EXPLAIN SELECT * FROM homeless_by_coc;")
plan_before = cur.fetchall()
cur.execute("VACUUM ANALYZE homeless_by_coc;")
cur.execute("EXPLAIN SELECT * FROM homeless_by_coc;")
plan_after = cur.fetchall()
print(plan_before)
print(plan_after)After VACUUM ANALYZE, the plan reflects the table’s true current size, so the planner makes better decisions.
VACUUM ANALYZE Is Not EXPLAIN ANALYZE
These two are different despite the shared word. EXPLAIN ANALYZE runs a query to measure its real time. VACUUM ANALYZE cleans a table and refreshes its statistics. Do not confuse them.
Reclaiming Space with VACUUM FULL
A plain VACUUM frees space, but the freed space stays reserved for that table — it is held in reserve for future inserts, not returned to the operating system. The FULL option returns the space to the whole server so anything can use it:
VACUUM FULL table_name;VACUUM FULL is powerful but risky. It takes an exclusive lock on the table for its entire duration: no INSERT, UPDATE, or DELETE can run against the table, and SELECTs slow to a crawl. In the trash analogy, a plain VACUUM empties the bin but keeps the freed disk space for that one folder; VACUUM FULL is a hard delete that returns the space to the whole disk.
Measure the effect using pg_total_relation_size, made readable with pg_size_pretty:
import psycopg2
conn = psycopg2.connect(dbname="hud", user="hud_admin", password="hud123")
conn.autocommit = True
cur = conn.cursor()
cur.execute("SELECT * FROM homeless_by_coc;")
print(cur.fetchall())
cur.execute("SELECT pg_size_pretty(pg_total_relation_size('homeless_by_coc'));")
space_before = cur.fetchone()
cur.execute("VACUUM FULL homeless_by_coc;")
cur.execute("SELECT pg_size_pretty(pg_total_relation_size('homeless_by_coc'));")
space_after = cur.fetchone()
print(space_before)
print(space_after)Even though the table is empty, space_before shows it still occupies meaningful disk space (held in reserve), and space_after shows that space returned to the server. Because of the exclusive lock, do not run VACUUM FULL on a schedule — save it for tables with a high ratio of dead rows to live rows.
Autovacuum and When to Vacuum Manually
Postgres ships with autovacuum, an optional but highly recommended feature that vacuums your tables automatically, removing dead rows and refreshing statistics on a schedule. Check whether it is on:
SELECT setting FROM pg_settings
WHERE name = 'autovacuum';In modern Postgres, autovacuum is on by default and needs no setup. So when should you vacuum by hand? Use these questions as a guide:
- Running normal analysis with no big deletes and few new rows? Leave it to autovacuum.
- Just deleted a lot of data and about to run heavy analysis? Run
VACUUMorVACUUM ANALYZEto optimize first. - Tables growing out of control with little free disk space left? Disable queries and run
VACUUM FULLto reclaim space.
To see when a table was last vacuumed, query pg_stat_user_tables:
import psycopg2
conn = psycopg2.connect(dbname="hud", user="hud_admin", password="hud123")
conn.autocommit = True
cur = conn.cursor()
cur.execute("VACUUM homeless_by_coc;")
import time
time.sleep(1)
cur.execute("""
SELECT last_vacuum, last_autovacuum FROM pg_stat_user_tables
WHERE relname = 'homeless_by_coc';
""")
timestamps = cur.fetchone()
print(timestamps)Statistics are not updated instantly — by default the refresh happens roughly every 500 milliseconds — so the time.sleep(1) gives Postgres a moment to record the vacuum before you read the timestamps.
Practice Exercises
Exercise 1: Create and Count Dead Rows
Delete all rows where state = 'CA' from homeless_by_coc, then query n_dead_tup from pg_stat_all_tables to see how many dead rows the delete produced.
Hint
Use DELETE FROM homeless_by_coc WHERE state = 'CA';, then SELECT n_dead_tup FROM pg_catalog.pg_stat_all_tables WHERE relname = 'homeless_by_coc';. Take the first element of the fetched tuple.
Exercise 2: Vacuum and Verify
Set conn.autocommit = True, run VACUUM VERBOSE on homeless_by_coc, and print every line from conn.notices. Find the line reporting how many removable row versions were cleaned up.
Hint
VACUUM cannot run in a transaction block, so conn.autocommit = True must come before the VACUUM. Look for the “found N removable” line in the notices.
Exercise 3: Measure Reclaimed Space
Record pg_total_relation_size for a table, run VACUUM FULL, and record the size again. Wrap both sizes in pg_size_pretty and report how much space was returned to the server.
Hint
SELECT pg_size_pretty(pg_total_relation_size('homeless_by_coc')); before and after the VACUUM FULL. The difference is the space returned to the whole server, not just reserved for the table.
Summary
You solved the opening puzzle: deleting and reloading data leaves dead rows behind, inflating the table and slowing scans. You learned that this behavior is a deliberate consequence of ACID and multi-version control — dead rows preserve a consistent view for concurrent transactions. Then you cleaned them up: VACUUM reclaims space within a table, VACUUM ANALYZE also refreshes planner statistics, and VACUUM FULL returns space to the server at the cost of an exclusive lock. Finally, you saw that autovacuum handles most of this automatically, leaving manual vacuuming for big deletes and runaway growth.
Key Concepts
- Dead rows — rows marked deleted but not yet removed; they keep transactions consistent but slow down scans until cleaned up.
- ACID — Atomicity, Consistency, Isolation, Durability; the four guarantees that make transactions reliable, and the reason
DELETEdoes not erase rows immediately. VACUUM— reclaims space from dead rows; needsconn.autocommit = Trueand cannot run inside a transaction block.VACUUM ANALYZE— vacuums and updates planner statistics so cost estimates stay accurate.VACUUM FULL— returns space to the whole server but takes an exclusive lock; reserve it for tables with many dead rows.- Autovacuum — the built-in feature that vacuums tables automatically; on by default in modern Postgres.
Why This Matters
A database that is fast on day one can degrade quietly over months as deletes and updates accumulate dead rows. Keeping tables healthy is part of a data engineer’s ongoing stewardship, not a one-time task — and understanding why dead rows exist (the ACID guarantees) is what lets you reason about concurrency, locking, and consistency across every database you will ever touch. With internals, query plans, indexing, and vacuuming together, you now have a complete toolkit for diagnosing and fixing slow PostgreSQL databases.
Next Steps
You have finished the Optimizing PostgreSQL module. You can now read a database from the inside, diagnose slow queries with EXPLAIN ANALYZE, design indexes that match your workload, and keep tables lean with VACUUM. The next module takes you beyond a single relational engine into the production data ecosystem — cloud warehouses, NoSQL, and document databases.
Continue to Production Database Tools
Run SQL in the cloud with Snowflake, learn when NoSQL beats SQL, and build document databases with MongoDB.
Back to Module Overview
Return to the Optimizing PostgreSQL module overview
Continue Building Your Skills
You have built genuine database performance expertise — the kind that turns a frustrating, timing-out report into a query that returns instantly. These skills are exactly what separates someone who can write SQL from someone who can run a database at scale. Carry the habits forward: measure before you change, match indexes to real queries, and keep your tables clean. Now step into the wider world of production data tools.