Lesson 1 - Exploring PostgreSQL Internals
Welcome to PostgreSQL Internals
Before you can make a database faster, you have to understand what is inside it. In this lesson you take on the role of a data engineer who has just been handed a Postgres database from the U.S. Department of Housing and Urban Development (HUD). Nobody has documented its structure for you. There is no schema diagram, no data dictionary — just a connection string and a job to do.
The good news is that Postgres documents itself. Every Postgres database keeps a set of internal tables that describe its own structure: which tables exist, what columns they have, what types those columns are, and much more. In this lesson you will learn to read those internal tables and assemble your own description of the database from scratch.
By the end of this lesson, you will be able to:
- Connect to a Postgres database from Python with
psycopg2 - Query the internal catalog tables to list a database’s tables
- Distinguish user-created tables from Postgres’ own system tables using schemas
- Map numeric type codes to human-readable type names
- Build a JSON description of a database’s structure, including row counts
You only need a working knowledge of SQL and a little Python. Let’s open up Postgres and see what it knows about itself.
Data for this lesson
Engine: PostgreSQL. Dataset: homeless-by-CoC — U.S. homelessness counts by Continuum of Care, loaded from CSV.
Tables used: homeless_by_coc
Real-world source: HUD — PIT and HIC Data Since 2007 (the HUD homelessness point-in-time counts this homeless_by_coc data is based on). Load it into your own PostgreSQL instance and run the queries in psql or a GUI client.
Connecting to the Database
Throughout this module you will talk to Postgres from Python using the psycopg2 library, the most widely used Postgres adapter for Python. A connection needs the database name, a user, and a password.
Opening and Closing a Connection
The HUD database is named hud, and you connect as the user hud_admin:
import psycopg2
conn = psycopg2.connect("dbname=hud user=hud_admin password=hud_pwd")
conn.close()The connect() function returns a connection object, which you assign to conn. When you are finished, you call conn.close() to release the connection. Every example in this module opens a connection like this; only the work in between changes.
To actually run queries you need a cursor, which you create from the connection. The cursor is what executes SQL and holds the results:
import psycopg2
conn = psycopg2.connect("dbname=hud user=hud_admin password=hud_pwd")
cur = conn.cursor()With conn and cur ready, you can start asking Postgres questions.
The Internal Catalog Tables
Every Postgres engine maintains a set of internal tables to manage its entire structure. These live in two documented groups: the information_schema and the system catalogs (in the pg_catalog group). Together they hold all the information about the data, table names, and types stored in the database. When you previously used cursor.description to inspect a result, that information was being pulled from these internal tables.
Listing All Tables
The internal table that lists every table in the database is named pg_tables, located in the pg_catalog schema. It has several columns; the first three are the ones you care about right now:
| Name | Type | Description |
|---|---|---|
| schemaname | name | Name of schema containing table |
| tablename | name | Name of table |
| tableowner | name | Name of table’s owner |
To get every table name in alphabetical order, query the tablename column:
import psycopg2
conn = psycopg2.connect("dbname=hud user=hud_admin password=hud_pwd")
cur = conn.cursor()
cur.execute("""
SELECT tablename FROM pg_catalog.pg_tables
ORDER BY tablename;
""")
table_names = cur.fetchall()
print(len(table_names))
for name in table_names:
print(name)When you run this, you get an overwhelming list — dozens of tables, most of them with names like pg_aggregate, pg_attribute, and pg_class. That is not what you wanted. Almost all of those tables are Postgres’ own internal bookkeeping, not the HUD data. You need a way to separate the two.
Working with Schemas
The reason the list was overwhelming is that it included Postgres’ internal tables. The fix comes from understanding schemas.
Schemas Are Folders for Tables
A schema is a namespace that separates database tables into isolated groups — think of it as a folder for tables. When a Postgres database is created, three schemas are instantiated automatically:
pg_catalogholds the system catalog tables.information_schemaholds the information schema tables.publicis the default schema for user-created tables.
Every time you run a CREATE TABLE command without specifying a schema, Postgres places that table in public. So unless someone created a special schema for them, the HUD tables live in public.
Filtering Out the Internal Tables
Because pg_tables exposes the schema of each table in its schemaname column, you can exclude the two internal schemas with a WHERE clause:
import psycopg2
conn = psycopg2.connect("dbname=hud user=hud_admin password=hud_pwd")
cur = conn.cursor()
cur.execute("""
SELECT tablename FROM pg_catalog.pg_tables
WHERE schemaname != 'pg_catalog' AND schemaname != 'information_schema'
ORDER BY tablename;
""")
table_names = cur.fetchall()
print(len(table_names))
for table_name in table_names:
name = table_name[0]
print(name)Now the result is short and meaningful. There are only three user-created tables in the HUD database:
homeless_by_coc
state_household_incomes
state_infoThese are the tables you will work with for the rest of the module. The homeless_by_coc table records the number of homeless residents for every state, city, and county in the United States.
Note
information_schema tables follow no obvious prefix pattern, but pg_catalog tables almost always begin with pg_. Filtering by schemaname is far more reliable than trying to match name prefixes.
Describing the Columns
Now that you know the three table names, you can inspect each table’s columns. One way is to run a SELECT and then read the cursor’s description attribute, which holds metadata about each column. But you want to do this inside a loop, with the table name changing on each iteration — and that introduces a subtlety.
Why You Cannot Parameterize a Table Name Normally
Your instinct might be to pass the table name as a query parameter:
table_name = "state_info"
cur.execute("SELECT * FROM %s LIMIT 0;", (table_name,))This raises an error:
ProgrammingError: syntax error at or near "'state_info'"
LINE 1: SELECT * FROM 'state_info' LIMIT 0
^The problem is that cursor.execute() wrapped the Python string "state_info" in single quotes, producing the SQL string 'state_info'. That quoting is correct behavior for a value (a column value or filter), but a table name must not be quoted that way.
Using AsIs for Identifiers
The solution is the AsIs class from psycopg2.extensions. It tells psycopg2 to insert the value as a raw SQL fragment instead of quoting it as a string:
from psycopg2.extensions import AsIs
table_name = "state_info"
cur.execute("SELECT * FROM %s LIMIT 0;", (AsIs(table_name),))The LIMIT 0 returns no rows — you do not want the data, only the column metadata. With AsIs, you can loop over every table and collect its description:
import psycopg2
from psycopg2.extensions import AsIs
conn = psycopg2.connect("dbname=hud user=hud_admin password=hud_pwd")
cur = conn.cursor()
# table_names from the previous step
col_descriptions = {}
for table_name in table_names:
cur.execute("SELECT * FROM %s LIMIT 0;", (AsIs(table_name),))
col_descriptions[table_name] = cur.description
conn.close()Now col_descriptions maps each table name to a list of Column objects describing its columns.
Never Use String Formatting for Queries
Do not build queries with str.format() or f-strings. That practice opens the door to SQL injection, even in code you think only you will run. Use parameters for values and AsIs for identifiers — make safe querying an automatic habit.
From Type Codes to Readable Types
When you inspect the description of homeless_by_coc (stored in col_descriptions['homeless_by_coc']), each column reports a numeric type_code instead of a friendly name:
| name | type_code |
|---|---|
| id | 23 |
| year | 1082 |
| state | 1042 |
| coc_number | 1042 |
| coc_name | 1043 |
| measures | 1043 |
| count | 23 |
A code like 1082 is not something you want to memorize. Postgres stores the mapping from code to name in another catalog table, pg_type, where the oid column holds the type code and the typname column holds the readable name.
Building a Type Map
You can pull the entire mapping into a Python dictionary in one query:
import psycopg2
conn = psycopg2.connect("dbname=hud user=hud_admin password=hud_pwd")
cur = conn.cursor()
cur.execute("SELECT oid, typname FROM pg_catalog.pg_type;")
type_mappings = { int(oid): typname for oid, typname in cur.fetchall() }
print(type_mappings[1082])The dictionary looks like this:
type_mappings = {
16: 'bool',
17: 'bytea',
18: 'char',
23: 'int4',
1042: 'bpchar',
1043: 'varchar',
1082: 'date',
...
}Printing type_mappings[1082] gives date, confirming that the year column really is a date. Note the conversion int(oid) — the oid comes back in a form that you cast to a plain integer so it matches the type_code values from the cursor description.
Assembling a Readable Description
You now have two pieces: col_descriptions (the columns of each table) and type_mappings (codes to names). Combine them into a single, human-readable structure.
Describing Each Column
Each Column object from the cursor description exposes name, type_code, and internal_size. A dictionary comprehension turns those into clean records:
readable_description = {}
for table_name in col_descriptions:
readable_description[table_name] = {
"columns": [
{
"name": col.name,
"type": type_mappings[col.type_code],
"internal_size": col.internal_size
}
for col in col_descriptions[table_name]
]
}The entry for homeless_by_coc now reads cleanly:
[
{'name': 'id', 'type': 'int4', 'internal_size': 4},
{'name': 'year', 'type': 'date', 'internal_size': 4},
{'name': 'state', 'type': 'bpchar', 'internal_size': 2},
{'name': 'coc_number', 'type': 'bpchar', 'internal_size': 8},
{'name': 'coc_name', 'type': 'varchar', 'internal_size': 128},
{'name': 'measures', 'type': 'varchar', 'internal_size': 64},
{'name': 'count', 'type': 'int4', 'internal_size': 4}
]Adding Row Counts
A description is more useful if it also tells you how large each table is. The COUNT(*) aggregate gives the number of rows, and AsIs again lets you plug in the table name:
import psycopg2
from psycopg2.extensions import AsIs
conn = psycopg2.connect("dbname=hud user=hud_admin password=hud_pwd")
cur = conn.cursor()
# readable_description from the previous step
for table_name in readable_description.keys():
cur.execute("SELECT COUNT(*) FROM %s;", [AsIs(table_name)])
readable_description[table_name]["number_of_rows"] = cur.fetchone()[0]
print(readable_description)For homeless_by_coc, the count comes back as 86529 rows — a table large enough that query performance will genuinely matter in the lessons ahead.
Making It Readable with JSON
Printing the dictionary directly produces an unreadable wall of text on one line:
{'homeless_by_coc': {'columns': [{'name': 'id', 'type': 'int4', 'internal_size': 4}, {'name': 'year', 'type': 'date', 'internal_size': 4}, ...], 'number_of_rows': 86529}}The cleanest fix is to convert the dictionary to JSON, a standard text format for representing key-value data. Python’s built-in json library does this with json.dumps(), and the indent argument controls the spacing:
import json
json_str = json.dumps(readable_description, indent=2)
print(json_str)With indent=2, the output is laid out hierarchically and easy to scan:
{
"homeless_by_coc": {
"columns": [
{
"name": "id",
"type": "int4",
"internal_size": 4
},
...
],
"number_of_rows": 86529
}
}You can also go the other way — turn a JSON string back into a Python dictionary with json.loads():
import json
json_string = '{"int": 1, "list": [1, 2, 3], "dictionary": {"k": 1}}'
dictionary = json.loads(json_string)
print(dictionary["list"])
# [1, 2, 3]One rule to remember: JSON requires double quotes for strings. A textual representation using single quotes is valid Python but invalid JSON, and json.loads() will reject it.
Practice Exercises
Exercise 1: List the Public Tables a Different Way
The information_schema also tracks tables. Write a query against information_schema.tables that selects table_name for rows where table_schema = 'public', ordered by name. Confirm you get the same three tables as the pg_tables approach.
Hint
The column is table_schema (not schemaname) in information_schema.tables, and the value you want is 'public'. Use single quotes around the value in the WHERE clause.
Exercise 2: Look Up a Type Code
Using the type_mappings dictionary you built, find the readable name for the type code 1043. Then look up the code for 23. Which columns in homeless_by_coc use each?
Hint
type_mappings[1043] returns 'varchar' and type_mappings[23] returns 'int4'. Cross-reference these against the type code table for homeless_by_coc.
Exercise 3: Count Rows in Every Table
Loop over the three HUD table names, run SELECT COUNT(*) on each using AsIs, and print the table name alongside its row count. Which table is the largest, and which is the smallest?
Hint
cursor.fetchone() returns a tuple, so take the first element with [0] to get the count itself. state_info is the smallest — keep that fact in mind for later lessons on indexing.
Summary
You started this lesson with an undocumented database and ended it with a complete, readable description built entirely from Postgres’ own internal tables. Along the way you learned how Postgres organizes its metadata into schemas, how to safely insert identifiers into queries, and how to translate Postgres’ internal type codes into names you can read.
Key Concepts
- Internal catalog tables — Postgres stores metadata about itself in the
pg_catalogandinformation_schemaschemas;pg_tableslists tables andpg_typemaps type codes to names. - Schema — a namespace that groups tables; new databases get
pg_catalog,information_schema, andpublic, and user tables default topublic. AsIs— apsycopg2.extensionsclass that inserts a value as raw SQL, letting you parameterize identifiers like table names that must not be quoted.type_code— the numeric identifier Postgres gives each column type;pg_typemaps it to a readabletypname.- JSON — a text format for key-value data;
json.dumps()serializes a dictionary (useindentfor readability) andjson.loads()parses JSON back into a dictionary.
Why This Matters
In real data engineering work, you are constantly handed databases you did not design. Being able to interrogate a database directly — to list its tables, inspect its columns, and measure its size without relying on someone else’s documentation — is a core skill. It is also the foundation for everything that follows: you cannot optimize what you cannot see. Now that you know the structure of the HUD database, you are ready to watch how Postgres runs queries against it.
Next Steps
Continue to Lesson 2 - Reading Query Plans with EXPLAIN
Debug slow queries by measuring estimated and actual execution times with EXPLAIN and EXPLAIN ANALYZE.
Back to Module Overview
Return to the Optimizing PostgreSQL module overview
Continue Building Your Skills
You can now read a Postgres database from the inside out. That habit of inspecting before acting will serve you in every performance investigation you ever run. In the next lesson you will turn from what is in the database to how Postgres works with it — and you will meet the single most important tool for understanding query performance: EXPLAIN.