Lesson 1 - Exploring PostgreSQL Internals

Welcome to PostgreSQL Internals

Before you can make a database faster, you have to understand what is inside it. In this lesson you take on the role of a data engineer who has just been handed a Postgres database from the U.S. Department of Housing and Urban Development (HUD). Nobody has documented its structure for you. There is no schema diagram, no data dictionary — just a connection string and a job to do.

The good news is that Postgres documents itself. Every Postgres database keeps a set of internal tables that describe its own structure: which tables exist, what columns they have, what types those columns are, and much more. In this lesson you will learn to read those internal tables and assemble your own description of the database from scratch.

By the end of this lesson, you will be able to:

Connect to a Postgres database from Python with psycopg2
Query the internal catalog tables to list a database’s tables
Distinguish user-created tables from Postgres’ own system tables using schemas
Map numeric type codes to human-readable type names
Build a JSON description of a database’s structure, including row counts

You only need a working knowledge of SQL and a little Python. Let’s open up Postgres and see what it knows about itself.

Data for this lesson

Engine: PostgreSQL. Dataset: homeless-by-CoC — U.S. homelessness counts by Continuum of Care, loaded from CSV.

Tables used: homeless_by_coc

Real-world source: HUD — PIT and HIC Data Since 2007 (the HUD homelessness point-in-time counts this homeless_by_coc data is based on). Load it into your own PostgreSQL instance and run the queries in psql or a GUI client.

Connecting to the Database

Throughout this module you will talk to Postgres from Python using the psycopg2 library, the most widely used Postgres adapter for Python. A connection needs the database name, a user, and a password.

Opening and Closing a Connection

The HUD database is named hud, and you connect as the user hud_admin:

import psycopg2

conn = psycopg2.connect("dbname=hud user=hud_admin password=hud_pwd")
conn.close()

The connect() function returns a connection object, which you assign to conn. When you are finished, you call conn.close() to release the connection. Every example in this module opens a connection like this; only the work in between changes.

To actually run queries you need a cursor, which you create from the connection. The cursor is what executes SQL and holds the results:

import psycopg2

conn = psycopg2.connect("dbname=hud user=hud_admin password=hud_pwd")
cur = conn.cursor()

With conn and cur ready, you can start asking Postgres questions.

The Internal Catalog Tables

Every Postgres engine maintains a set of internal tables to manage its entire structure. These live in two documented groups: the information_schema and the system catalogs (in the pg_catalog group). Together they hold all the information about the data, table names, and types stored in the database. When you previously used cursor.description to inspect a result, that information was being pulled from these internal tables.

Listing All Tables

The internal table that lists every table in the database is named pg_tables, located in the pg_catalog schema. It has several columns; the first three are the ones you care about right now:

Name	Type	Description
schemaname	name	Name of schema containing table
tablename	name	Name of table
tableowner	name	Name of table’s owner

To get every table name in alphabetical order, query the tablename column:

import psycopg2

conn = psycopg2.connect("dbname=hud user=hud_admin password=hud_pwd")
cur = conn.cursor()
cur.execute("""
    SELECT tablename FROM pg_catalog.pg_tables
    ORDER BY tablename;
""")
table_names = cur.fetchall()
print(len(table_names))
for name in table_names:
    print(name)

When you run this, you get an overwhelming list — dozens of tables, most of them with names like pg_aggregate, pg_attribute, and pg_class. That is not what you wanted. Almost all of those tables are Postgres’ own internal bookkeeping, not the HUD data. You need a way to separate the two.

Working with Schemas

The reason the list was overwhelming is that it included Postgres’ internal tables. The fix comes from understanding schemas.

Schemas Are Folders for Tables

A schema is a namespace that separates database tables into isolated groups — think of it as a folder for tables. When a Postgres database is created, three schemas are instantiated automatically:

pg_catalog holds the system catalog tables.
information_schema holds the information schema tables.
public is the default schema for user-created tables.

Every time you run a CREATE TABLE command without specifying a schema, Postgres places that table in public. So unless someone created a special schema for them, the HUD tables live in public.

Filtering Out the Internal Tables

Because pg_tables exposes the schema of each table in its schemaname column, you can exclude the two internal schemas with a WHERE clause:

import psycopg2

conn = psycopg2.connect("dbname=hud user=hud_admin password=hud_pwd")
cur = conn.cursor()
cur.execute("""
    SELECT tablename FROM pg_catalog.pg_tables
    WHERE schemaname != 'pg_catalog' AND schemaname != 'information_schema'
    ORDER BY tablename;
""")
table_names = cur.fetchall()
print(len(table_names))
for table_name in table_names:
    name = table_name[0]
    print(name)

Now the result is short and meaningful. There are only three user-created tables in the HUD database:

homeless_by_coc
state_household_incomes
state_info

These are the tables you will work with for the rest of the module. The homeless_by_coc table records the number of homeless residents for every state, city, and county in the United States.

Note

information_schema tables follow no obvious prefix pattern, but pg_catalog tables almost always begin with pg_. Filtering by schemaname is far more reliable than trying to match name prefixes.

Describing the Columns

Now that you know the three table names, you can inspect each table’s columns. One way is to run a SELECT and then read the cursor’s description attribute, which holds metadata about each column. But you want to do this inside a loop, with the table name changing on each iteration — and that introduces a subtlety.

Why You Cannot Parameterize a Table Name Normally

Your instinct might be to pass the table name as a query parameter:

table_name = "state_info"
cur.execute("SELECT * FROM %s LIMIT 0;", (table_name,))

This raises an error:

ProgrammingError: syntax error at or near "'state_info'"
LINE 1: SELECT * FROM 'state_info' LIMIT 0
                      ^

The problem is that cursor.execute() wrapped the Python string "state_info" in single quotes, producing the SQL string 'state_info'. That quoting is correct behavior for a value (a column value or filter), but a table name must not be quoted that way.

Using AsIs for Identifiers

The solution is the AsIs class from psycopg2.extensions. It tells psycopg2 to insert the value as a raw SQL fragment instead of quoting it as a string:

from psycopg2.extensions import AsIs

table_name = "state_info"
cur.execute("SELECT * FROM %s LIMIT 0;", (AsIs(table_name),))

The LIMIT 0 returns no rows — you do not want the data, only the column metadata. With AsIs, you can loop over every table and collect its description:

import psycopg2
from psycopg2.extensions import AsIs

conn = psycopg2.connect("dbname=hud user=hud_admin password=hud_pwd")
cur = conn.cursor()
# table_names from the previous step

col_descriptions = {}
for table_name in table_names:
    cur.execute("SELECT * FROM %s LIMIT 0;", (AsIs(table_name),))
    col_descriptions[table_name] = cur.description

conn.close()

Now col_descriptions maps each table name to a list of Column objects describing its columns.

Never Use String Formatting for Queries

Do not build queries with str.format() or f-strings. That practice opens the door to SQL injection, even in code you think only you will run. Use parameters for values and AsIs for identifiers — make safe querying an automatic habit.

From Type Codes to Readable Types

When you inspect the description of homeless_by_coc (stored in col_descriptions['homeless_by_coc']), each column reports a numeric type_code instead of a friendly name:

name	type_code
id	23
year	1082
state	1042
coc_number	1042
coc_name	1043
measures	1043
count	23

A code like 1082 is not something you want to memorize. Postgres stores the mapping from code to name in another catalog table, pg_type, where the oid column holds the type code and the typname column holds the readable name.

Building a Type Map

You can pull the entire mapping into a Python dictionary in one query:

import psycopg2

conn = psycopg2.connect("dbname=hud user=hud_admin password=hud_pwd")
cur = conn.cursor()
cur.execute("SELECT oid, typname FROM pg_catalog.pg_type;")
type_mappings = { int(oid): typname for oid, typname in cur.fetchall() }
print(type_mappings[1082])

The dictionary looks like this:

type_mappings = {
    16: 'bool',
    17: 'bytea',
    18: 'char',
    23: 'int4',
    1042: 'bpchar',
    1043: 'varchar',
    1082: 'date',
    ...
}

Printing type_mappings[1082] gives date, confirming that the year column really is a date. Note the conversion int(oid) — the oid comes back in a form that you cast to a plain integer so it matches the type_code values from the cursor description.

Assembling a Readable Description

You now have two pieces: col_descriptions (the columns of each table) and type_mappings (codes to names). Combine them into a single, human-readable structure.

Describing Each Column

Each Column object from the cursor description exposes name, type_code, and internal_size. A dictionary comprehension turns those into clean records:

readable_description = {}
for table_name in col_descriptions:
    readable_description[table_name] = {
        "columns": [
            {
                "name": col.name,
                "type": type_mappings[col.type_code],
                "internal_size": col.internal_size
            }
            for col in col_descriptions[table_name]
        ]
    }

The entry for homeless_by_coc now reads cleanly:

[
  {'name': 'id', 'type': 'int4', 'internal_size': 4},
  {'name': 'year', 'type': 'date', 'internal_size': 4},
  {'name': 'state', 'type': 'bpchar', 'internal_size': 2},
  {'name': 'coc_number', 'type': 'bpchar', 'internal_size': 8},
  {'name': 'coc_name', 'type': 'varchar', 'internal_size': 128},
  {'name': 'measures', 'type': 'varchar', 'internal_size': 64},
  {'name': 'count', 'type': 'int4', 'internal_size': 4}
]

Adding Row Counts

A description is more useful if it also tells you how large each table is. The COUNT(*) aggregate gives the number of rows, and AsIs again lets you plug in the table name:

import psycopg2
from psycopg2.extensions import AsIs

conn = psycopg2.connect("dbname=hud user=hud_admin password=hud_pwd")
cur = conn.cursor()
# readable_description from the previous step

for table_name in readable_description.keys():
    cur.execute("SELECT COUNT(*) FROM %s;", [AsIs(table_name)])
    readable_description[table_name]["number_of_rows"] = cur.fetchone()[0]

print(readable_description)

For homeless_by_coc, the count comes back as 86529 rows — a table large enough that query performance will genuinely matter in the lessons ahead.

Making It Readable with JSON

Printing the dictionary directly produces an unreadable wall of text on one line:

{'homeless_by_coc': {'columns': [{'name': 'id', 'type': 'int4', 'internal_size': 4}, {'name': 'year', 'type': 'date', 'internal_size': 4}, ...], 'number_of_rows': 86529}}

The cleanest fix is to convert the dictionary to JSON, a standard text format for representing key-value data. Python’s built-in json library does this with json.dumps(), and the indent argument controls the spacing:

import json

json_str = json.dumps(readable_description, indent=2)
print(json_str)

With indent=2, the output is laid out hierarchically and easy to scan:

{
  "homeless_by_coc": {
    "columns": [
      {
        "name": "id",
        "type": "int4",
        "internal_size": 4
      },
      ...
    ],
    "number_of_rows": 86529
  }
}

You can also go the other way — turn a JSON string back into a Python dictionary with json.loads():

import json

json_string = '{"int": 1, "list": [1, 2, 3], "dictionary": {"k": 1}}'
dictionary = json.loads(json_string)
print(dictionary["list"])
# [1, 2, 3]

One rule to remember: JSON requires double quotes for strings. A textual representation using single quotes is valid Python but invalid JSON, and json.loads() will reject it.

Practice Exercises

Exercise 1: List the Public Tables a Different Way

The information_schema also tracks tables. Write a query against information_schema.tables that selects table_name for rows where table_schema = 'public', ordered by name. Confirm you get the same three tables as the pg_tables approach.

Hint

The column is table_schema (not schemaname) in information_schema.tables, and the value you want is 'public'. Use single quotes around the value in the WHERE clause.

Exercise 2: Look Up a Type Code

Using the type_mappings dictionary you built, find the readable name for the type code 1043. Then look up the code for 23. Which columns in homeless_by_coc use each?

Hint

type_mappings[1043] returns 'varchar' and type_mappings[23] returns 'int4'. Cross-reference these against the type code table for homeless_by_coc.

Exercise 3: Count Rows in Every Table

Loop over the three HUD table names, run SELECT COUNT(*) on each using AsIs, and print the table name alongside its row count. Which table is the largest, and which is the smallest?

Hint

cursor.fetchone() returns a tuple, so take the first element with [0] to get the count itself. state_info is the smallest — keep that fact in mind for later lessons on indexing.

Summary

You started this lesson with an undocumented database and ended it with a complete, readable description built entirely from Postgres’ own internal tables. Along the way you learned how Postgres organizes its metadata into schemas, how to safely insert identifiers into queries, and how to translate Postgres’ internal type codes into names you can read.

Key Concepts

Internal catalog tables — Postgres stores metadata about itself in the pg_catalog and information_schema schemas; pg_tables lists tables and pg_type maps type codes to names.
Schema — a namespace that groups tables; new databases get pg_catalog, information_schema, and public, and user tables default to public.
AsIs — a psycopg2.extensions class that inserts a value as raw SQL, letting you parameterize identifiers like table names that must not be quoted.
type_code — the numeric identifier Postgres gives each column type; pg_type maps it to a readable typname.
JSON — a text format for key-value data; json.dumps() serializes a dictionary (use indent for readability) and json.loads() parses JSON back into a dictionary.

Why This Matters

In real data engineering work, you are constantly handed databases you did not design. Being able to interrogate a database directly — to list its tables, inspect its columns, and measure its size without relying on someone else’s documentation — is a core skill. It is also the foundation for everything that follows: you cannot optimize what you cannot see. Now that you know the structure of the HUD database, you are ready to watch how Postgres runs queries against it.

Next Steps

Continue to Lesson 2 - Reading Query Plans with EXPLAIN

Debug slow queries by measuring estimated and actual execution times with EXPLAIN and EXPLAIN ANALYZE.

Back to Module Overview

Return to the Optimizing PostgreSQL module overview

Continue Building Your Skills

You can now read a Postgres database from the inside out. That habit of inspecting before acting will serve you in every performance investigation you ever run. In the next lesson you will turn from what is in the database to how Postgres works with it — and you will meet the single most important tool for understanding query performance: EXPLAIN.

Lesson 20 - Window Functions and Rolling Operations

Lesson 21 - Final Project: Real-World Data Analysis

Courses

DATATWEETS

Title here

Lesson 1 - Exploring PostgreSQL Internals

Welcome to PostgreSQL Internals

Connecting to the Database

Opening and Closing a Connection

The Internal Catalog Tables

Listing All Tables

Working with Schemas

Schemas Are Folders for Tables

Filtering Out the Internal Tables

Describing the Columns

Why You Cannot Parameterize a Table Name Normally

Using AsIs for Identifiers

From Type Codes to Readable Types

Building a Type Map

Assembling a Readable Description

Describing Each Column

Adding Row Counts

Making It Readable with JSON

Practice Exercises

Exercise 1: List the Public Tables a Different Way

Exercise 2: Look Up a Type Code

Exercise 3: Count Rows in Every Table

Summary

Key Concepts

Why This Matters

Next Steps

Continue to Lesson 2 - Reading Query Plans with EXPLAIN

Back to Module Overview

Continue Building Your Skills

Lesson 1 - Exploring PostgreSQL Internals

Welcome to PostgreSQL Internals#

Connecting to the Database#

Opening and Closing a Connection#

The Internal Catalog Tables#

Listing All Tables#

Working with Schemas#

Schemas Are Folders for Tables#

Filtering Out the Internal Tables#

Describing the Columns#

Why You Cannot Parameterize a Table Name Normally#

Using AsIs for Identifiers#

From Type Codes to Readable Types#

Building a Type Map#

Assembling a Readable Description#

Describing Each Column#

Adding Row Counts#

Making It Readable with JSON#

Practice Exercises#

Exercise 1: List the Public Tables a Different Way#

Exercise 2: Look Up a Type Code#

Exercise 3: Count Rows in Every Table#

Summary#

Key Concepts#

Why This Matters#

Next Steps#

Continue to Lesson 2 - Reading Query Plans with EXPLAIN

Back to Module Overview

Continue Building Your Skills#

Welcome to PostgreSQL Internals

Connecting to the Database

Opening and Closing a Connection

The Internal Catalog Tables

Listing All Tables

Working with Schemas

Schemas Are Folders for Tables

Filtering Out the Internal Tables

Describing the Columns

Why You Cannot Parameterize a Table Name Normally

Using AsIs for Identifiers

From Type Codes to Readable Types

Building a Type Map

Assembling a Readable Description

Describing Each Column

Adding Row Counts

Making It Readable with JSON

Practice Exercises

Exercise 1: List the Public Tables a Different Way

Exercise 2: Look Up a Type Code

Exercise 3: Count Rows in Every Table

Summary

Key Concepts

Why This Matters

Next Steps

Continue Building Your Skills