Lesson 1 - Introduction to PostgreSQL
Welcome to PostgreSQL
You already know how to write SQL against SQLite. In this lesson you take a big step up: you meet PostgreSQL, the open-source database engine that powers a huge share of the world’s data platforms, and you connect to it from Python.
The shift from SQLite to Postgres is the shift from analyst to engineer. Instead of a single file on your laptop, you will be talking to a database server that many people can use at the same time. That changes how you connect, how you save your work, and how you think about safety.
By the end of this lesson, you will be able to:
- Explain how PostgreSQL differs from SQLite and why data engineers choose it
- Connect to a Postgres database from Python using the
psycopg2library - Create a table and insert data with parameterized queries
- Explain what a SQL transaction is and why you must commit your changes
- Bulk-load a CSV file into a Postgres table
A little SQL and a little Python are all you need. Let’s begin.
Data for this lesson
Engine: PostgreSQL. Dataset: users — a 10,000-row generated sample (user_accounts.csv).
Tables used: users, accounts
Download: user_accounts.csv — a 10,000-row generated (not real) sample with the schema above. Load it into your PostgreSQL instance (the lesson walks through the import) and query it in psql or a GUI client. Prefer your own? Regenerate with any fake-data tool such as Python’s Faker.
From SQLite to PostgreSQL
SQLite is one of the most common database engines in the world. All of your data lives in a single file, which makes it portable and incredibly easy to use. Because SQLite supports most of the SQL commands you will find in larger engines, learning SQLite means you already know the basics of every other SQL database.
But SQLite has a hard limit: only one process can write to the database at a time. That makes it difficult to share with many people and services. The job of a data engineer is to open an organization’s data to a wide group of analysts, data scientists, and applications. You could hand the SQLite file to every interested user — but what happens when 10,000 people want the data at once? SQLite does not scale to that.
The client-server model
To serve that many users, you need a different kind of engine. PostgreSQL (often just called Postgres) is an open-source relational database management system (RDBMS) built as a server rather than a single file.
As a server, Postgres accepts connections from many clients at once. Each client can run queries like SELECT, INSERT, or any other SQL statement, which makes the data accessible to many people simultaneously. Imagine two users, Rose and Bruno: both can connect to the same Postgres server, and both can query the same databases without stepping on each other.
This client-server model is exactly what solves the central data-engineering challenge of making data available to everyone who needs it.
What this means for you
Because Postgres is a server, connecting to it is a little different from opening a SQLite file. You will specify which database you want and which user you are. Postgres uses multiple users and databases to improve both security and the organization of your data — without those values, the server has no idea who is connecting or where, so it refuses.
Most of the SQL commands you already know — SELECT, CREATE, INSERT — work the same in Postgres. When a command behaves differently, the official PostgreSQL documentation is your friend.
Connecting with psycopg2
To talk to a Postgres server from Python, you use the open-source psycopg2 library. If you have used Python’s built-in sqlite3 library before, psycopg2 will feel familiar — it plays the same role for Postgres.
Opening a connection
You connect with the psycopg2.connect() function, passing it a connection string that names the database and the user:
import psycopg2
conn = psycopg2.connect("dbname=datatweets user=datatweets")Here the connection string "dbname=datatweets user=datatweets" says: connect to the database named datatweets as the user named datatweets. The function returns a connection object that lets the client interact with the server until the connection is closed.
When you are finished, you close the connection with the connection.close() method:
conn.close()Once a connection is closed, you can no longer run commands through it.
Cursors execute your queries
The connection object does not run queries directly. Instead, you create a cursor object by calling connection.cursor(), and you run SQL through the cursor’s execute() method:
import psycopg2
conn = psycopg2.connect("dbname=datatweets user=datatweets")
cur = conn.cursor()
cur.execute("SELECT * FROM users;")The string you pass to execute() is called a query string.
Fetching results
cursor.execute() does not return the results directly — it returns None on success, or raises an error. To read the rows, you either loop over the cursor or call one of two fetch methods:
cur.execute("SELECT * FROM users;")
for row in cur:
print(row)cursor.fetchone()returns the first result, orNoneif there are no rows.cursor.fetchall()returns a list of every row, or an empty list[]if there are none.
cur.execute("SELECT * FROM users;")
one_result = cur.fetchone()
all_results = cur.fetchall()The data you will load in this module is a set of randomly generated (not real) user accounts from a file named user_accounts.csv. The first few rows look like this:
id | email | name | address
---+--------------------------+----------------+-----------------------------------------------------
0 | [email protected] | Anna Carter | 27183 Craig Shore Suite 886 New Benjamin TN 92858
1 | [email protected] | Joseph Kirby | 3594 Fox Ford Apt. 192 West Kristen GA 22838-8977
2 | [email protected] | Larry Cain | 58208 Cook Bypass West Benjaminfurt OH 25179Creating Your First Table
Creating a table in Postgres uses the same CREATE TABLE statement you know from SQLite. For each column you specify its name and its data type:
CREATE TABLE employees (
id integer PRIMARY KEY,
first_name text,
last_name text
);The PRIMARY KEY keyword marks id as the table’s primary key. In Postgres, every table needs at least one set of columns that holds a unique value per row, and the primary key fills that role.
From Python, you wrap the statement in cursor.execute(). Using a triple-quoted string keeps the query readable across multiple lines:
import psycopg2
conn = psycopg2.connect("dbname=datatweets user=datatweets")
cur = conn.cursor()
cur.execute("""
CREATE TABLE users (
id integer PRIMARY KEY,
email text,
name text,
address text
);
""")The users table has four columns and two data types: an integer for the IDs and text for everything else. Postgres, like every relational database, is type-sensitive — you must declare a type for every column.
Why the table might not be there
Here is a surprise. If you create that table, close the connection, then reconnect and run SELECT * FROM users;, you get an error:
ProgrammingError: relation "users" does not exist
LINE 1: SELECT * FROM users;
^This is not a bug. It is your first encounter with transactions.
Transactions and the Commit Method
Postgres groups your queries into transactions and only saves the changes when you explicitly ask it to. This is different from SQLite, where every data-modifying query took effect immediately.
Why transactions exist
Imagine a bank with an accounts table:
id | name | balance
---+------+--------
1 | Jim | 100
2 | Sue | 200Sue gives Jim 100 dollars. You model that with two updates:
UPDATE accounts SET balance = 200 WHERE name = 'Jim';
UPDATE accounts SET balance = 100 WHERE name = 'Sue';Now suppose the connection fails after the first update but before the second. Jim is credited his 100 dollars, but nothing is ever removed from Sue. The bank just lost money:
id | name | balance
---+------+--------
1 | Jim | 200
2 | Sue | 200Transactions prevent exactly this. All the queries in a transaction block run together. If any query in the block fails, the whole block fails and no changes are made at all.
Committing your work
To tell Postgres “apply all the queries in this transaction,” you call connection.commit():
conn.commit()Whenever you open a connection in psycopg2, a new transaction begins automatically. Every query you run is grouped together until you commit. Committing is like saving a document in a text editor: your edits are not really written until you save.
If you change your mind, connection.rollback() discards the transaction. If you call neither commit nor rollback before closing, the transaction stays pending and your changes simply vanish — which is exactly why the users table seemed to disappear earlier.
So the corrected version commits before closing:
import psycopg2
conn = psycopg2.connect("dbname=datatweets user=datatweets")
cur = conn.cursor()
cur.execute("""
CREATE TABLE users (
id integer PRIMARY KEY,
email text,
name text,
address text
);
""")
conn.commit()
conn.close()Different views of the same data
Until you commit, your connection sees your changes as if they were saved, but other users see the old data. Two connections can have completely different views of the database. The moment you commit, everyone sees the same thing. This is the price — and the safety — of a multi-user server.
Inserting and Bulk-Loading Data
With a table in place, you load rows with the INSERT command.
The safe way to insert
A tempting but wrong approach is to build the query string yourself with Python formatting:
insert_string = "INSERT INTO users VALUES ({0}, {1}, {2}, {3});".format(
1, "[email protected]", "John", "123, Fake Street")This breaks. The result looks like this:
INSERT INTO users VALUES (1, [email protected], John, 123, Fake Street);Two things went wrong: the comma inside the address split it into two values (so Postgres sees five values, not four), and the text values are missing their single quotes.
The recommended way is to let psycopg2 substitute the values for you. Pass the query string with %s placeholders as the first argument, and a sequence of values as the second:
import psycopg2
conn = psycopg2.connect("dbname=datatweets user=datatweets")
cur = conn.cursor()
cur.execute(
"INSERT INTO users VALUES (%s, %s, %s, %s);",
(1, "[email protected]", "John", "123, Fake Street")
)
conn.commit()The %s placeholders behave like {} in Python formatting, but psycopg2 handles the quoting and escaping safely. A couple of rules to remember:
- The second argument must always be a sequence, even for a single value:
cur.execute("INSERT INTO foo VALUES (%s)", "bar") # WRONG
cur.execute("INSERT INTO foo VALUES (%s)", ("bar",)) # correct
cur.execute("INSERT INTO foo VALUES (%s)", ["bar"]) # correct- Do not quote the placeholder yourself —
psycopg2does it:
cur.execute("INSERT INTO numbers VALUES ('%s')", (10,)) # WRONG
cur.execute("INSERT INTO numbers VALUES (%s)", (10,)) # correctLoading a whole CSV file
The user_accounts.csv file holds 10,000 users — far too many to insert by hand. Combine psycopg2 with Python’s built-in csv module to loop over every row:
import psycopg2
import csv
conn = psycopg2.connect("dbname=datatweets user=datatweets")
cur = conn.cursor()
with open("user_accounts.csv", "r") as file:
next(file) # skip the header row
reader = csv.reader(file)
for row in reader:
cur.execute("INSERT INTO users VALUES (%s, %s, %s, %s);", row)
conn.commit()
conn.close()Because each row from the CSV reader is already a list, you can pass it straight in as the second argument — no need to break out the individual values. And as always, you commit at the end so the inserts are saved.
The General Postgres Workflow
Pulling it all together, here is the workflow you will use again and again with psycopg2:
- Connect to a database with
psycopg2.connect(). - Create a cursor with
connection.cursor(). - Execute one or more SQL queries with
cursor.execute(). - Commit your changes with
connection.commit(). - Close the connection with
connection.close().
You can run as many queries as you like between commits. Everything between two commits forms a single transaction block that succeeds or fails as a unit.
Practice Exercises
Exercise 1: Connect and disconnect
Write a script that imports psycopg2, connects to a database named datatweets as the user datatweets, stores the connection in a variable named conn, and then closes it.
# Your code hereHint
Your connection string should be "dbname=datatweets user=datatweets". Remember to call conn.close() at the end.
Exercise 2: Create a table and commit
Connect to datatweets, create a cursor, and create a table named products with columns id (integer, primary key), name (text), and price (integer). Commit your changes and close the connection.
# Your code hereHint
If you forget to call conn.commit(), the table will not actually be saved — even though no error appears.
Exercise 3: Insert safely
Insert one product into your products table using a parameterized query with %s placeholders. Then commit.
# Your code hereHint
Pass the query string as the first argument to cursor.execute() and a tuple of values as the second — never build the string with .format().
Summary
You connected to a PostgreSQL server, created a table, inserted data safely, and learned why your changes are not saved until you commit. You also saw why Postgres exists at all: it is a multi-user server, and that single fact shapes connections, transactions, and safety.
Key Concepts
- PostgreSQL (Postgres) — an open-source relational database server that handles many simultaneous client connections, unlike SQLite’s single-file, single-writer model.
- psycopg2 — the Python library used to connect to and query a Postgres server.
- Connection string — the text (e.g.
"dbname=datatweets user=datatweets") that tells Postgres which database and user to connect as. - Cursor — the object, created from a connection, that executes SQL and fetches results.
- Transaction — a group of queries that succeed or fail together; changes become permanent only when you call
connection.commit().
Why This Matters
Almost every production data platform you will ever touch runs on a database server, and Postgres is one of the most popular choices. Understanding the connection-cursor-commit cycle is the foundation for everything a data engineer does: loading pipelines, serving analysts, and keeping data consistent when many people use it at once. The transaction safety you learned here is what stands between a working bank and a lost balance.
Next Steps
Continue to Lesson 2 - Designing Tables and Choosing Data Types
Pick the right data types for your data and modify column types as requirements change
Back to Module Overview
Return to the PostgreSQL for Data Engineers module overview
Continue Building Your Skills
You have made the leap from a single-file database to a real database server, and you now know the rhythm every data engineer follows: connect, execute, commit, close. In the next lesson you will sharpen the most engineering-focused skill of all — choosing data types that save space and protect your data from bad input. Keep going!