Lesson 4 - Loading and Copying Data
Welcome to Bulk Data Loading
Inserting one row at a time is fine for a handful of records, but data engineers move data by the thousands and millions. In this lesson you will learn the faster, more robust ways to load CSV files into Postgres, to export tables back out to files, and to copy data between tables — sometimes without leaving SQL at all.
You will keep working with the IGN game reviews dataset, whose date columns have been merged into a single release_date.
By the end of this lesson, you will be able to:
- Explain how
cursor.mogrify()converts Python values into safe SQL - Insert many rows with a single
INSERTstatement - Bulk-load CSV files with
copy_fromandcopy_expert - Export a Postgres table to a CSV file
- Copy data between tables using
INSERT…SELECT
You will use the psycopg2 workflow from earlier lessons. Let’s load some data.
Data for this lesson
Engine: PostgreSQL. Dataset: IGN game reviews loaded from CSV.
Tables used: ign_reviews, users, users_copy, emails
Real-world source: Kaggle — 20 Years of Games / IGN reviews. Load it into your own PostgreSQL instance (the lesson walks through the COPY import) and run the queries in psql or a GUI client.
Behind the Scenes: mogrify
When you pass values to cursor.execute() with %s placeholders, psycopg2 safely converts your Python objects into Postgres values. The method that does this conversion is cursor.mogrify(), and you can call it directly to see the result without touching the database:
from datetime import date
mogrified_values = cur.mogrify("(%s, %s, %s, %s, %s, %s, %s, %s, %s)",
(52499790661213,
'Amazing',
'LittleBigPlanet PS Vita',
'/games/littlebigplanet-vita/vita-98907',
'PlayStation Vita',
9.0,
'Platformer',
'y',
date(2012, 12, 9))
)
print(mogrified_values)b"(52499790661213, 'Amazing', 'LittleBigPlanet PS Vita', '/games/littlebigplanet-vita/vita-98907', 'PlayStation Vita', 9.0, 'Platformer', 'y', '2012-12-09'::date)"Notice how mogrify quoted the strings and turned the Python date into a proper Postgres '2012-12-09'::date. This is exactly the substitution execute() performs internally.
It returns bytes, not a string
Look closely at the output: it starts with b, meaning it is a bytes object, not a string. To use it as text, decode it with the connection’s encoding:
mogrified_string = mogrified_values.decode(conn.encoding)
print(mogrified_string)(52499790661213, 'Amazing', 'LittleBigPlanet PS Vita', '/games/littlebigplanet-vita/vita-98907', 'PlayStation Vita', 9.0, 'Platformer', 'y', '2012-12-09'::date)The connection.encoding attribute tells you which encoding the connection uses, which depends on factors like your operating system.
Inserting Many Rows at Once
Issuing one INSERT per row sends one query to the server per row — slow when there are thousands. Postgres lets you insert many rows in a single statement by comma-separating the value tuples:
INSERT INTO example_table VALUES (1, 'hello', 'world'), (2, 'data', 'quest'), (3, 'love', 'learning');Using mogrify, you can build that big statement programmatically. Mogrify each row, decode it, join the pieces with commas, and run one INSERT:
import csv
import psycopg2
conn = psycopg2.connect("dbname=datatweets user=datatweets")
cur = conn.cursor()
with open("ign.csv", "r") as f:
next(f) # skip header
reader = csv.reader(f)
rows = [row for row in reader]
mogrified_rows = [
cur.mogrify("(%s, %s, %s, %s, %s, %s, %s, %s, %s)", row)
for row in rows
]
decoded_rows = [row.decode(conn.encoding) for row in mogrified_rows]
insert_string = ",".join(decoded_rows)
cur.execute("INSERT INTO ign_reviews VALUES " + insert_string + ";")
conn.commit()
conn.close()This is much faster than a loop of inserts. But it still has drawbacks: you write a lot of code, you have to deal with encodings, and the entire query string must fit in memory at once before it is sent over the network. For truly large files, there is a better tool.
The COPY Methods
psycopg2 provides dedicated methods for loading files, built on Postgres’s COPY command.
copy_from for simple delimited files
The cursor.copy_from() method loads a text file straight into a table. It takes a file object, a table name, and a separator (default is the TAB character \t). For a file where & separates columns:
cur = conn.cursor()
with open("users.txt", "r") as f:
next(f) # skip the header row
cur.copy_from(f, "users", sep='&')This is simple, but it only works when a single delimiter character cleanly separates columns.
copy_expert for real CSV files
CSV files often contain commas inside values, such as an address like '100, Fake street'. A plain delimiter split would mistake that inner comma for a column boundary. The more robust cursor.copy_expert() method takes a full COPY statement plus a file, and it understands real CSV rules:
conn = psycopg2.connect("dbname=datatweets user=datatweets")
cur = conn.cursor()
with open("ign.csv", "r") as f:
cur.copy_expert("COPY ign_reviews FROM STDIN WITH CSV HEADER;", f)
conn.commit()
conn.close()Each part of the COPY statement matters:
ign_reviews— the target table.FROM STDIN— read the data from the connection (standard input), not a file on the server.WITH— what follows are options.CSV— interpret the input as CSV, so commas inside quoted fields are not treated as separators.HEADER— skip the first row, so you no longer neednext(f).
You can chain options separated by spaces. For a semicolon-delimited file, add DELIMITER ';':
COPY example_table FROM STDIN WITH CSV DELIMITER ';';Which method is fastest?
If you time the three approaches — many single INSERTs, one big mogrified INSERT, and copy_expert — using timeit, copy_expert is consistently the fastest and most memory-efficient for large CSV files. But speed is not everything: if you need to transform each row in Python before loading, the INSERT approaches give you that flexibility. Choose the method that fits the job.
Exporting a Table to a File
copy_expert works in both directions. To write a table out to a file, use COPY table_name TO STDOUT. Just as STDIN means “data comes in over the connection,” STDOUT means “data goes out over the connection” to your client, rather than to a file on the server:
import psycopg2
conn = psycopg2.connect("dbname=datatweets user=datatweets")
cur = conn.cursor()
with open("ign_copy.csv", "w") as f:
cur.copy_expert("COPY ign_reviews TO STDOUT WITH CSV HEADER;", f)
conn.close()Note the "w" mode on open() because you are writing, and the CSV HEADER options so the export includes a header row.
Copying Data Between Tables
Often you need to copy data from one table to another. There are two ways.
Through a temporary file
You can export a table to a temporary CSV and then load that file into a new table. This reuses the two skills you just learned:
import psycopg2
create_string = """
CREATE TABLE ign_reviews_copy (
id bigint PRIMARY KEY,
score_phrase evaluation_enum,
title varchar(200),
url varchar(200),
platform platform_enum,
score decimal(3, 1),
genre genre_enum,
editors_choice boolean,
release_date date
);
"""
conn = psycopg2.connect("dbname=datatweets user=datatweets")
cur = conn.cursor()
# export the source table to a temporary file
with open("temp.csv", "w") as f:
cur.copy_expert("COPY ign_reviews TO STDOUT WITH CSV HEADER;", f)
# create the destination table
cur.execute(create_string)
# load the temporary file into the new table
with open("temp.csv", "r") as f:
cur.copy_expert("COPY ign_reviews_copy FROM STDIN WITH CSV HEADER;", f)
conn.commit()
conn.close()This works well for tables up to around a million rows, but it writes the whole table to your local disk — wasteful for very large tables.
Purely in SQL with INSERT … SELECT
For large tables, do the copy entirely on the server and skip your local disk altogether. Select the columns you want and insert them into the destination table:
INSERT INTO users_copy (id, email, name, address) SELECT * FROM users;The real power here is flexibility: you can copy only some columns, or even transform the data on the way:
INSERT INTO emails (id, email) SELECT id, email FROM users;For example, to build a slimmed-down copy of the reviews with just three columns:
cur.execute("""
CREATE TABLE ign_restricted (
id bigint PRIMARY KEY,
title varchar(200),
release_date date
);
""")
cur.execute("""
INSERT INTO ign_restricted (id, title, release_date)
SELECT id, title, release_date FROM ign_reviews;
""")
conn.commit()Because everything runs on the Postgres server, this scales to tables far too large to fit on your machine.
Practice Exercises
Exercise 1: Read mogrify output
Without running it, predict what cur.mogrify("(%s, %s)", (5, "it's here")) returns. Pay attention to how the apostrophe in the string is handled.
Hint
mogrify escapes single quotes inside strings and returns a bytes object (prefixed with b).
Exercise 2: Load a CSV with copy_expert
Write the code to load a file named products.csv (which has a header row) into a table named products using cursor.copy_expert(). Commit and close afterward.
# Your code hereHint
Use the statement COPY products FROM STDIN WITH CSV HEADER; so the header row is skipped automatically.
Exercise 3: Copy in pure SQL
Write a single INSERT … SELECT statement that copies just the id and title columns from ign_reviews into an existing table ign_titles(id, title).
-- Your code hereHint
Name the destination columns explicitly: INSERT INTO ign_titles (id, title) SELECT id, title FROM ign_reviews;.
Summary
You learned how psycopg2 safely converts Python values with mogrify, how to batch many rows into one INSERT, how to bulk-load and export CSV files with the COPY methods, and how to copy data between tables either through a file or entirely in SQL.
Key Concepts
- mogrify — the
cursormethod that converts Python values into safely quoted SQL; returns a bytes object you decode withconn.encoding. - copy_expert — the fastest, most robust way to load or export CSV files, using a
COPY … STDIN/STDOUTstatement. - STDIN / STDOUT — direct the data over the client connection rather than to a file on the server.
- CSV and HEADER options — make
COPYrespect CSV quoting rules and skip the header row. - INSERT … SELECT — copy data between tables entirely on the server, scaling beyond your local disk.
Why This Matters
Loading data is the daily work of a data pipeline, and the difference between a one-row-at-a-time loop and a copy_expert call can be the difference between minutes and hours. Knowing when to transform in Python versus when to push the whole operation onto the Postgres server is exactly the kind of judgment that makes a pipeline both fast and maintainable.
Next Steps
Continue to Lesson 5 - Managing Users and Databases
Create databases and users and apply security best practices when granting access
Back to Module Overview
Return to the PostgreSQL for Data Engineers module overview
Continue Building Your Skills
You can now move data into, out of, and between Postgres tables at scale. The last piece of the engineering puzzle is the human side: deciding who gets to touch that data and what they are allowed to do. In the next lesson you will manage users, groups, and databases with security in mind. See you there.