Lesson 4 - Loading and Copying Data

Welcome to Bulk Data Loading

Inserting one row at a time is fine for a handful of records, but data engineers move data by the thousands and millions. In this lesson you will learn the faster, more robust ways to load CSV files into Postgres, to export tables back out to files, and to copy data between tables — sometimes without leaving SQL at all.

You will keep working with the IGN game reviews dataset, whose date columns have been merged into a single release_date.

By the end of this lesson, you will be able to:

  • Explain how cursor.mogrify() converts Python values into safe SQL
  • Insert many rows with a single INSERT statement
  • Bulk-load CSV files with copy_from and copy_expert
  • Export a Postgres table to a CSV file
  • Copy data between tables using INSERTSELECT

You will use the psycopg2 workflow from earlier lessons. Let’s load some data.

Data for this lesson

Engine: PostgreSQL. Dataset: IGN game reviews loaded from CSV.

Tables used: ign_reviews, users, users_copy, emails

Real-world source: Kaggle — 20 Years of Games / IGN reviews. Load it into your own PostgreSQL instance (the lesson walks through the COPY import) and run the queries in psql or a GUI client.


Behind the Scenes: mogrify

When you pass values to cursor.execute() with %s placeholders, psycopg2 safely converts your Python objects into Postgres values. The method that does this conversion is cursor.mogrify(), and you can call it directly to see the result without touching the database:

from datetime import date

mogrified_values = cur.mogrify("(%s, %s, %s, %s, %s, %s, %s, %s, %s)",
    (52499790661213,
     'Amazing',
     'LittleBigPlanet PS Vita',
     '/games/littlebigplanet-vita/vita-98907',
     'PlayStation Vita',
     9.0,
     'Platformer',
     'y',
     date(2012, 12, 9))
)
print(mogrified_values)
b"(52499790661213, 'Amazing', 'LittleBigPlanet PS Vita', '/games/littlebigplanet-vita/vita-98907', 'PlayStation Vita', 9.0, 'Platformer', 'y', '2012-12-09'::date)"

Notice how mogrify quoted the strings and turned the Python date into a proper Postgres '2012-12-09'::date. This is exactly the substitution execute() performs internally.

It returns bytes, not a string

Look closely at the output: it starts with b, meaning it is a bytes object, not a string. To use it as text, decode it with the connection’s encoding:

mogrified_string = mogrified_values.decode(conn.encoding)
print(mogrified_string)
(52499790661213, 'Amazing', 'LittleBigPlanet PS Vita', '/games/littlebigplanet-vita/vita-98907', 'PlayStation Vita', 9.0, 'Platformer', 'y', '2012-12-09'::date)

The connection.encoding attribute tells you which encoding the connection uses, which depends on factors like your operating system.


Inserting Many Rows at Once

Issuing one INSERT per row sends one query to the server per row — slow when there are thousands. Postgres lets you insert many rows in a single statement by comma-separating the value tuples:

INSERT INTO example_table VALUES (1, 'hello', 'world'), (2, 'data', 'quest'), (3, 'love', 'learning');

Using mogrify, you can build that big statement programmatically. Mogrify each row, decode it, join the pieces with commas, and run one INSERT:

import csv
import psycopg2

conn = psycopg2.connect("dbname=datatweets user=datatweets")
cur = conn.cursor()

with open("ign.csv", "r") as f:
    next(f)  # skip header
    reader = csv.reader(f)
    rows = [row for row in reader]
    mogrified_rows = [
        cur.mogrify("(%s, %s, %s, %s, %s, %s, %s, %s, %s)", row)
        for row in rows
    ]
    decoded_rows = [row.decode(conn.encoding) for row in mogrified_rows]
    insert_string = ",".join(decoded_rows)
    cur.execute("INSERT INTO ign_reviews VALUES " + insert_string + ";")
    conn.commit()
    conn.close()

This is much faster than a loop of inserts. But it still has drawbacks: you write a lot of code, you have to deal with encodings, and the entire query string must fit in memory at once before it is sent over the network. For truly large files, there is a better tool.


The COPY Methods

psycopg2 provides dedicated methods for loading files, built on Postgres’s COPY command.

copy_from for simple delimited files

The cursor.copy_from() method loads a text file straight into a table. It takes a file object, a table name, and a separator (default is the TAB character \t). For a file where & separates columns:

cur = conn.cursor()
with open("users.txt", "r") as f:
    next(f)  # skip the header row
    cur.copy_from(f, "users", sep='&')

This is simple, but it only works when a single delimiter character cleanly separates columns.

copy_expert for real CSV files

CSV files often contain commas inside values, such as an address like '100, Fake street'. A plain delimiter split would mistake that inner comma for a column boundary. The more robust cursor.copy_expert() method takes a full COPY statement plus a file, and it understands real CSV rules:

conn = psycopg2.connect("dbname=datatweets user=datatweets")
cur = conn.cursor()
with open("ign.csv", "r") as f:
    cur.copy_expert("COPY ign_reviews FROM STDIN WITH CSV HEADER;", f)
    conn.commit()
    conn.close()

Each part of the COPY statement matters:

  • ign_reviews — the target table.
  • FROM STDIN — read the data from the connection (standard input), not a file on the server.
  • WITH — what follows are options.
  • CSV — interpret the input as CSV, so commas inside quoted fields are not treated as separators.
  • HEADER — skip the first row, so you no longer need next(f).

You can chain options separated by spaces. For a semicolon-delimited file, add DELIMITER ';':

COPY example_table FROM STDIN WITH CSV DELIMITER ';';

Which method is fastest?

If you time the three approaches — many single INSERTs, one big mogrified INSERT, and copy_expert — using timeit, copy_expert is consistently the fastest and most memory-efficient for large CSV files. But speed is not everything: if you need to transform each row in Python before loading, the INSERT approaches give you that flexibility. Choose the method that fits the job.


Exporting a Table to a File

copy_expert works in both directions. To write a table out to a file, use COPY table_name TO STDOUT. Just as STDIN means “data comes in over the connection,” STDOUT means “data goes out over the connection” to your client, rather than to a file on the server:

import psycopg2
conn = psycopg2.connect("dbname=datatweets user=datatweets")
cur = conn.cursor()
with open("ign_copy.csv", "w") as f:
    cur.copy_expert("COPY ign_reviews TO STDOUT WITH CSV HEADER;", f)
    conn.close()

Note the "w" mode on open() because you are writing, and the CSV HEADER options so the export includes a header row.


Copying Data Between Tables

Often you need to copy data from one table to another. There are two ways.

Through a temporary file

You can export a table to a temporary CSV and then load that file into a new table. This reuses the two skills you just learned:

import psycopg2

create_string = """
CREATE TABLE ign_reviews_copy (
    id bigint PRIMARY KEY,
    score_phrase evaluation_enum,
    title varchar(200),
    url varchar(200),
    platform platform_enum,
    score decimal(3, 1),
    genre genre_enum,
    editors_choice boolean,
    release_date date
);
"""

conn = psycopg2.connect("dbname=datatweets user=datatweets")
cur = conn.cursor()

# export the source table to a temporary file
with open("temp.csv", "w") as f:
    cur.copy_expert("COPY ign_reviews TO STDOUT WITH CSV HEADER;", f)

# create the destination table
cur.execute(create_string)

# load the temporary file into the new table
with open("temp.csv", "r") as f:
    cur.copy_expert("COPY ign_reviews_copy FROM STDIN WITH CSV HEADER;", f)

conn.commit()
conn.close()

This works well for tables up to around a million rows, but it writes the whole table to your local disk — wasteful for very large tables.

Purely in SQL with INSERT … SELECT

For large tables, do the copy entirely on the server and skip your local disk altogether. Select the columns you want and insert them into the destination table:

INSERT INTO users_copy (id, email, name, address) SELECT * FROM users;

The real power here is flexibility: you can copy only some columns, or even transform the data on the way:

INSERT INTO emails (id, email) SELECT id, email FROM users;

For example, to build a slimmed-down copy of the reviews with just three columns:

cur.execute("""
    CREATE TABLE ign_restricted (
        id bigint PRIMARY KEY,
        title varchar(200),
        release_date date
    );
""")
cur.execute("""
    INSERT INTO ign_restricted (id, title, release_date)
    SELECT id, title, release_date FROM ign_reviews;
""")
conn.commit()

Because everything runs on the Postgres server, this scales to tables far too large to fit on your machine.


Practice Exercises

Exercise 1: Read mogrify output

Without running it, predict what cur.mogrify("(%s, %s)", (5, "it's here")) returns. Pay attention to how the apostrophe in the string is handled.

Hint

mogrify escapes single quotes inside strings and returns a bytes object (prefixed with b).

Exercise 2: Load a CSV with copy_expert

Write the code to load a file named products.csv (which has a header row) into a table named products using cursor.copy_expert(). Commit and close afterward.

# Your code here

Hint

Use the statement COPY products FROM STDIN WITH CSV HEADER; so the header row is skipped automatically.

Exercise 3: Copy in pure SQL

Write a single INSERT … SELECT statement that copies just the id and title columns from ign_reviews into an existing table ign_titles(id, title).

-- Your code here

Hint

Name the destination columns explicitly: INSERT INTO ign_titles (id, title) SELECT id, title FROM ign_reviews;.


Summary

You learned how psycopg2 safely converts Python values with mogrify, how to batch many rows into one INSERT, how to bulk-load and export CSV files with the COPY methods, and how to copy data between tables either through a file or entirely in SQL.

Key Concepts

  • mogrify — the cursor method that converts Python values into safely quoted SQL; returns a bytes object you decode with conn.encoding.
  • copy_expert — the fastest, most robust way to load or export CSV files, using a COPY … STDIN/STDOUT statement.
  • STDIN / STDOUT — direct the data over the client connection rather than to a file on the server.
  • CSV and HEADER options — make COPY respect CSV quoting rules and skip the header row.
  • INSERT … SELECT — copy data between tables entirely on the server, scaling beyond your local disk.

Why This Matters

Loading data is the daily work of a data pipeline, and the difference between a one-row-at-a-time loop and a copy_expert call can be the difference between minutes and hours. Knowing when to transform in Python versus when to push the whole operation onto the Postgres server is exactly the kind of judgment that makes a pipeline both fast and maintainable.


Next Steps

Continue to Lesson 5 - Managing Users and Databases

Create databases and users and apply security best practices when granting access

Back to Module Overview

Return to the PostgreSQL for Data Engineers module overview


Continue Building Your Skills

You can now move data into, out of, and between Postgres tables at scale. The last piece of the engineering puzzle is the human side: deciding who gets to touch that data and what they are allowed to do. In the next lesson you will manage users, groups, and databases with security in mind. See you there.