Lesson 2 - Designing Tables and Choosing Data Types

Welcome to Data Type Design

Choosing the right data type for each column is one of the most important things a data engineer does. The right types save space on the server — which means faster reads and writes — and they catch bad data the moment it tries to enter your table. The wrong types waste disk, let corrupt values slip in, and make your queries harder to write.

In this lesson you will work with a real dataset of game reviews from IGN, and you will reshape a poorly typed table into a well-designed one, one column at a time.

By the end of this lesson, you will be able to:

Inspect the data types Postgres is using for a table’s columns
Choose the correct numeric type and understand integer ranges
Limit string columns with varchar(n) and create enumerated types
Use boolean and date types to store data compactly and query it easily
Add, drop, and alter columns as your requirements change

You will need Lesson 1’s psycopg2 workflow. Let’s design some tables.

Data for this lesson

Engine: PostgreSQL. Dataset: IGN game reviews — used to design tables and choose data types.

Tables used: ign_reviews

Real-world source: Kaggle — 20 Years of Games / IGN reviews. Load it into your own PostgreSQL instance (the lesson walks through the COPY import) and run the queries in psql or a GUI client.

When the Data Type Is Wrong

Here is the IGN dataset you will reshape. It introduces several types beyond the text and integer you have seen:

id             | score_phrase | title                   | platform         | score | genre      | editors_choice | release_year | release_month | release_day
---------------+--------------+-------------------------+------------------+-------+------------+----------------+--------------+---------------+------------
52499790661213 | Amazing      | LittleBigPlanet PS Vita | PlayStation Vita | 9.0   | Platformer | y              | 2012         | 9             | 12
582057716445789| Great        | Splice: Tree of Life    | iPad             | 8.5   | Puzzle     | n              | 2012         | 9             | 12
108760805829117| Great        | NHL 13                  | Xbox 360         | 8.5   | Sports     | n              | 2012         | 9             | 11

Suppose an ign_reviews table already exists, but its id column was created as a plain integer. When you try to insert the first row, you hit an error:

import psycopg2
conn = psycopg2.connect("dbname=datatweets user=datatweets")
cur = conn.cursor()
cur.execute(query_string)  # inserts the first game

NumericValueOutOfRange: integer out of range

The culprit is the id value 52499790661213 — it is too big for an integer. To fix it, you first need to see what types the table is using.

Inspecting a Table’s Types

You can discover a table’s data types by running a query and inspecting the cursor’s description property. To avoid pulling any actual rows, ask for zero rows with LIMIT 0:

cur.execute("SELECT * FROM ign_reviews LIMIT 0;")
print(cur.description)

The description is a tuple of Column objects, each with a name and a type_code:

Column(name='id', type_code=23)

Those numeric type codes are hard to read on their own. Organized into a table, the ign_reviews columns look like this:

name           | type_code
---------------+----------
id             | 23
score_phrase   | 25
title          | 25
url            | 25
platform       | 25
score          | 700
genre          | 25
editors_choice | 23
release_year   | 23
release_month  | 23
release_day    | 23

Turning codes into type names

Postgres stores the meaning of every type code in an internal table called pg_catalog.pg_type. Its oid column matches the type_code, and its typname column gives the readable name. So to look up code 23:

cur.execute("SELECT typname FROM pg_catalog.pg_type WHERE oid = 23;")
type_name_23 = cur.fetchone()[0]
print(type_name_23)

int4

Looking up codes 25 and 700 the same way reveals the full picture:

Column name | id   | score_phrase | title | url  | platform | score  | genre | editors_choice | release_year | release_month | release_day
------------+------+--------------+-------+------+----------+--------+-------+----------------+--------------+---------------+------------
Type name   | int4 | text         | text  | text | text     | float4 | text  | text           | int4         | int4          | int4

So 23 is int4 (a 4-byte integer), 25 is text, and 700 is float4 (a 4-byte float).

Choosing Numeric Types

Now you know id is an int4. What range does that cover? Here are the numeric types Postgres supports:

Internal | Name             | Storage  | Range
---------+------------------+----------+--------------------------------------------------
int2     | smallint         | 2 bytes  | -32768 to 32767
int4     | integer          | 4 bytes  | -2147483648 to 2147483647
int8     | bigint           | 8 bytes  | -9223372036854775808 to 9223372036854775807
numeric  | decimal/numeric  | variable | user-specified precision, exact
float4   | real             | 4 bytes  | 6 decimal digits precision
float8   | double precision | 8 bytes  | 15 decimal digits precision
int4     | serial           | 4 bytes  | 1 to 2147483647 (autoincrementing)
int8     | bigserial        | 8 bytes  | 1 to 9223372036854775807 (autoincrementing)

An int4 tops out at 2147483647, far below 52499790661213 — that is why the insert failed. You can confirm how many bits the value needs:

id_first_row = 52499790661213
print(id_first_row.bit_length())

It needs 46 bits, more than the 32 bits integer allows. A bigint (8 bytes = 64 bits) is large enough.

Altering a column’s type

To change a column’s type, use ALTER TABLE with ALTER COLUMN:

ALTER TABLE table_name
ALTER COLUMN column_name TYPE new_type;

So fixing the id column looks like this:

cur.execute("""
    ALTER TABLE ign_reviews
    ALTER COLUMN id TYPE bigint;
""")
conn.commit()

Precise decimals

The score column holds values like 9.0, 8.5, and 6.5. Right now it is float4 (real), which allows six decimal digits of precision — far more than scores ever need. Scores have at most two digits before the decimal point and one after, so the decimal type is a better fit.

The decimal type takes two parameters: precision (total digits) and scale (digits after the decimal point). The number 23.5141 has precision 6 and scale 4. For scores, precision 3 and scale 1 are enough:

cur.execute("""
    ALTER TABLE ign_reviews
    ALTER COLUMN score TYPE DECIMAL(3, 1);
""")
conn.commit()

In Postgres, numeric and decimal are the same type — you can use whichever name you prefer.

String Types: text, varchar, and Enums

The text type stores strings of any length, which gives you total flexibility — and that is not always good. If a primary_color column is supposed to hold values like blue or black, text would happily accept a nonsense value like smurgabourgabacko. That is corrupted data: unexpected values that should never have entered the table.

Limiting length with varchar

Postgres offers three string types:

Name                              | Description
----------------------------------+------------------------------
character varying(n), varchar(n)  | variable-length with limit
character(n), char(n)             | fixed-length, blank padded
text                              | variable, unlimited length

The difference between char(n) and varchar(n) matters: char(10) pads black out to ten characters with trailing spaces (black␣␣␣␣␣), wasting space on every row, while varchar(10) stores only the five characters it needs. The rule of thumb in Postgres: use text when sizes are uncertain, and varchar(n) when you know a sensible maximum.

To find a good maximum for score_phrase, you can read the CSV and collect the distinct values:

import csv
with open('ign.csv', 'r') as f:
    next(f)
    reader = csv.reader(f)
    unique_words = set()
    for row in reader:
        unique_words.add(row[1])
print(unique_words)

{'Great', 'Mediocre', 'Bad', 'Good', 'Awful', 'Okay', 'Masterpiece', 'Amazing', 'Unbearable', 'Disaster', 'Painful'}

The longest phrase, Masterpiece, is 11 characters, so varchar(11) is a tight fit:

cur.execute("""
    ALTER TABLE ign_reviews
    ALTER COLUMN score_phrase TYPE varchar(11);
""")
conn.commit()

Even tighter: enumerated types

The score_phrase column is special — it only ever holds one of eleven specific values. With varchar(11), a string like "DataTweets" would still be accepted even though it is not a valid score phrase. An enumerated type fixes that by allowing only a predefined set of values, and storing each as a compact 4-byte index in an internal pg_enum table.

You create the type, then cast the column to it:

cur.execute("""
    CREATE TYPE evaluation_enum AS ENUM (
        'Great', 'Mediocre', 'Bad', 'Good', 'Awful', 'Okay',
        'Masterpiece', 'Amazing', 'Unbearable', 'Disaster', 'Painful'
    );
""")
cur.execute("""
    ALTER TABLE ign_reviews
    ALTER COLUMN score_phrase TYPE evaluation_enum
    USING score_phrase::evaluation_enum;
""")
conn.commit()

The USING column_name::enum_type_name clause tells Postgres how to interpret the existing values as the new type. You needed it here because you are switching between different kinds of types (text to enum); you did not need it earlier when converting between two numeric types or two string types.

The platform and genre columns are also good enum candidates — a platform is a real system, and a genre comes from a fixed list — so you would create platform_enum and genre_enum the same way. The title and url columns, on the other hand, are essentially free text, so varchar(200) is the right choice for them.

Boolean and Date Types

Boolean

The editors_choice column holds only y or n. You might reach for varchar(1) or an enum, but Postgres has a purpose-built boolean type. It is one byte, like varchar(1), but far easier to query. Postgres accepts many spellings as boolean input, including the y/n already in the data:

True  | False
------+------
TRUE  | FALSE
t     | f
true  | false
y     | n
yes   | no
on    | off
1     | 0

Cast the column to boolean:

cur.execute("""
    ALTER TABLE ign_reviews
    ALTER COLUMN editors_choice TYPE boolean
    USING editors_choice::boolean;
""")
conn.commit()

Date

The dataset splits release dates across three columns: release_year, release_month, and release_day. To find every game released after April 2011, you would need a clumsy query:

SELECT * FROM ign_reviews
WHERE (release_year > 2011) OR (release_year = 2011 AND release_month > 4);

A single date column makes that trivial:

SELECT * FROM ign_reviews WHERE release_date > '2011-04-01';

You add and drop columns with ALTER TABLE:

cur.execute("ALTER TABLE ign_reviews ADD COLUMN release_date date;")
cur.execute("ALTER TABLE ign_reviews DROP COLUMN release_year;")
cur.execute("ALTER TABLE ign_reviews DROP COLUMN release_month;")
cur.execute("ALTER TABLE ign_reviews DROP COLUMN release_day;")
conn.commit()

Loading data into the redesigned table

When you finally load ign.csv, you combine the three date columns into a single Python datetime.date before inserting:

import datetime
import psycopg2
import csv

conn = psycopg2.connect("dbname=datatweets user=datatweets")
cur = conn.cursor()

with open('ign.csv', 'r') as file:
    next(file)  # skip header
    reader = csv.reader(file)
    for row in reader:
        year = int(row[8])
        month = int(row[9])
        day = int(row[10])
        date = datetime.date(year, month, day)
        row = row[:-3]      # drop the three date columns
        row.append(date)    # append the combined date
        cur.execute(
            "INSERT INTO ign_reviews VALUES (%s, %s, %s, %s, %s, %s, %s, %s, %s);",
            row
        )

conn.commit()
conn.close()

Practice Exercises

Exercise 1: Inspect a table

Connect to datatweets, run SELECT * FROM ign_reviews LIMIT 0;, and print cur.description. Identify the type code of the id column.

# Your code here

Hint

LIMIT 0 returns no rows but still gives you the full column description — an efficient way to inspect types.

Exercise 2: Choose a numeric type

A column will store population counts up to about 8 billion. Which numeric type should you use, and why is integer not enough?

Hint

Compare 8,000,000,000 against the integer maximum of 2,147,483,647 and the bigint maximum.

Exercise 3: Create an enum

Write the SQL to create an enumerated type named day_enum that allows only the seven days of the week, then alter a hypothetical day_of_week text column to use it.

-- Your code here

Hint

Remember the USING day_of_week::day_enum clause when switching from text to an enum type.

Summary

You learned to inspect a table’s types, diagnose an out-of-range error, and reshape every column of a real dataset into a precise, space-efficient type — numeric, string, enumerated, boolean, and date.

Key Concepts

Data type — the declared kind of value a column holds; the right type saves space and rejects bad data.
bigint — an 8-byte integer for values too large for integer (which tops out near 2.1 billion).
decimal(precision, scale) — an exact decimal type where you control total digits and digits after the point.
varchar(n) vs. text — varchar(n) caps length and saves space when you know a maximum; text is unlimited.
Enumerated type — a custom type that accepts only a fixed set of values, stored compactly as 4-byte indices.

Why This Matters

Data modeling is the quiet, high-leverage work of data engineering. Every byte you save per row is multiplied across millions of rows, and every type constraint you add is a wall that bad data cannot climb. When an analyst trusts that a column is always a valid date or one of eleven score phrases, they write simpler, faster, more correct queries — and that trust starts with the types you choose here.

Next Steps

Continue to Lesson 3 - Prepared Statements and Preventing SQL Injection

Write safe parameterized queries and understand how SQL injection attacks work

Back to Module Overview

Return to the PostgreSQL for Data Engineers module overview

Continue Building Your Skills

You now design tables the way an engineer does: deliberately, with every column typed to fit its data. In the next lesson you will turn your attention to safety — how the way you pass values into a query can either protect your database or hand the keys to an attacker. See you there.

Lesson 11 - Handling Missing Data

Lesson 12 - Data Type Conversion and Cleaning

Courses

DATATWEETS

Title here

Lesson 2 - Designing Tables and Choosing Data Types

Welcome to Data Type Design

When the Data Type Is Wrong

Inspecting a Table’s Types

Turning codes into type names

Choosing Numeric Types

Altering a column’s type

Precise decimals

String Types: text, varchar, and Enums

Limiting length with varchar

Even tighter: enumerated types

Boolean and Date Types

Boolean

Date

Loading data into the redesigned table

Practice Exercises

Exercise 1: Inspect a table

Exercise 2: Choose a numeric type

Exercise 3: Create an enum

Summary

Key Concepts

Why This Matters

Next Steps

Continue to Lesson 3 - Prepared Statements and Preventing SQL Injection

Back to Module Overview

Continue Building Your Skills

Lesson 2 - Designing Tables and Choosing Data Types

Welcome to Data Type Design#

When the Data Type Is Wrong#

Inspecting a Table’s Types#

Turning codes into type names#

Choosing Numeric Types#

Altering a column’s type#

Precise decimals#

String Types: text, varchar, and Enums#

Limiting length with varchar#

Even tighter: enumerated types#

Boolean and Date Types#

Boolean#

Date#

Loading data into the redesigned table#

Practice Exercises#

Exercise 1: Inspect a table#

Exercise 2: Choose a numeric type#

Exercise 3: Create an enum#

Summary#

Key Concepts#

Why This Matters#

Next Steps#

Continue to Lesson 3 - Prepared Statements and Preventing SQL Injection

Back to Module Overview

Continue Building Your Skills#

Welcome to Data Type Design

When the Data Type Is Wrong

Inspecting a Table’s Types

Turning codes into type names

Choosing Numeric Types

Altering a column’s type

Precise decimals

String Types: text, varchar, and Enums

Limiting length with varchar

Even tighter: enumerated types

Boolean and Date Types

Boolean

Date

Loading data into the redesigned table

Practice Exercises

Exercise 1: Inspect a table

Exercise 2: Choose a numeric type

Exercise 3: Create an enum

Summary

Key Concepts

Why This Matters

Next Steps

Continue Building Your Skills