Lesson 2 - Designing Tables and Choosing Data Types
Welcome to Data Type Design
Choosing the right data type for each column is one of the most important things a data engineer does. The right types save space on the server — which means faster reads and writes — and they catch bad data the moment it tries to enter your table. The wrong types waste disk, let corrupt values slip in, and make your queries harder to write.
In this lesson you will work with a real dataset of game reviews from IGN, and you will reshape a poorly typed table into a well-designed one, one column at a time.
By the end of this lesson, you will be able to:
- Inspect the data types Postgres is using for a table’s columns
- Choose the correct numeric type and understand integer ranges
- Limit string columns with
varchar(n)and create enumerated types - Use
booleananddatetypes to store data compactly and query it easily - Add, drop, and alter columns as your requirements change
You will need Lesson 1’s psycopg2 workflow. Let’s design some tables.
Data for this lesson
Engine: PostgreSQL. Dataset: IGN game reviews — used to design tables and choose data types.
Tables used: ign_reviews
Real-world source: Kaggle — 20 Years of Games / IGN reviews. Load it into your own PostgreSQL instance (the lesson walks through the COPY import) and run the queries in psql or a GUI client.
When the Data Type Is Wrong
Here is the IGN dataset you will reshape. It introduces several types beyond the text and integer you have seen:
id | score_phrase | title | platform | score | genre | editors_choice | release_year | release_month | release_day
---------------+--------------+-------------------------+------------------+-------+------------+----------------+--------------+---------------+------------
52499790661213 | Amazing | LittleBigPlanet PS Vita | PlayStation Vita | 9.0 | Platformer | y | 2012 | 9 | 12
582057716445789| Great | Splice: Tree of Life | iPad | 8.5 | Puzzle | n | 2012 | 9 | 12
108760805829117| Great | NHL 13 | Xbox 360 | 8.5 | Sports | n | 2012 | 9 | 11Suppose an ign_reviews table already exists, but its id column was created as a plain integer. When you try to insert the first row, you hit an error:
import psycopg2
conn = psycopg2.connect("dbname=datatweets user=datatweets")
cur = conn.cursor()
cur.execute(query_string) # inserts the first gameNumericValueOutOfRange: integer out of rangeThe culprit is the id value 52499790661213 — it is too big for an integer. To fix it, you first need to see what types the table is using.
Inspecting a Table’s Types
You can discover a table’s data types by running a query and inspecting the cursor’s description property. To avoid pulling any actual rows, ask for zero rows with LIMIT 0:
cur.execute("SELECT * FROM ign_reviews LIMIT 0;")
print(cur.description)The description is a tuple of Column objects, each with a name and a type_code:
Column(name='id', type_code=23)Those numeric type codes are hard to read on their own. Organized into a table, the ign_reviews columns look like this:
name | type_code
---------------+----------
id | 23
score_phrase | 25
title | 25
url | 25
platform | 25
score | 700
genre | 25
editors_choice | 23
release_year | 23
release_month | 23
release_day | 23Turning codes into type names
Postgres stores the meaning of every type code in an internal table called pg_catalog.pg_type. Its oid column matches the type_code, and its typname column gives the readable name. So to look up code 23:
cur.execute("SELECT typname FROM pg_catalog.pg_type WHERE oid = 23;")
type_name_23 = cur.fetchone()[0]
print(type_name_23)int4Looking up codes 25 and 700 the same way reveals the full picture:
Column name | id | score_phrase | title | url | platform | score | genre | editors_choice | release_year | release_month | release_day
------------+------+--------------+-------+------+----------+--------+-------+----------------+--------------+---------------+------------
Type name | int4 | text | text | text | text | float4 | text | text | int4 | int4 | int4So 23 is int4 (a 4-byte integer), 25 is text, and 700 is float4 (a 4-byte float).
Choosing Numeric Types
Now you know id is an int4. What range does that cover? Here are the numeric types Postgres supports:
Internal | Name | Storage | Range
---------+------------------+----------+--------------------------------------------------
int2 | smallint | 2 bytes | -32768 to 32767
int4 | integer | 4 bytes | -2147483648 to 2147483647
int8 | bigint | 8 bytes | -9223372036854775808 to 9223372036854775807
numeric | decimal/numeric | variable | user-specified precision, exact
float4 | real | 4 bytes | 6 decimal digits precision
float8 | double precision | 8 bytes | 15 decimal digits precision
int4 | serial | 4 bytes | 1 to 2147483647 (autoincrementing)
int8 | bigserial | 8 bytes | 1 to 9223372036854775807 (autoincrementing)An int4 tops out at 2147483647, far below 52499790661213 — that is why the insert failed. You can confirm how many bits the value needs:
id_first_row = 52499790661213
print(id_first_row.bit_length())46It needs 46 bits, more than the 32 bits integer allows. A bigint (8 bytes = 64 bits) is large enough.
Altering a column’s type
To change a column’s type, use ALTER TABLE with ALTER COLUMN:
ALTER TABLE table_name
ALTER COLUMN column_name TYPE new_type;So fixing the id column looks like this:
cur.execute("""
ALTER TABLE ign_reviews
ALTER COLUMN id TYPE bigint;
""")
conn.commit()Precise decimals
The score column holds values like 9.0, 8.5, and 6.5. Right now it is float4 (real), which allows six decimal digits of precision — far more than scores ever need. Scores have at most two digits before the decimal point and one after, so the decimal type is a better fit.
The decimal type takes two parameters: precision (total digits) and scale (digits after the decimal point). The number 23.5141 has precision 6 and scale 4. For scores, precision 3 and scale 1 are enough:
cur.execute("""
ALTER TABLE ign_reviews
ALTER COLUMN score TYPE DECIMAL(3, 1);
""")
conn.commit()In Postgres, numeric and decimal are the same type — you can use whichever name you prefer.
String Types: text, varchar, and Enums
The text type stores strings of any length, which gives you total flexibility — and that is not always good. If a primary_color column is supposed to hold values like blue or black, text would happily accept a nonsense value like smurgabourgabacko. That is corrupted data: unexpected values that should never have entered the table.
Limiting length with varchar
Postgres offers three string types:
Name | Description
----------------------------------+------------------------------
character varying(n), varchar(n) | variable-length with limit
character(n), char(n) | fixed-length, blank padded
text | variable, unlimited lengthThe difference between char(n) and varchar(n) matters: char(10) pads black out to ten characters with trailing spaces (black␣␣␣␣␣), wasting space on every row, while varchar(10) stores only the five characters it needs. The rule of thumb in Postgres: use text when sizes are uncertain, and varchar(n) when you know a sensible maximum.
To find a good maximum for score_phrase, you can read the CSV and collect the distinct values:
import csv
with open('ign.csv', 'r') as f:
next(f)
reader = csv.reader(f)
unique_words = set()
for row in reader:
unique_words.add(row[1])
print(unique_words){'Great', 'Mediocre', 'Bad', 'Good', 'Awful', 'Okay', 'Masterpiece', 'Amazing', 'Unbearable', 'Disaster', 'Painful'}The longest phrase, Masterpiece, is 11 characters, so varchar(11) is a tight fit:
cur.execute("""
ALTER TABLE ign_reviews
ALTER COLUMN score_phrase TYPE varchar(11);
""")
conn.commit()Even tighter: enumerated types
The score_phrase column is special — it only ever holds one of eleven specific values. With varchar(11), a string like "DataTweets" would still be accepted even though it is not a valid score phrase. An enumerated type fixes that by allowing only a predefined set of values, and storing each as a compact 4-byte index in an internal pg_enum table.
You create the type, then cast the column to it:
cur.execute("""
CREATE TYPE evaluation_enum AS ENUM (
'Great', 'Mediocre', 'Bad', 'Good', 'Awful', 'Okay',
'Masterpiece', 'Amazing', 'Unbearable', 'Disaster', 'Painful'
);
""")
cur.execute("""
ALTER TABLE ign_reviews
ALTER COLUMN score_phrase TYPE evaluation_enum
USING score_phrase::evaluation_enum;
""")
conn.commit()The USING column_name::enum_type_name clause tells Postgres how to interpret the existing values as the new type. You needed it here because you are switching between different kinds of types (text to enum); you did not need it earlier when converting between two numeric types or two string types.
The platform and genre columns are also good enum candidates — a platform is a real system, and a genre comes from a fixed list — so you would create platform_enum and genre_enum the same way. The title and url columns, on the other hand, are essentially free text, so varchar(200) is the right choice for them.
Boolean and Date Types
Boolean
The editors_choice column holds only y or n. You might reach for varchar(1) or an enum, but Postgres has a purpose-built boolean type. It is one byte, like varchar(1), but far easier to query. Postgres accepts many spellings as boolean input, including the y/n already in the data:
True | False
------+------
TRUE | FALSE
t | f
true | false
y | n
yes | no
on | off
1 | 0Cast the column to boolean:
cur.execute("""
ALTER TABLE ign_reviews
ALTER COLUMN editors_choice TYPE boolean
USING editors_choice::boolean;
""")
conn.commit()Date
The dataset splits release dates across three columns: release_year, release_month, and release_day. To find every game released after April 2011, you would need a clumsy query:
SELECT * FROM ign_reviews
WHERE (release_year > 2011) OR (release_year = 2011 AND release_month > 4);A single date column makes that trivial:
SELECT * FROM ign_reviews WHERE release_date > '2011-04-01';You add and drop columns with ALTER TABLE:
cur.execute("ALTER TABLE ign_reviews ADD COLUMN release_date date;")
cur.execute("ALTER TABLE ign_reviews DROP COLUMN release_year;")
cur.execute("ALTER TABLE ign_reviews DROP COLUMN release_month;")
cur.execute("ALTER TABLE ign_reviews DROP COLUMN release_day;")
conn.commit()Loading data into the redesigned table
When you finally load ign.csv, you combine the three date columns into a single Python datetime.date before inserting:
import datetime
import psycopg2
import csv
conn = psycopg2.connect("dbname=datatweets user=datatweets")
cur = conn.cursor()
with open('ign.csv', 'r') as file:
next(file) # skip header
reader = csv.reader(file)
for row in reader:
year = int(row[8])
month = int(row[9])
day = int(row[10])
date = datetime.date(year, month, day)
row = row[:-3] # drop the three date columns
row.append(date) # append the combined date
cur.execute(
"INSERT INTO ign_reviews VALUES (%s, %s, %s, %s, %s, %s, %s, %s, %s);",
row
)
conn.commit()
conn.close()Practice Exercises
Exercise 1: Inspect a table
Connect to datatweets, run SELECT * FROM ign_reviews LIMIT 0;, and print cur.description. Identify the type code of the id column.
# Your code hereHint
LIMIT 0 returns no rows but still gives you the full column description — an efficient way to inspect types.
Exercise 2: Choose a numeric type
A column will store population counts up to about 8 billion. Which numeric type should you use, and why is integer not enough?
Hint
Compare 8,000,000,000 against the integer maximum of 2,147,483,647 and the bigint maximum.
Exercise 3: Create an enum
Write the SQL to create an enumerated type named day_enum that allows only the seven days of the week, then alter a hypothetical day_of_week text column to use it.
-- Your code hereHint
Remember the USING day_of_week::day_enum clause when switching from text to an enum type.
Summary
You learned to inspect a table’s types, diagnose an out-of-range error, and reshape every column of a real dataset into a precise, space-efficient type — numeric, string, enumerated, boolean, and date.
Key Concepts
- Data type — the declared kind of value a column holds; the right type saves space and rejects bad data.
- bigint — an 8-byte integer for values too large for
integer(which tops out near 2.1 billion). - decimal(precision, scale) — an exact decimal type where you control total digits and digits after the point.
- varchar(n) vs. text —
varchar(n)caps length and saves space when you know a maximum;textis unlimited. - Enumerated type — a custom type that accepts only a fixed set of values, stored compactly as 4-byte indices.
Why This Matters
Data modeling is the quiet, high-leverage work of data engineering. Every byte you save per row is multiplied across millions of rows, and every type constraint you add is a wall that bad data cannot climb. When an analyst trusts that a column is always a valid date or one of eleven score phrases, they write simpler, faster, more correct queries — and that trust starts with the types you choose here.
Next Steps
Continue to Lesson 3 - Prepared Statements and Preventing SQL Injection
Write safe parameterized queries and understand how SQL injection attacks work
Back to Module Overview
Return to the PostgreSQL for Data Engineers module overview
Continue Building Your Skills
You now design tables the way an engineer does: deliberately, with every column typed to fit its data. In the next lesson you will turn your attention to safety — how the way you pass values into a query can either protect your database or hand the keys to an attacker. See you there.