Lesson 7 - Guided Project: Building a Crime Reports Database
On this page
- Welcome to the Capstone
- The Dataset and the Goal
- Step 1: Create the Database and Schema
- Step 2: Read the Headers and a Sample Row
- Step 3: Build an Auxiliary Function
- Step 4: Find the Maximum Length
- Step 5: Create the Table
- Step 6: Load the Data
- Step 7: Revoke Public Privileges
- Step 8: Create the User Groups
- Step 9: Create the Users
- Step 10: Test Everything
- Practice Exercises
- Summary
- Next Steps
- Continue Building Your Skills
Welcome to the Capstone
This guided project pulls together everything you have learned in this module. You will build a complete database for storing real crime reports from Boston, from an empty server to a secured, loaded, production-style system.
Along the way you will:
- Create a database and a schema
- Profile a real dataset to choose the right data types
- Build and load a table with appropriate types
- Secure it with
readonlyandreadwritegroups and users
By the end of this project, you will be able to:
- Design a database from scratch for a real dataset
- Choose data types by analyzing the data itself
- Load CSV data efficiently into a typed table
- Apply least-privilege user management end to end
This is your chance to do the full job of a data engineer. Let’s build it.
Data for this lesson
Engine: PostgreSQL. Dataset: Boston crime reports — real CSV data you load to build a crimes table.
Tables used: crimes
Real-world source: Analyze Boston — Crime Incident Reports. Load boston.csv into your own PostgreSQL instance (the lesson walks through it) and run the queries in psql or a GUI client.
The Dataset and the Goal
The data lives in a file named boston.csv. Its first rows look like this:
incident_number | offense_code | description | date | day_of_the_week | lat | long
----------------+--------------+----------------------+------------+-----------------+-------------+-------------
1 | 619 | LARCENY ALL OTHERS | 2018-09-02 | Sunday | 42.35779134 | -71.13937053
2 | 1402 | VANDALISM | 2018-08-21 | Tuesday | 42.30682138 | -71.06030035
3 | 3410 | TOWED MOTOR VEHICLE | 2018-09-03 | Monday | 42.34658879 | -71.07242943
4 | 3114 | INVESTIGATE PROPERTY | 2018-09-03 | Monday | 42.33418175 | -71.07866441Each row is one crime: an identifier, a numeric offense code, a text description, the date and day of the week, and the latitude/longitude where it happened.
Your goal: create a database named crime_db containing a table boston_crimes (inside a schema named crimes) with well-chosen data types, then load boston.csv into it and set up readonly and readwrite groups, each with one user. This mirrors how a real analytics database is provisioned.
Step 1: Create the Database and Schema
Start by creating the crime_db database. Recall from Lesson 5 that creating a database cannot happen inside a transaction, so you must set autocommit = True first. Since crime_db does not exist yet, connect to the existing datatweets database to create it:
import psycopg2
conn = psycopg2.connect("dbname=datatweets user=datatweets")
conn.autocommit = True
cur = conn.cursor()
cur.execute("CREATE DATABASE crime_db;")
conn.close()Now reconnect to the new database and create a schema named crimes to keep your tables organized. Keep this connection open for the rest of the project:
conn = psycopg2.connect("dbname=crime_db user=datatweets")
cur = conn.cursor()
cur.execute("CREATE SCHEMA crimes;")
conn.commit()You now have an empty crime_db with a crimes schema, ready for a table.
Step 2: Read the Headers and a Sample Row
Before creating a table, gather information about the data so you can choose the right types. Read the header row and the first data row from boston.csv:
import csv
with open("boston.csv") as file:
reader = csv.reader(file)
rows = list(reader)
col_headers = rows[0]
first_row = rows[1]
print(col_headers)
print(first_row)['incident_number', 'offense_code', 'description', 'date', 'day_of_the_week', 'lat', 'long']
['1', '619', 'LARCENY ALL OTHERS', '2018-09-02', 'Sunday', '42.35779134', '-71.13937053']Keeping these handy means you never have to guess a column’s name or contents while designing the table.
Step 3: Build an Auxiliary Function
To choose types wisely, you want two facts about each column: how many distinct values it has, and how long its longest value is. A column with few distinct values is a good candidate for an enumerated type; the longest value tells you the right size for a varchar.
Write a helper, get_col_set(), that returns the set of distinct values in a column:
def get_col_set(csv_filename, col_index):
values = set()
with open(csv_filename) as f:
next(f) # skip header
reader = csv.reader(f)
for row in reader:
values.add(row[col_index])
return valuesNow count the distinct values in each of the seven columns:
for i in range(len(col_headers)):
values = get_col_set("boston.csv", i)
print(col_headers[i], len(values))A typical result looks like this:
incident_number 298329
offense_code 219
description 239
date 1177
day_of_the_week 7
lat 18177
long 18177The standouts are day_of_the_week (only 7 distinct values) and offense_code (a few hundred) — both strong enumerated-type candidates. The description column has many values, so a sized varchar fits better.
Step 4: Find the Maximum Length
For the text columns you need the longest value. The day_of_the_week column obviously maxes out at Wednesday (9 characters), so focus on description. Reuse the helper:
descriptions = get_col_set("boston.csv", 2)
max_len = max(len(value) for value in descriptions)
print(max_len)58The longest description is 58 characters, so a varchar(100) gives comfortable headroom without wasting space.
Step 5: Create the Table
Now design crimes.boston_crimes using everything you learned about types in Lesson 2. Here is a sound set of choices, with reasoning:
incident_number— the primary key; the IDs fit in aninteger.offense_code— small integer codes;smallint(range up to 32767) is plenty, butintegeris also fine.description— text up to 58 characters; usevarchar(100).date— a real calendar date; usedate.day_of_the_week— exactly seven values; a perfect enumerated type.latandlong— geographic coordinates needing several decimal places;decimalwith generous precision.
Create the enumerated type first, then the table inside the schema:
cur.execute("""
CREATE TYPE weekday_enum AS ENUM (
'Monday', 'Tuesday', 'Wednesday', 'Thursday',
'Friday', 'Saturday', 'Sunday'
);
""")
cur.execute("""
CREATE TABLE crimes.boston_crimes (
incident_number integer PRIMARY KEY,
offense_code smallint,
description varchar(100),
date date,
day_of_the_week weekday_enum,
lat decimal(11, 8),
long decimal(11, 8)
);
""")
conn.commit()The decimal(11, 8) type holds coordinates like 42.35779134 exactly — up to 11 total digits with 8 after the decimal point.
Step 6: Load the Data
With the table in place, load boston.csv. The cursor.copy_expert() method from Lesson 4 is the fastest and most robust choice, and the HEADER option skips the header row for you:
with open("boston.csv") as f:
cur.copy_expert(
"COPY crimes.boston_crimes FROM STDIN WITH CSV HEADER;", f
)
conn.commit()Confirm the load by counting the rows:
cur.execute("SELECT COUNT(*) FROM crimes.boston_crimes;")
print(cur.fetchone())(298329,)All the crime reports are now in your typed table.
Step 7: Revoke Public Privileges
Now secure the database, starting with the least privilege principle. Before creating groups, make sure nothing is inherited from the built-in public group on the public schema or the database:
cur.execute("REVOKE ALL ON SCHEMA public FROM public;")
cur.execute("REVOKE ALL ON DATABASE crime_db FROM public;")
conn.commit()This guarantees that any access you grant from here on is intentional.
Step 8: Create the User Groups
Create the two standard groups. Each needs permission to connect to the database and to use the crimes schema, plus its own table privileges:
cur.execute("CREATE GROUP readonly NOLOGIN;")
cur.execute("CREATE GROUP readwrite NOLOGIN;")
# both groups can connect and use the schema
cur.execute("GRANT CONNECT ON DATABASE crime_db TO readonly;")
cur.execute("GRANT CONNECT ON DATABASE crime_db TO readwrite;")
cur.execute("GRANT USAGE ON SCHEMA crimes TO readonly;")
cur.execute("GRANT USAGE ON SCHEMA crimes TO readwrite;")
# group-specific table privileges
cur.execute("GRANT SELECT ON ALL TABLES IN SCHEMA crimes TO readonly;")
cur.execute("""
GRANT SELECT, INSERT, DELETE, UPDATE
ON ALL TABLES IN SCHEMA crimes TO readwrite;
""")
conn.commit()The readonly group can only SELECT; the readwrite group can read and modify data but cannot drop tables.
Step 9: Create the Users
Finally, create one user per group and assign each to its group:
cur.execute("CREATE USER data_analyst WITH PASSWORD 'secret1';")
cur.execute("GRANT readonly TO data_analyst;")
cur.execute("CREATE USER data_scientist WITH PASSWORD 'secret2';")
cur.execute("GRANT readwrite TO data_scientist;")
conn.commit()data_analyst inherits read-only access; data_scientist inherits read-write access. Adding more analysts or scientists later is now just one GRANT each.
Step 10: Test Everything
A good engineer always verifies the setup. Query Postgres’s internal tables to confirm the groups have the privileges you expect. For example, check the readwrite group’s table privileges:
cur.execute("""
SELECT grantee, privilege_type
FROM information_schema.table_privileges
WHERE grantee = 'readwrite';
""")
for row in cur.fetchall():
print(row)('readwrite', 'INSERT')
('readwrite', 'SELECT')
('readwrite', 'UPDATE')
('readwrite', 'DELETE')You can run the same check for readonly (which should show only SELECT) and inspect the pg_roles table to confirm the users and groups exist. When the privileges match your intentions, the database is ready for use.
Practice Exercises
Exercise 1: Justify a type choice
The offense_code column was given the smallint type. Explain why smallint is sufficient here and when you would have needed integer instead.
Hint
smallint covers values from -32768 to 32767; check whether any offense code exceeds that range.
Exercise 2: Add a third user
Write the SQL to add a new read-only analyst named intern with password temp123 and assign them to the readonly group.
-- Your code hereHint
Create the user, then GRANT readonly TO intern; — no individual privileges needed because the group already has them.
Exercise 3: Verify readonly
Write a query against information_schema.table_privileges that confirms the readonly group has only the SELECT privilege.
-- Your code hereHint
Filter with WHERE grantee = 'readonly' and look at the privilege_type column.
Summary
You built a complete crime reports database end to end: you created a database and schema, profiled a real dataset to choose precise data types, built and bulk-loaded a typed table, and secured it with readonly and readwrite groups and users following the least privilege principle. That is the full data-engineering provisioning workflow in one project.
Key Concepts
- Profiling before typing — counting distinct values and measuring lengths tells you when to use enums versus sized
varchar. - Schema-organized tables — creating tables inside a named schema (
crimes.boston_crimes) keeps a database tidy. - copy_expert loading — the fastest, most robust way to load a large CSV into a typed table.
- Group-based security —
readonlyandreadwritegroups withCONNECT,USAGE, and table privileges let you manage many users at scale. - Testing the setup — querying
information_schema.table_privilegesandpg_rolesverifies that access matches your intent.
Why This Matters
This is the work a data engineer is hired to do: take a raw file and turn it into a reliable, well-typed, properly secured database that a whole team can trust. Every decision you made here — type choices, schema layout, group privileges — is one you will make again and again in real systems. You have now done it from start to finish.
Next Steps
Continue to Optimizing PostgreSQL
Explore PostgreSQL internals and learn to make your databases faster and more efficient
Back to Module Overview
Return to the PostgreSQL for Data Engineers module overview
Continue Building Your Skills
Congratulations — you have built a real database from an empty server to a secured, loaded system. You now connect to Postgres, design typed tables, write injection-proof queries, load data at scale, and manage users like an engineer. The natural next step is performance: in the Optimizing PostgreSQL course you will look under the hood at how Postgres runs your queries and learn to make them fast. Keep going!