Lesson 7 - Guided Project: Building a Crime Reports Database

Welcome to the Capstone

This guided project pulls together everything you have learned in this module. You will build a complete database for storing real crime reports from Boston, from an empty server to a secured, loaded, production-style system.

Along the way you will:

  • Create a database and a schema
  • Profile a real dataset to choose the right data types
  • Build and load a table with appropriate types
  • Secure it with readonly and readwrite groups and users

By the end of this project, you will be able to:

  • Design a database from scratch for a real dataset
  • Choose data types by analyzing the data itself
  • Load CSV data efficiently into a typed table
  • Apply least-privilege user management end to end

This is your chance to do the full job of a data engineer. Let’s build it.

Data for this lesson

Engine: PostgreSQL. Dataset: Boston crime reports — real CSV data you load to build a crimes table.

Tables used: crimes

Real-world source: Analyze Boston — Crime Incident Reports. Load boston.csv into your own PostgreSQL instance (the lesson walks through it) and run the queries in psql or a GUI client.


The Dataset and the Goal

The data lives in a file named boston.csv. Its first rows look like this:

incident_number | offense_code | description          | date       | day_of_the_week | lat         | long
----------------+--------------+----------------------+------------+-----------------+-------------+-------------
1               | 619          | LARCENY ALL OTHERS   | 2018-09-02 | Sunday          | 42.35779134 | -71.13937053
2               | 1402         | VANDALISM            | 2018-08-21 | Tuesday         | 42.30682138 | -71.06030035
3               | 3410         | TOWED MOTOR VEHICLE  | 2018-09-03 | Monday          | 42.34658879 | -71.07242943
4               | 3114         | INVESTIGATE PROPERTY | 2018-09-03 | Monday          | 42.33418175 | -71.07866441

Each row is one crime: an identifier, a numeric offense code, a text description, the date and day of the week, and the latitude/longitude where it happened.

Your goal: create a database named crime_db containing a table boston_crimes (inside a schema named crimes) with well-chosen data types, then load boston.csv into it and set up readonly and readwrite groups, each with one user. This mirrors how a real analytics database is provisioned.


Step 1: Create the Database and Schema

Start by creating the crime_db database. Recall from Lesson 5 that creating a database cannot happen inside a transaction, so you must set autocommit = True first. Since crime_db does not exist yet, connect to the existing datatweets database to create it:

import psycopg2

conn = psycopg2.connect("dbname=datatweets user=datatweets")
conn.autocommit = True
cur = conn.cursor()
cur.execute("CREATE DATABASE crime_db;")
conn.close()

Now reconnect to the new database and create a schema named crimes to keep your tables organized. Keep this connection open for the rest of the project:

conn = psycopg2.connect("dbname=crime_db user=datatweets")
cur = conn.cursor()
cur.execute("CREATE SCHEMA crimes;")
conn.commit()

You now have an empty crime_db with a crimes schema, ready for a table.


Step 2: Read the Headers and a Sample Row

Before creating a table, gather information about the data so you can choose the right types. Read the header row and the first data row from boston.csv:

import csv

with open("boston.csv") as file:
    reader = csv.reader(file)
    rows = list(reader)
    col_headers = rows[0]
    first_row = rows[1]

print(col_headers)
print(first_row)
['incident_number', 'offense_code', 'description', 'date', 'day_of_the_week', 'lat', 'long']
['1', '619', 'LARCENY ALL OTHERS', '2018-09-02', 'Sunday', '42.35779134', '-71.13937053']

Keeping these handy means you never have to guess a column’s name or contents while designing the table.


Step 3: Build an Auxiliary Function

To choose types wisely, you want two facts about each column: how many distinct values it has, and how long its longest value is. A column with few distinct values is a good candidate for an enumerated type; the longest value tells you the right size for a varchar.

Write a helper, get_col_set(), that returns the set of distinct values in a column:

def get_col_set(csv_filename, col_index):
    values = set()
    with open(csv_filename) as f:
        next(f)  # skip header
        reader = csv.reader(f)
        for row in reader:
            values.add(row[col_index])
    return values

Now count the distinct values in each of the seven columns:

for i in range(len(col_headers)):
    values = get_col_set("boston.csv", i)
    print(col_headers[i], len(values))

A typical result looks like this:

incident_number 298329
offense_code 219
description 239
date 1177
day_of_the_week 7
lat 18177
long 18177

The standouts are day_of_the_week (only 7 distinct values) and offense_code (a few hundred) — both strong enumerated-type candidates. The description column has many values, so a sized varchar fits better.


Step 4: Find the Maximum Length

For the text columns you need the longest value. The day_of_the_week column obviously maxes out at Wednesday (9 characters), so focus on description. Reuse the helper:

descriptions = get_col_set("boston.csv", 2)
max_len = max(len(value) for value in descriptions)
print(max_len)
58

The longest description is 58 characters, so a varchar(100) gives comfortable headroom without wasting space.


Step 5: Create the Table

Now design crimes.boston_crimes using everything you learned about types in Lesson 2. Here is a sound set of choices, with reasoning:

  • incident_number — the primary key; the IDs fit in an integer.
  • offense_code — small integer codes; smallint (range up to 32767) is plenty, but integer is also fine.
  • description — text up to 58 characters; use varchar(100).
  • date — a real calendar date; use date.
  • day_of_the_week — exactly seven values; a perfect enumerated type.
  • lat and long — geographic coordinates needing several decimal places; decimal with generous precision.

Create the enumerated type first, then the table inside the schema:

cur.execute("""
    CREATE TYPE weekday_enum AS ENUM (
        'Monday', 'Tuesday', 'Wednesday', 'Thursday',
        'Friday', 'Saturday', 'Sunday'
    );
""")

cur.execute("""
    CREATE TABLE crimes.boston_crimes (
        incident_number integer PRIMARY KEY,
        offense_code smallint,
        description varchar(100),
        date date,
        day_of_the_week weekday_enum,
        lat decimal(11, 8),
        long decimal(11, 8)
    );
""")
conn.commit()

The decimal(11, 8) type holds coordinates like 42.35779134 exactly — up to 11 total digits with 8 after the decimal point.


Step 6: Load the Data

With the table in place, load boston.csv. The cursor.copy_expert() method from Lesson 4 is the fastest and most robust choice, and the HEADER option skips the header row for you:

with open("boston.csv") as f:
    cur.copy_expert(
        "COPY crimes.boston_crimes FROM STDIN WITH CSV HEADER;", f
    )
conn.commit()

Confirm the load by counting the rows:

cur.execute("SELECT COUNT(*) FROM crimes.boston_crimes;")
print(cur.fetchone())
(298329,)

All the crime reports are now in your typed table.


Step 7: Revoke Public Privileges

Now secure the database, starting with the least privilege principle. Before creating groups, make sure nothing is inherited from the built-in public group on the public schema or the database:

cur.execute("REVOKE ALL ON SCHEMA public FROM public;")
cur.execute("REVOKE ALL ON DATABASE crime_db FROM public;")
conn.commit()

This guarantees that any access you grant from here on is intentional.


Step 8: Create the User Groups

Create the two standard groups. Each needs permission to connect to the database and to use the crimes schema, plus its own table privileges:

cur.execute("CREATE GROUP readonly NOLOGIN;")
cur.execute("CREATE GROUP readwrite NOLOGIN;")

# both groups can connect and use the schema
cur.execute("GRANT CONNECT ON DATABASE crime_db TO readonly;")
cur.execute("GRANT CONNECT ON DATABASE crime_db TO readwrite;")
cur.execute("GRANT USAGE ON SCHEMA crimes TO readonly;")
cur.execute("GRANT USAGE ON SCHEMA crimes TO readwrite;")

# group-specific table privileges
cur.execute("GRANT SELECT ON ALL TABLES IN SCHEMA crimes TO readonly;")
cur.execute("""
    GRANT SELECT, INSERT, DELETE, UPDATE
    ON ALL TABLES IN SCHEMA crimes TO readwrite;
""")
conn.commit()

The readonly group can only SELECT; the readwrite group can read and modify data but cannot drop tables.


Step 9: Create the Users

Finally, create one user per group and assign each to its group:

cur.execute("CREATE USER data_analyst WITH PASSWORD 'secret1';")
cur.execute("GRANT readonly TO data_analyst;")

cur.execute("CREATE USER data_scientist WITH PASSWORD 'secret2';")
cur.execute("GRANT readwrite TO data_scientist;")
conn.commit()

data_analyst inherits read-only access; data_scientist inherits read-write access. Adding more analysts or scientists later is now just one GRANT each.


Step 10: Test Everything

A good engineer always verifies the setup. Query Postgres’s internal tables to confirm the groups have the privileges you expect. For example, check the readwrite group’s table privileges:

cur.execute("""
    SELECT grantee, privilege_type
    FROM information_schema.table_privileges
    WHERE grantee = 'readwrite';
""")
for row in cur.fetchall():
    print(row)
('readwrite', 'INSERT')
('readwrite', 'SELECT')
('readwrite', 'UPDATE')
('readwrite', 'DELETE')

You can run the same check for readonly (which should show only SELECT) and inspect the pg_roles table to confirm the users and groups exist. When the privileges match your intentions, the database is ready for use.


Practice Exercises

Exercise 1: Justify a type choice

The offense_code column was given the smallint type. Explain why smallint is sufficient here and when you would have needed integer instead.

Hint

smallint covers values from -32768 to 32767; check whether any offense code exceeds that range.

Exercise 2: Add a third user

Write the SQL to add a new read-only analyst named intern with password temp123 and assign them to the readonly group.

-- Your code here

Hint

Create the user, then GRANT readonly TO intern; — no individual privileges needed because the group already has them.

Exercise 3: Verify readonly

Write a query against information_schema.table_privileges that confirms the readonly group has only the SELECT privilege.

-- Your code here

Hint

Filter with WHERE grantee = 'readonly' and look at the privilege_type column.


Summary

You built a complete crime reports database end to end: you created a database and schema, profiled a real dataset to choose precise data types, built and bulk-loaded a typed table, and secured it with readonly and readwrite groups and users following the least privilege principle. That is the full data-engineering provisioning workflow in one project.

Key Concepts

  • Profiling before typing — counting distinct values and measuring lengths tells you when to use enums versus sized varchar.
  • Schema-organized tables — creating tables inside a named schema (crimes.boston_crimes) keeps a database tidy.
  • copy_expert loading — the fastest, most robust way to load a large CSV into a typed table.
  • Group-based securityreadonly and readwrite groups with CONNECT, USAGE, and table privileges let you manage many users at scale.
  • Testing the setup — querying information_schema.table_privileges and pg_roles verifies that access matches your intent.

Why This Matters

This is the work a data engineer is hired to do: take a raw file and turn it into a reliable, well-typed, properly secured database that a whole team can trust. Every decision you made here — type choices, schema layout, group privileges — is one you will make again and again in real systems. You have now done it from start to finish.


Next Steps

Continue to Optimizing PostgreSQL

Explore PostgreSQL internals and learn to make your databases faster and more efficient

Back to Module Overview

Return to the PostgreSQL for Data Engineers module overview


Continue Building Your Skills

Congratulations — you have built a real database from an empty server to a secured, loaded system. You now connect to Postgres, design typed tables, write injection-proof queries, load data at scale, and manage users like an engineer. The natural next step is performance: in the Optimizing PostgreSQL course you will look under the hood at how Postgres runs your queries and learn to make them fast. Keep going!