Lesson 6 - Distribution and Percentile Functions
Welcome to Distribution Functions
This final lesson on window functions enters the world of statistics. Distribution functions tell you where a value sits in a distribution — is this trip in the top 5% by duration? Is this product in the first revenue quartile? And percentile functions run the other direction: given a percentage, what value sits at that point — what is the median, or the 75th percentile?
These functions power statistical analysis across marketing, finance, healthcare, and more. Along the way you will also learn the WINDOW clause, a tidy way to avoid repeating the same OVER() definition.
By the end of this lesson, you will be able to:
- Use
CUME_DIST()andPERCENT_RANK()to measure a value’s relative standing - Explain the difference between the two and when to use each
- Use
PERCENTILE_CONT()andPERCENTILE_DISC()withWITHIN GROUPto find percentiles - Simplify repeated window definitions with the
WINDOWclause
This lesson uses the trips table from Lesson 4 and the phone-sales tables. Let’s get started.
Data for this lesson
Engine: PostgreSQL — the examples print small sample tables inline.
Tables used: trips, tbl_bikeshare, phone_sales_by_month, phone_sales_quantity_by_month, employees
Want to run them yourself? Download window_practice.db and use any SQLite client (SQLite supports window functions too), or recreate the tables in PostgreSQL with psql.
Rank Distribution Functions
There are two rank distribution functions: CUME_DIST() and PERCENT_RANK(). Both compute a row’s relative standing in the window and return a ratio between 0 and 1. Both take no argument and require an ORDER BY inside OVER(); PARTITION BY is optional.
CUME_DIST()
CUME_DIST() (cumulative distribution) returns the proportion of rows whose value is less than or equal to the current row’s value. Formally, for ascending order:
SELECT start_date, bike_number, duration,
ROUND(CUME_DIST() OVER(ORDER BY duration)::numeric, 2) AS cume_dist
FROM trips;start_date | bike_number | duration | cume_dist
--------------------+-------------+----------+----------
2017-10-05 08:08:00 | W20184 | 202 | 0.08
2017-10-01 05:01:00 | W00143 | 650 | 0.17
2017-10-04 04:58:00 | W23052 | 797 | 0.33
2017-10-03 09:34:00 | W20095 | 797 | 0.33
2017-10-04 05:21:00 | W22051 | 918 | 0.58
2017-10-04 06:07:00 | W23268 | 918 | 0.58
2017-10-04 08:30:00 | W22517 | 918 | 0.58
2017-10-02 03:30:00 | W21096 | 987 | 0.67
2017-10-01 03:08:00 | W23272 | 1253 | 0.75
2017-10-03 12:00:00 | W22965 | 1390 | 0.83
2017-10-01 05:01:00 | W23254 | 1476 | 0.92
2017-10-05 08:08:00 | W00895 | 1482 | 1.00Read the first row: a duration of 202 has cume_dist = 0.08, meaning about 8% of trips are at or below 202 (it is the shortest of twelve, so ). The two trips tied at 797 both show 0.33 — about 33% of durations are at or below 797. The longest trip always reaches exactly 1.00.
The first value is never 0, because the current row is always counted in the “less than or equal” set.
Casting before ROUND()
CUME_DIST() and PERCENT_RANK() return a double precision value. To use ROUND(..., 2), cast the result to numeric first, as in ROUND(CUME_DIST() OVER(...)::numeric, 2). Without the cast, ROUND with two arguments will not apply.
PERCENT_RANK()
PERCENT_RANK() looks similar but measures the relative rank of a value, considering only the rows strictly below it. The formula is:
SELECT start_date, bike_number, duration,
ROUND(PERCENT_RANK() OVER(ORDER BY duration ASC)::numeric, 2) AS percent_rank
FROM trips;start_date | bike_number | duration | percent_rank
--------------------+-------------+----------+-------------
2017-10-05 08:08:00 | W20184 | 202 | 0.00
2017-10-01 05:01:00 | W00143 | 650 | 0.09
2017-10-04 04:58:00 | W23052 | 797 | 0.18
2017-10-03 09:34:00 | W20095 | 797 | 0.18
2017-10-04 05:21:00 | W22051 | 918 | 0.36
2017-10-04 06:07:00 | W23268 | 918 | 0.36
2017-10-04 08:30:00 | W22517 | 918 | 0.36
2017-10-02 03:30:00 | W21096 | 987 | 0.64
2017-10-01 03:08:00 | W23272 | 1253 | 0.73
2017-10-03 12:00:00 | W22965 | 1390 | 0.82
2017-10-01 05:01:00 | W23254 | 1476 | 0.91
2017-10-05 08:08:00 | W00895 | 1482 | 1.00The key difference: the first row is always 0 with PERCENT_RANK(), because nothing ranks below it. With CUME_DIST() the first row is greater than 0, because it counts itself. In short, CUME_DIST() includes the current value (≤), while PERCENT_RANK() excludes it (strictly <).
PERCENT_RANK() makes it easy to assign quartiles. Ordering revenue descending so that 0 = best performance:
SELECT *,
PERCENT_RANK() OVER(PARTITION BY brand ORDER BY quantity*unit_price DESC) AS percent_rank,
CASE
WHEN PERCENT_RANK() OVER(PARTITION BY brand ORDER BY quantity*unit_price DESC) < 0.25 THEN '1st'
WHEN PERCENT_RANK() OVER(PARTITION BY brand ORDER BY quantity*unit_price DESC) < 0.50 THEN '2nd'
WHEN PERCENT_RANK() OVER(PARTITION BY brand ORDER BY quantity*unit_price DESC) < 0.75 THEN '3rd'
ELSE '4th'
END AS revenue_quartiles
FROM phone_sales_by_month;Because the ordering is descending, a PERCENT_RANK near 0 means top performance, and the first quartile (< 0.25) captures the top 25% of months.
Inverse Distribution (Percentile) Functions
Rank distribution functions answer “what percentile is this value?” The inverse distribution functions answer the reverse: “what value sits at this percentile?” They are PERCENTILE_CONT() and PERCENTILE_DISC().
First, the intuition. For the sorted list 1, 4, 5, 41, 100, 2100, 79000 (seven values), the median (50th percentile) is the middle value, 41. But with an even count like 1, 4, 5, 41, 100, 2100 (six values), there is no single middle. The two middle values are 5 and 41, and the interpolated median is their average, . That choice — return an existing value, or interpolate between two — is exactly the difference between the two functions.
The WITHIN GROUP Clause
Percentile functions are ordered-set aggregate functions, not true window functions. They use a WITHIN GROUP clause to specify ordering, and they return a single value per group:
function_name(percent) WITHIN GROUP (ORDER BY column_expression [ASC | DESC])PERCENTILE_CONT() vs. PERCENTILE_DISC()
Using phone_sales_quantity_by_month (twelve rows of quantity), here is the median both ways.
PERCENTILE_CONT() (continuous) interpolates:
SELECT PERCENTILE_CONT(0.50) WITHIN GROUP(ORDER BY quantity) AS "Median of Quantity"
FROM phone_sales_quantity_by_month;Median of Quantity
------------------
89.5With twelve rows there is no single middle, so PERCENTILE_CONT() averages the two central values to get 89.5 — a value that may not exist in the data.
PERCENTILE_DISC() (discrete) returns an actual value from the data:
SELECT PERCENTILE_DISC(0.50) WITHIN GROUP(ORDER BY quantity) AS "Median of Quantity"
FROM phone_sales_quantity_by_month;Median of Quantity
------------------
89PERCENTILE_DISC() returns the closest existing value at or below the requested percentile — here 89. Use PERCENTILE_CONT() when an interpolated statistic is fine (true median, smooth percentiles) and PERCENTILE_DISC() when you need a value that actually appears in the data.
Percentile functions in WHERE
Because PERCENTILE_CONT() and PERCENTILE_DISC() are ordered-set aggregates rather than window functions, you can reference them in a WHERE clause — but only by computing them in a subquery or CTE first and comparing against the result. For example, filter rows where quantity >= (SELECT PERCENTILE_DISC(0.75) WITHIN GROUP(ORDER BY quantity) FROM ...).
The WINDOW Clause
When several window functions share the same OVER() definition, repeating it is noisy and error-prone. The WINDOW clause lets you name a window once and reuse it.
WINDOW window_name AS (
[PARTITION BY ...]
[ORDER BY ...]
[frame clause]
)It goes after GROUP BY and HAVING and before ORDER BY. Instead of repeating OVER(PARTITION BY department), you define it once and refer to it by name:
SELECT *,
AVG(salary) OVER win1
FROM employees
WINDOW win1 AS (PARTITION BY department);This is purely a readability and maintenance win — the result is identical to writing the full OVER() clause inline. With multiple window functions sharing a definition, it can dramatically shorten a query and guarantee they all use the exact same window.
Practice Exercises
Use the trips, tbl_bikeshare, phone_sales_quantity_by_month, and phone_sales_by_month tables.
Exercise 1: Fast Riders by Day
Using tbl_bikeshare, label a trip a ‘Fast Rider’ if its CUME_DIST() of duration (per day, ordered descending) is at least 0.95 — the top 5% by duration each day. Return the fast riders, ordered by start_date and duration.
Hint
In a CTE, compute CUME_DIST() OVER(PARTITION BY EXTRACT(day FROM start_date) ORDER BY duration DESC) and use a CASE to tag rows >= 0.95 as 'Fast Rider'. Filter the outer query to keep only those rows.
Exercise 2: Top-Performing Models per Month
Using phone_sales_by_month, find each month’s top-performing models — those whose revenue percentile rank falls in the first quartile (<= 0.25). Return sales_date, brand, model, and total_sales.
Hint
Group by sales_date, brand, model and compute SUM(quantity*unit_price) AS total_sales plus PERCENT_RANK() OVER(PARTITION BY sales_date ORDER BY SUM(quantity*unit_price) DESC) in a CTE. Filter the outer query where that rank is <= 0.25.
Exercise 3: 75th Percentile Months
Using phone_sales_quantity_by_month, return the rows where quantity is at or above the 75th percentile of all quantities.
Hint
Compute the threshold in a subquery: PERCENTILE_DISC(0.75) WITHIN GROUP(ORDER BY quantity). In the outer query, filter WHERE quantity >= (subquery).
Summary
You learned the statistical side of window functions. CUME_DIST() and PERCENT_RANK() measure where a value falls in a distribution, differing only in whether they count the current value. PERCENTILE_CONT() and PERCENTILE_DISC() run in reverse, returning the value at a given percentile — interpolated or discrete. And the WINDOW clause keeps queries clean by naming a window once. With this, you have covered all four families of window functions.
Key Concepts
CUME_DIST()— proportion of rows with a value ≤ the current value; the first row is never 0.PERCENT_RANK()—(rank - 1) / (rows - 1); the first row is always 0.PERCENTILE_CONT(p)— the interpolated value at percentile p; may not exist in the data.PERCENTILE_DISC(p)— an actual data value at or below percentile p; usesWITHIN GROUP.WINDOWclause — names a reusable window definition to avoid repeatingOVER().
Why This Matters
Percentiles and distributions are the language of analytics: top-quartile customers, p95 latency, median order value, and “is this an outlier?” all rely on these functions. Knowing the subtle gap between CUME_DIST() and PERCENT_RANK(), and between continuous and discrete percentiles, lets you compute statistics correctly rather than approximately. The WINDOW clause, meanwhile, makes complex multi-window queries readable — a small habit that pays off in every large query you write.
Next Steps
Continue to Lesson 7 - Guided Project: Sales Analytics for Northwind Traders
Combine window functions and CTEs to build a sales analytics report from the classic Northwind database.
Back to Module Overview
Return to the Window Functions module overview
Continue Building Your Skills
You now command all four families of window functions. Try answering a real distribution question on your own data: what is the median order value, which customers are in the top decile, where does today’s metric fall versus history? Running CUME_DIST() next to PERCENT_RANK() on the same column is a great way to lock in the difference. Next, you will put everything together in a full guided project.