Lesson 6 - Distribution and Percentile Functions

Welcome to Distribution Functions

This final lesson on window functions enters the world of statistics. Distribution functions tell you where a value sits in a distribution — is this trip in the top 5% by duration? Is this product in the first revenue quartile? And percentile functions run the other direction: given a percentage, what value sits at that point — what is the median, or the 75th percentile?

These functions power statistical analysis across marketing, finance, healthcare, and more. Along the way you will also learn the WINDOW clause, a tidy way to avoid repeating the same OVER() definition.

By the end of this lesson, you will be able to:

Use CUME_DIST() and PERCENT_RANK() to measure a value’s relative standing
Explain the difference between the two and when to use each
Use PERCENTILE_CONT() and PERCENTILE_DISC() with WITHIN GROUP to find percentiles
Simplify repeated window definitions with the WINDOW clause

This lesson uses the trips table from Lesson 4 and the phone-sales tables. Let’s get started.

Data for this lesson

Engine: PostgreSQL — the examples print small sample tables inline.

Tables used: trips, tbl_bikeshare, phone_sales_by_month, phone_sales_quantity_by_month, employees

Want to run them yourself? Download window_practice.db and use any SQLite client (SQLite supports window functions too), or recreate the tables in PostgreSQL with psql.

Rank Distribution Functions

There are two rank distribution functions: CUME_DIST() and PERCENT_RANK(). Both compute a row’s relative standing in the window and return a ratio between 0 and 1. Both take no argument and require an ORDER BY inside OVER(); PARTITION BY is optional.

CUME_DIST()

CUME_DIST() (cumulative distribution) returns the proportion of rows whose value is less than or equal to the current row’s value. Formally, for ascending order:

\text{CUME\_DIST} = \frac{\text{number of rows with value} \le \text{current value}}{\text{total rows in the partition}}

SELECT start_date, bike_number, duration,
       ROUND(CUME_DIST() OVER(ORDER BY duration)::numeric, 2) AS cume_dist
  FROM trips;

start_date          | bike_number | duration | cume_dist
--------------------+-------------+----------+----------
2017-10-05 08:08:00 | W20184      | 202      | 0.08
2017-10-01 05:01:00 | W00143      | 650      | 0.17
2017-10-04 04:58:00 | W23052      | 797      | 0.33
2017-10-03 09:34:00 | W20095      | 797      | 0.33
2017-10-04 05:21:00 | W22051      | 918      | 0.58
2017-10-04 06:07:00 | W23268      | 918      | 0.58
2017-10-04 08:30:00 | W22517      | 918      | 0.58
2017-10-02 03:30:00 | W21096      | 987      | 0.67
2017-10-01 03:08:00 | W23272      | 1253     | 0.75
2017-10-03 12:00:00 | W22965      | 1390     | 0.83
2017-10-01 05:01:00 | W23254      | 1476     | 0.92
2017-10-05 08:08:00 | W00895      | 1482     | 1.00

Read the first row: a duration of 202 has cume_dist = 0.08, meaning about 8% of trips are at or below 202 (it is the shortest of twelve, so $1/12 \approx 0.08$ ). The two trips tied at 797 both show 0.33 — about 33% of durations are at or below 797. The longest trip always reaches exactly 1.00.

The first value is never 0, because the current row is always counted in the “less than or equal” set.

Casting before ROUND()

CUME_DIST() and PERCENT_RANK() return a double precision value. To use ROUND(..., 2), cast the result to numeric first, as in ROUND(CUME_DIST() OVER(...)::numeric, 2). Without the cast, ROUND with two arguments will not apply.

PERCENT_RANK()

PERCENT_RANK() looks similar but measures the relative rank of a value, considering only the rows strictly below it. The formula is:

\text{PERCENT\_RANK} = \frac{\text{rank} - 1}{\text{number of rows} - 1}

SELECT start_date, bike_number, duration,
       ROUND(PERCENT_RANK() OVER(ORDER BY duration ASC)::numeric, 2) AS percent_rank
  FROM trips;

start_date          | bike_number | duration | percent_rank
--------------------+-------------+----------+-------------
2017-10-05 08:08:00 | W20184      | 202      | 0.00
2017-10-01 05:01:00 | W00143      | 650      | 0.09
2017-10-04 04:58:00 | W23052      | 797      | 0.18
2017-10-03 09:34:00 | W20095      | 797      | 0.18
2017-10-04 05:21:00 | W22051      | 918      | 0.36
2017-10-04 06:07:00 | W23268      | 918      | 0.36
2017-10-04 08:30:00 | W22517      | 918      | 0.36
2017-10-02 03:30:00 | W21096      | 987      | 0.64
2017-10-01 03:08:00 | W23272      | 1253     | 0.73
2017-10-03 12:00:00 | W22965      | 1390     | 0.82
2017-10-01 05:01:00 | W23254      | 1476     | 0.91
2017-10-05 08:08:00 | W00895      | 1482     | 1.00

The key difference: the first row is always 0 with PERCENT_RANK(), because nothing ranks below it. With CUME_DIST() the first row is greater than 0, because it counts itself. In short, CUME_DIST() includes the current value (≤), while PERCENT_RANK() excludes it (strictly <).

PERCENT_RANK() makes it easy to assign quartiles. Ordering revenue descending so that 0 = best performance:

SELECT *,
       PERCENT_RANK() OVER(PARTITION BY brand ORDER BY quantity*unit_price DESC) AS percent_rank,
       CASE
           WHEN PERCENT_RANK() OVER(PARTITION BY brand ORDER BY quantity*unit_price DESC) < 0.25 THEN '1st'
           WHEN PERCENT_RANK() OVER(PARTITION BY brand ORDER BY quantity*unit_price DESC) < 0.50 THEN '2nd'
           WHEN PERCENT_RANK() OVER(PARTITION BY brand ORDER BY quantity*unit_price DESC) < 0.75 THEN '3rd'
           ELSE '4th'
       END AS revenue_quartiles
  FROM phone_sales_by_month;

Because the ordering is descending, a PERCENT_RANK near 0 means top performance, and the first quartile (< 0.25) captures the top 25% of months.

Inverse Distribution (Percentile) Functions

Rank distribution functions answer “what percentile is this value?” The inverse distribution functions answer the reverse: “what value sits at this percentile?” They are PERCENTILE_CONT() and PERCENTILE_DISC().

First, the intuition. For the sorted list 1, 4, 5, 41, 100, 2100, 79000 (seven values), the median (50th percentile) is the middle value, 41. But with an even count like 1, 4, 5, 41, 100, 2100 (six values), there is no single middle. The two middle values are 5 and 41, and the interpolated median is their average, $(5 + 41)/2 = 23$ . That choice — return an existing value, or interpolate between two — is exactly the difference between the two functions.

The WITHIN GROUP Clause

Percentile functions are ordered-set aggregate functions, not true window functions. They use a WITHIN GROUP clause to specify ordering, and they return a single value per group:

function_name(percent) WITHIN GROUP (ORDER BY column_expression [ASC | DESC])

PERCENTILE_CONT() vs. PERCENTILE_DISC()

Using phone_sales_quantity_by_month (twelve rows of quantity), here is the median both ways.

PERCENTILE_CONT() (continuous) interpolates:

SELECT PERCENTILE_CONT(0.50) WITHIN GROUP(ORDER BY quantity) AS "Median of Quantity"
  FROM phone_sales_quantity_by_month;

Median of Quantity
------------------
89.5

With twelve rows there is no single middle, so PERCENTILE_CONT() averages the two central values to get 89.5 — a value that may not exist in the data.

PERCENTILE_DISC() (discrete) returns an actual value from the data:

SELECT PERCENTILE_DISC(0.50) WITHIN GROUP(ORDER BY quantity) AS "Median of Quantity"
  FROM phone_sales_quantity_by_month;

Median of Quantity
------------------
89

PERCENTILE_DISC() returns the closest existing value at or below the requested percentile — here 89. Use PERCENTILE_CONT() when an interpolated statistic is fine (true median, smooth percentiles) and PERCENTILE_DISC() when you need a value that actually appears in the data.

Percentile functions in WHERE

Because PERCENTILE_CONT() and PERCENTILE_DISC() are ordered-set aggregates rather than window functions, you can reference them in a WHERE clause — but only by computing them in a subquery or CTE first and comparing against the result. For example, filter rows where quantity >= (SELECT PERCENTILE_DISC(0.75) WITHIN GROUP(ORDER BY quantity) FROM ...).

The WINDOW Clause

When several window functions share the same OVER() definition, repeating it is noisy and error-prone. The WINDOW clause lets you name a window once and reuse it.

WINDOW window_name AS (
    [PARTITION BY ...]
    [ORDER BY ...]
    [frame clause]
)

It goes after GROUP BY and HAVING and before ORDER BY. Instead of repeating OVER(PARTITION BY department), you define it once and refer to it by name:

SELECT *,
       AVG(salary) OVER win1
  FROM employees
 WINDOW win1 AS (PARTITION BY department);

This is purely a readability and maintenance win — the result is identical to writing the full OVER() clause inline. With multiple window functions sharing a definition, it can dramatically shorten a query and guarantee they all use the exact same window.

Practice Exercises

Use the trips, tbl_bikeshare, phone_sales_quantity_by_month, and phone_sales_by_month tables.

Exercise 1: Fast Riders by Day

Using tbl_bikeshare, label a trip a ‘Fast Rider’ if its CUME_DIST() of duration (per day, ordered descending) is at least 0.95 — the top 5% by duration each day. Return the fast riders, ordered by start_date and duration.

Hint

In a CTE, compute CUME_DIST() OVER(PARTITION BY EXTRACT(day FROM start_date) ORDER BY duration DESC) and use a CASE to tag rows >= 0.95 as 'Fast Rider'. Filter the outer query to keep only those rows.

Exercise 2: Top-Performing Models per Month

Using phone_sales_by_month, find each month’s top-performing models — those whose revenue percentile rank falls in the first quartile (<= 0.25). Return sales_date, brand, model, and total_sales.

Hint

Group by sales_date, brand, model and compute SUM(quantity*unit_price) AS total_sales plus PERCENT_RANK() OVER(PARTITION BY sales_date ORDER BY SUM(quantity*unit_price) DESC) in a CTE. Filter the outer query where that rank is <= 0.25.

Exercise 3: 75th Percentile Months

Using phone_sales_quantity_by_month, return the rows where quantity is at or above the 75th percentile of all quantities.

Hint

Compute the threshold in a subquery: PERCENTILE_DISC(0.75) WITHIN GROUP(ORDER BY quantity). In the outer query, filter WHERE quantity >= (subquery).

Summary

You learned the statistical side of window functions. CUME_DIST() and PERCENT_RANK() measure where a value falls in a distribution, differing only in whether they count the current value. PERCENTILE_CONT() and PERCENTILE_DISC() run in reverse, returning the value at a given percentile — interpolated or discrete. And the WINDOW clause keeps queries clean by naming a window once. With this, you have covered all four families of window functions.

Key Concepts

CUME_DIST() — proportion of rows with a value ≤ the current value; the first row is never 0.
PERCENT_RANK() — (rank - 1) / (rows - 1); the first row is always 0.
PERCENTILE_CONT(p) — the interpolated value at percentile p; may not exist in the data.
PERCENTILE_DISC(p) — an actual data value at or below percentile p; uses WITHIN GROUP.
WINDOW clause — names a reusable window definition to avoid repeating OVER().

Why This Matters

Percentiles and distributions are the language of analytics: top-quartile customers, p95 latency, median order value, and “is this an outlier?” all rely on these functions. Knowing the subtle gap between CUME_DIST() and PERCENT_RANK(), and between continuous and discrete percentiles, lets you compute statistics correctly rather than approximately. The WINDOW clause, meanwhile, makes complex multi-window queries readable — a small habit that pays off in every large query you write.

Next Steps

Continue to Lesson 7 - Guided Project: Sales Analytics for Northwind Traders

Combine window functions and CTEs to build a sales analytics report from the classic Northwind database.

Back to Module Overview

Return to the Window Functions module overview

Continue Building Your Skills

You now command all four families of window functions. Try answering a real distribution question on your own data: what is the median order value, which customers are in the top decile, where does today’s metric fall versus history? Running CUME_DIST() next to PERCENT_RANK() on the same column is a great way to lock in the difference. Next, you will put everything together in a full guided project.

Lesson 5 - Boolean Indexing and Data Filtering

Lesson 6 - Modifying Data and Assignment

Courses

DATATWEETS

Title here

Lesson 6 - Distribution and Percentile Functions

Welcome to Distribution Functions

Rank Distribution Functions

CUME_DIST()

PERCENT_RANK()

Inverse Distribution (Percentile) Functions

The WITHIN GROUP Clause

PERCENTILE_CONT() vs. PERCENTILE_DISC()

The WINDOW Clause

Practice Exercises

Exercise 1: Fast Riders by Day

Exercise 2: Top-Performing Models per Month

Exercise 3: 75th Percentile Months

Summary

Key Concepts

Why This Matters

Next Steps

Continue to Lesson 7 - Guided Project: Sales Analytics for Northwind Traders

Back to Module Overview

Continue Building Your Skills

Lesson 6 - Distribution and Percentile Functions

Welcome to Distribution Functions#

Rank Distribution Functions#

CUME_DIST()#

PERCENT_RANK()#

Inverse Distribution (Percentile) Functions#

The WITHIN GROUP Clause#

PERCENTILE_CONT() vs. PERCENTILE_DISC()#

The WINDOW Clause#

Practice Exercises#

Exercise 1: Fast Riders by Day#

Exercise 2: Top-Performing Models per Month#

Exercise 3: 75th Percentile Months#

Summary#

Key Concepts#

Why This Matters#

Next Steps#

Continue to Lesson 7 - Guided Project: Sales Analytics for Northwind Traders

Back to Module Overview

Continue Building Your Skills#

Welcome to Distribution Functions

Rank Distribution Functions

CUME_DIST()

PERCENT_RANK()

Inverse Distribution (Percentile) Functions

The WITHIN GROUP Clause

PERCENTILE_CONT() vs. PERCENTILE_DISC()

The WINDOW Clause

Practice Exercises

Exercise 1: Fast Riders by Day

Exercise 2: Top-Performing Models per Month

Exercise 3: 75th Percentile Months

Summary

Key Concepts

Why This Matters

Next Steps

Continue Building Your Skills