Lesson 3 - Grouping Data with GROUP BY

Welcome to GROUP BY

So far, every aggregate query you have written treated the whole table — or one filtered slice of it — as a single group, producing one summary row. But a question like “how many invoices come from each country?” needs a separate answer per country. You could write one filtered query per country, but with 24 billing countries that is 24 queries. There is a far better way.

The GROUP BY clause splits your rows into groups and computes the aggregate once for each group, all in a single query. It is the single most important clause in this module.

By the end of this lesson, you will be able to:

  • Explain what the GROUP BY clause does and when to use it
  • Compute aggregate statistics for each group in one query
  • Combine GROUP BY with WHERE to summarize filtered groups
  • Order grouped results with ORDER BY, including by aggregated columns
  • Place GROUP BY correctly in SQL’s order of clauses and order of execution

You will continue with the Chinook database, using the invoice and invoice_line tables. Let’s begin.

Data for this lesson

Database: chinook.db — a digital music store (SQLite).

Tables used: invoice, invoice_line

Open it in any SQLite client — DB Browser for SQLite, the sqlite3 CLI, or Python’s sqlite3 — to run every query yourself.


The Problem GROUP BY Solves

Here are the first five rows of the invoice table, with more of its columns shown:

invoice_idcustomer_idbilling_citybilling_statebilling_countrytotal
12StuttgartGermany1.98
24OsloNorway3.96
38BrusselsBelgium5.94
414EdmontonABCanada8.91
523BostonMAUSA13.86

To count the U.S. invoices, you already know how to use WHERE:

SELECT COUNT(*) AS num_row
  FROM invoice
 WHERE billing_country = 'USA';
num_row
-------
91

To also count Canada’s and France’s invoices, you would change the condition and run the query again — once per country. With 24 countries in the table, that approach simply does not scale. GROUP BY lets you get all 24 counts at once.


Counting Rows by Group

The GROUP BY clause tells SQL which column to split the rows on. Imagine a small illustrative example table holding five invoices — three billed to USA, one to Canada, and one to France. Grouping by billing_country and counting gives:

SELECT COUNT(*) AS num_row
  FROM example
 GROUP BY billing_country;
num_row
-------
1
1
3

There are three counts — one per country — but you cannot tell which is which. The fix is to add the grouping column to the SELECT list so each row is labeled:

SELECT billing_country, COUNT(*) AS num_row
  FROM example
 GROUP BY billing_country;
billing_country | num_row
--------------- | -------
Canada          | 1
France          | 1
USA             | 3

Now run the same idea on the full invoice table to count invoices for every country at once:

SELECT billing_country, COUNT(*) AS num_row
  FROM invoice
 GROUP BY billing_country;

One query, one row per country. That is the whole point of GROUP BY.

Always SELECT the grouping column

If you GROUP BY a column but forget to put it in the SELECT list, you will get the right counts but no labels. As a rule, the column you group by belongs in your SELECT so each result row identifies its group.


How Grouping Works

Without GROUP BY, an aggregate treats the entire table as one big group. With GROUP BY, SQL works in three steps:

  1. Split the rows into groups based on the distinct values of the grouping column.
  2. Apply the aggregate function to each group separately.
  3. Combine the per-group results into one output table — one row per group.

Summary statistics per group

Any aggregate works with GROUP BY, not just COUNT. Consider the invoice_line table, which lists every individual track sold. Here are its first rows:

invoice_line_idinvoice_idtrack_idunit_pricequantity
1120.991
2140.991
3260.991
4280.991
52100.991

Each row is one line on a sale. If you add up unit_price * quantity for every line of a given invoice, you should recover that invoice’s total from the invoice table. Group by invoice_id and sum the line totals:

SELECT invoice_id, SUM(unit_price * quantity) AS total
  FROM invoice_line
 GROUP BY invoice_id
 LIMIT 5;
invoice_id | total
---------- | -----
1          | 1.98
2          | 3.96
3          | 5.94
4          | 8.91
5          | 13.86

Compare these to the total column of the invoice table — they match exactly. Grouping the line items reproduces each invoice’s stored total. Notice you can aggregate a computed column (unit_price * quantity) per group, just as you did over a whole table in Lesson 2.


Where GROUP BY Fits

Adding a new clause means knowing where it goes. SQL is strict about clause order, and getting it right prevents errors.

Order of clauses (how you write them)

Before this lesson you knew:

SELECT > FROM > WHERE > ORDER BY > LIMIT

GROUP BY slots in after WHERE and before ORDER BY:

SELECT > FROM > WHERE > GROUP BY > ORDER BY > LIMIT

You must write the clauses in this order or SQL will report a syntax error.

Order of execution (how SQL runs them)

The order in which SQL executes clauses differs from the order you write them. In particular, although SELECT is written first, SQL runs it after grouping:

FROM > WHERE > GROUP BY > SELECT > ORDER BY > LIMIT

Two facts follow from this and are worth memorizing:

  • SQL filters rows with WHERE before grouping them.
  • SQL groups rows before ordering the result and limiting how many rows appear.

This is why WHERE conditions apply to raw rows, while ORDER BY can sort by an aggregated column — by the time ordering runs, the aggregates already exist.


Grouping with a Condition

Because WHERE runs before GROUP BY, you can filter rows first and then group only what survives. Recall the per-country counts:

SELECT billing_country, COUNT(*) AS num_row
  FROM invoice
 GROUP BY billing_country;

Now restrict to invoices of 1010 dollars or less by adding WHERE before GROUP BY:

SELECT billing_country, COUNT(*) AS num_row
  FROM invoice
 WHERE total <= 10
 GROUP BY billing_country;

For the five-row example, the USA’s count drops from 3 to 1, because two of its invoices exceed 1010 dollars and were filtered out before grouping.

A practical example: U.S. sales by state

Let’s compute, for U.S. invoices only, the number of invoices and average sale per state:

SELECT billing_state, COUNT(*) AS num_row, AVG(total) AS avg_sale
  FROM invoice
 WHERE billing_country = 'USA'
 GROUP BY billing_state;
billing_state | num_row | avg_sale
------------- | ------- | --------
AZ            | 7       | 5.374286
CA            | 21      | 5.517143
FL            | 7       | 5.660000
IL            | 7       | 6.231429
MA            | 7       | 5.374286
NV            | 7       | 5.374286
NY            | 7       | 5.374286
TX            | 7       | 6.802857
UT            | 7       | 6.231429
WA            | 7       | 5.660000
WI            | 7       | 6.088571

The WHERE clause kept only U.S. rows; GROUP BY then produced one summary line per state.


Ordering Grouped Results

A grouped result is still a table, so you can sort it with ORDER BY. The interesting part is that you can sort by an aggregated column — because ordering runs after grouping.

To find the states with the most customers, sort the previous query by num_row descending:

SELECT billing_state, COUNT(*) AS num_row, AVG(total) AS avg_sale
  FROM invoice
 WHERE billing_country = 'USA'
 GROUP BY billing_state
 ORDER BY num_row DESC;
billing_state | num_row | avg_sale
------------- | ------- | --------
CA            | 21      | 5.517143
WI            | 7       | 6.088571
WA            | 7       | 5.660000
UT            | 7       | 6.231429
TX            | 7       | 6.802857
NY            | 7       | 5.374286
...           | ...     | ...

California leads clearly with 21 invoices. But the remaining states all tie at 7, so the order among them is arbitrary. You can break the tie with a second sort key — here, average sale descending — and keep just the top three with LIMIT:

SELECT billing_state, COUNT(*) AS num_row, AVG(total) AS avg_sale
  FROM invoice
 WHERE billing_country = 'USA'
 GROUP BY billing_state
 ORDER BY num_row DESC, avg_sale DESC
 LIMIT 3;
billing_state | num_row | avg_sale
------------- | ------- | --------
CA            | 21      | 5.517143
TX            | 7       | 6.802857
IL            | 7       | 6.231429

ORDER BY num_row DESC, avg_sale DESC sorts first by count, then by average within ties. This works precisely because SQL executes ORDER BY after GROUP BY, so the aggregated columns already exist when sorting happens.


A Complete Grouped Query

You can now combine WHERE, GROUP BY, several aggregates, ORDER BY, and LIMIT in one query. Suppose you want the top three sales cities across France and Canada, with a full set of statistics for each:

SELECT billing_city,
       COUNT(*) AS num_row,
       SUM(total) AS overall_sale,
       MIN(total) AS min_sale,
       AVG(total) AS avg_sale,
       MAX(total) AS max_sale
  FROM invoice
 WHERE billing_country = 'Canada'
    OR billing_country = 'France'
 GROUP BY billing_city
 ORDER BY overall_sale DESC, num_row DESC
 LIMIT 3;
billing_city | num_row | overall_sale | min_sale | avg_sale | max_sale
------------ | ------- | ------------ | -------- | -------- | --------
Paris        | 14      | 77.24        | 0.99     | 5.517143 | 13.86
Dijon        | 7       | 40.62        | 0.99     | 5.802857 | 16.86
Bordeaux     | 7       | 39.62        | 0.99     | 5.660000 | 13.86

Read top to bottom, this query: filters to two countries, groups by city, computes five statistics per city, sorts by total sales (then by count to break ties), and keeps the top three. Paris leads with the most sales and the highest overall purchase amount; in Dijon, the average customer spends the most of the three.


Practice Exercises

Work these against the Chinook database.

Exercise 1: Invoices per country

Write a query on the invoice table that returns each billing_country and the number of invoices for it as num_invoices, sorted from most invoices to fewest.

-- Your code here

Hint

GROUP BY billing_country, select billing_country and COUNT(*) AS num_invoices, then add ORDER BY num_invoices DESC.

Exercise 2: Best-selling tracks

Using the invoice_line table, write a query that returns each track_id, the number of line items for it as num_row, and the total revenue as overall_sale (computed as SUM(unit_price * quantity)). Order by overall_sale descending, then num_row descending, and show only the top five.

-- Your code here

Hint

Group by track_id. Use SUM(unit_price * quantity) AS overall_sale, then ORDER BY overall_sale DESC, num_row DESC LIMIT 5.

Exercise 3: Average sale per U.S. state

Write a query that, for U.S. invoices only, returns each billing_state and its average sale as avg_sale, sorted from highest average to lowest.

-- Your code here

Hint

Filter with WHERE billing_country = 'USA' before the GROUP BY billing_state, then ORDER BY avg_sale DESC.


Summary

You learned the GROUP BY clause — the tool that turns a single summary into one summary per group. You saw how grouping works in three steps, how to label groups by selecting the grouping column, and how WHERE and ORDER BY cooperate with grouping thanks to SQL’s order of execution.

Key Concepts

  • GROUP BY — splits rows into groups by the distinct values of a column, then applies the aggregate to each group.
  • Three-step grouping — SQL splits rows into groups, applies the aggregate per group, then combines the results.
  • Labeling groups — include the grouping column in SELECT so each result row identifies its group.
  • Order of clausesSELECT > FROM > WHERE > GROUP BY > ORDER BY > LIMIT.
  • Order of executionFROM > WHERE > GROUP BY > SELECT > ORDER BY > LIMIT; WHERE filters before grouping, ORDER BY sorts after grouping (so you can sort by aggregates).

Why This Matters

GROUP BY is where SQL becomes an analytics language. “Revenue by month,” “users by plan,” “errors by service” — every one of those phrases maps directly to a GROUP BY query. Once you can group, filter, sort, and limit in a single statement, you can answer the comparative questions that drive almost every real decision.


Next Steps

You can now group by one column. Next you will group by several columns at once and learn the HAVING clause — the right way to filter on an aggregated value.

Continue to Lesson 4 - Grouping by Multiple Columns and Filtering Groups

Group by several criteria at once and filter aggregated results with HAVING

Back to Module Overview

Return to the Summarizing Data with SQL module overview


Continue Building Your Skills

Grouping is a turning point. You have gone from describing a whole table to comparing every slice of it side by side, all in one query. Keep practicing on the Chinook tables — try grouping invoices by year, customer, or country and see what stories the numbers tell.