Lesson 4 - Grouping by Multiple Columns and Filtering Groups

Welcome to Multi-Column Grouping and HAVING

You can already group by a single column and compute a summary for each group. In this final lesson of the module you push grouping further in two ways. First, you will group by several columns at once, creating finer breakdowns like “sales by country and state.” Second, you will meet the HAVING clause — the correct way to filter groups based on an aggregated value, something WHERE cannot do.

By the end of this lesson, you will be able to:

  • Group rows by two or more columns to create combined groups
  • Explain why non-aggregate, non-grouping columns should be avoided in aggregate queries
  • Filter aggregated results with the HAVING clause
  • Distinguish HAVING from WHERE and place each correctly
  • Combine WHERE and HAVING in the same query

You will keep working with the Chinook invoice table. Let’s begin.

Data for this lesson

Database: chinook.db — a digital music store (SQLite).

Tables used: invoice

Open it in any SQLite client — DB Browser for SQLite, the sqlite3 CLI, or Python’s sqlite3 — to run every query yourself.


Grouping by Several Columns

When you group by one column, you get one group per distinct value. When you group by several columns, you get one group per distinct combination of those columns’ values.

Consider this tiny illustration table:

col1col2
A1
A1
A1
A2
B2
B2
C1

col1 has distinct values A, B, C; col2 has 1 and 2. Grouping by both columns means each group is a unique pair: the possible pairs are A1, A2, B1, B2, C1, C2 — but only the combinations that actually occur appear in the result. Here B1 and C2 never appear in the data, so they are not in the output. You would get groups for A1, A2, B2, and C1.

Country and city in the invoice table

To count invoices and average the sale per country and city, list both columns after GROUP BY, separated by a comma:

SELECT billing_country, billing_city, COUNT(*) AS num_row, AVG(total) AS avg_sale
  FROM invoice
 GROUP BY billing_country, billing_city;
billing_country | billing_city | num_row | avg_sale
--------------- | ------------ | ------- | --------
Argentina       | Buenos Aires | 7       | 5.374286
Australia       | Sidney       | 7       | 5.374286
Austria         | Vienne       | 7       | 6.088571
Belgium         | Brussels     | 7       | 5.374286
...             | ...          | ...     | ...

Each row is now one country-city combination. You can group by billing_country and billing_state the same way — just swap the second column:

SELECT billing_country, billing_state, COUNT(*) AS num_row, AVG(total) AS avg_sale
  FROM invoice
 GROUP BY billing_country, billing_state;

A Caution: Non-Aggregate Columns

There is a subtle trap worth knowing about. Look back at what you can safely put in SELECT:

  • Aggregate functions over the whole table return a single summary row:

    SELECT COUNT(*) AS num_row, AVG(total) AS avg_sale
      FROM invoice;
  • Aggregate functions with GROUP BY return one row per group, and the SELECT lists the grouping columns plus the aggregates:

    SELECT billing_country, billing_state, COUNT(*) AS num_row, AVG(total) AS avg_sale
      FROM invoice
     GROUP BY billing_country, billing_state;

Both are standard and behave identically in every database. But what happens if you add a column that is neither aggregated nor in the GROUP BY?

SELECT billing_country, billing_state, billing_city, COUNT(*) AS num_row, AVG(total) AS avg_sale
  FROM invoice
 GROUP BY billing_country, billing_state
 LIMIT 3;
billing_country | billing_state | billing_city | num_row | avg_sale
--------------- | ------------- | ------------ | ------- | --------
Argentina       |               | Buenos Aires | 7       | 5.374
Australia       | NSW           | Sidney       | 7       | 5.374
Austria         |               | Vienne       | 7       | 6.089

Each group may contain several cities, yet only one city appears. SQLite simply picked one value arbitrarily. (Countries with no state recorded show a blank billing_state.)

Avoid non-aggregate columns in grouped queries

Picking an arbitrary value for a non-grouped column is a flexibility specific to SQLite. Other database systems (PostgreSQL, MySQL in strict mode, SQL Server) will raise an error instead. To write queries that work everywhere, only SELECT columns that are either in the GROUP BY or wrapped in an aggregate function.


Refining Multi-Column Groups

All the clauses you learned in Lesson 3 still work when grouping by several columns. You can sort the result with ORDER BY and filter input rows with WHERE.

Order by num_row descending to find the biggest country-state groups:

SELECT billing_country, billing_state, COUNT(*) AS num_row, AVG(total) AS avg_sale
  FROM invoice
 GROUP BY billing_country, billing_state
 ORDER BY num_row DESC
 LIMIT 3;
billing_country | billing_state | num_row | avg_sale
--------------- | ------------- | ------- | --------
France          |               | 35      | 5.574286
Germany         |               | 28      | 5.588571
Brazil          | SP            | 21      | 5.469524

Add a WHERE condition to focus on U.S. states:

SELECT billing_country, billing_state, COUNT(*) AS num_row, AVG(total) AS avg_sale
  FROM invoice
 WHERE billing_country = 'USA'
 GROUP BY billing_country, billing_state
 ORDER BY num_row DESC
 LIMIT 3;
billing_country | billing_state | num_row | avg_sale
--------------- | ------------- | ------- | --------
USA             | CA            | 21      | 5.517143
USA             | AZ            | 7       | 5.374286
USA             | FL            | 7       | 5.660000

This works because WHERE filters raw rows before grouping. But that timing creates a problem when your condition is about an aggregated value.


Why WHERE Cannot Filter Aggregates

Suppose you want only the groups with more than 20 invoices. It is tempting to write:

SELECT billing_country, billing_state, COUNT(*) AS num_row, AVG(total) AS avg_sale
  FROM invoice
 WHERE COUNT(*) > 20
 GROUP BY billing_country, billing_state;

Running it produces an error:

OperationalError: misuse of aggregate: COUNT()

The reason is execution order. Recall:

FROM > WHERE > GROUP BY > SELECT > ORDER BY > LIMIT

WHERE runs before GROUP BY. At that point the rows have not been grouped yet, so COUNT(*) for each group does not exist. You cannot filter on a number that has not been computed. WHERE is for raw-row conditions only.


Filtering Groups with HAVING

SQL solves this with the HAVING clause. HAVING is like WHERE, except it runs after grouping, so it can filter on aggregated values. To keep only groups with more than 20 invoices, replace WHERE with HAVING and move it after GROUP BY:

SELECT billing_country, billing_state, COUNT(*) AS num_row, AVG(total) AS avg_sale
  FROM invoice
 GROUP BY billing_country, billing_state
HAVING COUNT(*) > 20;
billing_country | billing_state | num_row | avg_sale
--------------- | ------------- | ------- | --------
Brazil          | SP            | 21      | 5.469524
France          |               | 35      | 5.574286
Germany         |               | 28      | 5.588571
USA             | CA            | 21      | 5.517143
United Kingdom  |               | 21      | 5.374286

No error this time — by the time HAVING runs, each group’s COUNT(*) already exists.

Where HAVING fits

HAVING slots in right after GROUP BY. The full order of clauses is now:

SELECT > FROM > WHERE > GROUP BY > HAVING > ORDER BY > LIMIT

And the order of execution:

FROM > WHERE > GROUP BY > HAVING > SELECT > ORDER BY > LIMIT

Two things to remember:

  • WHERE filters rows before grouping; HAVING filters groups after grouping.
  • HAVING is always used together with GROUP BY and is placed immediately after it.

Filtering on an aggregate you do not display

A nice property of HAVING is that the aggregate it filters on does not need to appear in SELECT — because SQL computes the statistics before deciding what to display. This query shows the same five groups as before but drops the num_row column, filtering on a count it never shows:

SELECT billing_country, billing_state, AVG(total) AS avg_sale
  FROM invoice
 GROUP BY billing_country, billing_state
HAVING COUNT(*) > 20;

You can also filter on a different aggregate. To keep only the country-state groups whose average sale is below 66 dollars, and report their min and max:

SELECT billing_country, billing_state, MIN(total) AS min_sale, MAX(total) AS max_sale
  FROM invoice
 GROUP BY billing_country, billing_state
HAVING AVG(total) < 6;
billing_country | billing_state | min_sale | max_sale
--------------- | ------------- | -------- | --------
Argentina       |               | 0.99     | 13.86
Australia       | NSW           | 0.99     | 13.86
Austria         |               | 0.99     | 18.86
Belgium         |               | 0.99     | 13.86
...             | ...           | ...      | ...

Combining WHERE and HAVING

WHERE and HAVING are not rivals — they do different jobs and often appear together. WHERE filters raw rows before grouping; HAVING filters groups after.

In the result above, some billing_state values are blank — those are countries with no state recorded. Since billing_state is a plain (non-aggregated) column, you filter it with WHERE, while still filtering the group’s average with HAVING:

SELECT billing_country, billing_state, MIN(total) AS min_sale, MAX(total) AS max_sale
  FROM invoice
 WHERE billing_state <> ''
 GROUP BY billing_country, billing_state
HAVING AVG(total) < 6;

The <> operator means not equal to. Read in execution order, this query: drops rows with no state (WHERE), groups the rest by country and state (GROUP BY), keeps only groups whose average sale is under 66 dollars (HAVING), and reports each group’s minimum and maximum sale.

WHERE vs. HAVING — a simple rule

If your condition is about a single row’s value (like billing_state <> ''), use WHERE. If your condition is about a group’s summary (like AVG(total) < 6 or COUNT(*) > 20), use HAVING.


Practice Exercises

Work these against the Chinook invoice table.

Exercise 1: Sales by country and city

Write a query that returns each billing_country, billing_city, the number of invoices as num_row, and the average sale as avg_sale, grouped by both country and city.

-- Your code here

Hint

List both columns after GROUP BY, separated by a comma, and include both in your SELECT alongside COUNT(*) and AVG(total).

Exercise 2: Busy groups only

Write a query that groups invoices by billing_country and billing_state and returns only the groups with more than 30 invoices, showing the country, state, and the count as num_row.

-- Your code here

Hint

Filter the count with HAVING COUNT(*) > 30, placed after GROUP BY. WHERE cannot filter an aggregate.

Exercise 3: Low-average groups, states only

Write a query that, for groups of country and state where the state is not blank, returns the country, state, and average sale as avg_sale, keeping only the groups whose average sale is greater than 66 dollars.

-- Your code here

Hint

Use WHERE billing_state <> '' for the row-level condition and HAVING AVG(total) > 6 for the group-level condition. WHERE comes before GROUP BY; HAVING comes after it.


Summary

You completed the module by grouping on several columns at once and by learning HAVING, the clause that filters groups on their aggregated values. You also saw why mixing non-aggregate, non-grouping columns into a grouped query is unsafe, and how WHERE and HAVING divide the work of filtering.

Key Concepts

  • Multi-column GROUP BY — grouping by several columns creates one group per distinct combination of their values; only combinations that exist in the data appear.
  • Non-aggregate columns — only SELECT columns that are in the GROUP BY or wrapped in an aggregate; other columns behave unpredictably across databases.
  • HAVING — filters groups after aggregation, so it can test aggregated values like COUNT(*) > 20.
  • WHERE vs. HAVINGWHERE filters raw rows before grouping; HAVING filters groups after grouping.
  • Final clause orderSELECT > FROM > WHERE > GROUP BY > HAVING > ORDER BY > LIMIT.
  • Final execution orderFROM > WHERE > GROUP BY > HAVING > SELECT > ORDER BY > LIMIT.

Why This Matters

Multi-column grouping and HAVING are what separate a casual query writer from a confident analyst. “Revenue by region and product line, but only for categories that cleared a sales threshold” is a single statement once you know these tools. The WHERE-versus-HAVING distinction in particular trips up many beginners, and understanding it through execution order means you will reason about queries correctly rather than guessing.


Next Steps

You can now summarize data from a single table in almost any way a report could ask for. The natural next step is to bring multiple tables together — connecting invoices to customers, tracks to albums, and more — using joins.

Continue to Combining Tables with Joins

Connect related tables together so you can summarize and analyze across your whole database

Back to Module Overview

Return to the Summarizing Data with SQL module overview


Continue Building Your Skills

You have completed Summarizing Data with SQL. You can total a column, build full summary tables, group by one or many columns, and filter groups with HAVING. These are the everyday tools of data analysis, and you now own all of them. Carry that momentum into joins, where your summaries will start spanning the whole database.