Lesson 1 - Scalar Subqueries

Welcome to Scalar Subqueries

Up to now, every query you have written has had a single job: pull rows from one place, maybe join, filter, and aggregate them, and return a result. But many real questions need the answer to one query in order to ask another. How does each country’s sales compare to the total? Which invoices are above the company-wide average? These questions are awkward — or impossible — to answer in a single flat query.

The tool that solves this is the subquery: a query nested inside another query. In this lesson you will meet the simplest and most common kind, the scalar subquery, which returns exactly one value you can drop into a calculation or comparison.

By the end of this lesson, you will be able to:

  • Explain what a subquery is and why subqueries matter
  • Write a scalar subquery that returns a single value
  • Use a scalar subquery inside a SELECT calculation
  • Filter rows with a scalar subquery in WHERE and HAVING
  • Follow SQL style conventions for writing readable subqueries

You should already be comfortable with SELECT, GROUP BY, and aggregate functions like COUNT, SUM, and AVG. Let’s begin.

Data for this lesson

Database: chinook.db — a digital music store (SQLite).

Tables used: invoice

Open it in any SQLite client — DB Browser for SQLite, the sqlite3 CLI, or Python’s sqlite3 — to run every query yourself.


What Is a Subquery?

A subquery is a query nested inside another query. The query on the inside is called the inner query (or simply the subquery); the query that contains it is the outer query.

The main reason subqueries exist is that information in a relational database is rarely stored in one table. It is spread across several tables to keep the data clean and non-redundant. Subqueries — like joins — let you pull those pieces back together to answer a single question.

We will work with the Chinook database throughout this module. Chinook models a fictional digital music store (think of an iTunes-style shop) with tables for customers, invoices, tracks, albums, genres, and employees.

Subqueries Are Classified by What They Return

SQL subqueries come in three flavors, classified by their output:

  • A scalar subquery returns a single value.
  • A multi-row subquery returns several values in one column (a list).
  • A multi-column subquery returns a table (several rows and columns).

This lesson focuses entirely on scalar subqueries. You will meet the other two types in the lessons that follow.

How Subqueries Execute

In SQL’s execution order, the inner query runs first, and its result is then used by the outer query. The one rule you must never break: always enclose a subquery in parentheses (). That is how SQL knows where the inner query begins and ends.


Your First Scalar Subquery

Let’s answer a real question: what percentage of all sales is made in each country? This means dividing the number of invoices per country by the total number of invoices across the whole table. There is no clean way to do this in one flat query — we need a subquery.

Building It Step by Step

First, count the sales per country from the invoice table:

SELECT billing_country, COUNT(*) AS row_tally
  FROM invoice
 GROUP BY billing_country
 ORDER BY row_tally DESC
 LIMIT 5;

Output:

billing_countryrow_tally
USA91
Canada56
France35
Brazil35
Germany28

Next, find the total number of sales — a single number:

SELECT COUNT(*)
  FROM invoice;

This returns 412. So the United States accounts for more than 22% of sales (91 * 100.0 / 412). We could hardcode 412 into a query:

SELECT billing_country,
       ROUND(COUNT(*) * 100.0 / 412, 2) AS sales_prop
  FROM invoice
 GROUP BY billing_country
 ORDER BY sales_prop DESC
 LIMIT 5;

Output:

billing_countrysales_prop
USA22.09
Canada13.59
France8.5
Brazil8.5
Germany6.8

Notice the * 100.0 (not * 100): because COUNT(*) and 412 are integers, multiplying by 100.0 forces SQL to do decimal division instead of integer division. ROUND(..., 2) then trims the result to two decimal places.

Replacing the Hardcoded Value with a Subquery

Hardcoding 412 is fragile — the moment a new invoice is added, the number is wrong. Instead, replace it with the query that computes it, wrapped in parentheses:

SELECT billing_country,
       ROUND(COUNT(*) * 100.0 /
             (SELECT COUNT(*)
                FROM invoice), 2) AS sales_prop
  FROM invoice
 GROUP BY billing_country
 ORDER BY sales_prop DESC
 LIMIT 5;

The inner query (SELECT COUNT(*) FROM invoice) returns the single value 614, and the outer query uses it in the division. The result is identical to the hardcoded version — but now it stays correct as the data changes. That is the power of a scalar subquery: one value, computed on the fly, used anywhere a value can go.


Using Scalar Subqueries to Filter Rows

Because a scalar subquery returns one value, you can use it anywhere you would use a literal value — including the WHERE clause to filter rows.

Filtering with WHERE

Suppose you want to count invoices whose total is greater than the average invoice total. You do not know the average in advance, so compute it with a subquery:

SELECT COUNT(*) AS rows_tally
  FROM invoice
 WHERE total > (SELECT AVG(total)
                  FROM invoice);

The inner query computes the average of the total column. The outer query then counts every invoice whose total exceeds that benchmark. The comparison total > (subquery) works because the subquery resolves to a single number.

Filtering Against a Computed Threshold

Subqueries can do arithmetic too. Here we keep only invoices whose total exceeds 75% of the single largest invoice:

SELECT *
  FROM invoice
 WHERE total > (SELECT MAX(total) * 0.75
                  FROM invoice);

The subquery finds the maximum total, multiplies it by 0.75, and returns that one value. The outer query filters against it. This is a clean way to express “the top spenders” without first running a separate query to find the maximum.


Filtering Aggregated Results with HAVING

When you filter the rows of a query, you use WHERE. When you filter aggregated results, you use HAVING. Scalar subqueries work in HAVING exactly as they do in WHERE.

A Motivating Question

Here is the total spend per customer, keeping only customers who spent at least 40 dollars:

SELECT customer_id, SUM(total) AS customer_total
  FROM invoice
 GROUP BY customer_id
HAVING SUM(total) >= 40;

Output:

customer_idcustomer_total
540.62
649.62
742.62
2443.62
2542.62
2647.62
2843.62
3743.62
4340.62
4441.62

(showing 10 of 14)

Customer 6 has the highest total spend. But which customers have a higher average purchase than customer 5? To answer that, you need customer 5’s average first, then compare every other customer against it — a perfect job for a subquery.

Comparing Aggregates with a Subquery

SELECT customer_id, AVG(total) AS customer_avg
  FROM invoice
 GROUP BY customer_id
HAVING AVG(total) > (SELECT AVG(total)
                       FROM invoice
                      WHERE customer_id = 5);

The subquery computes customer 5’s average purchase amount — a single value. The outer query groups all invoices by customer and keeps only those whose average exceeds that benchmark. You have filtered an aggregate against another aggregate, all in one statement.


Style: Writing Readable Subqueries

Subqueries can get hard to read fast, so follow the same style you use for any query:

  • Always wrap the subquery in parentheses. This is required, not optional.
  • Uppercase keywords like SELECT, FROM, WHERE, and AVG.
  • Align clauses to the river — the vertical line of spacing before each keyword — and indent the subquery to the right of the value it produces.
  • Put each clause on its own line.
  • End only the outer query with a semicolon; the inner query has no trailing semicolon.

Compare a cramped version with a clean one. Both run, but only one is readable:

SELECT COUNT(*) AS rows_tally FROM invoice WHERE total > (SELECT AVG(total) FROM invoice);
SELECT COUNT(*) AS rows_tally
  FROM invoice
 WHERE total > (SELECT AVG(total)
                  FROM invoice);

When a teammate (or future you) reads the second version, the structure of the query is obvious at a glance.


Practice Exercises

Try each of these against the Chinook invoice and customer tables before peeking at the hints.

Exercise 1: Customer Share of Sales

Write a query that returns the top five customers by their percentage of total sales — the number of invoices per customer_id over the total number of invoices — rounded to two decimal places. Use a scalar subquery for the total.

Hint

Start from the country-share query. Swap billing_country for customer_id, keep the (SELECT COUNT(*) FROM invoice) subquery, and remember * 100.0 for decimal division.

Exercise 2: Each Country’s Revenue Share

Write a query that returns each country’s share of total revenue (use SUM(total), not COUNT(*)), ordered from largest to smallest, limited to five rows. Name the column country_share.

Hint

Your subquery is (SELECT SUM(total) FROM invoice). Divide each country’s SUM(total) by it. Group by billing_country and order by country_share DESC.

Exercise 3: Above-Average Spenders

Write a query that returns every customer whose average purchase amount is greater than the overall average purchase amount across all invoices.

Hint

Group by customer_id, compute AVG(total), and use HAVING AVG(total) > (SELECT AVG(total) FROM invoice).


Summary

You learned that a subquery is a query nested inside another, that the inner query runs first, and that scalar subqueries return exactly one value you can use anywhere a value is allowed. You used them to compute proportions in SELECT, to filter rows in WHERE, and to filter aggregates in HAVING.

Key Concepts

  • Subquery — a query nested inside another query, always enclosed in parentheses ().
  • Inner query — the nested subquery; it executes first.
  • Outer query — the query that contains the subquery and uses its result.
  • Scalar subquery — a subquery that returns a single value, usable in SELECT, WHERE, and HAVING.
  • Decimal division — multiply by 100.0 (not 100) so SQL divides in decimals instead of integers.

Why This Matters

Comparing each row against a computed benchmark — the total, the average, a percentage of the maximum — is one of the most common things analysts do. Scalar subqueries let you express that comparison in a single, self-updating query instead of running two separate queries and copying numbers by hand. Every dashboard metric that reads “share of total” or “above average” is, under the hood, a scalar subquery.


Next Steps

You can now compute and compare against a single value. Next, you will return sets of values from a subquery to power IN, ANY, and ALL comparisons.

Continue to Lesson 2 - Multi-Row and Multi-Column Subqueries

Return sets of rows and columns from a subquery to power IN, ANY, and ALL comparisons

Back to Module Overview

Return to the Subqueries, CTEs, and Views module overview


Continue Building Your Skills

You just unlocked a new level of SQL: queries that depend on the answers to other queries. The scalar subquery is your first taste of composing logic, and you will lean on it constantly. Keep practicing the “compare each row to a computed benchmark” pattern — it shows up everywhere from reporting to data validation.