Lesson 2 - From SQL Results to pandas DataFrames

Welcome to SQL Results in pandas

In the last lesson you queried a SQLite database and got back a list of tuples. That works, but tuples make for clumsy analysis — there are no column names, and every calculation means looping and indexing by position. In this lesson you fix that by routing your query results straight into a pandas DataFrame, the table-shaped data structure at the heart of data analysis in Python.

A DataFrame gives you named columns, vectorized operations, and one-line plotting. Once your SQL results land in a DataFrame, the full power of pandas — filtering, grouping, sorting, charting — is available to you. SQLite is excellent at storing and querying data, but it cannot draw a chart; pandas and matplotlib can, so the natural move is to query in SQL and analyze in Python.

By the end of this lesson, you will be able to:

  • Connect to a SQLite database and run a query with pd.read_sql_query()
  • Receive query results directly as a labeled pandas DataFrame
  • Write readable multi-line SQL queries inside Python
  • Build a DataFrame manually from cursor results with pd.DataFrame()
  • Join tables in SQL and visualize the results with matplotlib

You should be comfortable with the connect-cursor-execute-fetch-close workflow from Lesson 1. Let’s begin.

Data for this lesson

Database: world_population.db — country population data (SQLite).

Tables used: population

Open it in any SQLite client — DB Browser for SQLite, the sqlite3 CLI, or Python’s sqlite3 — to run every query yourself.


The Dataset

In this lesson you will work with the World Population Database, a public dataset describing the population of countries (and World Bank country groupings) around the world. It is stored in a SQLite file named world_population.db. The table, population, has one row per country per year, with three columns:

  • CountryName — the country’s full name
  • Year — the year of the measurement
  • Population — the country’s population that year

The data covers the years 1960 through 2024. Alongside individual countries, the table also contains a few World Bank aggregates such as World, Arab World, and income groups — keep that in mind when you sort by population.

Your goal across the lesson is concrete: pull population-by-year data into Python and chart it. Because SQLite has no plotting of its own, you will query the data, hand it to pandas, and draw the chart there.


Querying Straight into a DataFrame

In Lesson 1 you fetched rows manually. pandas offers a more direct route with pd.read_sql_query(), which runs a SQL query and returns the results as a DataFrame in a single call. It takes two arguments: the query string and an open connection.

A first query

import pandas as pd
import sqlite3

# Open the connection
conn = sqlite3.connect("world_population.db")

# Run the query and read the results into a DataFrame
query = "SELECT CountryName, Population FROM population WHERE Year = 2020 LIMIT 10;"
results = pd.read_sql_query(query, conn)

print(results)

# Always close the connection when finished
conn.close()

Output:

                   CountryName  Population
0                        Aruba      108587
1  Africa Eastern and Southern   694446100
2                  Afghanistan    39068979
3   Africa Western and Central   474569351
4                       Angola    33451132
5                      Albania     2528480
6                      Andorra       77380
7                   Arab World   453723239
8         United Arab Emirates     9401038
9                    Argentina    45191965

Look at what you got back. Instead of a list of bare tuples, results is a DataFrame with named columns (CountryName, Population) and a numeric index down the left side. Every pandas tool now applies. For example:

print(type(results))
print(results.shape)
print(results["Population"].max())

Output:

<class 'pandas.core.frame.DataFrame'>
(10, 2)
694446100

That largest value belongs to Africa Eastern and Southern, a World Bank regional aggregate rather than a single country — a reminder that this dataset mixes individual countries with grouped totals.

Why this is better than tuples

With a list of tuples you would write row[1] and have to remember that index 1 meant population. With a DataFrame you write results["Population"] and the code explains itself. You also get free access to .head(), .describe(), .sort_values(), .groupby(), and hundreds of other methods — none of which exist on a plain list.

Note

The same pd.read_sql_query() call works against many databases, not just SQLite — you simply pass a connection created by a different library (for example, a PostgreSQL connection). The pandas side of your code stays the same.


Writing Readable Queries with Multi-line Strings

Short queries fit on one line, but real queries have several clauses and quickly become hard to read. Python’s multi-line strings — text wrapped in triple quotes (""" or ''') — let you spread a query across lines with indentation, so each clause stands on its own:

query = """
    SELECT CountryName, Year, Population
      FROM population
     WHERE CountryName = 'United States';
"""

This is exactly the same string as a one-line version with \n characters in it; SQLite ignores the extra whitespace. The payoff is readability — you can scan the SELECT, FROM, and WHERE clauses at a glance.

Putting it to work:

import pandas as pd
import sqlite3

conn = sqlite3.connect("world_population.db")

query = """
    SELECT CountryName, Year, Population
      FROM population
     WHERE CountryName = 'United States';
"""
results = pd.read_sql_query(query, conn)

print(results.head(10))

conn.close()

Output:

     CountryName  Year  Population
0  United States  1960   180671000
1  United States  1961   183691000
2  United States  1962   186538000
3  United States  1963   189242000
4  United States  1964   191889000
5  United States  1965   194303000
6  United States  1966   196560000
7  United States  1967   198712000
8  United States  1968   200706000
9  United States  1969   202677000

Notice results.head(10) shows just the first ten rows — a quick way to peek at a DataFrame without printing the whole thing.

Tip

Although SQLite does not require a semicolon at the end of a query, ending with one is good practice and keeps your queries portable to databases that do expect it.


Building a DataFrame from Cursor Results

pd.read_sql_query() is the easiest path, but it is worth knowing how to build a DataFrame yourself from the raw cursor results you learned to fetch in Lesson 1. This is useful when you have already fetched rows, or when you want full control over column names.

The constructor pd.DataFrame() accepts a list of tuples as rows; you supply the column names separately with the columns argument:

import pandas as pd
import sqlite3

conn = sqlite3.connect("world_population.db")
cursor = conn.cursor()

# Fetch rows the Lesson 1 way: a list of tuples
cursor.execute("SELECT CountryName, Population FROM population WHERE Year = 2020 LIMIT 5;")
rows = cursor.fetchall()
print("Raw rows:", rows)

# Turn the tuples into a labeled DataFrame
df = pd.DataFrame(rows, columns=["CountryName", "Population"])
print(df)

conn.close()

Output:

Raw rows: [('Aruba', 108587), ('Africa Eastern and Southern', 694446100), ('Afghanistan', 39068979), ('Africa Western and Central', 474569351), ('Angola', 33451132)]
                   CountryName  Population
0                        Aruba      108587
1  Africa Eastern and Southern   694446100
2                  Afghanistan    39068979
3   Africa Western and Central   474569351
4                       Angola    33451132

The same five tuples that were awkward to work with on their own become a tidy, labeled table.

Pulling column names from the cursor

You can even ask the cursor for the column names instead of typing them by hand. After running a query, the cursor’s description attribute holds metadata about each column, with the name in the first position:

cursor.execute("SELECT CountryName, Year, Population FROM population LIMIT 5;")
rows = cursor.fetchall()

# Extract column names from cursor.description
column_names = [col[0] for col in cursor.description]
print("Columns:", column_names)

df = pd.DataFrame(rows, columns=column_names)
print(df)

Output:

Columns: ['CountryName', 'Year', 'Population']
  CountryName  Year  Population
0       Aruba  1960       54922
1       Aruba  1961       55578
2       Aruba  1962       56320
3       Aruba  1963       57002
4       Aruba  1964       57619

This is essentially what pd.read_sql_query() does for you under the hood — it runs the query, fetches the rows, and reads the column names from cursor.description.


Visualizing the Results

Once your data is in a DataFrame, plotting it is straightforward with matplotlib. Since SQLite cannot draw charts, this is the real payoff of bringing data into Python.

A bar chart of the largest countries

Let’s query the ten most populous countries in 2022 and chart them. Because the table also stores World Bank aggregates like World and Arab World, we list the individual countries we care about in an IN (...) filter so only real countries reach the chart:

import sqlite3
import pandas as pd
import matplotlib.pyplot as plt

conn = sqlite3.connect("world_population.db")

query = """
    SELECT CountryName, Population
      FROM population
     WHERE Year = 2022
       AND CountryName IN ('India', 'China', 'United States', 'Indonesia',
                           'Pakistan', 'Nigeria', 'Brazil', 'Bangladesh',
                           'Russian Federation', 'Mexico', 'Japan', 'Ethiopia')
     ORDER BY Population DESC
     LIMIT 10;
"""
data = pd.read_sql_query(query, conn)
conn.close()

print(data)

# Build a bar chart from the DataFrame columns
plt.bar(data["CountryName"], data["Population"])
plt.xticks(rotation=45, ha="right")
plt.xlabel("Country")
plt.ylabel("Population")
plt.title("Top 10 Countries by Population in 2022")
plt.show()

Output:

          CountryName  Population
0               India  1425423212
1               China  1412175000
2       United States   334017321
3           Indonesia   278830529
4            Pakistan   243700667
5             Nigeria   223150896
6              Brazil   210306415
7          Bangladesh   169384897
8  Russian Federation   144236933
9              Mexico   128613117

The query sorts countries by population in descending order and keeps the top ten, so data arrives ready to plot. plt.bar() draws the bars, plt.xticks(rotation=45, ha="right") angles the country labels so they do not overlap, the xlabel/ylabel/title calls label the chart, and plt.show() displays it.

Charting one country over time

You can reproduce the lesson’s original task — population by year for a single country — the same way. Query the rows, then plot Year against Population:

import sqlite3
import pandas as pd
import matplotlib.pyplot as plt

conn = sqlite3.connect("world_population.db")

query = """
    SELECT Year, Population
      FROM population
     WHERE CountryName = 'United States';
"""
data = pd.read_sql_query(query, conn)
conn.close()

plt.bar(data["Year"], data["Population"])
plt.xlabel("Year")
plt.ylabel("Population")
plt.title("Population of the United States by Year")
plt.show()

Because the data is already a DataFrame, data["Year"] and data["Population"] plug directly into plt.bar(). The result is a column chart showing steady population growth across the decades.


Putting It All Together: Joining Tables

Real questions often compare rows that live in different parts of a table. A self-join treats one table as if it were two — here you join population to itself so that each country’s 2010 row sits next to its 2020 row, letting you compute the population change over the decade in a single query. Because the SQL JOIN happens inside the database, pandas receives a single, already-combined result set.

import sqlite3
import pandas as pd
import matplotlib.pyplot as plt

conn = sqlite3.connect("world_population.db")

query = """
    SELECT old.CountryName,
           new.Population - old.Population AS PopChange
      FROM population old
      JOIN population new ON old.CountryName = new.CountryName
     WHERE old.Year = 2010 AND new.Year = 2020
       AND old.CountryName IN ('India', 'China', 'United States', 'Indonesia',
                               'Pakistan', 'Nigeria', 'Brazil', 'Bangladesh',
                               'Russian Federation', 'Mexico', 'Japan', 'Ethiopia',
                               'Philippines', 'Egypt, Arab Rep.')
     ORDER BY PopChange DESC
     LIMIT 10;
"""
df = pd.read_sql_query(query, conn)
conn.close()

print(df.head())

# Horizontal bar chart: CountryName on the y-axis, population change on the x-axis
plt.barh(df["CountryName"], df["PopChange"])
plt.title("Top 10 Countries by Population Change from 2010 to 2020")
plt.xlabel("Population Change")
plt.ylabel("Country")
plt.show()

Output:

     CountryName  PopChange
0          India  159136131
1          China   73395000
2        Nigeria   47353295
3       Pakistan   35762699
4      Indonesia   28509544

A few things to notice in the query:

  • The JOIN ... ON clause links the table to a second copy of itself, matching rows on the shared CountryName. The aliases old and new keep the two copies straight.
  • WHERE old.Year = 2010 AND new.Year = 2020 pins each side of the join to one year, so every matched pair spans exactly one decade.
  • new.Population - old.Population computes the change, and the alias PopChange names that computed column, so it appears as df["PopChange"] in pandas.
  • The IN (...) filter keeps the chart to individual countries, and ORDER BY ... DESC LIMIT 10 keeps the ten largest.

Caution

SQL keywords and column names are case-insensitive inside the database, but pandas column labels are case-sensitive. If your query aliases a column PopChange, you must write df["PopChange"]df["popchange"] would raise a KeyError.

This single workflow — join in SQL, load with pandas, chart with matplotlib — is the everyday rhythm of data analysis that draws on a database.


Practice Exercises

Try these against world_population.db. Assume the population table described above (columns CountryName, Year, Population).

Exercise 1: Query into a DataFrame

Connect to world_population.db and use pd.read_sql_query() to retrieve the CountryName and Population of all countries in the year 2020, limited to 10 rows. Print the DataFrame and close the connection.

# Your code here

Hint

Filter with WHERE Year = 2020 and trim with LIMIT 10. Pass the query string and conn to pd.read_sql_query(), then call conn.close() at the end.

Exercise 2: A Readable Multi-line Query

Write a SELECT query as a multi-line triple-quoted string that returns CountryName, Year, and Population for your own country across all years. Read it into a DataFrame and print the first ten rows with .head(10).

# Your code here

Hint

Wrap the query in """ so each clause sits on its own line. Use WHERE CountryName = '...' to filter, matching the exact name in the database.

Exercise 3: Build a DataFrame by Hand

Use a cursor to fetch the CountryName and Population for the year 2022 (limit 5) as a list of tuples, then turn that list into a DataFrame with pd.DataFrame(), giving it the column names CountryName and Population.

# Your code here

Hint

After cursor.execute(...), call fetchall() to get the tuples, then pd.DataFrame(rows, columns=["CountryName", "Population"]).

Exercise 4: Chart Population by Year

Query the population of your country for every year, read it into a DataFrame, and use plt.bar() to draw a column chart with Year on the x-axis and Population on the y-axis. Add axis labels and a title.

# Your code here

Hint

After loading the DataFrame, pass data["Year"] and data["Population"] to plt.bar(), then call plt.xlabel(), plt.ylabel(), plt.title(), and finish with plt.show().


Summary

You have moved from raw tuples to fully labeled pandas DataFrames. The fastest route is pd.read_sql_query(query, conn), which runs a SQL query and hands back a DataFrame in one step. When you have already fetched rows yourself, pd.DataFrame(rows, columns=[...]) builds the same structure, and cursor.description can supply the column names automatically. Once the data is in a DataFrame, you used matplotlib to turn database queries into charts, and you used a self-join in SQL so that pandas received a single combined result.

Key Concepts

  • pandas DataFrame — a labeled, table-shaped data structure that makes filtering, grouping, and plotting easy.
  • pd.read_sql_query(query, conn) — runs a SQL query and returns the results directly as a DataFrame.
  • Multi-line strings — triple-quoted strings that let you write readable, multi-clause SQL inside Python.
  • pd.DataFrame(rows, columns=[...]) — builds a DataFrame from a list of tuples with column names you specify.
  • cursor.description — metadata about a query’s columns; the column name is the first item of each entry.
  • JOIN ... ON — combines rows from two tables (or a self-join of one table) in the database before pandas ever sees them.
  • matplotlib — the plotting library that turns DataFrame columns into bar and column charts.

Why This Matters

Databases store the data; pandas and matplotlib make sense of it. By querying in SQL and loading straight into a DataFrame, you let each tool do what it does best — the database filters and joins efficiently over large tables, and pandas handles the analysis and visualization. This combination is the daily workflow of analysts, data scientists, and engineers everywhere, and it scales from a tiny SQLite file on your laptop to a production database serving an entire company.


Next Steps

You can now query a database from Python and analyze the results with pandas. The natural next step is to work with a full production-grade database. PostgreSQL is the open-source database that powers countless real systems, and learning it opens the door to data engineering.

Continue to PostgreSQL for Data Engineers

Step up to a production-grade database and start building data engineering skills with PostgreSQL

Back to Module Overview

Return to the Querying Databases from Python module overview


Continue Building Your Skills

You have closed the loop between SQL and Python: query in the database, analyze in pandas, visualize with matplotlib. Practice this workflow on your own database files and datasets — pick a table, ask a question, write the query, and chart the answer. The more you repeat the cycle, the more naturally you will reach for a database whenever real data work begins.