← All articles
PythonData Analysis

Pandas and JSON: A Practical Guide to Nested Data and Large Files

Real JSON is rarely a flat table. This guide builds a tree-to-table mental model for pandas' JSON tools, then works through json_normalize on nested records and reading JSON Lines files with lines=True and chunksize so a file far bigger than memory still processes cleanly.

A CSV file is already a table — rows and columns, nothing to decide. JSON almost never arrives that way. An order from an e-commerce API comes with a customer object nested inside it and a list of line items nested inside that, and before you can plot or group anything, you have to answer a question CSV never asks: how do I turn this shape into a table at all? (If you haven’t seen basic JSON parsing yet — json.load on a small file — our post on reading files in Python covers that first; this post picks up exactly where it leaves off, with pandas and data too large to treat as one object.)

There’s a second question people skip just as often: what happens when the JSON is a file too large to load as one object in memory? This is where people reach for pd.read_json on a multi-gigabyte file, watch their machine grind to a halt, and assume pandas “isn’t built for JSON.” It is — you’re just missing two tools: json_normalize for nested structure, and JSON Lines with chunked reading for scale. This guide builds the mental model first, then works through both on data you can reproduce yourself.

The Mental Model: A Tree, Not a Table

A DataFrame is a table: fixed columns, one flat value per cell. JSON is a tree: objects can contain objects, and objects can contain lists of more objects, nested as deep as whoever wrote the API felt like going. Every tool in this post exists to answer one question, over and over: given this tree, which nodes become columns, and which nodes become rows?

  1. A scalar value (a string, a number, a date) at the top level of a record becomes a column, unchanged.
  2. A nested object (a dict inside a dict) becomes a group of columns, one per key, joined by a dot — customer with a name inside it becomes a column called customer.name.
  3. A nested list of objects (a list of dicts) can’t become a column at all — a cell can’t hold multiple rows. It has to become new rows, with the surrounding record’s fields repeated onto each one.
Diagram showing a nested JSON tree being flattened into a DataFrame: a top-level order id becomes a plain column, a nested customer object becomes dot-notation columns like customer.name, and a nested list of item objects becomes multiple rows with the order's own fields repeated onto each one.

Keep this in mind and pd.json_normalize’s two main arguments stop looking arbitrary: meta says which of the record’s own fields to carry along, and record_path says which nested list to explode into rows. Everything else in this post is just applying that idea.

A Small Order Export You Can Reproduce

Say you run a small independent bookshop and your storefront platform exports each order as a JSON object: an order ID and timestamp at the top, a nested customer object, and a nested list of items, since one order can contain several books. Six orders, hand-written so you can see the exact shape before any pandas is involved:

import json
from pathlib import Path

orders = [
    {
        "order_id": "ORD-1042",
        "ordered_at": "2026-06-18T14:32:00",
        "status": "shipped",
        "customer": {"name": "Freya Lindqvist", "email": "[email protected]", "city": "Utrecht"},
        "items": [
            {"sku": "BK-021", "title": "The Quiet Atlas", "qty": 2, "unit_price": 12.5},
            {"sku": "BK-114", "title": "Salt and Static", "qty": 1, "unit_price": 22.0},
        ],
    },
    {
        "order_id": "ORD-1043",
        "ordered_at": "2026-06-18T15:05:00",
        "status": "shipped",
        "customer": {"name": "Amir Kassab", "email": "[email protected]", "city": "Leuven"},
        "items": [{"sku": "BK-057", "title": "Nine Small Rivers", "qty": 1, "unit_price": 18.0}],
    },
    # ... 4 more orders, same shape
]

work_dir = Path("shopfiles")
work_dir.mkdir(exist_ok=True)
orders_path = work_dir / "orders.json"
orders_path.write_text(json.dumps(orders, indent=2), encoding="utf-8")
print("wrote", orders_path.name, "with", len(orders), "orders")
wrote orders.json with 6 orders

Data: an original, hand-written example (no external dataset — see the note near the JSON Lines section below for why). Each of the 6 orders has the same shape: three flat fields, one nested object, one nested list of 1-3 items — a small but genuinely nested tree, which is exactly what makes json_normalize worth using instead of hand-rolled loops.

Loading JSON the Straightforward Way: pd.read_json

The obvious first move is pd.read_json, and it works, in the sense that it doesn’t error:

naive = pd.read_json(orders_path)
print(naive.dtypes)
print()
print("naive.loc[0, 'customer'] ->", naive.loc[0, "customer"])
print("naive.loc[0, 'items']    ->", naive.loc[0, "items"])
order_id                 str
ordered_at    datetime64[us]
status                   str
customer              object
items                 object
dtype: object

naive.loc[0, 'customer'] -> {'name': 'Freya Lindqvist', 'email': '[email protected]', 'city': 'Utrecht'}
naive.loc[0, 'items']    -> [{'sku': 'BK-021', 'title': 'The Quiet Atlas', 'qty': 2, 'unit_price': 12.5}, {'sku': 'BK-114', 'title': 'Salt and Static', 'qty': 1, 'unit_price': 22.0}]

pd.read_json gives you one row per order, but it doesn’t flatten anything. customer and items both come back with dtype: object, which in this case means “a plain Python dict” and “a plain Python list of dicts,” sitting inside a DataFrame cell exactly as they were in the JSON. You could write a loop to pull customer["name"] out of every row yourself — but that’s what json_normalize is for.

Flattening the Top Level with json_normalize

pd.json_normalize takes the parsed JSON (a list of dicts, not a file path) and flattens nested objects into dot-notation columns automatically:

with open(orders_path, encoding="utf-8") as f:
    raw_orders = json.load(f)

orders_flat = pd.json_normalize(raw_orders)
print(orders_flat.columns.tolist())
print()
print(orders_flat[["order_id", "customer.name", "customer.city", "status"]])
['order_id', 'ordered_at', 'status', 'items', 'customer.name', 'customer.email', 'customer.city']

   order_id    customer.name customer.city      status
0  ORD-1042  Freya Lindqvist       Utrecht     shipped
1  ORD-1043      Amir Kassab        Leuven     shipped
2  ORD-1044        Ines Faro       Utrecht  processing
3  ORD-1045    Marek Wysocki         Ghent   cancelled
4  ORD-1046  Freya Lindqvist       Utrecht     shipped
5  ORD-1047   Sanne de Groot        Leuven     shipped

Notice customer disappeared and became three separate columns — customer.name, customer.email, customer.city — exactly rule 2 from the mental model. But items is still there, untouched, still a list of dicts in every cell. json_normalize flattens nested objects by default; it leaves nested lists alone until you tell it what to do with them, which is rule 3.

Flattening a List-of-Dicts Field with record_path and meta

To turn items into real rows, pass record_path="items" — the nested list to explode — and meta=[...] — which of the surrounding order’s fields to repeat onto every resulting row, including reaching into the nested customer object with a ["customer", "name"]-style path:

line_items = pd.json_normalize(
    raw_orders,
    record_path="items",
    meta=["order_id", "ordered_at", "status", ["customer", "name"], ["customer", "city"]],
)
print(line_items.shape)
print()
display_cols = ["order_id", "sku", "title", "qty", "unit_price", "customer.name", "customer.city"]
print(line_items[display_cols].head(6).to_string())
(10, 9)

   order_id     sku              title  qty  unit_price    customer.name customer.city
0  ORD-1042  BK-021    The Quiet Atlas    2       12.50  Freya Lindqvist       Utrecht
1  ORD-1042  BK-114    Salt and Static    1       22.00  Freya Lindqvist       Utrecht
2  ORD-1043  BK-057  Nine Small Rivers    1       18.00      Amir Kassab        Leuven
3  ORD-1044  BK-021    The Quiet Atlas    1       12.50        Ines Faro       Utrecht
4  ORD-1044  BK-088     Loud Furniture    1       15.75        Ines Faro       Utrecht
5  ORD-1044  BK-114    Salt and Static    1       22.00        Ines Faro       Utrecht

Six orders became 10 rows — one per line item — and every row still knows which order and which customer it belongs to, because meta repeated those fields onto each exploded row. This is now an ordinary flat DataFrame, so ordinary pandas works on it directly:

line_items["line_total"] = (line_items["qty"] * line_items["unit_price"]).round(2)
revenue_by_city = line_items.groupby("customer.city")["line_total"].sum().sort_values(ascending=False)
print(revenue_by_city)
customer.city
Utrecht    144.5
Leuven      48.5
Ghent       36.0
Name: line_total, dtype: float64

That’s a per-city revenue summary computed on data that started life three levels deep inside a JSON tree, with no manual loop. The official json_normalize and read_json documentation covers additional options like sep (to change the . in column names) and errors (for records missing a meta field), if you need them. (The outputs in this post come from pandas 3.0 — everything shown also works on pandas 2.x.)

Data Too Big to Load as One Array: JSON Lines

Everything above assumed the whole file fits comfortably in memory as one parsed object — six orders, no problem. But the bookshop’s site also logs every page view, cart add, and purchase as an event, and that log grows continuously. Shipping it as one giant JSON array ([{...}, {...}, ..., {...}]) has two real problems: you can’t append a new event without re-parsing and re-writing the entire array to keep the brackets and commas valid, and you can’t read record 1 without technically parsing through to the closing ].

JSON Lines (.jsonl, sometimes .ndjson) solves both problems with one rule: one complete, independent JSON object per line, no enclosing brackets, no commas between records. A new event is just a line appended to the end of the file. A reader can process record 1 the moment it hits the first newline, without knowing or caring how many more lines follow.

For a real demonstration of scale, this generates a seeded synthetic event log — the teaching point here is the technique for handling a file too big to load naively, not a specific real dataset, so a generated file serves it better than searching for one:

import numpy as np

rng = np.random.default_rng(42)
n_events = 6000
event_types = np.array(["page_view", "add_to_cart", "purchase"])
event_probs = [0.72, 0.20, 0.08]
skus = np.array(["BK-021", "BK-057", "BK-088", "BK-114", "BK-142", "BK-190"])
unit_prices = {"BK-021": 12.5, "BK-057": 18.0, "BK-088": 15.75, "BK-114": 22.0, "BK-142": 9.9, "BK-190": 27.5}

events_path = work_dir / "site_events.jsonl"
with open(events_path, "w", encoding="utf-8") as f:
    for i in range(n_events):
        etype = str(rng.choice(event_types, p=event_probs))
        sku = str(rng.choice(skus))
        record = {"event_id": i + 1, "session_id": f"S-{rng.integers(1000, 1999)}", "event_type": etype, "sku": sku}
        if etype in ("add_to_cart", "purchase"):
            qty = int(rng.integers(1, 3))
            record["qty"] = qty
            record["value"] = round(qty * unit_prices[sku], 2)
        f.write(json.dumps(record) + "\n")

size_kb = events_path.stat().st_size / 1024
with open(events_path, encoding="utf-8") as f:
    line_count = sum(1 for _ in f)
print(f"wrote {events_path.name}: {line_count} lines, {size_kb:.1f} KB")
wrote site_events.jsonl: 6000 lines, 550.9 KB

{"event_id": 1, "session_id": "S-1438", "event_type": "add_to_cart", "sku": "BK-114", "qty": 1, "value": 22.0}
{"event_id": 2, "session_id": "S-1201", "event_type": "page_view", "sku": "BK-190"}
{"event_id": 3, "session_id": "S-1735", "event_type": "purchase", "sku": "BK-021", "qty": 2, "value": 25.0}

6,000 lines on disk, and notice line 2 has fewer keys than lines 1 and 3 — page_view events have no qty or value. That’s a normal, valid JSON Lines file: each line is independently valid JSON, and lines don’t need identical shapes, unlike a CSV’s fixed columns.

Reading JSON Lines with lines=True

Pass lines=True and pd.read_json reads the whole file, one JSON object per row:

events = pd.read_json(events_path, lines=True)
print(events.shape)
print()
print(events.dtypes)
(6000, 6)

event_id        int64
session_id        str
event_type        str
sku               str
qty           float64
value         float64
dtype: object

All 6,000 rows loaded fine here because the file is small enough — but lines=True alone still loads the entire file into one DataFrame before handing it back, no better than the plain JSON array from before for a file too big for memory. The fix is the next section. qty and value came back as float64, not int64 — pandas had to accommodate the missing values on page_view rows, and NaN forces a float column even where every present value is a whole number.

Reading in Pieces with chunksize

Add chunksize=, and pd.read_json(..., lines=True, chunksize=N) stops returning a DataFrame and instead returns an iterator — something you loop over, where each step hands you the next N rows as their own small DataFrame and then lets them be garbage-collected. The file itself is read incrementally as you iterate, not all at once up front. Aggregating a result across the whole file means accumulating into a running total as you go, not stitching the pieces back into one big DataFrame afterward:

chunk_size = 500
n_chunks = 0
total_rows_seen = 0
purchase_value_by_sku = {}
event_type_counts = {"page_view": 0, "add_to_cart": 0, "purchase": 0}

for chunk in pd.read_json(events_path, lines=True, chunksize=chunk_size):
    n_chunks += 1
    total_rows_seen += len(chunk)

    counts = chunk["event_type"].value_counts()
    for etype, c in counts.items():
        event_type_counts[etype] = event_type_counts.get(etype, 0) + int(c)

    purchases = chunk[chunk["event_type"] == "purchase"]
    if len(purchases):
        by_sku = purchases.groupby("sku")["value"].sum()
        for sku, v in by_sku.items():
            purchase_value_by_sku[sku] = purchase_value_by_sku.get(sku, 0) + float(v)

print(f"processed {n_chunks} chunks, {total_rows_seen} rows total, "
      f"largest single chunk in memory at once: {chunk_size} rows")
print()
print("event_type_counts:", event_type_counts)
print()
purchase_totals = pd.Series(purchase_value_by_sku).sort_values(ascending=False).round(2)
print(purchase_totals)
processed 12 chunks, 6000 rows total, largest single chunk in memory at once: 500 rows

event_type_counts: {'page_view': 4344, 'add_to_cart': 1173, 'purchase': 483}

BK-114    3102.00
BK-057    2286.00
BK-088    2031.75
BK-190    1897.50
BK-021    1650.00
BK-142    1287.00
dtype: float64

Twelve chunks of 500 rows each, and at no point did the program hold more than 500 rows of site_events.jsonl at once — only two small running totals (event_type_counts and purchase_value_by_sku), which stay tiny no matter how long the file gets. The final numbers match a direct, whole-file computation exactly, which is worth checking once so you trust the pattern:

direct_check = events[events["event_type"] == "purchase"].groupby("sku")["value"].sum().sort_values(ascending=False).round(2)
print("matches direct computation:", purchase_totals.equals(direct_check))
matches direct computation: True

Same answer, two very different memory profiles. On this 6,000-line file the difference doesn’t matter. On a file with 60 million lines, it’s the difference between a script that finishes and one that OOMs.

Gotchas Worth Knowing

json_normalize on deeply nested data can quietly produce a very wide DataFrame. By default it flattens every level it finds — a real API response nested five or six levels deep can turn into a DataFrame with hundreds of columns, most of them barely used. The max_level argument caps how deep it goes:

level0 = pd.json_normalize(raw_orders, max_level=0)
print(level0.columns.tolist())
print(level0.loc[0, "customer"])
['order_id', 'ordered_at', 'status', 'customer', 'items']
{'name': 'Freya Lindqvist', 'email': '[email protected]', 'city': 'Utrecht'}

With max_level=0, customer stays a single object column instead of exploding into three — compare that to the unlimited-depth call earlier, which produced customer.name, customer.email, and customer.city as separate columns. On real nested data, start with a shallow max_level to see the shape before deciding how far to flatten.

lines=True requires a genuine one-object-per-line file — not a JSON array that happens to contain newlines. A pretty-printed array (json.dumps(data, indent=2)) has newlines all over it, but it’s still one big array, not one object per line, and lines=True will fail trying to parse each individual line as a complete JSON value:

try:
    pd.read_json(pretty_array_path, lines=True)
except ValueError as e:
    print(f"{type(e).__name__}: {e}")
ValueError: Expected object or value

If you see that error, check the file’s actual format before assuming lines=True is broken — nine times out of ten, the file is a regular indented JSON array, and the fix is lines=False (the default), not a different flag.

Concatenating every chunk back into one DataFrame defeats the entire point of chunking. It’s tempting to write pd.concat(pd.read_json(path, lines=True, chunksize=N), ignore_index=True) because it’s short — but that forces every chunk to exist in memory simultaneously to build the concatenated result, which is exactly the full-file load chunksize was meant to avoid:

reassembled = pd.concat(pd.read_json(events_path, lines=True, chunksize=chunk_size), ignore_index=True)
print("reassembled shape:", reassembled.shape)
reassembled shape: (6000, 6)

That line works, and on this toy file it’s harmless. But if the whole reason you reached for chunksize was a file too big to hold in memory as one DataFrame, pd.concat on all the chunks silently reconstructs the exact thing you were avoiding. If you truly need memory efficiency, the aggregation has to stay incremental — running sums, counts, or a groupby you merge chunk by chunk — the way the chunksize loop above does it.

Wrapping Up

Pandas’ JSON tools map directly onto the tree-to-table mental model:

  • pd.read_json → one row per top-level record, nested objects and lists left as raw Python values in the cells
  • pd.json_normalize → flattens nested objects into dot-notation columns automatically (max_level controls how deep)
  • record_path + meta → explodes a nested list of objects into its own rows, repeating the parent record’s fields onto each one
  • lines=True → reads a JSON Lines file, one object per line, no enclosing array
  • chunksize= → turns that read into an iterator, so you process and aggregate a file far larger than memory in fixed-size pieces

Once you’re comfortable turning nested JSON into a flat table, the rest is ordinary pandas — the DataFrames and Reading Data lessons in our free Python for Data Analytics course cover read_json alongside CSV and Excel loading as part of a fuller data-ingestion workflow, and our guide to pandas GroupBy picks up right where the line-item aggregation in this post leaves off.

More from the blog