Lesson 3 - Summarization Memory

Welcome to Summarization Memory

Truncation keeps the list small by throwing the oldest turns away. That works when only the recent turns matter — but it’s lossy. The moment a detail from early in the conversation becomes relevant again (“what was that budget you mentioned at the start?”), it’s simply gone. Summarization takes a different bargain: instead of discarding the old turns, it compresses them. When the history gets long, you replace the old turns with one short summary of them and keep the recent turns in full. The list stays small, but the gist of the whole conversation survives.

By the end of this lesson, you will be able to:

  • Explain how summarization compresses old turns into a single summary message
  • Write a compact_history function that folds everything except the last few turns into one summary
  • Describe where the summary text comes from — a cheap model call
  • Choose between truncation and summarization for a given situation

This builds on the truncation idea from Lesson 2 — same goal, different trade-off. Let’s begin.


The Idea: Compress the Old, Keep the Recent

Truncation and summarization are solving the same problem — a messages list that grows past what you want to carry — but they make opposite choices about the old turns.

  • Truncation drops the oldest turns entirely. Fast and free, but the detail is gone forever.
  • Summarization replaces the oldest turns with a short paragraph that captures what they contained. The detail is compressed, not erased.

The structure after summarization looks like this: the original task message stays at the front (it anchors the whole run), then a single summary message stands in for all the middle turns, then the most recent turns are kept verbatim because they’re what the agent is actively working with. So a sprawling history collapses into “here’s the task → here’s the gist of everything that happened → here are the last few turns in full.”

This is the same idea that long-running agent frameworks call compaction: when the context fills up, fold the old history into a compact summary and continue. You’re building the small, hand-rolled version of it here.


The Compaction Function

Here’s a compact_history function that does exactly that — keep the first message, summarize the middle, keep the last keep_last turns:

def compact_history(messages, summarize, keep_last=2):
    """Fold everything except the last `keep_last` messages into one summary
    message, preserving recent context. `summarize` is your LLM call."""
    if len(messages) <= keep_last + 1:
        return messages
    old, recent = messages[1:-keep_last], messages[-keep_last:]
    summary_text = summarize(old)                      # an LLM call in real code
    summary_msg = {"role": "user", "content": f"[Summary of earlier conversation: {summary_text}]"}
    return [messages[0], summary_msg] + recent

Walk through it. The guard at the top says: if the list is short enough that there’s nothing in the middle to compress, leave it alone. Otherwise, slice the list into two pieces — old is everything from the second message up to (but not including) the last keep_last turns, and recent is those last keep_last turns. The old slice gets handed to summarize, which returns a short summary string. That string becomes one new user message, and the function returns the original task message, then the summary, then the recent turns — in that order.

Here is a verified run on a 9-message conversation, using a stand-in summarizer so the result is deterministic (the compaction logic is real; only the summary text is mocked):

original: 9 -> compacted: 4
  [user] Plan a trip to Japan.
  [user] [Summary of earlier conversation: user wants a Japan trip; 6 earlier turns covered dates and budget]
  [assistant] turn 7
  [user] turn 8

Read that output against the function. The original task message — “Plan a trip to Japan.” — stays at the front. The 6 middle turns are gone, replaced by the single [Summary of earlier conversation: ...] message. The last 2 turns (keep_last=2) are kept exactly as they were. Nine messages became four, and the gist of the dropped turns lives on in the summary instead of vanishing. (The summary wording above comes from the model, so it will vary run to run — only the structure is fixed.)


Where the Summary Comes From

The compact_history function above takes summarize as an argument because the summary text doesn’t come from string manipulation — it comes from a model. You hand the old turns to a model and ask it to write a few sentences capturing what mattered. There’s no reason to spend your best, most expensive model on this; a small, cheap one is exactly right for “compress this into a few sentences.” Here’s the production shape of that call, using a fast model like claude-haiku-4-5:

def summarize(old_messages):
    text = "\n".join(str(m["content"]) for m in old_messages)
    resp = client.messages.create(
        model="claude-haiku-4-5", max_tokens=200,
        system="Summarize this conversation so far in 2-3 sentences, keeping key facts and decisions.",
        messages=[{"role": "user", "content": text}],
    )
    return resp.content[0].text

It flattens the old messages into one block of text, sends it to a cheap model with a system prompt that asks for a tight 2-3 sentence summary, and returns the model’s reply. That returned string is what compact_history wraps into the summary message. In the verified run above this call was mocked so the output would be stable, but in a real agent this is the function you’d plug in — and because it’s just another messages.create call, it costs real tokens and real latency every time you compact.


Truncate vs Summarize: When to Use Each

Both keep the list bounded; they differ in what you pay and what you keep.

  • Truncation is free and instant — it’s just a slice. But it’s lossy: anything you drop is unrecoverable, and the agent will act as if those turns never happened.
  • Summarization preserves far more information, because the gist of the dropped turns survives. But it costs an extra model call (money and latency) every time you compact, and a summary is compression — it necessarily loses specifics. “We discussed dates and budget” doesn’t tell you the agent picked April 12th and $3,000.

So the choice is about whether early context still matters:

  • Reach for truncation when only the recent turns are relevant and you want it cheap — short chats, quick lookups, tasks where the old turns are genuinely done with.
  • Reach for summarization on long tasks where early context still shapes the work — multi-step research, a planning session where the original constraints must persist, anything where forgetting the beginning would break the result.

Many real agents combine the two: keep recent turns verbatim, summarize the older ones, and only truncate the summary itself if even that grows too large.

You pay a model call for information density

Summarization buys you information density — the gist of a long conversation in a fraction of the tokens — at the cost of an extra model call every time you compact. That’s a good trade on long tasks where early context still matters, and a wasteful one when only the recent turns do. When in doubt: if dropping the old turns outright would change the answer, summarize; if it wouldn’t, truncate.


Practice Exercises

Exercise 1: Which messages get summarized?

Given a 10-message conversation and compact_history(messages, summarize, keep_last=3), which messages end up summarized, which are kept verbatim, and what’s the final length of the returned list?

Hint

messages[0] (the task) is kept as-is, the last 3 messages (keep_last=3) are kept verbatim, and the 6 in between (messages[1:-3]) are folded into one summary message. The returned list is: task + 1 summary + 3 recent = 5 messages.

Exercise 2: Why a cheap model for the summary?

A teammate wants to use your most powerful, most expensive model for the summarize call “to get the best summary.” Why is a small, fast model like claude-haiku-4-5 the better default here?

Hint

Summarizing a conversation into 2-3 sentences is an easy, well-scoped task that small models handle well — and you run it every time you compact, so it’s on the hot path for cost and latency. Spending a top-tier model on it pays a premium for capability the task doesn’t need. Reserve the expensive model for the agent’s actual reasoning, not for housekeeping.

Exercise 3: Truncate or summarize?

For each, pick truncation or summarization: (a) a customer-support chat answering one quick question; (b) a multi-hour research agent where the user’s original goal and constraints from turn one must still guide step fifty; (c) a long debugging session where early error messages might still be relevant.

Hint

(a) Truncate — only the recent turns matter for a one-off question, and it’s free. (b) Summarize — the early goal and constraints still shape the work, so dropping them would derail the agent. (c) Summarize — early errors carry detail you may need again, and a summary preserves the gist rather than erasing it. The rule: if forgetting the beginning would change the outcome, summarize.


Summary

Truncation discards old turns; summarization compresses them instead. When the history gets long, you replace the old turns with one short summary message — keeping the original task message at the front and the most recent turns verbatim — so the list stays small while the gist of the whole conversation survives. The compact_history function does this by slicing the list into old and recent, handing old to a summarize function, and rebuilding the list as task + summary + recent. That summarize function is a real model call — a cheap one like claude-haiku-4-5 is the right default, since the task is easy and runs every time you compact. Compared with truncation, summarization preserves far more information but costs an extra model call and still loses specifics. Truncate when only the recent turns matter; summarize when early context still shapes the work. This is the same idea long-running agents call compaction.

Key Concepts

  • Summarization compresses, truncation discards — the old turns become a short summary instead of vanishing.
  • Compaction structure — keep the task message, fold the middle into one summary, keep the last keep_last turns verbatim.
  • The summary is a model call — use a cheap model (e.g. claude-haiku-4-5) to write 2-3 sentences; it costs tokens and latency every compaction.
  • The trade-off — summarization preserves more but costs a call and loses specifics; truncation is free but lossy.

Why This Matters

On a long-running agent, you can’t carry the whole transcript forever — but you also can’t always afford to forget the beginning. Summarization is how you keep a long task coherent without blowing the context budget: the agent remembers that you set a $3,000 budget and an April trip, even after the turns that established them have scrolled out of full detail. Knowing when to reach for it instead of plain truncation — and knowing it costs a model call you should run on a cheap model — is the difference between an agent that loses the plot halfway through and one that stays on track from turn one to turn fifty. Next, you’ll give the agent memory that survives past the session entirely.


Next Steps

Continue to Lesson 4 - Long-Term Memory with a Vector Store

Persist durable facts beyond a single run and pull back only the relevant ones when you need them.

Back to Module Overview

Return to the Memory and State module overview


Continue Building Your Skills

You can now keep a long conversation inside the context window without throwing its history away — folding the old turns into a compact summary while keeping the recent ones in full, and knowing when that trade beats plain truncation. Next you’ll tackle the other memory problem from Lesson 1: making facts survive past the session itself, with a long-term store you can write to and search.