Lesson 2 - Managing Context: Truncation and Budgets

Welcome to Managing Context: Truncation and Budgets

In the last lesson you saw the first problem with relying on the messages list as memory: it grows without bound. Every turn appends more, you resend the whole thing, and eventually the conversation no longer fits the context window — or the cost and latency climb past what you want to pay. This lesson tackles that head-on with the simplest tool available: truncation. You decide how much history to keep, drop the rest, and watch a token budget so you know when to trim. You’ll also learn the one accurate way to measure how big your prompt really is.

By the end of this lesson, you will be able to:

  • Name the three strategies for keeping context inside the window, and when each fits
  • Write a truncate_history function that keeps the first message and the most recent turns
  • Reason about what truncation throws away — and when that’s fine
  • Measure prompt size accurately with Claude’s count_tokens and manage a token budget
  • Avoid the classic mistake of splitting a tool_use from its matching tool_result

Let’s start with the lay of the land.


Three Ways to Stay Inside the Window

There are three standard strategies for keeping a growing history from overflowing, and they trade off the same way: how much detail you keep against how much work it takes.

Three ways to keep the context from overflowing. On the left, a growing history stack overflows past a dashed 'context limit' line. Three strategies: 1. Truncate — drop the oldest turns, keep the first message and the most recent ones; simplest and cheap, but you lose the dropped detail. 2. Summarize — compress old turns into one short summary note, keep recent turns in full; keeps the gist small but costs an extra model call. 3. Retrieve — keep facts in long-term memory and pull only the relevant ones into context; scales past any single session but needs a vector store. A caption notes real agents combine these: truncate or summarize the running history, and retrieve durable facts on demand.
Three strategies for staying within the context window. This lesson covers truncation; summarization is next, and retrieval comes with long-term memory.
  • Truncate — drop the oldest turns and keep the most recent ones (plus, usually, the very first message). It’s the cheapest option: pure list slicing, no extra model call. The cost is that dropped detail is gone for good.
  • Summarize — compress the old turns into one short summary note and keep the recent turns in full. You keep the gist of what was dropped, at the price of an extra model call to write the summary. That’s the next lesson.
  • Retrieve — store durable facts in a long-term memory and pull only the relevant ones into context when needed. This scales past any single session, but needs a vector store to do the pulling. That’s the long-term-memory lessons later in this module.

Real agents combine these: they truncate or summarize the running transcript and retrieve durable facts on demand. But the foundation — and the simplest thing that works — is truncation, so that’s where we begin.


Truncation: Keep the First Message and the Recent Turns

The idea is straightforward. When the history gets too long, throw away the middle and keep the two parts that matter most: the first message (the original task — the thing the agent is actually trying to do) and the most recent turns (where the conversation is right now). Here’s the verified function:

def truncate_history(messages, keep_last=4):
    """Keep the first message (the original task) + the last `keep_last`."""
    if len(messages) <= keep_last + 1:
        return messages
    return [messages[0]] + messages[-keep_last:]

Two things to notice. The guard — if len(messages) <= keep_last + 1 — means we don’t bother truncating until there’s actually something to drop; a short conversation passes through untouched. And the return is just two slices glued together: messages[0] (the first message) plus messages[-keep_last:] (the last keep_last of them). No model call, no API, no cleverness — just list arithmetic.

Run it on a conversation of 9 messages (the first message plus 8 turns):

original messages: 9 -> kept: 5
kept contents: ['Plan a trip to Japan.', 'turn 5', 'turn 6', 'turn 7', 'turn 8']

Look at what survived: 'Plan a trip to Japan.' (the original task) and turn 5 through turn 8 (the four most recent). Turns 1 through 4 — the middle of the conversation — are gone.

Why keep the first message? Because it’s almost always the task statement. If you drop it, the model can lose track of what it’s even doing halfway through a long run. Keeping it anchors every future turn to the original goal.

What you lose, and when that’s fine. Truncation is lossy: whatever was in those dropped middle turns is gone, and the agent can no longer refer back to it. That sounds bad, but for a great many agents it’s perfectly fine — because the recent context is what matters. If the agent is iterating toward a goal, the last few turns usually carry everything it needs to take the next step, and the early back-and-forth is no longer relevant. Truncation is the right call when old detail genuinely stops mattering. It’s the wrong call when the agent needs to remember something specific from far back — and that’s exactly the case summarization (next lesson) and retrieval (later) are for.


Token Budgets: Measure, Then Decide What to Trim

truncate_history takes a keep_last count, but how do you choose it? The honest answer is that you’re really managing a token budget: you decide how many tokens of history you’re willing to send each call, and you trim until you’re under it. To do that well, you need to know how big your prompt actually is.

A rough rule of thumb (a few characters per token) is fine for building intuition, but it’s an estimate, and it’s wrong often enough that you shouldn’t budget against it. The accurate way is to ask Claude directly, using the token-counting endpoint:

count = client.messages.count_tokens(model="claude-haiku-4-5", messages=messages)
print(count.input_tokens)   # the exact prompt size, so you can decide what to trim

count_tokens runs your messages through Claude’s actual tokenizer — the same one the model uses — and returns .input_tokens, the exact number of input tokens that prompt would cost. That’s the number to budget against. The pattern in practice: measure with count_tokens, and if you’re over budget, call truncate_history (with a smaller keep_last, or in a loop) until .input_tokens drops below your limit.

One caution: do not reach for tiktoken to count Claude tokens. tiktoken is OpenAI’s tokenizer; it splits text differently and will give you the wrong number for Claude — typically undercounting, and worse on code or non-English text. count_tokens is the only accurate measurement for a Claude prompt.

The practical caveat: don’t split tool pairs. When the conversation involves tool use, the history contains paired blocks — a tool_use (the model asking to call a tool) and its matching tool_result (what the tool returned). The Claude API requires these to stay together: every tool_use must be followed by its tool_result. If your truncation slices the list in a spot that keeps one half and drops the other, the next API call fails. The fix is exactly what truncate_history already does: truncate by whole turns, not by individual messages. When you keep “the last four turns” rather than “the last four list entries,” each kept turn carries its tool pairs intact, and you never orphan half a pair.

Truncate by whole turns, and never orphan a tool pair

Two rules keep truncation safe. First, measure with count_tokens, not tiktokencount_tokens uses Claude’s own tokenizer and returns the exact .input_tokens; tiktoken is OpenAI’s tokenizer and will report the wrong number for a Claude prompt. Second, never separate a tool_use block from its matching tool_result — the API needs the pair, and dropping one half causes an error. Slicing by whole turns (as truncate_history does) keeps every pair together automatically.


Practice Exercises

Exercise 1: Why keep the first message?

truncate_history always keeps messages[0] even when it drops everything else from the middle. Why is that first message worth a permanent spot in the budget?

Hint

The first message is almost always the original task — the thing the agent is trying to accomplish (e.g. 'Plan a trip to Japan.'). If you drop it, the model can lose track of its goal partway through a long run. Keeping it anchors every later turn to the original request, even after the middle of the conversation has been trimmed away.

Exercise 2: What does truncation cost you?

A teammate wants to use truncate_history on an agent that, deep into a session, sometimes needs to recall a specific detail the user mentioned near the very beginning. Is plain truncation a good fit? What does it throw away?

Hint

Truncation is lossy: the dropped middle turns are gone, and the agent can no longer refer to them. That’s fine when only recent context matters (most goal-directed loops), but it’s a poor fit here — the teammate needs a specific old detail, which truncation may have already discarded. This is the case for summarization (keep the gist of old turns) or retrieval (store and look up durable facts), covered in the next lessons.

Exercise 3: Estimate or measure?

You’re deciding how many turns to keep so the prompt stays under a token budget. One approach: multiply your character count by a rough tokens-per-character guess. Another: call client.messages.count_tokens(...). Which should you budget against, and why not tiktoken?

Hint

Budget against count_tokens. The rough guess is fine for intuition but is only an estimate; count_tokens runs your messages through Claude’s actual tokenizer and returns the exact .input_tokens — the real prompt size. Don’t use tiktoken: it’s OpenAI’s tokenizer and splits text differently, so it reports the wrong count for a Claude prompt (often undercounting, and worse on code or non-English text).


Summary

The first problem with the messages list — unbounded growth — has three standard fixes: truncate, summarize, and retrieve, in increasing order of how much detail they preserve and effort they take. This lesson covered the simplest: truncation. The verified truncate_history keeps the first message (the original task) and the most recent keep_last turns, dropping the middle with nothing more than list slicing. It’s cheap and effective when recent context is what matters; its cost is that dropped detail is gone for good. To decide how much to keep, you manage a token budget — and the accurate way to measure prompt size is Claude’s count_tokens endpoint, which returns .input_tokens from Claude’s own tokenizer (never tiktoken, which is a different tokenizer and will be wrong). Finally, when a conversation uses tools, truncate by whole turns so you never split a tool_use from its matching tool_result — a split pair causes an API error.

Key Concepts

  • Three strategies — truncate (cheapest, lossy), summarize (keeps the gist, costs a call), retrieve (scales, needs a store).
  • truncate_history — keep messages[0] + the last keep_last turns; drop the middle.
  • What truncation loses — dropped detail is unrecoverable; fine when only recent context matters.
  • count_tokens — the accurate way to measure a Claude prompt; returns .input_tokens. Not tiktoken.
  • Keep tool pairs together — truncate by whole turns so a tool_use is never orphaned from its tool_result.

Why This Matters

“My agent fell over deep into a long session” almost always traces back to an unmanaged messages list. Truncation is the first, simplest line of defense, and it covers a surprising number of real agents — the ones where the recent turns carry everything that matters. Pairing it with an accurate token count means you trim deliberately against a budget instead of guessing, and respecting tool pairs means your trimming never breaks the next API call. When truncation throws away too much, you’ll reach for the next tool — which is exactly where the following lesson goes.


Next Steps

Continue to Lesson 3 - Summarization Memory

When dropping old turns loses too much, compress them into a short summary instead — keeping the gist while staying inside the window.

Back to Module Overview

Return to the Memory and State module overview


Continue Building Your Skills

You can now keep a growing messages list inside the context window the simplest way there is: keep the first message and the most recent turns, measure the prompt accurately with count_tokens, and trim by whole turns against a budget. Next you’ll handle the case truncation can’t — when the old turns hold something worth keeping — by summarizing them instead of discarding them.