Lesson 7 - Guided Project: A Command-Line Assistant

Welcome to the Guided Project

You’ll build a small but complete command-line chat assistant that ties together everything in this module — a system persona, multi-turn memory, streaming output, and live token/cost tracking. By the end you’ll have a single file, assistant.py, that you can run in a terminal and talk to like a real chat app, watching each reply type out word by word and seeing exactly what every turn costs.

Nothing here is new theory. Each step reuses one idea from Lessons 2 through 6 and snaps it into place: the client and system prompt from Lesson 3, the messages history from Lesson 4, the usage numbers from Lesson 5, and streaming from Lesson 6. The project is where those pieces stop being separate exercises and become one program.

By the end of this lesson, you will be able to:

  • Give an assistant a consistent personality with a system prompt
  • Keep a conversation’s memory across turns with a messages history list
  • Stream replies so they appear live, and capture the full text as they arrive
  • Track tokens and running cost per turn, and wrap it all in a clean REPL loop

You only need the Anthropic SDK, your API key in an environment variable, and the five earlier lessons. Let’s build it.


Step 1: Project Setup and a Persona

Start a fresh file called assistant.py. The first job is to create the client and decide who the assistant is.

As in every lesson, the client reads your key from the ANTHROPIC_API_KEY environment variable — you never put the key in the code. If you haven’t set it for this terminal session yet:

export ANTHROPIC_API_KEY="your-key-here"
pip install anthropic

Now the top of assistant.py. We create the client, pin the inexpensive claude-haiku-4-5 model, and write a system prompt — the persona string that shapes every reply:

import anthropic

# The client reads your key from the ANTHROPIC_API_KEY environment variable.
client = anthropic.Anthropic()

MODEL = "claude-haiku-4-5"

SYSTEM = (
    "You are Aria, a concise command-line assistant. "
    "Answer in plain text suitable for a terminal: no markdown headers, "
    "no tables, and keep replies to a few short sentences unless asked for more. "
    "If you are unsure of a fact, say so plainly."
)

That SYSTEM string is doing real work. Because the assistant lives in a terminal, we tell it to skip markdown and keep answers short — otherwise the model defaults to headers, bold text, and long bulleted lists that look messy in a plain console. A quick check that the persona takes hold:

resp = client.messages.create(
    model=MODEL,
    max_tokens=200,
    system=SYSTEM,
    messages=[{"role": "user", "content": "In one sentence, who are you?"}],
)
print(resp.content[0].text)
I'm Aria, a command-line assistant designed to give you quick, practical answers in plain text.

The model has adopted the name and the tone we asked for. That one SYSTEM string is the difference between a generic chatbot and your assistant. (We’ll drop this throwaway test once the real loop is in place.)

Why the persona lives in system, not messages

The system prompt sets durable behavior that applies to the whole conversation — name, tone, formatting rules. It is not a turn in the chat, so it never appears in the messages list. Put instructions about how to behave in system, and put the actual conversation in messages. You met this split in Lesson 3; the project is where it pays off.


Step 2: A Conversation That Remembers

A single messages.create call has no memory — the model only knows what you send it. To hold a real conversation, you keep a growing list of turns and send the whole list every time. That list is the assistant’s memory.

The rule from Lesson 4: every turn is a dictionary with a role ("user" or "assistant") and content. After each reply, you append both the user’s turn and the model’s answer, so the next request carries the full history.

messages = []

messages.append({"role": "user", "content": "My favorite language is Python."})
reply = client.messages.create(
    model=MODEL, max_tokens=200, system=SYSTEM, messages=messages,
)
answer = reply.content[0].text
messages.append({"role": "assistant", "content": answer})
print("Aria:", answer)

messages.append({"role": "user", "content": "What did I just say my favorite language was?"})
reply = client.messages.create(
    model=MODEL, max_tokens=200, system=SYSTEM, messages=messages,
)
answer = reply.content[0].text
messages.append({"role": "assistant", "content": answer})
print("Aria:", answer)

print("Turns stored:", len(messages))
Aria: Nice choice. Python's clean syntax and broad ecosystem make it great for everything from scripting to data science to web development. What are you working on with it?
Aria: Python.

The second answer — “Python.” — is the whole point. The model had no built-in memory of the first message; it answered correctly only because we sent the earlier turns along with the new question. Print the list length and you’ll see it holds 4 turns: two from you, two from Aria.

Append the assistant’s turn too

The most common memory bug is appending only the user’s messages and forgetting the assistant’s replies. If you skip the {"role": "assistant", ...} append, the model never sees its own past answers, and the conversation forgets half of itself. Append both, every turn.


Step 3: Stream the Reply So It Types Out Live

Right now each reply appears all at once after a pause. Lesson 6 showed the fix: streaming, where you receive the answer in small chunks as the model generates it and print each chunk immediately.

We switch from client.messages.create(...) to client.messages.stream(...), used as a context manager. Iterating stream.text_stream gives you the text deltas; we print each one with flush=True so it shows up instantly, and we also build up reply_text so we still have the complete answer to store in our history afterward.

messages = [{"role": "user", "content": "What is a variable, in one sentence?"}]

print("Aria> ", end="", flush=True)
reply_text = ""
with client.messages.stream(
    model=MODEL, max_tokens=200, system=SYSTEM, messages=messages,
) as stream:
    for chunk in stream.text_stream:
        print(chunk, end="", flush=True)
        reply_text += chunk
    final = stream.get_final_message()
print()  # newline after the streamed reply

print("Captured length:", len(reply_text))
print("Output tokens:", final.usage.output_tokens)
Aria> A variable is a named container that stores a value you can use and change throughout your program.
Captured length: 99
Output tokens: 22

Two things happen at once here. On screen, the sentence appears word by word as it’s generated. In memory, every chunk is also concatenated into reply_text, so when the stream finishes we have the full answer — 99 characters — ready to append to messages. And stream.get_final_message() hands back the complete message object, including the usage numbers we’ll need in the next step. Streaming costs you nothing extra; it only changes when the text arrives.


Step 4: Track Tokens and Running Cost Per Turn

You learned in Lesson 5 that every response carries a usage object with input_tokens and output_tokens, and that you’re billed per token. For claude-haiku-4-5 the price is $1 per million input tokens and $5 per million output tokens. Turning that into a per-turn cost is just arithmetic.

Define the price of a single token once, then read usage off the final message and multiply:

# Haiku 4.5 pricing, in dollars per single token.
INPUT_COST_PER_TOKEN = 1.00 / 1_000_000   # $1 per million input tokens
OUTPUT_COST_PER_TOKEN = 5.00 / 1_000_000  # $5 per million output tokens

messages = [{"role": "user", "content": "Give me two tips for naming variables."}]

with client.messages.stream(
    model=MODEL, max_tokens=300, system=SYSTEM, messages=messages,
) as stream:
    for chunk in stream.text_stream:
        pass
    final = stream.get_final_message()

usage = final.usage
turn_cost = (
    usage.input_tokens * INPUT_COST_PER_TOKEN
    + usage.output_tokens * OUTPUT_COST_PER_TOKEN
)
print(f"This turn: {usage.input_tokens} in + {usage.output_tokens} out "
      f"= ${turn_cost:.6f}")
This turn: 28 in + 101 out = $0.000533

Half a thousandth of a dollar for a full question and answer. Notice the shape of the cost: output tokens dominate, because they cost five times as much and there are usually more of them. This is the number we’ll surface after every turn so you can watch the session total climb in real time — and immediately see why a chatty assistant costs more than a terse one.

Why input tokens grow as you chat

Because you resend the whole messages history each turn (Step 2), the input token count rises as the conversation gets longer — turn five pays to re-read turns one through four. That’s the cost of memory, and it’s exactly why later modules care about trimming and summarizing history instead of sending everything forever.


Step 5: Wrap It in a REPL Loop

The final piece is the loop. A REPL — read, evaluate, print, loop — reads a line from you, sends it to the model, prints the streamed reply, and repeats. We use while True and give the user two commands: /exit to quit cleanly with a session total, and /reset to wipe the conversation and start fresh.

Each iteration does the full cycle: read input, handle commands, append the user turn, stream the reply while capturing it, append the assistant turn, then compute and print the cost. The running total lives in total_cost, accumulated across every turn.

def main():
    messages = []        # the running conversation history
    total_cost = 0.0     # accumulated dollars across the whole session

    print("Aria is ready. Type your message, or /exit to quit, /reset to start over.\n")

    while True:
        user_input = input("you> ").strip()

        if not user_input:
            continue
        if user_input == "/exit":
            print(f"\nGoodbye. This session cost ${total_cost:.4f}.")
            break
        if user_input == "/reset":
            messages = []
            print("(conversation cleared)\n")
            continue

        messages.append({"role": "user", "content": user_input})
        # ... stream the reply, append it, print the cost ...

Handling /exit and /reset before touching messages matters: a command is not a turn in the conversation, so we deal with it and continue (or break) before any API call. Skipping empty input keeps stray Enter-presses from sending blank requests you’d still pay for. With that scaffolding in place, the assistant is complete — here is the whole program.


The Complete assistant.py

import anthropic

# The client reads your key from the ANTHROPIC_API_KEY environment variable.
client = anthropic.Anthropic()

MODEL = "claude-haiku-4-5"

# Haiku 4.5 pricing, in dollars per single token.
INPUT_COST_PER_TOKEN = 1.00 / 1_000_000   # $1 per million input tokens
OUTPUT_COST_PER_TOKEN = 5.00 / 1_000_000  # $5 per million output tokens

SYSTEM = (
    "You are Aria, a concise command-line assistant. "
    "Answer in plain text suitable for a terminal: no markdown headers, "
    "no tables, and keep replies to a few short sentences unless asked for more. "
    "If you are unsure of a fact, say so plainly."
)


def main():
    messages = []        # the running conversation history
    total_cost = 0.0     # accumulated dollars across the whole session

    print("Aria is ready. Type your message, or /exit to quit, /reset to start over.\n")

    while True:
        user_input = input("you> ").strip()

        if not user_input:
            continue
        if user_input == "/exit":
            print(f"\nGoodbye. This session cost ${total_cost:.4f}.")
            break
        if user_input == "/reset":
            messages = []
            print("(conversation cleared)\n")
            continue

        # Add the user's turn to the history.
        messages.append({"role": "user", "content": user_input})

        # Stream the reply so it types out live, and capture the full text.
        print("Aria> ", end="", flush=True)
        reply_text = ""
        with client.messages.stream(
            model=MODEL,
            max_tokens=400,
            system=SYSTEM,
            messages=messages,
        ) as stream:
            for chunk in stream.text_stream:
                print(chunk, end="", flush=True)
                reply_text += chunk
            final = stream.get_final_message()
        print()  # newline after the streamed reply

        # Record the assistant's turn so the next request remembers it.
        messages.append({"role": "assistant", "content": reply_text})

        # Track tokens and running cost for this turn.
        usage = final.usage
        turn_cost = (
            usage.input_tokens * INPUT_COST_PER_TOKEN
            + usage.output_tokens * OUTPUT_COST_PER_TOKEN
        )
        total_cost += turn_cost
        print(
            f"  [{usage.input_tokens} in + {usage.output_tokens} out tokens"
            f" | turn ${turn_cost:.5f} | session ${total_cost:.5f}]\n"
        )


if __name__ == "__main__":
    main()

Run it with python assistant.py and have a conversation. Here’s a real session — your text after you>, Aria’s streamed reply after Aria>, and the token/cost readout after each turn:

Aria is ready. Type your message, or /exit to quit, /reset to start over.

you> My name is Sam and I am learning Python.
Aria> Hi Sam! That's great that you're learning Python. Feel free to ask me any
questions about it—whether it's about syntax, concepts, debugging, or working
through problems. I'm here to help.
  [73 in + 46 out tokens | turn $0.00030 | session $0.00030]

you> What is a list comprehension? Keep it short.
Aria> A list comprehension is a concise way to create a new list by applying an
expression to each item in an existing sequence, optionally filtering items with
a condition. For example, [x*2 for x in range(5)] creates [0, 2, 4, 6, 8].
  [133 in + 72 out tokens | turn $0.00049 | session $0.00080]

you> What did I say my name was?
Aria> Your name is Sam.
  [216 in + 8 out tokens | turn $0.00026 | session $0.00105]

you> /exit

Goodbye. This session cost $0.0011.

Read the token counts down the page and the whole module is visible at once. The input count climbs every turn — 73, then 133, then 216 — because each request resends the growing history; that’s memory in action. The third reply, “Your name is Sam.”, proves the memory works: the model only knew that because turn one was still in the messages list. And the entire three-turn conversation cost about a tenth of a cent. You just built and ran a real LLM application.


Take It Further

The assistant works; now make it yours. Each of these is a small change to the file you just wrote:

  • Swap the model. Change MODEL to a more capable model for harder questions, and update INPUT_COST_PER_TOKEN / OUTPUT_COST_PER_TOKEN to that model’s pricing. Watch how the per-turn cost changes — and decide for yourself when the extra capability is worth it.
  • Add a /tokens command. Alongside /exit and /reset, handle a /tokens command that prints the running total_cost and the number of turns stored in messages, without sending anything to the model.
  • Save the transcript. On /exit, write the messages list to a file (for example with json.dump) so you can reload a past conversation or review what was said. This is the first step toward an assistant that remembers across sessions, not just within one.
  • Trim the history. Once a conversation gets long, the input cost grows with it. Keep only the last N turns in messages before each request and see how the input token count stops climbing — a preview of the context-management techniques in later modules.

Summary

You assembled a complete command-line assistant from the parts you built across this module. A system prompt gave it a consistent persona; a messages history list gave it memory across turns; client.messages.stream made replies type out live while you captured the full text; and the usage numbers plus Haiku’s $1/$5 pricing let you show the cost of every turn and a running session total. A while True REPL loop with /exit and /reset tied it into a program you can actually use.

Key Concepts

  • System persona — a system prompt that fixes the assistant’s name, tone, and formatting for the whole conversation.
  • Conversation memory — a messages list that grows with every user and assistant turn and is resent on each request.
  • Streaming capture — printing stream.text_stream chunks live while concatenating them into the full reply to store.
  • Per-turn costusage.input_tokens and usage.output_tokens multiplied by the model’s per-token price, accumulated across the session.
  • REPL loop — a read-evaluate-print loop with command handling (/exit, /reset) that turns the API calls into an interactive app.

Why This Matters

Almost every LLM product you’ll ever build is this loop with more around it: a persona, a memory of the conversation, a streamed response, and a meter on the cost. Retrieval systems, agents, and tools all bolt onto this same skeleton. Having written it once — and watched the token counts and dollars move with your own eyes — you now understand what those bigger systems are doing underneath, and what they cost to run.


Next Steps

Continue to Module 2 - Prompt Engineering (next in the course)

Move from making calls to making them count — designing prompts that get reliable, well-shaped answers out of any model.

Back to Module Overview

Return to the Working with LLMs in Python module overview


Continue Building Your Skills

You’ve built your first real LLM app — a working assistant you wrote line by line, that remembers, streams, and tells you what it costs. That’s a genuine milestone: everything in this module now lives in one program you understand top to bottom. Next you’ll sharpen the part that matters most for getting good answers out of any model — the prompt itself. Onward to prompt engineering.