Context engineering is the discipline of deciding exactly which tokens enter the LLM’s context window on each inference call.
A naive agent appends every message and every tool output to a growing list. This works for short sessions. For longer sessions it fails in two ways:
Context engineering replaces unbounded accumulation with deliberate curation.
Every model has a token limit. Think of it as a budget:
| Model | Token limit |
|---|---|
| GPT-5 | 128,000 |
| GPT-5-mini | 128,000 |
| GPT-4-turbo | 128,000 |
| GPT-4 | 8,192 |
| GPT-3.5-turbo | 16,385 |
Your context on each call consumes some of that budget across:
The remaining budget is available for the model’s response. Good context engineering keeps each component proportional and prunes aggressively when you approach the limit.
calculate_context_usageThis function estimates how much of the context budget is used and returns a percentage. It is called inside the agent harness on each iteration to trigger summarisation when the context gets too large.
Token estimation: Dividing character count by 4 gives a reasonable approximation for English text (average ~4 characters per token). For production use tiktoken for exact counts. The approximation is sufficient for threshold-based decisions.
Complete solution:
def calculate_context_usage(context: str, model: str = "gpt-5-mini") -> dict:
"""Calculate context window usage as a percentage."""
estimated_tokens = len(context) // 4
max_tokens = MODEL_TOKEN_LIMITS.get(model, 128000)
percentage = (estimated_tokens / max_tokens) * 100
return {"tokens": estimated_tokens, "max": max_tokens, "percent": round(percentage, 1)}
Key names matter. The agent harness reads
usage["tokens"],usage["max"], andusage["percent"]directly. If your dict uses different key names (such asestimated_tokensorpercent_used), the agent will raise aKeyErrorat runtime.
Usage in the agent harness:
usage = calculate_context_usage(current_context, model="gpt-5")
if usage["percent"] > 80:
# Trigger summarisation before the next LLM call
context, summaries = offload_to_summary(context, memory_manager, llm_client)
The code cells following the TODO contain summarise_context_window and offload_to_summary. These are provided complete. Read them to understand the pattern.
summarise_context_window calls the LLM to compress a block of context into a summary, stores the summary in SUMMARY_MEMORY, and returns a reference ID.
offload_to_summary wraps the above: if context usage exceeds a threshold (default 80%), it replaces the verbose context with a compact summary reference. The agent can expand the reference later using the expand_summary tool.
This is conversation compaction — the same technique used by production agent frameworks to handle long-running sessions.
When context is offloaded, the full content is not lost — it is stored in Oracle’s SUMMARY_MEMORY table. A expand_summary tool is registered in the toolbox, allowing the agent to retrieve the full content on demand:
Agent context: "... [SUMMARY REF: abc-123] ..."
Agent decides: I need the detail from that summary
Agent calls: expand_summary(summary_id="abc-123")
Oracle returns: full original content
This is the key advantage of using a database for memory: the content is always retrievable, never truly discarded.
The plotting cell at the end of Part 6 shows context window usage across agent iterations. After completing the workshop, compare the chart for the memory-engineered agent (Part 6 Step 1) versus the naive agent (Part 6 Step 2). The difference will be visible as a flat line versus a continuously growing line.
calculate_context_usage returns None — Your function is missing the return statement. Make sure you return the dict.
KeyError on MODEL_TOKEN_LIMITS — Check the model name string. The dict uses specific keys like "gpt-5". The function defaults to 128,000 for unknown models, so this should not raise — check if you accidentally removed the .get() default.
This is the most open-ended TODO in the workshop. The prompt you write directly determines the quality of what gets stored in summary memory — and therefore what the agent can recall later.
What makes a good summarisation prompt:
A poor prompt produces a vague paragraph. A good prompt produces a structured, faithful, retrievable snapshot. The key constraints are:
Complete solution:
summary_prompt = f"""
You are compressing an AI agent context window for later retrieval.
The content may include conversation memory, retrieved papers, entities, workflows, and prior summaries.
Produce a compact summary that preserves:
- user goal and constraints
- key facts/findings already established
- important entities (paper titles, arXiv IDs, authors)
- unresolved questions and next actions
Output 4-7 short bullet points.
Be faithful to the source, and do not add new facts.
Context window content:
{content[:3000]}
""".strip()
Why content[:3000]? At the point summarisation is triggered, the context may already be large. Truncating to 3,000 characters prevents the summarisation call itself from exceeding the model’s token limit — you are compressing because the context is large, so you cannot afford to send all of it.
What happens after the prompt: The function makes two LLM calls — one to produce the bullet-point summary, and one to generate a short label (max 12 words) used as the summary’s description in Oracle. The description is what appears in read_summary_context so the agent can decide whether to expand a summary without fetching its full content.