Context engineering supersedes prompt engineering for long-horizon agents because it owns the full token lifecycle — from the first system prompt token to the last compacted summary — not just how instructions are worded. Every major model, regardless of context window size, degrades when that window fills with the wrong tokens, and the degradation is measurable, consistent, and avoidable with the right strategies.
Andrej Karpathy popularised the term in June 2025, and by September Anthropic had published a formal engineering framework alongside two new platform primitives: context editing and a memory tool. Cognition AI — makers of Devin — called it “effectively the #1 job of engineers building AI agents.” The term caught on because it captured something prompt engineering never did: agents are not one-shot chatbots. They run for hundreds of turns, accumulate tool outputs, and can fail in ways that have nothing to do with how well the original instructions were written.
This playbook covers the empirical evidence for context degradation, a taxonomy of four agent-specific failure modes, and the four engineering levers that address them — with concrete token budget allocations, compaction decision rules, and multi-agent isolation patterns you can apply to production systems today. All data points are sourced from Anthropic, LangChain, Chroma, Cognition, and independent research; fabricated metrics are explicitly excluded.
- 01Context engineering is not prompt engineering.Prompt engineering asks how to write effective instructions; context engineering asks what tokens should occupy the window at every moment of a multi-turn agent run — including what to evict, compress, or delegate to external storage.
- 02Context rot is real and affects every major model.Chroma's empirical research demonstrates performance degrades as input token count increases across Claude Sonnet 4, GPT-4.1, Qwen3-32B, and Gemini 2.5 Flash — even on intentionally controlled tasks. Million-token windows do not solve the problem; they shift it.
- 03Four failure modes, four levers — mapped one-to-one.Context Poisoning, Distraction, Confusion, and Clash each have primary remediations in the write/select/compress/isolate framework. The novel contribution here is pairing these two taxonomies into a single remediation matrix.
- 04Compaction is the critical lever for long-horizon runs.Anthropic's internal evaluations show context editing alone delivers a 29% performance lift, and combining it with a memory tool reaches 39%. In a 100-turn web search eval, context editing reduced token consumption by 84% while enabling workflows that would otherwise fail due to context exhaustion.
- 05Multi-agent architecture is a context isolation strategy.Anthropic's multi-agent research system uses subagents with isolated context windows, each returning 1,000–2,000 token condensed summaries. Token usage explains 80% of performance variance on BrowseComp. The architecture outperformed single-agent Claude Opus 4 by 90.2% on their internal research evaluation.
01 — What It IsA discipline for the full token lifecycle, not just instructions.
The Anthropic Applied AI Team published their canonical definition on September 29, 2025: context engineering is “the set of strategies for curating and maintaining the optimal set of tokens (information) during LLM inference, including all the other information that may land there outside of the prompts.” Their framing is that effective agent development requires thinking in context — considering the holistic state available to the model at any given time and what potential behaviors that state might yield, not just whether the system prompt is well-written.
Prompt engineering addresses how to write effective prompts, particularly system prompts. Context engineering owns everything else: what retrieved documents to include and when to drop them, which tool outputs to keep versus clear, when to summarize a growing message history, how to route subtasks to isolated subagents, and how to maintain persistent notes outside the window entirely. The distinction matters because production agents fail at the context layer far more often than at the prompt-writing layer.
Andrej Karpathy gave the discipline its most-cited definition in a June 25, 2025 post: “the delicate art and science of filling the context window with just the right information for the next step.” His enumeration of what belongs in the window — task descriptions, few-shot examples, RAG, related multimodal data, tools, state, history, and compacting — is the closest thing to a canonical component list. The post accumulated over 14,000 likes and 2,000 retweets, signalling that the term resonated well beyond academic circles.
"Context engineering is the delicate art and science of filling the context window with just the right information for the next step. Too little or of the wrong form and the LLM doesn't have the right context for optimal performance. Too much or of the wrong form and costs might go up and performance might come down."— Andrej Karpathy, co-founder OpenAI / independent researcher, X/Twitter, Jun 25, 2025
02 — Empirical EvidenceContext rot: performance degrades as tokens accumulate.
Chroma's published research, titled “Context Rot,” is the most rigorous independent dataset on this problem. The researchers extended the standard Needle in a Haystack (NIAH) benchmark — which tests direct lexical matching — with semantic matching tasks and a conversational QA evaluation called LongMemEval, designed to isolate context-length effects from task difficulty. Their conclusion: performance degrades as input token count increases across all major models they tested, including Claude Sonnet 4, GPT-4.1, Qwen3-32B, and Gemini 2.5 Flash. Million-token context windows extend where degradation occurs; they do not eliminate it.
The mechanism is architectural. LLM attention complexity scales quadratically — every token must attend to every other token, creating n² pairwise relationships. As context grows, the model's capacity to represent these relationships gets stretched. Anthropic's engineering blog notes a second factor: models develop attention patterns from training-data distributions where shorter sequences are far more common than longer ones, meaning fewer specialized parameters exist for long-tail context-wide dependencies. Position encoding interpolation extends sequence length, but it introduces degradation in token-position understanding.
A Databricks study found that correctness for retrieval tasks began to fall around the 32k token mark for Llama 3.1 405B, and fell earlier for smaller models. The practical implication: smaller models hit their distraction ceiling well before filling their context windows, meaning raw context length is a poor proxy for usable context capacity. Chroma's note that NIAH “only tests direct lexical matching and does not represent the flexible, semantically-oriented tasks real agents perform” is the key methodological critique — models that score near-perfect on NIAH can still degrade significantly on real semantic retrieval from long contexts.
03 — Failure TaxonomyFour ways agents break under context pressure.
Drew Breunig documented four distinct context failure modes from public agent research, including Gemini's own technical report and the Berkeley Function-Calling Leaderboard. These are not theoretical — each has been observed in production agents or published evaluations. The matrix below maps each failure mode against its primary remediation lever from the write/select/compress/ isolate framework, a detection signal, and an example agent that demonstrated it.
| Failure Mode | When it occurs | Primary lever | Detection signal | Observed in |
|---|---|---|---|---|
| Context Poisoning | Hallucination enters context and compounds across turns | Compress — evict stale/wrong context | Agent pursues impossible or irrelevant goals; loop escalation | Gemini 2.5 Pokémon agent |
| Context Distraction | Long context causes over-reliance on history vs. novel reasoning (onset: ~100k tokens for Gemini) | Compress — summarize message history | Agent repeats prior actions instead of adapting; low novelty in outputs | Gemini 2.5 technical report |
| Context Confusion | Superfluous tools or docs overwhelm the model; Llama 3.1 8B failed with 46 tools, succeeded with 19 | Select — just-in-time tool and doc loading | Wrong tool called; irrelevant doc cited in output | Berkeley Function-Calling Leaderboard; GeoEngine benchmark |
| Context Clash | Conflicting information gathered across turns; multi-turn sharding caused avg 39% performance drop (o3: 98.1 → 64.1) | Isolate — subagent partitioning; Write — structured notes | Contradictory answers across tool calls; hedging escalation | Microsoft/Salesforce study (arXiv:2505.06120) |
Sources: Drew Breunig (dbreunig.com, Jun 22, 2025), Gemini 2.5 technical report, Berkeley Function-Calling Leaderboard, arXiv:2505.06120 (Microsoft/Salesforce, Jun 2025)
The Gemini 2.5 technical report named context poisoning explicitly, describing scenarios in a Pokémon-playing experiment where the agent's “goals, summary” section became “poisoned with misinformation about the game state, causing the agent to pursue impossible or irrelevant goals.” The distraction pattern emerged at beyond 100k tokens, where the agent showed “a tendency toward favoring repeating actions from its vast history rather than synthesizing novel plans.” Both failures appear in the same agent because they are different expressions of the same root cause: unmanaged context accumulation.
The Microsoft/Salesforce sharding result is particularly striking because it affects frontier-tier models. When prompt information is distributed across multi-turn exchanges — as agents do by design — OpenAI's o3 dropped from 98.1 to 64.1, an absolute decline of 34 points. The average across all models was a 39% performance drop. Context Clash is not a small-model problem.
04 — Engineering FrameworkWrite, select, compress, isolate.
LangChain formalised four context engineering strategies in their July 2, 2025 blog post. These are not competing approaches — they are layered. Most production agents need all four operating simultaneously at different points in the context lifecycle.
Write — save outside the window
Save context outside the active window so it persists across context resets. Claude Code uses CLAUDE.md files loaded upfront. Anthropic's memory tool enables CRUD operations on file-based storage. Cognition's Devin uses structured notes allowing an agent to track 'for the last 1,234 steps I've been training Pokémon in Route 1.'
Select — pull in just-in-time
Rather than loading all context upfront, maintain lightweight identifiers (file paths, stored queries, web links) and load data dynamically at the point it's needed. Claude Code uses glob and grep for just-in-time file access. Applying RAG to tool descriptions — not just documents — can improve tool selection accuracy by approximately 3× according to research cited by LangChain.
Compress — retain only signal
Summarize growing message history and discard raw tool outputs once their essential information has been captured. Anthropic's context editing automatically clears stale tool calls and results as agents approach token limits. Claude Code's auto-compact triggers at 95% context window usage. Cognition uses a fine-tuned compaction model because off-the-shelf summarization does not reliably preserve key decisions.
Isolate — split across agents or environments
Split context across agents, environments, or tools so no single context accumulates everything. Anthropic's multi-agent research system uses subagents with isolated context windows, each returning condensed summaries of 1,000–2,000 tokens to the lead agent. LangGraph's Bigtool library applies semantic search over tool descriptions to pre-select only relevant tools for each task.
The interaction between these levers is where most production gains come from. Compression alone (Lever 03) tackles distraction and reduces token spend, but does nothing about context confusion from too many tools (Lever 02 problem) or context clash from multi-turn information accumulation (Lever 04 problem). Teams that apply only compaction — the most visible lever — often still hit confusion and clash failures. The playbook sections below cover each lever operationally.
05 — Compaction Deep DiveWhen to trigger, what to keep, what to drop.
Compaction is the primary lever for long-horizon agents — runs that span hundreds of turns, hours of wall-clock time, or workflows that would otherwise exhaust the context window entirely. Anthropic published the most precise production data on compaction effectiveness available: in a 100-turn web search evaluation, context editing reduced token consumption by 84% while enabling agents to complete workflows that would have otherwise failed due to context exhaustion. Context editing alone delivered a 29% performance improvement; combining it with the memory tool reached 39%.
The key implementation question is not whether to compact — it is what the compaction model should preserve. Cognition found that off-the-shelf summarization does not reliably preserve key decisions for long-running agent tasks, requiring a fine-tuned compaction model for production reliability. Anthropic's guidance specifies the preservation targets explicitly.
What to preserve in compaction
- Architectural decisions — which approach was chosen and why, so the agent does not re-evaluate resolved tradeoffs.
- Unresolved bugs and blockers — the current state of the problem, not the history of how it was diagnosed.
- Implementation context — what was built, tested, and in what state it was left.
- Current objectives and sub-goals — what the agent is trying to accomplish in this run.
What to discard in compaction
- Raw tool outputs that have been processed — once the agent has extracted the relevant signal, the full output is noise.
- Intermediate reasoning traces that led nowhere— rejected hypotheses can be noted as “tried X, failed because Y” rather than preserved in full.
- Redundant confirmations and status checks — tool calls that confirmed expected state without changing behavior.
Trigger thresholds
Claude Code's auto-compact triggers at 95% of context window usage. For programmatic agents, Anthropic's context editing API launched on September 29, 2025 alongside Claude Sonnet 4.5 — it automatically clears stale tool calls and results rather than requiring manual compaction logic. The practical difference: Claude Code's auto-compact is a product-layer feature for interactive coding runs; context editing is the platform-level API primitive for automated workflows and is the right choice for production agent pipelines that need fine-grained control over what gets cleared.
Performance lift
Combining Anthropic's context editing API with the memory tool delivered a 39% improvement over baseline on an agentic search evaluation set. Context editing alone was 29%.
100-turn web search eval
Context editing reduced token consumption by 84% in Anthropic's 100-turn web search evaluation, while enabling completion of workflows that would otherwise fail from context exhaustion.
Claude Code threshold
Claude Code's auto-compact triggers at 95% context window usage, summarizing the full agent trajectory and preserving architectural decisions, unresolved bugs, and implementation details.
06 — Budget AllocationToken budget by agent archetype.
Every context engineering post discusses managing tokens abstractly. The missing piece is concrete allocation by agent type. The table below gives recommended ceilings for each component slot, assuming a 200k total token budget at steady state. Adjust proportionally for different window sizes; the ratios are more stable than the absolute numbers. Sources: Anthropic blog (compaction targets), LangChain blog (write/select/compress/isolate framework), Anthropic multi-agent research system (subagent token patterns), Cognition blog (compaction model guidance).
| Agent Type | System prompt | Tool definitions | RAG / retrieved docs | Message history | Compaction strategy |
|---|---|---|---|---|---|
| Research agent (breadth-first) | ~2k–4k | ~5k (select via RAG) | ~30k per subagent | ~10k condensed summaries | Isolate to subagents; 1k–2k summary per subagent returned to lead |
| Coding agent (file-heavy) | ~4k–8k | ~3k (glob/grep JIT) | ~40k (files on demand) | ~20k before compact | Auto-compact at 95%; preserve architectural decisions + unresolved bugs |
| Customer support agent (conversational) | ~3k–5k | ~4k (select per intent) | ~15k (KB chunks) | ~8k rolling window | Rolling summarize + write persistent customer state outside window |
| Data analysis agent (tool-heavy) | ~2k–3k | ~8k (schema + tool defs) | ~20k (query results) | ~15k before compact | Clear raw query results once extracted; compress to finding summaries |
| Long-running automation (hours+) | ~3k–5k | ~4k (JIT select) | ~10k (rolling) | ~8k structured notes | Fine-tuned compaction model (Cognition pattern); write checkpoints to external storage every N steps |
Budget guidance derived from: Anthropic Engineering Blog (Sep 29, 2025), Anthropic Multi-Agent Research System (Jun 13, 2025), LangChain Blog (Jul 2, 2025), Cognition AI Blog (2025). Assumes 200k total token budget; ratios are more stable than absolutes.
Token budget allocation across context slots — steady-state agent
Source: Anthropic Engineering Blog + LangChain Blog (2025). Steady-state guidance for a 200k context budget.07 — Multi-Agent ArchitectureIsolation as a context strategy.
Anthropic's multi-agent research system — which uses Claude Opus 4 as a lead agent and Claude Sonnet 4 as subagents — is the most detailed published case study of the Isolate lever at scale. The key finding: token usage explains 80% of performance variance on their BrowseComp evaluation, with number of tool calls and model choice as the other two explanatory factors. This establishes context management as the primary performance lever, not model selection.
The architecture works by giving each subagent its own isolated context window. Subagents return condensed summaries — typically 1,000–2,000 tokens — to the lead agent rather than their full context. The lead agent's context thus accumulates distilled insights, not raw research trails. This prevents the context clash pattern where conflicting or duplicated information gathered across parallel workstreams merges into a single confused context.
Anthropic published scaling rules for effort allocation based on task complexity: simple fact-finding requires just 1 agent with 3–10 tool calls; direct comparisons need 2–4 subagents with 10–15 calls each; complex research may use more than 10 subagents with clearly divided responsibilities. Parallel tool calling — where the lead agent spawns 3–5 subagents simultaneously rather than sequentially — reportedly cut research time by up to 90% for complex queries.
One operational finding worth singling out: a tool-testing agent built by Anthropic rewrote flawed MCP tool descriptions, resulting in a 40% decrease in task completion time for future agents using those improved descriptions. Tool description quality is a context engineering lever — poorly written tool descriptions consume tokens explaining themselves and still confuse the model. For teams operating production AI transformation programs, auditing tool descriptions for precision is a high-ROI, low-effort improvement.
Single-agent, direct tool calls
1 agent, 3–10 tool calls. Adding subagents here is overhead that adds latency and token cost without capability benefit. Apply Write + Select levers only — load docs JIT, persist notes if multi-session.
2–4 subagents, parallel
Breadth-first queries with multiple independent directions. Each subagent handles one direction, returns a 1k–2k token condensed summary to the lead. Prevents context clash from merging conflicting research trails into one window.
10+ subagents, divided responsibilities
Large-scale research with many independent directions. Anthropic's scaling rule: more than 10 subagents with clearly divided responsibilities. Lead agent assembles distilled summaries. Token usage now the primary performance driver — track it in observability tooling.
Compaction + write + isolate combined
Runs spanning hundreds of turns or hours. Combine all four levers: Write (structured notes + external checkpoints), Select (JIT tool loading), Compress (fine-tuned compaction model — Cognition pattern), Isolate (subagents for parallel workstreams). Use dedicated agent observability to track context health across the full run.
08 — System Prompt AltitudeThe Goldilocks zone for instruction specificity.
Anthropic identifies two system prompt failure modes that bracket the optimal range. Over-specification produces brittle if-else hardcoded logic that breaks when the real world deviates from the anticipated case. Under-specificationprovides vague, high-level guidance that falsely assumes shared context — the model is left to infer what the developer meant rather than executing clear heuristics. The optimal zone is “specific enough to guide behavior effectively, yet flexible enough to provide strong heuristics.”
Structuring the system prompt for scannability matters both for model comprehension and for token efficiency. Anthropic recommends organizing into distinct sections with XML tagging or Markdown headers — for example, <background_information>, <instructions>, ## Tool guidance, ## Output description. The guiding principle across all context components is “the smallest possible set of high-signal tokens that maximize the likelihood of the desired outcome.”
This principle has an underappreciated implication for prompt engineering pattern libraries: a template optimized for single-turn chatbot interactions is often the wrong shape for agent system prompts. Chatbot prompts tend to be exhaustive to handle diverse one-shot queries; agent system prompts should be concise and defer to dynamically-selected context for the specifics. The distinction is between “answer every possible question” and “guide a sustained, tool-using process.”
For teams with large tool inventories, the tool description quality review deserves a dedicated pass. Anthropic's tool-testing agent found that rewriting flawed tool descriptions cut task completion time by 40% for subsequent agents. Poorly written tool descriptions that require the model to infer intent are a hidden context drain — they consume both tokens and model attention on orientation rather than execution. The connection to agent observability and evals is direct: tracing token consumption by context component surfaces which tool descriptions are the worst offenders.
Finally, context engineering connects to agent memory systems at the Write lever. The memory tool, structured note-taking, and external storage are the mechanisms that allow agents to persist knowledge outside the window — turning multi-hour runs from brittle stateless sessions into coherent, resumable workflows. The combination of all four levers operating together is what distinguishes production-grade agent infrastructure from demo-grade scaffolding. Partnering with a team that specializes in agentic workflows at scale can accelerate this architecture significantly.
Token discipline is the real agent reliability lever — not model choice.
Context engineering is the discipline that separates agents that work in demos from agents that work in production. The empirical evidence from Chroma, Anthropic, Microsoft, and Salesforce consistently points to the same finding: token accumulation is the primary failure mode for long-horizon agents, and it affects frontier models as much as smaller ones. A model that scores 98.1 on a clean prompt evaluation can drop to 64.1 when that same information is distributed across a multi-turn agent run the way real agents operate.
The four-lever framework — write, select, compress, isolate — gives builders a complete vocabulary for addressing all four failure modes. Each lever targets a specific failure pattern: just-in-time selection prevents context confusion from tool overload, compaction addresses context distraction from history accumulation, subagent isolation prevents context clash from information merging, and structured external storage prevents context poisoning from persisting hallucinated state. Applying all four in combination is what Anthropic's multi-agent architecture achieves — and the 90.2% improvement over single-agent performance and 84% token reduction in production evaluations are the result.
The forward direction is clearer than the current state of tooling suggests. Compaction models will improve; context editing APIs will become standard infrastructure; observability tooling will surface per-component token attribution automatically. The teams that operationalize context engineering now — before it becomes table stakes — will have a year's head start on the agents their competitors are still trying to get out of demo mode.