Context Engineering for AI Coding Agents: 9 Techniques for 2026
Your AI coding agent is forgetting things. Halfway through a refactor, it loses track of the file it was just editing. You asked it to "follow the project conventions" two hours ago — it doesn't anymore. It re-reads the same 400-line file for the third time this session because nothing stuck. Your token bill looks like a phone plan from 2007.
This is not a model problem. It is a context problem. And in 2026, fixing it has its own name: context engineering. Anthropic's Applied AI team formalized the term in September 2025, calling it "the set of strategies for curating and maintaining the optimal set of tokens during LLM inference." The shift matters because agents — unlike chat — cannot be re-prompted at every step of a 15-step refactor. They need a persistent, carefully curated information environment.
This guide gives you nine techniques that actually work — each with a concrete tool, a measurable token or accuracy win, and a rule for when to reach for it. No vibes, no philosophy. Just the mechanics of feeding an agent less and getting more.
📋 What You'll Need
- An AI coding agent — Claude Code, Cursor, GitHub Copilot, Gemini CLI, or Aider all work
- A project repo where you're already losing context (any real codebase qualifies)
- Basic familiarity with MCP — the Model Context Protocol standard. If you're new, start with MCP Servers Explained
- A willingness to measure tokens before and after. You cannot engineer what you don't count.
🧠 What Context Engineering Actually Is
Prompt engineering asked: how do I phrase this request? Context engineering asks: what does the agent need to know to succeed, and what can I strip out?
The two are not the same. A perfect prompt inside a bloated context window still produces mediocre output. A mediocre prompt inside a surgically curated context often produces great output. Anthropic's formal definition reframes the job: find "the smallest possible set of high-signal tokens that maximize the likelihood of some desired outcome." Thoughtworks' Birgitta Böckeler, writing for Martin Fowler's site in February 2026, put it more bluntly: context engineering is curating what the model sees so that you get a better result.
The discipline became necessary because agents got real. A 15-step refactor with tool calls, file reads, shell output, and subagent hand-offs can easily balloon to 100,000 tokens before the model has done any actual thinking. At that point you are no longer engineering prompts. You are engineering an information environment.
📉 Why Context Rot Breaks Agents
Before the techniques, understand the enemy. Context rot is the measurable performance degradation LLMs experience as input length grows. Chroma's 2025 study tested 18 frontier models — GPT-4.1, Claude Opus 4, Gemini 2.5, and others — and found that every single one degrades as context fills. It is not a bug of one model. It is a property of the transformer.
Three mechanisms compound:
- Lost in the middle. Models attend well to the start and end of the context, poorly to the middle. In multi-document QA with 20 documents, moving the relevant document from position 1 to positions 5–15 drops accuracy by over 30%.
- Attention dilution. Transformer attention is quadratic. At 100K tokens that is ten billion pairwise relationships competing for signal.
- Distractor interference. Semantically similar but irrelevant content actively misleads the model. The more you dump in, the more decoys you create.
Every technique below is, at its core, a way to fight one of these three mechanisms. Either the high-signal tokens get in, or the low-signal tokens get out.
🛠 The 9 Techniques
1. Write an AGENTS.md or CLAUDE.md onboarding file
The highest-leverage single move. One markdown file at your project root that the agent reads at every session start. Include the architectural decisions it should not re-derive: "we use PostgreSQL not SQLite," "components are PascalCase, utilities are camelCase," "tests live in tests/ and run with pytest," "never commit to main directly." This is the onboarding doc you would give a new hire, minus the coffee-machine directions.
AGENTS.md — the open standard originated by OpenAI with Sourcegraph, Google, and Cursor in mid-2025, now governed by the Linux Foundation's Agentic AI Foundation since December 2025 — has cross-tool support across Claude Code, Cursor, Copilot, Gemini CLI, Windsurf, Aider, Zed, Warp, and RooCode. Over 60,000 public repos already ship one. Claude Code also reads CLAUDE.md; use AGENTS.md if you want one file for every tool, CLAUDE.md if you want Claude-specific overrides on top.
# AGENTS.md
## Stack
- Python 3.11, Django 5, PostgreSQL 16
- Pytest for tests, Ruff for lint
## Conventions
- Function-based views (not class-based)
- Use `transaction.atomic()` around multi-write operations
- Never log PII; use `log.info_safe()`
## Commands
- Run tests: `pytest -x`
- Lint: `ruff check .`
- Migrate: `python manage.py migrate --settings=app.settings.local_dev`
Token math: a 400-token AGENTS.md that prevents five "what stack do you use?" clarification turns saves roughly 2,000 tokens per session. Over a week, that is real money. Deep dive: The CLAUDE.md Standard.
2. Package repeatable playbooks as Skills
Skills are folders with a SKILL.md that Claude (and a growing list of other agents) load progressively — the name and one-line description enter context at session start, the full body only loads when the skill triggers. That is progressive disclosure as a first-class primitive.
Use a skill when you have a repeatable playbook — "generate a fundesk article," "run a security review," "scaffold a new Django app." The skill encapsulates the full procedure, example outputs, and lifecycle hooks. It stays dormant until needed.
The ecosystem is real: forrestchang/andrej-karpathy-skills crossed 78,000 stars in April 2026 (adding 35,000 in a single week). Addy Osmani's production-grade skills pack sits at 18,000. Vercel's skills.sh directory indexes public skills across every major agent.
Token math: a skill with a 2,500-token full body and a 60-token header delays ~98% of its cost until it actually fires. Five skills registered against a project means roughly 300 tokens at session start instead of 12,500. For the full playbook: Claude Skills Explained.
3. Retrieve code just-in-time with semantic MCP search
Feeding an agent your entire 200k-line codebase is not context engineering. It is context arson. Just-in-time retrieval — the agent asks for code when it needs it, gets only the relevant chunks — is the sane default.
The open-source pattern worth copying is zilliztech/claude-context, an MCP server that uses hybrid search (BM25 lexical + dense vector embeddings via Milvus) with AST-based chunking and Merkle-tree incremental indexing. It claims roughly 40% token reduction at equivalent retrieval quality.
# Install claude-context as an MCP server in Claude Code
claude mcp add claude-context -- npx -y @zilliz/claude-context-mcp
// Configure retrieval scope in your settings
{
"mcpServers": {
"claude-context": {
"command": "npx",
"args": ["-y", "@zilliz/claude-context-mcp"],
"env": { "OPENAI_API_KEY": "...", "MILVUS_URL": "..." }
}
}
}
Once indexed, the agent calls search_code("where do we validate JWTs") and gets five relevant chunks instead of a 12-file file-tree dump. The principle generalizes: expose data through narrow retrieval tools, not broad dumps.
4. Sandbox tool output before it hits context
Tool output is the silent context killer. One Playwright snapshot is 56 KB. Twenty GitHub issues are 59 KB. Over a 30-minute session, up to 40% of your context window can be consumed by raw tool data the agent looked at once and never needed again.
The pattern — popularized by mksglu/context-mode (9,100 stars, April 2026) — is to run tool calls in a subprocess, keep the raw output in a sandbox, and only let a concise summary into the conversation. Context-mode's benchmark: a 56.2 KB Playwright snapshot becomes 299 bytes entering context — a 99% reduction. Twenty GitHub issues at 58.9 KB compress to 1.1 KB — 98%.
You can implement a crude version yourself with a small shell wrapper:
# Instead of dumping curl output into context
curl -s https://api.example.com/issues > /tmp/issues.json
# Let the agent query the file without loading it
jq '[.[] | {id, title, state}] | length' /tmp/issues.json
The agent sees the count, not the 58 KB of JSON. If it needs a specific issue, it queries jq again. The raw data never enters the transformer's attention.
5. Compact sessions with indexed external memory
Long sessions inevitably hit the context ceiling. The naive fix — let the tool compact automatically — throws away useful history along with the noise. The better pattern: indexed external memory that retrieves past state on demand.
thedotmack/claude-mem (66,000 stars as of April 2026) implements this with five lifecycle hooks (SessionStart, UserPromptSubmit, PostToolUse, Stop, SessionEnd), a SQLite store for structured observations, and a Chroma vector DB on port 37777 for hybrid semantic search. It exposes a three-layer retrieval:
- A compact index of past sessions (50–100 tokens per result)
- A chronological timeline around a hit
- Full observations fetched only for filtered IDs (500–1,000 tokens each)
That three-step filter is the whole trick — and it saves roughly 10× tokens compared with stuffing prior sessions back into context.
# One-command install, auto-detects your agent
npx claude-mem install
After install, restart the agent. Past-session context appears automatically and stays out of the way until the model decides it wants it.
6. Fan out with sub-agents to isolate context
A single agent trying to simultaneously read 30 files, run tests, update a migration, and review the diff will blow its own context. Sub-agents fix this by isolating work in separate processes — each with its own context budget — and returning a condensed summary to the parent.
Anthropic explicitly recommends this pattern: "specialized agents handle focused tasks, returning condensed summaries to main agents." Claude Code's Task tool, Cursor's background agents, and the Claude Agent SDK all support the primitive.
Use sub-agents when:
- The work is parallelizable — running three independent searches at once
- The work is context-heavy but produces a small answer — "review this 40-file diff and return a three-bullet summary"
- The work is isolable — the sub-agent does not need to see the parent's full state
Do not use sub-agents when the work is inherently sequential and the parent needs every intermediate result. You will just move the context bloat around.
7. Ask the agent to write code instead of reading it
The highest-leverage trick in the entire field. Instead of having the agent read fifty files and summarize patterns, have it write a script that processes the files and logs only the result.
Which is bigger — reading fifty 200-line files (~100,000 tokens) or running grep -rn "deprecated" --include='*.py' | wc -l and reading "12"? The grep output enters context. The files do not.
A real example. Instead of asking "which of our 40 API endpoints use the legacy auth middleware?" and letting the agent read all 40 files:
# Agent writes and runs this
grep -l "legacy_auth" app/views/*.py | head -20
Context-mode's docs call this pattern code-first analysis: "Instead of reading 50 files into context, agents write scripts that process data and log only results — replacing ten tool calls with one, saving 100× context." The pattern generalizes to SQL (SELECT COUNT(*) ... beats SELECT *), to file globbing, and to log analysis. When in doubt, have the agent compute, not read.
8. Set an explicit context budget and compact on trigger
Treat context like a memory allocator. Define a budget, monitor fill, and compact on a trigger — not when the ceiling is already in flames.
A simple rule that holds up in practice:
| Context fill | Action |
|---|---|
| 0–40% | Keep going, no action needed |
| 40–60% | 🟡 Start favoring just-in-time retrieval over dumps |
| 60–75% | 🟠 Run /compact with a specific directive ("keep architectural decisions, drop tool outputs") |
| 75%+ | 🔴 Hand off to a fresh session with a summary; do not push further |
The Chroma research backs this up: past ~50% fullness, models favor recent tokens over middle or early tokens. Past 75%, accuracy drops hard. Compacting proactively — not reactively — is what keeps an agent coherent across an eight-hour session.
9. Curate a few canonical examples instead of many exhaustive ones
Few-shot examples are context engineering's oldest technique and its most frequently misused. The mistake: piling in a dozen examples in the hope that the model will "learn the pattern." The fix, per Anthropic's own guidance: include "diverse, canonical examples rather than exhaustive edge cases."
Three canonical examples that together cover the pattern beat twenty examples that exhaustively cover the edge cases. The twenty-example version wastes tokens, activates distractor interference, and often produces worse output. The three-example version gives the model what it needs to generalize and nothing it doesn't.
A practical test: can you describe what each example teaches in one sentence? If two examples teach the same lesson, cut one. Good few-shot is a tight taxonomy, not a large sample.
🆚 Context Engineering vs Prompt Engineering vs RAG vs Spec-Driven
These four disciplines overlap and get confused constantly. The clean separation:
| Discipline | What it controls | When you reach for it |
|---|---|---|
| Prompt Engineering | The phrasing of a single request | One-shot tasks, chat interactions, fixed outputs |
| Context Engineering | The entire information environment (system prompt, tools, memory, retrieval, compaction) | Multi-step agents, long sessions, tool-heavy workflows |
| RAG | External document retrieval feeding a prompt | Q&A over large knowledge bases, citations, grounded generation |
| Spec-Driven Development | The upstream specification that generates code | Feature scaffolding, repeatable implementation patterns |
Context engineering is the superset when you are working with agents. RAG is one retrieval technique inside context engineering. Prompt engineering is what you still do at the leaves — inside a single tool call or inside a sub-agent task. They are not competitors. They stack.
If you have an existing prompt engineering practice, context engineering is the next discipline on top — not a replacement.
🎯 When to Reach for Each Technique
A quick decision matrix. The nine techniques are not a checklist; they are a toolkit. Pull the right one for the right failure mode.
| Symptom | Technique to try first |
|---|---|
| Agent re-asks your stack/conventions every session | 🥇 1. AGENTS.md / CLAUDE.md |
| You have a repeatable workflow that keeps entering every prompt | 2. Skills with progressive disclosure |
| Agent reads whole files when it only needs one function | 3. MCP semantic retrieval |
| Tool outputs are eating 30%+ of your context | 4. Tool-output sandboxing |
| Agent "forgets" what happened two hours ago | 5. Indexed session memory |
| One agent is juggling too many concurrent concerns | 6. Sub-agent fan-out |
| Agent reads 40 files to answer a counting question | 7. Code-first analysis |
| Session hits context ceiling mid-task | 8. Budget + directed compaction |
| Few-shot prompts have gotten bloated | 9. Canonical example curation |
Start with techniques 1 and 4 for any project — AGENTS.md plus tool-output sandboxing covers the majority of waste with the least investment. Layer on techniques 3 and 5 when your codebase grows past ~10,000 lines. Reach for 6 and 7 when individual sessions regularly exceed an hour.
⚠️ Common Mistakes
A few failure modes show up often enough to call out:
- Treating bigger context windows as a free upgrade. A 1M-token window does not repeal context rot; it just raises the ceiling at which the rot becomes obvious. Budget fill percentage, not absolute tokens.
- Confusing AGENTS.md with documentation. AGENTS.md is for the agent, not the human. Keep it procedural and decision-oriented. The
README.mdcan stay verbose. - Using sub-agents for sequential work. If step 2 needs every detail from step 1, a sub-agent adds overhead without isolating anything. Keep it in the parent context.
- Letting tool outputs auto-compact. The model has no idea which output was the 59 KB GitHub issue dump and which was the critical one-line error. Sandbox at the tool layer so it never has to choose.
- Skipping measurement. If you are not watching token counts, you are not doing context engineering. You are hoping. The name for hoping is "prompt engineering with extra steps."
🚀 What's Next
- 📘 Write your first AGENTS.md today using the template in technique 1 — the highest-ROI single move in this list
- 🧰 Install
claude-memorcontext-modeon one repo and measure tokens before and after for a week - 📝 Audit an existing long-running agent session — which of the nine failure modes is eating the most context?
- 🔗 Learn the primitive underneath most of these techniques: the Model Context Protocol explained
- 🧠 Go deeper on packaging reusable context into loadable units: Claude Skills Explained
Related reading: The CLAUDE.md Standard: How Project Instructions Are Shaping AI Workflows · Prompt Engineering for Code: Get Better Results from AI Coding Tools