Context Engineering for AI Coding Agents: 9 Techniques for 2026

Your AI coding agent is forgetting things. Halfway through a refactor, it loses track of the file it was just editing. You asked it to "follow the project conventions" two hours ago — it doesn't anymore. It re-reads the same 400-line file for the third time this session because nothing stuck. Your token bill looks like a phone plan from 2007.

This is not a model problem. It is a context problem. And in 2026, fixing it has its own name: context engineering. Anthropic's Applied AI team formalized the term in September 2025, calling it "the set of strategies for curating and maintaining the optimal set of tokens during LLM inference." The shift matters because agents — unlike chat — cannot be re-prompted at every step of a 15-step refactor. They need a persistent, carefully curated information environment.

This guide gives you nine techniques that actually work — each with a concrete tool, a measurable token or accuracy win, and a rule for when to reach for it. No vibes, no philosophy. Just the mechanics of feeding an agent less and getting more.


📋 What You'll Need

  • An AI coding agent — Claude Code, Cursor, GitHub Copilot, Gemini CLI, or Aider all work
  • A project repo where you're already losing context (any real codebase qualifies)
  • Basic familiarity with MCP — the Model Context Protocol standard. If you're new, start with MCP Servers Explained
  • A willingness to measure tokens before and after. You cannot engineer what you don't count.

🧠 What Context Engineering Actually Is

Prompt engineering asked: how do I phrase this request? Context engineering asks: what does the agent need to know to succeed, and what can I strip out?

The two are not the same. A perfect prompt inside a bloated context window still produces mediocre output. A mediocre prompt inside a surgically curated context often produces great output. Anthropic's formal definition reframes the job: find "the smallest possible set of high-signal tokens that maximize the likelihood of some desired outcome." Thoughtworks' Birgitta Böckeler, writing for Martin Fowler's site in February 2026, put it more bluntly: context engineering is curating what the model sees so that you get a better result.

The discipline became necessary because agents got real. A 15-step refactor with tool calls, file reads, shell output, and subagent hand-offs can easily balloon to 100,000 tokens before the model has done any actual thinking. At that point you are no longer engineering prompts. You are engineering an information environment.

Tip: A clean mental model: the prompt is what you say. The context is everything the model reads — system instructions, tools, tool outputs, file contents, retrieved snippets, and prior messages. Context engineering treats all of that as a budget.

📉 Why Context Rot Breaks Agents

Before the techniques, understand the enemy. Context rot is the measurable performance degradation LLMs experience as input length grows. Chroma's 2025 study tested 18 frontier models — GPT-4.1, Claude Opus 4, Gemini 2.5, and others — and found that every single one degrades as context fills. It is not a bug of one model. It is a property of the transformer.

Three mechanisms compound:

  • Lost in the middle. Models attend well to the start and end of the context, poorly to the middle. In multi-document QA with 20 documents, moving the relevant document from position 1 to positions 5–15 drops accuracy by over 30%.
  • Attention dilution. Transformer attention is quadratic. At 100K tokens that is ten billion pairwise relationships competing for signal.
  • Distractor interference. Semantically similar but irrelevant content actively misleads the model. The more you dump in, the more decoys you create.
Warning: The intuition "bigger context window = better results" is wrong above about 50% fullness. Past that point, Chroma found that context degrades by distance from the end — the model favors recent tokens, then middle, then early. Your carefully placed system prompt at position zero is the part the model forgets first.

Every technique below is, at its core, a way to fight one of these three mechanisms. Either the high-signal tokens get in, or the low-signal tokens get out.


🛠 The 9 Techniques

1. Write an AGENTS.md or CLAUDE.md onboarding file

The highest-leverage single move. One markdown file at your project root that the agent reads at every session start. Include the architectural decisions it should not re-derive: "we use PostgreSQL not SQLite," "components are PascalCase, utilities are camelCase," "tests live in tests/ and run with pytest," "never commit to main directly." This is the onboarding doc you would give a new hire, minus the coffee-machine directions.

AGENTS.md — the open standard originated by OpenAI with Sourcegraph, Google, and Cursor in mid-2025, now governed by the Linux Foundation's Agentic AI Foundation since December 2025 — has cross-tool support across Claude Code, Cursor, Copilot, Gemini CLI, Windsurf, Aider, Zed, Warp, and RooCode. Over 60,000 public repos already ship one. Claude Code also reads CLAUDE.md; use AGENTS.md if you want one file for every tool, CLAUDE.md if you want Claude-specific overrides on top.

# AGENTS.md

## Stack
- Python 3.11, Django 5, PostgreSQL 16
- Pytest for tests, Ruff for lint

## Conventions
- Function-based views (not class-based)
- Use `transaction.atomic()` around multi-write operations
- Never log PII; use `log.info_safe()`

## Commands
- Run tests: `pytest -x`
- Lint: `ruff check .`
- Migrate: `python manage.py migrate --settings=app.settings.local_dev`

Token math: a 400-token AGENTS.md that prevents five "what stack do you use?" clarification turns saves roughly 2,000 tokens per session. Over a week, that is real money. Deep dive: The CLAUDE.md Standard.


2. Package repeatable playbooks as Skills

Skills are folders with a SKILL.md that Claude (and a growing list of other agents) load progressively — the name and one-line description enter context at session start, the full body only loads when the skill triggers. That is progressive disclosure as a first-class primitive.

Use a skill when you have a repeatable playbook — "generate a fundesk article," "run a security review," "scaffold a new Django app." The skill encapsulates the full procedure, example outputs, and lifecycle hooks. It stays dormant until needed.

The ecosystem is real: forrestchang/andrej-karpathy-skills crossed 78,000 stars in April 2026 (adding 35,000 in a single week). Addy Osmani's production-grade skills pack sits at 18,000. Vercel's skills.sh directory indexes public skills across every major agent.

Token math: a skill with a 2,500-token full body and a 60-token header delays ~98% of its cost until it actually fires. Five skills registered against a project means roughly 300 tokens at session start instead of 12,500. For the full playbook: Claude Skills Explained.


Feeding an agent your entire 200k-line codebase is not context engineering. It is context arson. Just-in-time retrieval — the agent asks for code when it needs it, gets only the relevant chunks — is the sane default.

The open-source pattern worth copying is zilliztech/claude-context, an MCP server that uses hybrid search (BM25 lexical + dense vector embeddings via Milvus) with AST-based chunking and Merkle-tree incremental indexing. It claims roughly 40% token reduction at equivalent retrieval quality.

# Install claude-context as an MCP server in Claude Code
claude mcp add claude-context -- npx -y @zilliz/claude-context-mcp
// Configure retrieval scope in your settings
{
  "mcpServers": {
    "claude-context": {
      "command": "npx",
      "args": ["-y", "@zilliz/claude-context-mcp"],
      "env": { "OPENAI_API_KEY": "...", "MILVUS_URL": "..." }
    }
  }
}

Once indexed, the agent calls search_code("where do we validate JWTs") and gets five relevant chunks instead of a 12-file file-tree dump. The principle generalizes: expose data through narrow retrieval tools, not broad dumps.


4. Sandbox tool output before it hits context

Tool output is the silent context killer. One Playwright snapshot is 56 KB. Twenty GitHub issues are 59 KB. Over a 30-minute session, up to 40% of your context window can be consumed by raw tool data the agent looked at once and never needed again.

The pattern — popularized by mksglu/context-mode (9,100 stars, April 2026) — is to run tool calls in a subprocess, keep the raw output in a sandbox, and only let a concise summary into the conversation. Context-mode's benchmark: a 56.2 KB Playwright snapshot becomes 299 bytes entering context — a 99% reduction. Twenty GitHub issues at 58.9 KB compress to 1.1 KB — 98%.

You can implement a crude version yourself with a small shell wrapper:

# Instead of dumping curl output into context
curl -s https://api.example.com/issues > /tmp/issues.json

# Let the agent query the file without loading it
jq '[.[] | {id, title, state}] | length' /tmp/issues.json

The agent sees the count, not the 58 KB of JSON. If it needs a specific issue, it queries jq again. The raw data never enters the transformer's attention.


5. Compact sessions with indexed external memory

Long sessions inevitably hit the context ceiling. The naive fix — let the tool compact automatically — throws away useful history along with the noise. The better pattern: indexed external memory that retrieves past state on demand.

thedotmack/claude-mem (66,000 stars as of April 2026) implements this with five lifecycle hooks (SessionStart, UserPromptSubmit, PostToolUse, Stop, SessionEnd), a SQLite store for structured observations, and a Chroma vector DB on port 37777 for hybrid semantic search. It exposes a three-layer retrieval:

  1. A compact index of past sessions (50–100 tokens per result)
  2. A chronological timeline around a hit
  3. Full observations fetched only for filtered IDs (500–1,000 tokens each)

That three-step filter is the whole trick — and it saves roughly 10× tokens compared with stuffing prior sessions back into context.

# One-command install, auto-detects your agent
npx claude-mem install

After install, restart the agent. Past-session context appears automatically and stays out of the way until the model decides it wants it.


6. Fan out with sub-agents to isolate context

A single agent trying to simultaneously read 30 files, run tests, update a migration, and review the diff will blow its own context. Sub-agents fix this by isolating work in separate processes — each with its own context budget — and returning a condensed summary to the parent.

Anthropic explicitly recommends this pattern: "specialized agents handle focused tasks, returning condensed summaries to main agents." Claude Code's Task tool, Cursor's background agents, and the Claude Agent SDK all support the primitive.

Use sub-agents when:

  • The work is parallelizable — running three independent searches at once
  • The work is context-heavy but produces a small answer — "review this 40-file diff and return a three-bullet summary"
  • The work is isolable — the sub-agent does not need to see the parent's full state

Do not use sub-agents when the work is inherently sequential and the parent needs every intermediate result. You will just move the context bloat around.


7. Ask the agent to write code instead of reading it

The highest-leverage trick in the entire field. Instead of having the agent read fifty files and summarize patterns, have it write a script that processes the files and logs only the result.

Which is bigger — reading fifty 200-line files (~100,000 tokens) or running grep -rn "deprecated" --include='*.py' | wc -l and reading "12"? The grep output enters context. The files do not.

A real example. Instead of asking "which of our 40 API endpoints use the legacy auth middleware?" and letting the agent read all 40 files:

# Agent writes and runs this
grep -l "legacy_auth" app/views/*.py | head -20

Context-mode's docs call this pattern code-first analysis: "Instead of reading 50 files into context, agents write scripts that process data and log only results — replacing ten tool calls with one, saving 100× context." The pattern generalizes to SQL (SELECT COUNT(*) ... beats SELECT *), to file globbing, and to log analysis. When in doubt, have the agent compute, not read.


8. Set an explicit context budget and compact on trigger

Treat context like a memory allocator. Define a budget, monitor fill, and compact on a trigger — not when the ceiling is already in flames.

A simple rule that holds up in practice:

Context fill Action
0–40% Keep going, no action needed
40–60% 🟡 Start favoring just-in-time retrieval over dumps
60–75% 🟠 Run /compact with a specific directive ("keep architectural decisions, drop tool outputs")
75%+ 🔴 Hand off to a fresh session with a summary; do not push further

The Chroma research backs this up: past ~50% fullness, models favor recent tokens over middle or early tokens. Past 75%, accuracy drops hard. Compacting proactively — not reactively — is what keeps an agent coherent across an eight-hour session.

Important: When you compact, give the model a directive about what to keep. "Compact this conversation" produces generic summaries. "Compact, preserving all architectural decisions and file paths, dropping tool outputs and error traces" produces useful summaries. Context engineering is directive, not passive.

9. Curate a few canonical examples instead of many exhaustive ones

Few-shot examples are context engineering's oldest technique and its most frequently misused. The mistake: piling in a dozen examples in the hope that the model will "learn the pattern." The fix, per Anthropic's own guidance: include "diverse, canonical examples rather than exhaustive edge cases."

Three canonical examples that together cover the pattern beat twenty examples that exhaustively cover the edge cases. The twenty-example version wastes tokens, activates distractor interference, and often produces worse output. The three-example version gives the model what it needs to generalize and nothing it doesn't.

A practical test: can you describe what each example teaches in one sentence? If two examples teach the same lesson, cut one. Good few-shot is a tight taxonomy, not a large sample.


🆚 Context Engineering vs Prompt Engineering vs RAG vs Spec-Driven

These four disciplines overlap and get confused constantly. The clean separation:

Discipline What it controls When you reach for it
Prompt Engineering The phrasing of a single request One-shot tasks, chat interactions, fixed outputs
Context Engineering The entire information environment (system prompt, tools, memory, retrieval, compaction) Multi-step agents, long sessions, tool-heavy workflows
RAG External document retrieval feeding a prompt Q&A over large knowledge bases, citations, grounded generation
Spec-Driven Development The upstream specification that generates code Feature scaffolding, repeatable implementation patterns

Context engineering is the superset when you are working with agents. RAG is one retrieval technique inside context engineering. Prompt engineering is what you still do at the leaves — inside a single tool call or inside a sub-agent task. They are not competitors. They stack.

If you have an existing prompt engineering practice, context engineering is the next discipline on top — not a replacement.


🎯 When to Reach for Each Technique

A quick decision matrix. The nine techniques are not a checklist; they are a toolkit. Pull the right one for the right failure mode.

Symptom Technique to try first
Agent re-asks your stack/conventions every session 🥇 1. AGENTS.md / CLAUDE.md
You have a repeatable workflow that keeps entering every prompt 2. Skills with progressive disclosure
Agent reads whole files when it only needs one function 3. MCP semantic retrieval
Tool outputs are eating 30%+ of your context 4. Tool-output sandboxing
Agent "forgets" what happened two hours ago 5. Indexed session memory
One agent is juggling too many concurrent concerns 6. Sub-agent fan-out
Agent reads 40 files to answer a counting question 7. Code-first analysis
Session hits context ceiling mid-task 8. Budget + directed compaction
Few-shot prompts have gotten bloated 9. Canonical example curation

Start with techniques 1 and 4 for any project — AGENTS.md plus tool-output sandboxing covers the majority of waste with the least investment. Layer on techniques 3 and 5 when your codebase grows past ~10,000 lines. Reach for 6 and 7 when individual sessions regularly exceed an hour.


⚠️ Common Mistakes

A few failure modes show up often enough to call out:

  • Treating bigger context windows as a free upgrade. A 1M-token window does not repeal context rot; it just raises the ceiling at which the rot becomes obvious. Budget fill percentage, not absolute tokens.
  • Confusing AGENTS.md with documentation. AGENTS.md is for the agent, not the human. Keep it procedural and decision-oriented. The README.md can stay verbose.
  • Using sub-agents for sequential work. If step 2 needs every detail from step 1, a sub-agent adds overhead without isolating anything. Keep it in the parent context.
  • Letting tool outputs auto-compact. The model has no idea which output was the 59 KB GitHub issue dump and which was the critical one-line error. Sandbox at the tool layer so it never has to choose.
  • Skipping measurement. If you are not watching token counts, you are not doing context engineering. You are hoping. The name for hoping is "prompt engineering with extra steps."

🚀 What's Next

  • 📘 Write your first AGENTS.md today using the template in technique 1 — the highest-ROI single move in this list
  • 🧰 Install claude-mem or context-mode on one repo and measure tokens before and after for a week
  • 📝 Audit an existing long-running agent session — which of the nine failure modes is eating the most context?
  • 🔗 Learn the primitive underneath most of these techniques: the Model Context Protocol explained
  • 🧠 Go deeper on packaging reusable context into loadable units: Claude Skills Explained

Related reading: The CLAUDE.md Standard: How Project Instructions Are Shaping AI Workflows · Prompt Engineering for Code: Get Better Results from AI Coding Tools





Thanks for feedback.

Share Your Thoughts




Read More....
AI Automation for Small Business: Where to Start in 2026
AI Coding Agents Compared: Cursor vs Copilot vs Claude Code vs Windsurf in 2026
AI Coding Agents and Security Risks: What You Need to Know
AI Pair Programming: The Productivity Guide for 2026
AI SRE Agents Explained: Platform Comparison and Pilot Guide for 2026
AI-Assisted Code Review: Tools and Workflows for 2026
Browse all AI-Assisted Engineering articles →