You typed "add user authentication" and hit enter. The agent shipped 600 lines. It built session auth; you wanted JWTs. It stored passwords in plaintext "for now." It edited three files you never wanted it near. The model isn't dumb — it did exactly what a vague spec told it to.
The biggest lever on AI coding output isn't the model or the IDE. It's the spec: what you write down before the agent touches code. GitHub studied 2,500+ agent instruction files and found one root cause again and again — most fail because they're too vague.
This is the writing side of spec-driven development, and it's tool-agnostic. It works the same in Claude Code, Cursor, or Copilot. If you want the GitHub Spec Kit CLI specifically, that's a separate guide. This one teaches the spec itself.
📋 What You'll Need
- Any agentic coding tool — Claude Code, Cursor, Copilot agent mode, Gemini CLI, or Codex
- A Markdown editor — your spec lives in a
.mdfile, not a chat box - A real feature — something you'd actually ship this week
- Twenty minutes — long enough to write a tight spec, short enough to not over-build it
No paid tools needed. A good spec is just well-structured text.
🧠 Why Vague Specs Break Agents
Humans fill gaps with judgment. You'd never store passwords in plaintext just because the ticket forgot to say so. Agents have no such instinct. They fill every blank with the most likely-looking code — which may not be the code you wanted.
That's the trap: AI is great at filling blanks. So each blank you leave becomes a decision the model makes for you, quietly, and the result looks finished before it's right.
There's a second catch. Cramming more rules into one prompt doesn't help adherence — it hurts it. Researchers call it the "curse of instructions": even top models start dropping requirements when you pile on too many at once. So the goal isn't a longer spec. It's a clearer one.
The win is real and measured. Teams that plan before generating report success on hard tasks jumping from about one-third to two-thirds. The spec is where that plan gets written down.
🏗️ What an Executable Spec Contains
A handy frame is the SCOPE model — five things every feature spec should pin down.
| Letter | Covers | What it kills |
|---|---|---|
| S — Structure & Stack | Versions, layout, dependencies, naming | Agent inventing its own architecture |
| C — Constraints | Always / ask-first / never rules | Agent touching things it shouldn't |
| O — Outcomes | Testable, measurable "done" | "Looks done" ≠ correct |
| P — Phases | Ordered, ~15–30 min steps | Agent doing everything at once, badly |
| E — Examples | Input/output samples, code snippets | Vague prose the model reinterprets |
Two rules do most of the work:
- Pin the stack. Write "React 18 with TypeScript, Vite, and Tailwind," not "a React project." Otherwise the agent guesses — toward whatever was most common in its training data, not your repo.
- Show, don't describe. One real code snippet beats a page of style rules. Don't explain your error-handling convention in prose; paste a function that follows it.
Write for parsing, not prose. Models read well-delimited text far better than flowing paragraphs:
# Spec: <feature name>
## Objective
One sentence: what this does and why.
## Stack
- Frontend: React 18, TypeScript, Vite, Tailwind
- Backend: Node 20, Express, PostgreSQL 16, Prisma
## Requirements
- [ ] Concrete, behavioral requirement
- [ ] Another one
## Acceptance Criteria
Given/When/Then, or input→output tables.
## Boundaries
✅ Always / ⚠️ Ask first / 🚫 Never
## Out of Scope
What you are NOT building.
That ## Out of Scope line is the quiet hero. Telling the agent what not to build stops it from "improving" three things you never asked about.
✅ Acceptance Criteria the Agent Can Check
This is where most specs fail. "The dashboard should feel responsive and clean" means nothing to an agent — there's nothing to check. Criteria must be checkable: numbers, explicit behaviors, testable conditions.
| ❌ Vague | ✅ Checkable |
|---|---|
| "Make it fast" | "Interactive in under 2s on a 3G throttle profile" |
| "Show key metrics" | "4 cards in a 2×2 grid: total users, active (7 days), revenue (USD, comma-formatted), churn (% to one decimal)" |
| "Handle errors gracefully" | "On API 500, show a retry banner and log to Sentry; never blank the page" |
| "Secure the endpoint" | "Reject requests without a valid JWT with HTTP 401; rate-limit to 100 req/min per IP" |
Notice the banned words: handle, support, ensure, responsive, clean, fast. Each one hands the agent a decision.
The most reliable format is Given/When/Then — it forces you to name the trigger and the result:
## Acceptance Criteria
- Given a logged-out user, When they hit /dashboard,
Then redirect to /login with ?next=/dashboard.
- Given an empty result set, When the table renders,
Then show the empty-state component, not a spinner.
- Given a 500 from /api/metrics, When the page loads,
Then show the retry banner and keep prior data visible.
Then flip the order: have the agent write tests from these criteria first, and code until they pass. Agents are good at grinding to green. This is the heart of test-driven development for AI agents — the tests become the executable half of your spec.
🚦 The Three-Tier Boundary System
A flat wall of "don't do this" gets ignored — it's just more rules fighting for attention. Give the agent a decision framework instead: always, ask first, never.
## Boundaries
✅ Always (no approval needed)
- Run the test suite before proposing a commit
- Follow existing naming conventions in the touched file
- Add types for every new function
⚠️ Ask first (needs a human)
- Schema changes or migrations
- Adding a dependency
- Anything in /infra or CI config
🚫 Never (hard stops)
- Commit secrets, API keys, or .env contents
- Delete or rewrite tests to make them pass
- Force-push or rewrite git history
In GitHub's study, "never commit secrets" was the most common helpful rule — and "never delete tests to make them pass" is the one that stops an agent from gaming its own criteria. Put standing rules in your CLAUDE.md file; put task-specific ones in the spec.
📐 Good Spec vs. Bad Spec
Same feature — "let users export their data" — written both ways.
The bad spec, the kind people type into the chat box:
Add a data export feature so users can download their stuff.
Should support common formats and be secure.
"Their stuff," "common formats," and "be secure" are three blank checks the agent cashes however it likes.
The good spec:
# Spec: User Data Export
## Objective
Let a logged-in user download all their own records as one
file, on demand, no admin involved.
## Stack
- Node 20, Express, Prisma, PostgreSQL 16
- Reuse the existing BullMQ queue in src/jobs/
## Requirements
- [ ] POST /api/export queues a job for the current user
- [ ] Includes profile, projects, comments — NOT other users'
data, NOT soft-deleted rows
- [ ] Formats: JSON (default), CSV via ?format=csv
- [ ] Email a signed download link when ready; expires in 24h
## Acceptance Criteria
- Given user A exports, When the file generates,
Then it contains zero rows from any other user.
- Given ?format=csv, Then the file is valid RFC-4180 CSV
with a header row.
- Given a link older than 24h, When opened, Then return 410.
## Boundaries
- ✅ Always: run `npm test` before commit; reuse the queue
- ⚠️ Ask first: new dependency; schema change
- 🚫 Never: include other users' data; expose unsigned links
## Out of Scope
- Scheduled exports, admin exports, PDF format
Same feature. One leaves a dozen decisions to a model that's never met your users. The other leaves none. It's longer, but every line removes a decision instead of adding noise — and that, not length, is the point.
🐛 Common Mistakes
| Problem | Fix |
|---|---|
| Vague verbs: "handle," "support," "ensure" | State the exact behavior and its trigger |
| Subjective words: "clean," "intuitive," "fast" | Use a number: "under 2s," "≤ 3 clicks" |
| No edge cases | Spec the empty, error, and loading states |
| Strategy in the spec | Cut market analysis — the agent can't run it |
| One giant 10k-word spec | Split into 30–60 line specs, one per feature |
| Over-specifying how | Set the what and limits; let it pick the how |
| Spec rots after day one | Update it as decisions get made |
That last row matters most. A spec is a living document. When the agent changes the data model or you cut a feature mid-build, edit the spec to match and commit it next to the code. The moment it drifts from reality it stops being the source of truth, and the agent goes back to guessing — the exact mess you wrote it to avoid.
⏱️ Start Small, Expand With the Agent
Don't write a 10,000-word spec upfront — that's the curse of instructions waiting to happen. Loop instead:
- Write a brief — objective plus a few core requirements. 15–60 lines, not pages.
- Let the agent expand it while it reads your real code.
- Make it interview you — tell it to list every ambiguity and ask before proposing anything.
- Tighten until there's no room to misread.
- Then let it write code.
Most tools have a mode for steps 2–4. In Claude Code, hit Shift+Tab for Plan Mode — the agent goes read-only and drafts a plan without touching files. Cursor's Plan Mode does the same.
> [Plan Mode]
> Read the auth code in src/auth/. I want refresh-token
> rotation per SPEC.md. Before writing anything, list every
> ambiguity in the spec and ask me about each one.
That last sentence is the trick. A planning step the agent can't question is just a slower way to ship the wrong thing — forcing it to interview you surfaces the gaps while they're cheap. Same idea as context engineering: feed the model what it needs for this step, not everything you know.
🚀 What's Next
- 📝 Make criteria executable — Test-Driven Development with AI Agents turns acceptance criteria into tests.
- ⚙️ Automate the loop — the GitHub Spec Kit guide runs
specify → plan → tasksfor you. - 📂 Lock in standing rules — move them into a CLAUDE.md file.
- 🧭 See the bigger shift — from vibe coding to agentic engineering on why specs are now the core artifact.
Pair this with the GitHub Spec Kit guide and Test-Driven Development with AI Agents to turn a tight spec into an automated, verifiable build loop.