How to Write Specs for AI Coding Agents That Actually Ship

You typed "add user authentication" and hit enter. The agent shipped 600 lines. It built session auth; you wanted JWTs. It stored passwords in plaintext "for now." It edited three files you never wanted it near. The model isn't dumb — it did exactly what a vague spec told it to.

The biggest lever on AI coding output isn't the model or the IDE. It's the spec: what you write down before the agent touches code. GitHub studied 2,500+ agent instruction files and found one root cause again and again — most fail because they're too vague.

This is the writing side of spec-driven development, and it's tool-agnostic. It works the same in Claude Code, Cursor, or Copilot. If you want the GitHub Spec Kit CLI specifically, that's a separate guide. This one teaches the spec itself.


📋 What You'll Need

  • Any agentic coding tool — Claude Code, Cursor, Copilot agent mode, Gemini CLI, or Codex
  • A Markdown editor — your spec lives in a .md file, not a chat box
  • A real feature — something you'd actually ship this week
  • Twenty minutes — long enough to write a tight spec, short enough to not over-build it

No paid tools needed. A good spec is just well-structured text.


🧠 Why Vague Specs Break Agents

Humans fill gaps with judgment. You'd never store passwords in plaintext just because the ticket forgot to say so. Agents have no such instinct. They fill every blank with the most likely-looking code — which may not be the code you wanted.

That's the trap: AI is great at filling blanks. So each blank you leave becomes a decision the model makes for you, quietly, and the result looks finished before it's right.

There's a second catch. Cramming more rules into one prompt doesn't help adherence — it hurts it. Researchers call it the "curse of instructions": even top models start dropping requirements when you pile on too many at once. So the goal isn't a longer spec. It's a clearer one.

Important: A spec's job is to remove decisions from the agent, not bury it in context. Every line either constrains behavior or it's noise crowding out the lines that do.

The win is real and measured. Teams that plan before generating report success on hard tasks jumping from about one-third to two-thirds. The spec is where that plan gets written down.


🏗️ What an Executable Spec Contains

A handy frame is the SCOPE model — five things every feature spec should pin down.

Letter Covers What it kills
S — Structure & Stack Versions, layout, dependencies, naming Agent inventing its own architecture
C — Constraints Always / ask-first / never rules Agent touching things it shouldn't
O — Outcomes Testable, measurable "done" "Looks done" ≠ correct
P — Phases Ordered, ~15–30 min steps Agent doing everything at once, badly
E — Examples Input/output samples, code snippets Vague prose the model reinterprets

Two rules do most of the work:

  • Pin the stack. Write "React 18 with TypeScript, Vite, and Tailwind," not "a React project." Otherwise the agent guesses — toward whatever was most common in its training data, not your repo.
  • Show, don't describe. One real code snippet beats a page of style rules. Don't explain your error-handling convention in prose; paste a function that follows it.

Write for parsing, not prose. Models read well-delimited text far better than flowing paragraphs:

# Spec: <feature name>

## Objective
One sentence: what this does and why.

## Stack
- Frontend: React 18, TypeScript, Vite, Tailwind
- Backend: Node 20, Express, PostgreSQL 16, Prisma

## Requirements
- [ ] Concrete, behavioral requirement
- [ ] Another one

## Acceptance Criteria
Given/When/Then, or input→output tables.

## Boundaries
✅ Always / ⚠️ Ask first / 🚫 Never

## Out of Scope
What you are NOT building.

That ## Out of Scope line is the quiet hero. Telling the agent what not to build stops it from "improving" three things you never asked about.


✅ Acceptance Criteria the Agent Can Check

This is where most specs fail. "The dashboard should feel responsive and clean" means nothing to an agent — there's nothing to check. Criteria must be checkable: numbers, explicit behaviors, testable conditions.

❌ Vague ✅ Checkable
"Make it fast" "Interactive in under 2s on a 3G throttle profile"
"Show key metrics" "4 cards in a 2×2 grid: total users, active (7 days), revenue (USD, comma-formatted), churn (% to one decimal)"
"Handle errors gracefully" "On API 500, show a retry banner and log to Sentry; never blank the page"
"Secure the endpoint" "Reject requests without a valid JWT with HTTP 401; rate-limit to 100 req/min per IP"

Notice the banned words: handle, support, ensure, responsive, clean, fast. Each one hands the agent a decision.

The most reliable format is Given/When/Then — it forces you to name the trigger and the result:

## Acceptance Criteria

- Given a logged-out user, When they hit /dashboard,
  Then redirect to /login with ?next=/dashboard.
- Given an empty result set, When the table renders,
  Then show the empty-state component, not a spinner.
- Given a 500 from /api/metrics, When the page loads,
  Then show the retry banner and keep prior data visible.

Then flip the order: have the agent write tests from these criteria first, and code until they pass. Agents are good at grinding to green. This is the heart of test-driven development for AI agents — the tests become the executable half of your spec.

Tip: If you can't write a test for a requirement, it isn't specific enough yet. Rewrite it until it's checkable.

🚦 The Three-Tier Boundary System

A flat wall of "don't do this" gets ignored — it's just more rules fighting for attention. Give the agent a decision framework instead: always, ask first, never.

## Boundaries

✅ Always (no approval needed)
- Run the test suite before proposing a commit
- Follow existing naming conventions in the touched file
- Add types for every new function

⚠️ Ask first (needs a human)
- Schema changes or migrations
- Adding a dependency
- Anything in /infra or CI config

🚫 Never (hard stops)
- Commit secrets, API keys, or .env contents
- Delete or rewrite tests to make them pass
- Force-push or rewrite git history

In GitHub's study, "never commit secrets" was the most common helpful rule — and "never delete tests to make them pass" is the one that stops an agent from gaming its own criteria. Put standing rules in your CLAUDE.md file; put task-specific ones in the spec.


📐 Good Spec vs. Bad Spec

Same feature — "let users export their data" — written both ways.

The bad spec, the kind people type into the chat box:

Add a data export feature so users can download their stuff.
Should support common formats and be secure.

"Their stuff," "common formats," and "be secure" are three blank checks the agent cashes however it likes.

The good spec:

# Spec: User Data Export

## Objective
Let a logged-in user download all their own records as one
file, on demand, no admin involved.

## Stack
- Node 20, Express, Prisma, PostgreSQL 16
- Reuse the existing BullMQ queue in src/jobs/

## Requirements
- [ ] POST /api/export queues a job for the current user
- [ ] Includes profile, projects, comments — NOT other users'
      data, NOT soft-deleted rows
- [ ] Formats: JSON (default), CSV via ?format=csv
- [ ] Email a signed download link when ready; expires in 24h

## Acceptance Criteria
- Given user A exports, When the file generates,
  Then it contains zero rows from any other user.
- Given ?format=csv, Then the file is valid RFC-4180 CSV
  with a header row.
- Given a link older than 24h, When opened, Then return 410.

## Boundaries
- ✅ Always: run `npm test` before commit; reuse the queue
- ⚠️ Ask first: new dependency; schema change
- 🚫 Never: include other users' data; expose unsigned links

## Out of Scope
- Scheduled exports, admin exports, PDF format

Same feature. One leaves a dozen decisions to a model that's never met your users. The other leaves none. It's longer, but every line removes a decision instead of adding noise — and that, not length, is the point.


🐛 Common Mistakes

Problem Fix
Vague verbs: "handle," "support," "ensure" State the exact behavior and its trigger
Subjective words: "clean," "intuitive," "fast" Use a number: "under 2s," "≤ 3 clicks"
No edge cases Spec the empty, error, and loading states
Strategy in the spec Cut market analysis — the agent can't run it
One giant 10k-word spec Split into 30–60 line specs, one per feature
Over-specifying how Set the what and limits; let it pick the how
Spec rots after day one Update it as decisions get made

That last row matters most. A spec is a living document. When the agent changes the data model or you cut a feature mid-build, edit the spec to match and commit it next to the code. The moment it drifts from reality it stops being the source of truth, and the agent goes back to guessing — the exact mess you wrote it to avoid.

Warning: Over-specifying backfires too. Dictate every line and the agent either ignores the spec or follows it so literally it adds needless complexity. Set the what and the limits; trust it on the how.

⏱️ Start Small, Expand With the Agent

Don't write a 10,000-word spec upfront — that's the curse of instructions waiting to happen. Loop instead:

  1. Write a brief — objective plus a few core requirements. 15–60 lines, not pages.
  2. Let the agent expand it while it reads your real code.
  3. Make it interview you — tell it to list every ambiguity and ask before proposing anything.
  4. Tighten until there's no room to misread.
  5. Then let it write code.

Most tools have a mode for steps 2–4. In Claude Code, hit Shift+Tab for Plan Mode — the agent goes read-only and drafts a plan without touching files. Cursor's Plan Mode does the same.

> [Plan Mode]
> Read the auth code in src/auth/. I want refresh-token
> rotation per SPEC.md. Before writing anything, list every
> ambiguity in the spec and ask me about each one.

That last sentence is the trick. A planning step the agent can't question is just a slower way to ship the wrong thing — forcing it to interview you surfaces the gaps while they're cheap. Same idea as context engineering: feed the model what it needs for this step, not everything you know.


🚀 What's Next

Pair this with the GitHub Spec Kit guide and Test-Driven Development with AI Agents to turn a tight spec into an automated, verifiable build loop.


Share Your Thoughts

Read More....
AI Automation for Small Business: Where to Start in 2026
AI Coding Agents Compared: Cursor vs Copilot vs Claude Code vs Windsurf in 2026
AI Coding Agents and Security Risks: What You Need to Know
AI Pair Programming: The Productivity Guide for 2026
AI SRE Agents Explained: Platform Comparison and Pilot Guide for 2026
AI-Assisted Code Review: Tools and Workflows for 2026
Browse all AI-Assisted Engineering articles

Stay Ahead

Only insights that save you time or money. No fluff, ever.

Stay Ahead

Only insights that save you time or money. No fluff.