Easy

Test-Driven Development with AI Agents: A Practical Guide

AI coding agents are fast. Dangerously fast. They'll generate 200 lines of code in seconds, and every line will look plausible. The problem? "Looks plausible" and "actually works" are two very different things. Without guardrails, you end up with code that passes a vibes check but fails in production.

Test-Driven Development fixes this. When you write the test first, you define what "correct" means before the AI writes a single line of implementation. The agent can't cheat by writing a test that validates its own broken logic. It can't quietly delete an assertion to make a failure disappear. The test exists. The test is yours. The AI's job is to make it pass.

Kent Beck -- the person who literally wrote the book on TDD -- has called this combination a "superpower" for working with AI agents. In 2026, TDD isn't just a best practice from the Agile playbook. It's the most reliable way to get production-quality code out of AI tools.

📋 What You'll Need

A testing framework -- pytest (Python), Jest (JavaScript), JUnit (Java), or whatever your language uses
An AI coding agent -- Claude Code, GitHub Copilot, Cursor, or Aider
A real project -- TDD with AI works best on actual codebases, not greenfield toy examples
Comfort with the terminal -- especially if you're using Claude Code or Aider
15-30 minutes -- enough time to run through the workflow end-to-end on a real task

🔄 The Classic TDD Loop, Remixed for AI

Traditional TDD follows the Red-Green-Refactor cycle. You write a failing test (red), write the minimum code to pass it (green), then clean up (refactor). The cycle hasn't changed. What's changed is who does which part.

The Human-AI Split

Here's how responsibilities divide when you pair TDD with an AI agent:

Phase	You (Human)	AI Agent
🔴 Red	Write the failing test	Review test for edge cases you missed
🟢 Green	Review the implementation	Write code to pass the test
🔵 Refactor	Approve or reject changes	Suggest and execute refactoring
🔁 Repeat	Define the next behavior	Generate the next test stub (you review)

The key insight: you own the specification, the AI owns the implementation. This division prevents the most dangerous failure mode in AI-assisted coding -- the agent writing tests that validate its own bugs.

What This Looks Like in Practice

┌─────────────────────────────────────────────────────┐
│              TDD + AI Agent Workflow                 │
├─────────────────────────────────────────────────────┤
│                                                     │
│  You: Write a failing test                          │
│    │                                                │
│    ▼                                                │
│  You: Run test → Confirm it fails (RED)             │
│    │                                                │
│    ▼                                                │
│  AI: Generate implementation to pass the test       │
│    │                                                │
│    ▼                                                │
│  You: Run test → Confirm it passes (GREEN)          │
│    │                                                │
│    ▼                                                │
│  AI: Suggest refactoring improvements               │
│    │                                                │
│    ▼                                                │
│  You: Review, approve, run tests again (REFACTOR)   │
│    │                                                │
│    ▼                                                │
│  Repeat with next behavior                          │
│                                                     │
└─────────────────────────────────────────────────────┘

Notice you run the tests at every stage. The AI never runs tests on your behalf and reports "all green" -- you verify. Trust, but verify. Actually, just verify.

🛠️ Setting Up TDD Workflows in Your AI Tool

Each major AI coding tool handles TDD differently. Here's how to configure the three most popular ones.

Claude Code: Hooks and CLAUDE.md

Claude Code is the most configurable tool for TDD enforcement. You can set up hooks -- deterministic scripts that run at specific points in the agent's lifecycle -- to ensure tests run automatically after every code change.

Add this to your .claude/settings.json:

{
  "hooks": {
    "PostToolUse": [
      {
        "matcher": "Write|Edit",
        "command": "pytest tests/ -x --tb=short 2>&1 | tail -20"
      }
    ]
  }
}

This hook runs your test suite after every file edit. The -x flag stops at the first failure, and tail -20 keeps the output concise so it doesn't flood the agent's context.

Then add TDD instructions to your CLAUDE.md project file:

## Development Rules

- Always follow TDD: write a failing test before implementing any feature
- Never delete or modify existing tests to make them pass
- Run `pytest tests/ -x` after every code change
- If a test fails, fix the implementation, not the test
- Keep test names descriptive: test_should_return_empty_list_when_no_items_found

Tip: CLAUDE.md instructions are suggestions the model can choose to follow. Hooks are deterministic -- they always run. Use both: CLAUDE.md for intent, hooks for enforcement.

GitHub Copilot: Agent Mode + Test Commands

In VS Code with Copilot, you can use agent mode (@workspace /new) and specify test-first workflows in your prompt. Copilot doesn't have hooks, but you can configure it to run tests by adding a github/copilot-instructions.md file to your repo:

## Coding Standards

When implementing any new feature:
1. Write a failing test first
2. Implement the minimum code to pass the test
3. Run the test suite: `npm test`
4. Refactor only after tests pass

Then in Copilot Chat, use structured prompts:

Write a failing test for a function that validates email addresses.
The test should cover: valid emails, missing @ symbol, missing domain,
empty string, and emails with spaces.

Do NOT write the implementation yet.

Cursor: Composer with Test-First Rules

In Cursor, add a .cursorrules file to your project root:

Always follow TDD:
1. When asked to implement a feature, write tests first
2. Show the failing tests before writing implementation
3. After implementation, run all tests
4. Never modify tests to make them pass -- fix the code instead

Then use Composer mode with explicit test-first instructions:

@codebase I need to add a rate limiter to the API.
Start by writing tests for: 10 requests per minute limit,
429 response when exceeded, rate reset after 60 seconds.
Show me the tests before implementing anything.

📝 Prompt Patterns That Actually Work

The quality of your TDD workflow depends heavily on how you prompt the AI. Vague prompts produce vague tests. Here are prompt patterns that consistently produce good results.

Pattern 1: Behavior-First Test Request

Instead of asking for a function, ask for the behavior:

❌ "Write a function to parse CSV files"

✅ "Write failing tests for a CSV parser that handles:
   - Standard comma-separated values
   - Quoted fields containing commas
   - Empty fields
   - Headers with spaces
   - Files with inconsistent column counts (should raise ValueError)"

The second prompt gives the AI precise test boundaries. Each bullet becomes a test case.

Pattern 2: The Contract Prompt

Define inputs, outputs, and error conditions explicitly:

Write tests for a UserService.create_user method with this contract:
- Input: dict with keys 'email', 'name', 'password'
- Output: User object with 'id', 'email', 'name', 'created_at'
- Raises ValueError if email is already taken
- Raises ValueError if password is less than 8 characters
- Password must NOT appear in the returned User object
- created_at must be within 1 second of current time

This produces tests that serve as living documentation. Anyone reading the test file understands the contract without looking at the implementation.

Pattern 3: The Edge Case Expansion

After writing your initial tests, ask the AI to find what you missed:

Here are my current tests for the payment processor:
[paste tests]

What edge cases am I missing? Consider:
- Currency conversion rounding
- Concurrent transactions
- Network timeouts
- Idempotency keys
- Zero and negative amounts

Write additional failing tests for any gaps you identify.

This is where AI shines. Humans are notoriously bad at imagining edge cases for their own code. The AI brings a different set of blindspots -- and the combination catches more bugs than either alone.

A Complete Example: Building a URL Shortener

Let's walk through a full TDD cycle with an AI agent. You're building a URL shortener service.

Step 1: You write the tests.

import pytest
from datetime import datetime, timedelta
from url_shortener import URLShortener


class TestURLShortener:
    def setup_method(self):
        self.shortener = URLShortener()

    def test_shorten_returns_short_code(self):
        result = self.shortener.shorten("https://example.com/very/long/path")
        assert isinstance(result, str)
        assert len(result) == 8
        assert result.isalnum()

    def test_shorten_same_url_returns_same_code(self):
        code1 = self.shortener.shorten("https://example.com")
        code2 = self.shortener.shorten("https://example.com")
        assert code1 == code2

    def test_resolve_returns_original_url(self):
        code = self.shortener.shorten("https://example.com")
        assert self.shortener.resolve(code) == "https://example.com"

    def test_resolve_unknown_code_raises(self):
        with pytest.raises(KeyError):
            self.shortener.resolve("nonexist")

    def test_shorten_invalid_url_raises(self):
        with pytest.raises(ValueError):
            self.shortener.shorten("not-a-url")

    def test_shorten_empty_string_raises(self):
        with pytest.raises(ValueError):
            self.shortener.shorten("")

    def test_custom_code_is_accepted(self):
        self.shortener.shorten("https://example.com", custom_code="mylink")
        assert self.shortener.resolve("mylink") == "https://example.com"

    def test_duplicate_custom_code_raises(self):
        self.shortener.shorten("https://example.com", custom_code="taken")
        with pytest.raises(ValueError):
            self.shortener.shorten("https://other.com", custom_code="taken")

Step 2: Run tests, confirm they fail (RED).

$ pytest test_url_shortener.py -v
FAILED test_shorten_returns_short_code - ModuleNotFoundError: No module named 'url_shortener'

Step 3: Prompt the AI to make them pass.

Here are my failing tests for a URL shortener:
[paste tests above]

Write the minimal URLShortener class that makes all 8 tests pass.
Do not add any features beyond what the tests require.

Step 4: The AI generates the implementation (GREEN).

import hashlib
import re
from urllib.parse import urlparse


class URLShortener:
    def __init__(self):
        self._url_to_code = {}
        self._code_to_url = {}

    def shorten(self, url: str, custom_code: str = None) -> str:
        if not url:
            raise ValueError("URL cannot be empty")
        if not self._is_valid_url(url):
            raise ValueError(f"Invalid URL: {url}")

        if url in self._url_to_code and custom_code is None:
            return self._url_to_code[url]

        if custom_code:
            if custom_code in self._code_to_url:
                raise ValueError(f"Code already taken: {custom_code}")
            code = custom_code
        else:
            code = self._generate_code(url)

        self._url_to_code[url] = code
        self._code_to_url[code] = url
        return code

    def resolve(self, code: str) -> str:
        if code not in self._code_to_url:
            raise KeyError(f"Unknown code: {code}")
        return self._code_to_url[code]

    def _generate_code(self, url: str) -> str:
        return hashlib.sha256(url.encode()).hexdigest()[:8]

    def _is_valid_url(self, url: str) -> bool:
        parsed = urlparse(url)
        return bool(parsed.scheme and parsed.netloc)

Step 5: Run tests, confirm they pass. Then refactor if needed.

$ pytest test_url_shortener.py -v
8 passed in 0.03s

Every line of that implementation exists because a test demanded it. No speculative features. No dead code. No "I thought you might need this."

Warning: Watch for AI agents that generate tests and implementation in the same response. This defeats the purpose of TDD. The agent writes tests that match its implementation rather than your requirements. Always review and approve tests before asking for code.

⚠️ The Traps: What Goes Wrong and Why

AI-assisted TDD sounds straightforward, but there are specific failure modes that catch even experienced developers. Kent Beck himself has noted that AI agents will try to delete or disable tests to make them "pass." Here are the traps and how to avoid them.

Trap 1: The Test That Tests Nothing

AI-generated tests sometimes assert on the wrong thing:

# Bad: tests that the function runs without error, not that it works
def test_process_payment():
    result = process_payment(100, "USD")
    assert result is not None  # This passes even if result is an error message

# Good: tests specific behavior
def test_process_payment_returns_transaction_id():
    result = process_payment(100, "USD")
    assert isinstance(result.transaction_id, str)
    assert len(result.transaction_id) == 36  # UUID format
    assert result.amount == 100
    assert result.currency == "USD"
    assert result.status == "completed"

Fix: Review every assertion. Ask yourself: "Would this test still pass if the function was completely broken but returned a truthy value?"

AI-generated tests tend to share the same blind spots as AI-generated code. If the AI doesn't think about timezone handling in the implementation, it won't think about timezone handling in the tests either.

# AI might generate both the code and this test without considering timezones
def test_event_is_today():
    event = Event(date=datetime.now())
    assert event.is_today()  # Passes in your timezone, fails in CI server's timezone

Fix: After the AI generates tests, manually add edge cases for: timezones, Unicode, empty inputs, concurrent access, and boundary values. These are the categories AI consistently under-tests.

Trap 3: The Mocking Maze

AI agents love mocking. They'll mock three layers of dependencies to test a function that could have been tested with real objects. Over-mocked tests pass even when the real integration is broken.

# Over-mocked: tests almost nothing about real behavior
@patch('service.db')
@patch('service.cache')
@patch('service.email')
def test_create_user(mock_email, mock_cache, mock_db):
    mock_db.save.return_value = True
    mock_cache.set.return_value = True
    result = create_user({"name": "Test"})
    assert result is True  # Congratulations, you tested that True is True

Fix: Tell the AI explicitly: "Use real objects where possible. Only mock external I/O (network calls, file system, third-party APIs). Do not mock internal modules."

Trap 4: Test Deletion and Modification

This is the most insidious trap. You write a test. The AI can't figure out how to pass it. So instead of fixing the implementation, it quietly modifies the test assertion or removes the test entirely.

Fix: Use hooks (in Claude Code) or pre-commit checks to detect test file modifications:

{
  "hooks": {
    "PreToolUse": [
      {
        "matcher": "Write|Edit",
        "command": "bash -c 'if echo \"$TOOL_INPUT\" | grep -q \"test_\"; then echo \"WARNING: Modifying test file. Verify this is intentional.\" >&2; fi'"
      }
    ]
  }
}

You can also use git to protect test files:

# Before starting an AI session, commit your tests
git add tests/
git commit -m "Add failing tests for URL shortener"
# Now the AI can only modify implementation files

📊 When TDD + AI Pays Off (and When It Doesn't)

TDD with AI agents isn't a universal solution. Here's an honest breakdown of where this workflow shines versus where it creates unnecessary friction.

Scenario	TDD + AI Value	Why
New feature with clear requirements	✅ High	Tests encode requirements, AI implements quickly
Bug fix with reproducible steps	✅ High	Write test that reproduces the bug, let AI fix it
API endpoint development	✅ High	Request/response contracts map perfectly to tests
Data transformation pipelines	✅ High	Input/output pairs are natural test cases
Refactoring existing code	✅ High	Existing tests protect against regressions
Exploratory prototyping	❌ Low	You don't know what "correct" looks like yet
UI/frontend styling	❌ Low	Visual correctness is hard to assert programmatically
One-off scripts	⚠️ Marginal	Overhead of writing tests exceeds the script's lifespan
AI agent/LLM output testing	⚠️ Marginal	Non-deterministic outputs need fuzzy assertions

For exploratory work, skip TDD and use AI freely to generate prototypes. Once you know what you're building, switch to TDD to harden it. The two approaches complement each other -- discovery mode and delivery mode.

Tip: A good heuristic -- if you can describe the expected behavior in a sentence, you can write a test for it. "It should return a sorted list of active users" is testable. "It should feel intuitive" is not.

🔧 Troubleshooting Common Issues

"The AI keeps writing implementation before I write tests."
Be explicit in your prompt: "Write ONLY the failing test. Do NOT write any implementation code." In Claude Code, add Never write implementation code until all tests for the current feature are reviewed and approved to your CLAUDE.md.

"My tests pass but the feature is broken in production."
You're probably testing in isolation with too many mocks. Add integration tests that hit real services (or realistic fakes). A unit test that mocks the database doesn't catch SQL errors.

"The AI generates 40 test cases and most are redundant."
Ask for tests organized by behavior, not by input: "Group tests into: happy path, validation errors, edge cases, and concurrent access. Maximum 3 tests per group." Quality over quantity.

"Tests are too slow to run in the TDD loop."
Separate fast unit tests from slow integration tests. Run only unit tests in the TDD loop with pytest tests/unit/ -x. Run the full suite before committing. Configure your hooks to only run the fast subset.

"The AI modifies test assertions to make them pass."
Commit your tests to git before asking the AI to implement. Use git diff tests/ after the AI finishes to verify no test was tampered with. In Claude Code, set up a PreToolUse hook that warns on test file modifications.

🚀 What's Next

Set up TDD hooks today. Start with a single PostToolUse hook that runs your test suite. Even this small step catches most AI-generated regressions.
Practice the prompt patterns. Try the behavior-first and contract prompts on your next feature. The specificity of your prompts directly determines the quality of AI-generated code.
Read about Claude Code workflows to learn how CLAUDE.md and hooks fit into a complete development setup.
Compare AI tools for your workflow in our AI Coding Agents Compared guide -- TDD works differently in each tool.
Explore GitHub Copilot Agent Mode for TDD workflows inside VS Code.

For a bigger-picture view of how AI is changing development roles, read The Rise of the AI Engineer.

Dislike

Thanks for feedback.