From Vibe Coding to Agentic Engineering: Karpathy's Software 3.0 Discipline

Easy

In February 2025, Andrej Karpathy named vibe coding in a throwaway tweet — "fully give in to the vibes, embrace exponentials, and forget that the code even exists." A year later, at Sequoia Ascent 2026, he renamed the serious version of the same practice: agentic engineering.

The shift isn't cosmetic. Vibe coding raised the floor — anyone could build something. Agentic engineering raises the ceiling — and demands a discipline that vibe coding deliberately abandons. This guide breaks down Karpathy's Software 3.0 framing, the maturity ladder underneath it, and how to figure out which rung your team actually belongs on.

📋 What You'll Need

A working AI coding tool — Cursor, Claude Code, Copilot, Aider, or similar
One real codebase to evaluate against (not a toy project — the discipline only matters when correctness does)
Honest answers to: who reviews the diffs, who owns the bugs, who gets paged at 3am
15 minutes to self-assess against the maturity ladder in section 4

🧠 Software 1.0 → 2.0 → 3.0: The Quick Recap

Karpathy's framing in his June 2025 YC AI Startup School talk introduced three software paradigms that now coexist in modern systems.

Paradigm	Programmer writes	Computer runs	Example
Software 1.0	Source code (Python, C++, etc.)	Compiled instructions	A Django view
Software 2.0	Datasets + objectives	Trained neural network weights	An image classifier
Software 3.0	Prompts, examples, context	An LLM interpreting English	A coding agent fixing a bug

In Software 3.0, the unit of programming shifts from a function to a paragraph. The context window is the program. The LLM is the interpreter. Your repo is no longer code alone — it's code, weights, prompts, tool definitions, and humans reviewing output.

💡 Why this matters: Software 3.0 isn't replacing 1.0 or 2.0. It's a third layer that sits on top, and the engineer's new job is deciding which layer should handle which part of the work.

🌀 What Vibe Coding Actually Is (and What It Isn't)

Vibe coding is the workflow where you describe what you want, an AI writes it, and you ship without reading the code carefully. It's powerful for prototypes, personal tools, throwaway scripts, and learning.

It collapses the moment that code touches real users, real data, or real money. Karpathy is blunt about why: "You can outsource your thinking, but you can't outsource your understanding."

Vibe coding's failure modes are predictable:

Security vulnerabilities sneak in because nobody read the diff
The codebase becomes "bloated, copy-pasted, awkwardly abstracted, brittle"
Bug fixes break unrelated features because there's no mental model
Onboarding a new engineer is impossible — nobody understands the code, including you

If you want the longer explainer of where vibe coding came from and where it actually shines, see Vibe Coding Explained.

🛠️ Agentic Engineering: Karpathy's Definition

At Sequoia Ascent 2026, Karpathy defined agentic engineering as:

🎯 "The professional discipline of coordinating fallible agents while preserving correctness, security, taste, and maintainability."

Three words in that definition do all the work.

"Discipline" — it's a practice with rules, not a vibe. Specs, reviews, tests, evals.

"Fallible" — agents are spiky. Karpathy: "It can refactor a 100,000-line codebase or find zero-day vulnerabilities, yet tells me to walk to the car wash." Treat them as powerful interns with savant patches, not senior engineers.

"Preserving" — the quality bar doesn't drop because AI wrote the code. Karpathy: "You are not allowed to introduce vulnerabilities because of vibe coding. You are still responsible for your software, just as before."

Floor vs. Ceiling

The cleanest way to remember the distinction:

	Vibe Coding	Agentic Engineering
Goal	Raise the floor	Extrapolate the ceiling
Audience	Anyone with an idea	Working engineers
Output	Prototypes, demos, scripts	Production systems
Code review	⚠️ Optional	✅ Mandatory
Tests / CI	⚠️ Optional	✅ Control surface
Spec	❌ Skipped	✅ Source of truth
Failure cost	Low	Real (security, downtime, data)
Engineer responsibility	Low	Unchanged from pre-AI era

Karpathy's prediction: the agentic-engineering ceiling is far higher than the old "10x engineer" myth ever was. Practitioners who internalize the discipline will outperform by orders of magnitude — but only if they actually internalize it.

📈 The Agentic Engineering Maturity Ladder

The most useful operationalization of Karpathy's framing is Swarmia's five-level autonomy ladder. Use it to honestly assess where your team is — and where you should aim.

Level	Name	Human role	Agent role	Best for
L1	🪶 Assistive	Accept/reject every suggestion	Single-file autocomplete	Learning a codebase
L2	💬 Conversational	Steer continuously, review every multi-file edit	Navigates repo, edits across files	Ambiguous tasks, architecture
L3	🎯 Task Agent	Assign scoped work, review the PR	Plans → codes → tests → opens PR	One-sentence specs, dependency bumps
L4	🤖 Autonomous Teammate	Set objectives, review outputs	Picks work from a backlog continuously	Flaky tests, doc drift, recurring chores
L5	🌊 Agentic Avalanche	Set trust boundaries at the orchestrator	Multi-agent swarm, parallel subagents	Large, parallelizable refactors

The counter-intuitive rule: higher isn't always better

A senior engineer running L1 autocomplete on a sensitive payments file is being responsible. A junior dev pointing an L4 swarm at production infrastructure is being reckless. Match the level to the task, not to your ambition.

Anthropic's own internal teams keep "80–100% active human oversight" at L2 for most work. Most professional teams should plateau at L2–L3, reach for L4 only on narrow recurring chores, and treat L5 as experimental.

⚠️ Common mistake: Teams adopt Cursor or Claude Code, immediately try to operate at L4 ("just build the feature"), and ship bugs they don't understand. The skill isn't unlocking higher levels — it's choosing the right level on purpose.

🧱 The Six Practices That Define the Discipline

Karpathy and the broader 2026 agentic-engineering community converge on roughly the same playbook. None of it is glamorous; all of it is what separates the discipline from the vibe.

1. Spec-driven workflow

Detailed specifications become the source of truth that agents implement against. Not a one-line prompt — an artifact you'd be willing to hand to a contractor. See Spec-Driven Development with GitHub Spec Kit for the working pattern.

2. Context as the program

In Software 3.0 the context window is your program. Curating which files, docs, screenshots, and prior decisions the agent sees matters more than the prompt itself. See Context Engineering for AI Coding Agents for nine techniques.

3. Tests and CI as the control surface

Agentic engineering treats automated tests, linting, and CI as the safety net that makes agent output trustworthy. Vibe coding treats them as optional. This is the single biggest cultural divide between the two.

4. Mandatory diff review

Every agent-generated PR gets read by a human who could have written it themselves. Karpathy's framing: the agent is your intern, not your replacement. You sign off on the code that ships under your name.

5. Adversarial testing

Karpathy's Twitter-clone-for-agents example: build the system, then have agents simulate adversarial activity against it. If the agent wrote the auth, another agent should try to break it.

6. Permission and isolation boundaries

What the agent can read, write, run, and reach over the network is part of the architecture, not an afterthought. The whole topic deserves its own deep-dive — see AI Coding Agents and Security Risks.

🔄 What Stays Human (And Always Will)

Karpathy is precise about which parts of engineering you cannot delegate to an agent — even at L5:

Aesthetics and taste — what "good" looks like for this product
System-level tradeoffs — storage vs. compute, latency vs. cost, simplicity vs. flexibility
Security boundaries — what data crosses which trust line
Knowing when the model is out-of-distribution — when the agent is bullshitting confidently
Choosing what to build — Karpathy: "I am becoming the bottleneck of even knowing what we are trying to build, why it is worth doing, and how to direct my agents."

The skills that matter in 2026 are the skills that were always load-bearing — they're just no longer hidden underneath the typing.

🚦 How to Move Up the Ladder Without Falling Off

A practical migration path for a team currently doing ad-hoc vibe coding on production code:

From	Move	Why
Vibe at L3+	Drop to L2 immediately	Stop the bleeding; rebuild trust through reviewed diffs
L2 with no tests	Invest in CI before L3	L3+ is unsafe without a control surface
L2 with tests	Pilot L3 on one scoped task type	Dependency bumps and bug-with-clear-repro are the safe entry
L3 working well	Add specs as the L3 input	Specs are what unlock reliable handoff
L3 across the team	Try L4 on recurring chores only	Flaky tests, doc updates — never feature work
L4 on chores	Treat L5 as a research project	Multi-agent orchestration is still bleeding edge in 2026

💡 Heuristic: If you can't write the one-sentence spec the agent will execute against, you're not ready to be at L3. Drop to L2 and keep steering.

🩺 Troubleshooting the Transition

Problem	Likely cause	Fix
Agent "finishes" but the code is wrong	Spec was vague; you're at L3 with L1 inputs	Write a real spec; run at L2 until specs improve
PRs pass tests but break in prod	Tests don't cover the surface that matters	Treat test coverage as a prerequisite, not a metric
Reviews take longer than just writing it	Agent generated 2000 lines when 200 would do	Constrain scope in the prompt; reject sprawling diffs
Security regression introduced	Permission boundaries weren't defined	Define what the agent can read/write/run before the task
Team velocity dropped after adopting agents	Operating one level too high for the work	Move down a level on the ladder

💰 The Cost Math (Roughly)

Across 2026 reporting, the agentic-engineering cost curve looks like this:

Tier	Tools	Approx. monthly cost / engineer	Typical level
🆓 Hobby	Free Copilot tier, Aider + own API key	$0–20	L1–L2
Pro	Cursor Pro, Claude Pro, Copilot Pro	$20–40	L2–L3
Team	Cursor Business, Claude Code, Copilot Business	$40–100	L3
Heavy agentic	Claude Code + multiple parallel agents + API spend	$200–600	L3–L4
Multi-agent R&D	Custom orchestrators, large API budgets	$1000+	L4–L5

The Anthropic 2026 Agentic Coding Trends Report notes Claude Code adoption grew 6x in workplace usage in under 12 months — most of that growth is at L2–L3, not the science-fiction tier.

🚀 What's Next

🧭 Self-assess which level your team actually operates at — most teams overestimate by one rung
📝 Write one real spec this week and hand it to an L3 agent — observe the gap between intent and output
🧪 Audit your test coverage before promoting any agent above L2 — tests are the agentic safety net
🔐 Define permission boundaries for any agent that touches your repo, infra, or secrets
📚 Pair this with the discipline-adjacent reads below

Dislike

Thanks for feedback.

Share Your Thoughts

What did you think of this article?

What topic should we cover next?

Email (optional, for follow-up)

Browse all AI-Assisted Engineering articles →