From Vibe Coding to Agentic Engineering: Karpathy's Software 3.0 Discipline

In February 2025, Andrej Karpathy named vibe coding in a throwaway tweet โ€” "fully give in to the vibes, embrace exponentials, and forget that the code even exists." A year later, at Sequoia Ascent 2026, he renamed the serious version of the same practice: agentic engineering.

The shift isn't cosmetic. Vibe coding raised the floor โ€” anyone could build something. Agentic engineering raises the ceiling โ€” and demands a discipline that vibe coding deliberately abandons. This guide breaks down Karpathy's Software 3.0 framing, the maturity ladder underneath it, and how to figure out which rung your team actually belongs on.


๐Ÿ“‹ What You'll Need

  • A working AI coding tool โ€” Cursor, Claude Code, Copilot, Aider, or similar
  • One real codebase to evaluate against (not a toy project โ€” the discipline only matters when correctness does)
  • Honest answers to: who reviews the diffs, who owns the bugs, who gets paged at 3am
  • 15 minutes to self-assess against the maturity ladder in section 4

๐Ÿง  Software 1.0 โ†’ 2.0 โ†’ 3.0: The Quick Recap

Karpathy's framing in his June 2025 YC AI Startup School talk introduced three software paradigms that now coexist in modern systems.

Paradigm Programmer writes Computer runs Example
Software 1.0 Source code (Python, C++, etc.) Compiled instructions A Django view
Software 2.0 Datasets + objectives Trained neural network weights An image classifier
Software 3.0 Prompts, examples, context An LLM interpreting English A coding agent fixing a bug

In Software 3.0, the unit of programming shifts from a function to a paragraph. The context window is the program. The LLM is the interpreter. Your repo is no longer code alone โ€” it's code, weights, prompts, tool definitions, and humans reviewing output.

๐Ÿ’ก Why this matters: Software 3.0 isn't replacing 1.0 or 2.0. It's a third layer that sits on top, and the engineer's new job is deciding which layer should handle which part of the work.

๐ŸŒ€ What Vibe Coding Actually Is (and What It Isn't)

Vibe coding is the workflow where you describe what you want, an AI writes it, and you ship without reading the code carefully. It's powerful for prototypes, personal tools, throwaway scripts, and learning.

It collapses the moment that code touches real users, real data, or real money. Karpathy is blunt about why: "You can outsource your thinking, but you can't outsource your understanding."

Vibe coding's failure modes are predictable:

  • Security vulnerabilities sneak in because nobody read the diff
  • The codebase becomes "bloated, copy-pasted, awkwardly abstracted, brittle"
  • Bug fixes break unrelated features because there's no mental model
  • Onboarding a new engineer is impossible โ€” nobody understands the code, including you

If you want the longer explainer of where vibe coding came from and where it actually shines, see Vibe Coding Explained.


๐Ÿ› ๏ธ Agentic Engineering: Karpathy's Definition

At Sequoia Ascent 2026, Karpathy defined agentic engineering as:

๐ŸŽฏ "The professional discipline of coordinating fallible agents while preserving correctness, security, taste, and maintainability."

Three words in that definition do all the work.

"Discipline" โ€” it's a practice with rules, not a vibe. Specs, reviews, tests, evals.

"Fallible" โ€” agents are spiky. Karpathy: "It can refactor a 100,000-line codebase or find zero-day vulnerabilities, yet tells me to walk to the car wash." Treat them as powerful interns with savant patches, not senior engineers.

"Preserving" โ€” the quality bar doesn't drop because AI wrote the code. Karpathy: "You are not allowed to introduce vulnerabilities because of vibe coding. You are still responsible for your software, just as before."

Floor vs. Ceiling

The cleanest way to remember the distinction:

Vibe Coding Agentic Engineering
Goal Raise the floor Extrapolate the ceiling
Audience Anyone with an idea Working engineers
Output Prototypes, demos, scripts Production systems
Code review โš ๏ธ Optional โœ… Mandatory
Tests / CI โš ๏ธ Optional โœ… Control surface
Spec โŒ Skipped โœ… Source of truth
Failure cost Low Real (security, downtime, data)
Engineer responsibility Low Unchanged from pre-AI era

Karpathy's prediction: the agentic-engineering ceiling is far higher than the old "10x engineer" myth ever was. Practitioners who internalize the discipline will outperform by orders of magnitude โ€” but only if they actually internalize it.


๐Ÿ“ˆ The Agentic Engineering Maturity Ladder

The most useful operationalization of Karpathy's framing is Swarmia's five-level autonomy ladder. Use it to honestly assess where your team is โ€” and where you should aim.

Level Name Human role Agent role Best for
L1 ๐Ÿชถ Assistive Accept/reject every suggestion Single-file autocomplete Learning a codebase
L2 ๐Ÿ’ฌ Conversational Steer continuously, review every multi-file edit Navigates repo, edits across files Ambiguous tasks, architecture
L3 ๐ŸŽฏ Task Agent Assign scoped work, review the PR Plans โ†’ codes โ†’ tests โ†’ opens PR One-sentence specs, dependency bumps
L4 ๐Ÿค– Autonomous Teammate Set objectives, review outputs Picks work from a backlog continuously Flaky tests, doc drift, recurring chores
L5 ๐ŸŒŠ Agentic Avalanche Set trust boundaries at the orchestrator Multi-agent swarm, parallel subagents Large, parallelizable refactors

The counter-intuitive rule: higher isn't always better

A senior engineer running L1 autocomplete on a sensitive payments file is being responsible. A junior dev pointing an L4 swarm at production infrastructure is being reckless. Match the level to the task, not to your ambition.

Anthropic's own internal teams keep "80โ€“100% active human oversight" at L2 for most work. Most professional teams should plateau at L2โ€“L3, reach for L4 only on narrow recurring chores, and treat L5 as experimental.

โš ๏ธ Common mistake: Teams adopt Cursor or Claude Code, immediately try to operate at L4 ("just build the feature"), and ship bugs they don't understand. The skill isn't unlocking higher levels โ€” it's choosing the right level on purpose.

๐Ÿงฑ The Six Practices That Define the Discipline

Karpathy and the broader 2026 agentic-engineering community converge on roughly the same playbook. None of it is glamorous; all of it is what separates the discipline from the vibe.

1. Spec-driven workflow

Detailed specifications become the source of truth that agents implement against. Not a one-line prompt โ€” an artifact you'd be willing to hand to a contractor. See Spec-Driven Development with GitHub Spec Kit for the working pattern.

2. Context as the program

In Software 3.0 the context window is your program. Curating which files, docs, screenshots, and prior decisions the agent sees matters more than the prompt itself. See Context Engineering for AI Coding Agents for nine techniques.

3. Tests and CI as the control surface

Agentic engineering treats automated tests, linting, and CI as the safety net that makes agent output trustworthy. Vibe coding treats them as optional. This is the single biggest cultural divide between the two.

4. Mandatory diff review

Every agent-generated PR gets read by a human who could have written it themselves. Karpathy's framing: the agent is your intern, not your replacement. You sign off on the code that ships under your name.

5. Adversarial testing

Karpathy's Twitter-clone-for-agents example: build the system, then have agents simulate adversarial activity against it. If the agent wrote the auth, another agent should try to break it.

6. Permission and isolation boundaries

What the agent can read, write, run, and reach over the network is part of the architecture, not an afterthought. The whole topic deserves its own deep-dive โ€” see AI Coding Agents and Security Risks.


๐Ÿ”„ What Stays Human (And Always Will)

Karpathy is precise about which parts of engineering you cannot delegate to an agent โ€” even at L5:

  • Aesthetics and taste โ€” what "good" looks like for this product
  • System-level tradeoffs โ€” storage vs. compute, latency vs. cost, simplicity vs. flexibility
  • Security boundaries โ€” what data crosses which trust line
  • Knowing when the model is out-of-distribution โ€” when the agent is bullshitting confidently
  • Choosing what to build โ€” Karpathy: "I am becoming the bottleneck of even knowing what we are trying to build, why it is worth doing, and how to direct my agents."

The skills that matter in 2026 are the skills that were always load-bearing โ€” they're just no longer hidden underneath the typing.


๐Ÿšฆ How to Move Up the Ladder Without Falling Off

A practical migration path for a team currently doing ad-hoc vibe coding on production code:

From Move Why
Vibe at L3+ Drop to L2 immediately Stop the bleeding; rebuild trust through reviewed diffs
L2 with no tests Invest in CI before L3 L3+ is unsafe without a control surface
L2 with tests Pilot L3 on one scoped task type Dependency bumps and bug-with-clear-repro are the safe entry
L3 working well Add specs as the L3 input Specs are what unlock reliable handoff
L3 across the team Try L4 on recurring chores only Flaky tests, doc updates โ€” never feature work
L4 on chores Treat L5 as a research project Multi-agent orchestration is still bleeding edge in 2026
๐Ÿ’ก Heuristic: If you can't write the one-sentence spec the agent will execute against, you're not ready to be at L3. Drop to L2 and keep steering.

๐Ÿฉบ Troubleshooting the Transition

Problem Likely cause Fix
Agent "finishes" but the code is wrong Spec was vague; you're at L3 with L1 inputs Write a real spec; run at L2 until specs improve
PRs pass tests but break in prod Tests don't cover the surface that matters Treat test coverage as a prerequisite, not a metric
Reviews take longer than just writing it Agent generated 2000 lines when 200 would do Constrain scope in the prompt; reject sprawling diffs
Security regression introduced Permission boundaries weren't defined Define what the agent can read/write/run before the task
Team velocity dropped after adopting agents Operating one level too high for the work Move down a level on the ladder

๐Ÿ’ฐ The Cost Math (Roughly)

Across 2026 reporting, the agentic-engineering cost curve looks like this:

Tier Tools Approx. monthly cost / engineer Typical level
๐Ÿ†“ Hobby Free Copilot tier, Aider + own API key $0โ€“20 L1โ€“L2
Pro Cursor Pro, Claude Pro, Copilot Pro $20โ€“40 L2โ€“L3
Team Cursor Business, Claude Code, Copilot Business $40โ€“100 L3
Heavy agentic Claude Code + multiple parallel agents + API spend $200โ€“600 L3โ€“L4
Multi-agent R&D Custom orchestrators, large API budgets $1000+ L4โ€“L5

The Anthropic 2026 Agentic Coding Trends Report notes Claude Code adoption grew 6x in workplace usage in under 12 months โ€” most of that growth is at L2โ€“L3, not the science-fiction tier.


๐Ÿš€ What's Next

  • ๐Ÿงญ Self-assess which level your team actually operates at โ€” most teams overestimate by one rung
  • ๐Ÿ“ Write one real spec this week and hand it to an L3 agent โ€” observe the gap between intent and output
  • ๐Ÿงช Audit your test coverage before promoting any agent above L2 โ€” tests are the agentic safety net
  • ๐Ÿ” Define permission boundaries for any agent that touches your repo, infra, or secrets
  • ๐Ÿ“š Pair this with the discipline-adjacent reads below

Related on fundesk: Vibe Coding Explained ยท Context Engineering for AI Coding Agents ยท Spec-Driven Development with GitHub Spec Kit ยท AI Coding Agents and Security Risks





Thanks for feedback.

Share Your Thoughts




Read More....
AI Automation for Small Business: Where to Start in 2026
AI Coding Agents Compared: Cursor vs Copilot vs Claude Code vs Windsurf in 2026
AI Coding Agents and Security Risks: What You Need to Know
AI Pair Programming: The Productivity Guide for 2026
AI SRE Agents Explained: Platform Comparison and Pilot Guide for 2026
AI-Assisted Code Review: Tools and Workflows for 2026
Browse all AI-Assisted Engineering articles →