From Vibe Coding to Agentic Engineering: Karpathy's Software 3.0 Discipline
In February 2025, Andrej Karpathy named vibe coding in a throwaway tweet โ "fully give in to the vibes, embrace exponentials, and forget that the code even exists." A year later, at Sequoia Ascent 2026, he renamed the serious version of the same practice: agentic engineering.
The shift isn't cosmetic. Vibe coding raised the floor โ anyone could build something. Agentic engineering raises the ceiling โ and demands a discipline that vibe coding deliberately abandons. This guide breaks down Karpathy's Software 3.0 framing, the maturity ladder underneath it, and how to figure out which rung your team actually belongs on.
๐ What You'll Need
- A working AI coding tool โ Cursor, Claude Code, Copilot, Aider, or similar
- One real codebase to evaluate against (not a toy project โ the discipline only matters when correctness does)
- Honest answers to: who reviews the diffs, who owns the bugs, who gets paged at 3am
- 15 minutes to self-assess against the maturity ladder in section 4
๐ง Software 1.0 โ 2.0 โ 3.0: The Quick Recap
Karpathy's framing in his June 2025 YC AI Startup School talk introduced three software paradigms that now coexist in modern systems.
| Paradigm | Programmer writes | Computer runs | Example |
|---|---|---|---|
| Software 1.0 | Source code (Python, C++, etc.) | Compiled instructions | A Django view |
| Software 2.0 | Datasets + objectives | Trained neural network weights | An image classifier |
| Software 3.0 | Prompts, examples, context | An LLM interpreting English | A coding agent fixing a bug |
In Software 3.0, the unit of programming shifts from a function to a paragraph. The context window is the program. The LLM is the interpreter. Your repo is no longer code alone โ it's code, weights, prompts, tool definitions, and humans reviewing output.
๐ What Vibe Coding Actually Is (and What It Isn't)
Vibe coding is the workflow where you describe what you want, an AI writes it, and you ship without reading the code carefully. It's powerful for prototypes, personal tools, throwaway scripts, and learning.
It collapses the moment that code touches real users, real data, or real money. Karpathy is blunt about why: "You can outsource your thinking, but you can't outsource your understanding."
Vibe coding's failure modes are predictable:
- Security vulnerabilities sneak in because nobody read the diff
- The codebase becomes "bloated, copy-pasted, awkwardly abstracted, brittle"
- Bug fixes break unrelated features because there's no mental model
- Onboarding a new engineer is impossible โ nobody understands the code, including you
If you want the longer explainer of where vibe coding came from and where it actually shines, see Vibe Coding Explained.
๐ ๏ธ Agentic Engineering: Karpathy's Definition
At Sequoia Ascent 2026, Karpathy defined agentic engineering as:
Three words in that definition do all the work.
"Discipline" โ it's a practice with rules, not a vibe. Specs, reviews, tests, evals.
"Fallible" โ agents are spiky. Karpathy: "It can refactor a 100,000-line codebase or find zero-day vulnerabilities, yet tells me to walk to the car wash." Treat them as powerful interns with savant patches, not senior engineers.
"Preserving" โ the quality bar doesn't drop because AI wrote the code. Karpathy: "You are not allowed to introduce vulnerabilities because of vibe coding. You are still responsible for your software, just as before."
Floor vs. Ceiling
The cleanest way to remember the distinction:
| Vibe Coding | Agentic Engineering | |
|---|---|---|
| Goal | Raise the floor | Extrapolate the ceiling |
| Audience | Anyone with an idea | Working engineers |
| Output | Prototypes, demos, scripts | Production systems |
| Code review | โ ๏ธ Optional | โ Mandatory |
| Tests / CI | โ ๏ธ Optional | โ Control surface |
| Spec | โ Skipped | โ Source of truth |
| Failure cost | Low | Real (security, downtime, data) |
| Engineer responsibility | Low | Unchanged from pre-AI era |
Karpathy's prediction: the agentic-engineering ceiling is far higher than the old "10x engineer" myth ever was. Practitioners who internalize the discipline will outperform by orders of magnitude โ but only if they actually internalize it.
๐ The Agentic Engineering Maturity Ladder
The most useful operationalization of Karpathy's framing is Swarmia's five-level autonomy ladder. Use it to honestly assess where your team is โ and where you should aim.
| Level | Name | Human role | Agent role | Best for |
|---|---|---|---|---|
| L1 | ๐ชถ Assistive | Accept/reject every suggestion | Single-file autocomplete | Learning a codebase |
| L2 | ๐ฌ Conversational | Steer continuously, review every multi-file edit | Navigates repo, edits across files | Ambiguous tasks, architecture |
| L3 | ๐ฏ Task Agent | Assign scoped work, review the PR | Plans โ codes โ tests โ opens PR | One-sentence specs, dependency bumps |
| L4 | ๐ค Autonomous Teammate | Set objectives, review outputs | Picks work from a backlog continuously | Flaky tests, doc drift, recurring chores |
| L5 | ๐ Agentic Avalanche | Set trust boundaries at the orchestrator | Multi-agent swarm, parallel subagents | Large, parallelizable refactors |
The counter-intuitive rule: higher isn't always better
A senior engineer running L1 autocomplete on a sensitive payments file is being responsible. A junior dev pointing an L4 swarm at production infrastructure is being reckless. Match the level to the task, not to your ambition.
Anthropic's own internal teams keep "80โ100% active human oversight" at L2 for most work. Most professional teams should plateau at L2โL3, reach for L4 only on narrow recurring chores, and treat L5 as experimental.
๐งฑ The Six Practices That Define the Discipline
Karpathy and the broader 2026 agentic-engineering community converge on roughly the same playbook. None of it is glamorous; all of it is what separates the discipline from the vibe.
1. Spec-driven workflow
Detailed specifications become the source of truth that agents implement against. Not a one-line prompt โ an artifact you'd be willing to hand to a contractor. See Spec-Driven Development with GitHub Spec Kit for the working pattern.
2. Context as the program
In Software 3.0 the context window is your program. Curating which files, docs, screenshots, and prior decisions the agent sees matters more than the prompt itself. See Context Engineering for AI Coding Agents for nine techniques.
3. Tests and CI as the control surface
Agentic engineering treats automated tests, linting, and CI as the safety net that makes agent output trustworthy. Vibe coding treats them as optional. This is the single biggest cultural divide between the two.
4. Mandatory diff review
Every agent-generated PR gets read by a human who could have written it themselves. Karpathy's framing: the agent is your intern, not your replacement. You sign off on the code that ships under your name.
5. Adversarial testing
Karpathy's Twitter-clone-for-agents example: build the system, then have agents simulate adversarial activity against it. If the agent wrote the auth, another agent should try to break it.
6. Permission and isolation boundaries
What the agent can read, write, run, and reach over the network is part of the architecture, not an afterthought. The whole topic deserves its own deep-dive โ see AI Coding Agents and Security Risks.
๐ What Stays Human (And Always Will)
Karpathy is precise about which parts of engineering you cannot delegate to an agent โ even at L5:
- Aesthetics and taste โ what "good" looks like for this product
- System-level tradeoffs โ storage vs. compute, latency vs. cost, simplicity vs. flexibility
- Security boundaries โ what data crosses which trust line
- Knowing when the model is out-of-distribution โ when the agent is bullshitting confidently
- Choosing what to build โ Karpathy: "I am becoming the bottleneck of even knowing what we are trying to build, why it is worth doing, and how to direct my agents."
The skills that matter in 2026 are the skills that were always load-bearing โ they're just no longer hidden underneath the typing.
๐ฆ How to Move Up the Ladder Without Falling Off
A practical migration path for a team currently doing ad-hoc vibe coding on production code:
| From | Move | Why |
|---|---|---|
| Vibe at L3+ | Drop to L2 immediately | Stop the bleeding; rebuild trust through reviewed diffs |
| L2 with no tests | Invest in CI before L3 | L3+ is unsafe without a control surface |
| L2 with tests | Pilot L3 on one scoped task type | Dependency bumps and bug-with-clear-repro are the safe entry |
| L3 working well | Add specs as the L3 input | Specs are what unlock reliable handoff |
| L3 across the team | Try L4 on recurring chores only | Flaky tests, doc updates โ never feature work |
| L4 on chores | Treat L5 as a research project | Multi-agent orchestration is still bleeding edge in 2026 |
๐ฉบ Troubleshooting the Transition
| Problem | Likely cause | Fix |
|---|---|---|
| Agent "finishes" but the code is wrong | Spec was vague; you're at L3 with L1 inputs | Write a real spec; run at L2 until specs improve |
| PRs pass tests but break in prod | Tests don't cover the surface that matters | Treat test coverage as a prerequisite, not a metric |
| Reviews take longer than just writing it | Agent generated 2000 lines when 200 would do | Constrain scope in the prompt; reject sprawling diffs |
| Security regression introduced | Permission boundaries weren't defined | Define what the agent can read/write/run before the task |
| Team velocity dropped after adopting agents | Operating one level too high for the work | Move down a level on the ladder |
๐ฐ The Cost Math (Roughly)
Across 2026 reporting, the agentic-engineering cost curve looks like this:
| Tier | Tools | Approx. monthly cost / engineer | Typical level |
|---|---|---|---|
| ๐ Hobby | Free Copilot tier, Aider + own API key | $0โ20 | L1โL2 |
| Pro | Cursor Pro, Claude Pro, Copilot Pro | $20โ40 | L2โL3 |
| Team | Cursor Business, Claude Code, Copilot Business | $40โ100 | L3 |
| Heavy agentic | Claude Code + multiple parallel agents + API spend | $200โ600 | L3โL4 |
| Multi-agent R&D | Custom orchestrators, large API budgets | $1000+ | L4โL5 |
The Anthropic 2026 Agentic Coding Trends Report notes Claude Code adoption grew 6x in workplace usage in under 12 months โ most of that growth is at L2โL3, not the science-fiction tier.
๐ What's Next
- ๐งญ Self-assess which level your team actually operates at โ most teams overestimate by one rung
- ๐ Write one real spec this week and hand it to an L3 agent โ observe the gap between intent and output
- ๐งช Audit your test coverage before promoting any agent above L2 โ tests are the agentic safety net
- ๐ Define permission boundaries for any agent that touches your repo, infra, or secrets
- ๐ Pair this with the discipline-adjacent reads below
Related on fundesk: Vibe Coding Explained ยท Context Engineering for AI Coding Agents ยท Spec-Driven Development with GitHub Spec Kit ยท AI Coding Agents and Security Risks