Human orchestrator directing multiple autonomous AI agent processes — agentic engineering visualization

    Agentic Engineering Is Here: What Karpathy’s Naming Means for Your AI Investment

    | |

    In February 2026, Andrej Karpathy proposed retiring the term “vibe coding” and replacing it with something more precise. The replacement he suggested: agentic engineering. Within weeks, monthly searches for the term grew from a few hundred to nearly 3,000. The naming stuck because it named something real.

    This covers what the naming shift signals for investment decisions, why the productivity numbers you’ve probably seen are more complicated than they appear, and how to tell whether a team or vendor is genuinely doing agentic engineering or just using the phrase.

    In this article:

    • What Karpathy actually said, and why the language matters for business
    • The productivity paradox: why some developers get slower with AI tools, not faster
    • Two distinct things called “agentic engineering” that require completely different evaluation criteria
    • What a production agentic system for business operations actually looks like
    • Five signs a team or vendor is doing agentic engineering vs. just claiming it
    • How to frame the budget decision for 2026

    Developer at workstation with AI agent companions under human direction — agentic engineering in practice

    What Karpathy Actually Said (And Why the Language Matters)

    The quote is worth reading in full, because it’s precise in a way that most coverage of it hasn’t been:

    “Agentic because the new default is that you are not writing the code directly 99% of the time, you are orchestrating agents who do and acting as oversight. Engineering to emphasize that there is an art & science and expertise to it.”

    Two things are happening in that sentence. First, the default mode of working has changed: instead of a developer writing code, a developer is directing agents that write code and then reviewing what comes back. Second, that orchestration takes expertise. It is not just a different interface for the same work. It’s a different discipline with its own skills, failure modes, and quality standards.

    The contrast with vibe coding is direct. Vibe coding was the early name for “give the AI a rough idea of what you want and see what it generates.” It worked well for prototypes, demos, and things that didn’t need to survive contact with reality. Agentic engineering is what you need when the output has to actually hold up.

    Addy Osmani put it plainly in his practitioner piece on the topic: “This isn’t engineering, it’s hoping.” That line, attributed to a developer describing vibe coding, captures the boundary. Vibe coding is AI builds it, you hope it works. Agentic engineering is AI builds it, a human who understands the system validates every step.

    For business leaders, the naming matters because it signals professional maturation. When a field gets a name that distinguishes craft from carelessness, it usually means the field is serious enough to have developed standards. That’s now true of this one.

    The Productivity Paradox Business Leaders Need to Understand

    The productivity claims for AI-assisted development have ranged from 55-88% improvement (early Copilot studies from 2023-2024) down to zero or negative. A METR study from mid-2025 found that experienced open-source developers were approximately 20% slower when using AI tools on their own codebases. The study ran 16 developers across real repositories averaging 22,000 GitHub stars, not toy projects.

    Research by Yegor Denisov-Blanch at Stanford puts the median productivity lift at 10-15%, not the 55-88% figure that circulated in early coverage.

    These numbers don’t contradict each other. They describe different situations. The high-end figures came from developers using AI on unfamiliar tasks: generating boilerplate, writing documentation, producing code in languages they knew less well. The lower or negative figures came from experienced developers working on complex codebases they already understood deeply. There, AI interrupted their flow more than it accelerated it.

    The implication for business is specific. Addy Osmani’s practitioner analysis states it directly: “Agentic engineering disproportionately benefits senior engineers. If you have deep fundamentals, you can leverage AI as a massive force multiplier.” The inverse is also true. Developers who use AI to skip fundamentals accumulate invisible debt. Code that demos fine fails six months later when something needs to change and nobody understands the underlying structure.

    According to IBM’s coverage of the Stack Overflow 2025 Developer Survey: 84% of developers use or intend to use AI-assisted programming, but only 3% say they “highly trust” AI-generated output. Experienced developers are the most skeptical: seasoned engineers reported the lowest rate of high trust (2.6%) and the highest rate of high distrust (20%). Senior engineers are adopting these tools while being most cautious about their outputs. That’s the professional posture agentic engineering describes.

    For budget purposes, the key takeaway is this: the ROI from agentic engineering depends far more on the skill of the orchestrator than on the cost of the AI tools. A senior engineer or a team that has put in the deliberate practice required will get dramatically different results than someone who installed an AI extension and called it done. Tool cost is nearly irrelevant. The human running the system determines the outcome.

    Diagram comparing AI productivity outcomes: junior developers accumulate technical debt vs senior engineers gain compounding returns from agentic engineering

    Two Things People Call Agentic Engineering (That Are Very Different)

    One reason the term is generating confusion is that it’s being used for two distinct applications. They share a methodology but produce completely different types of value and require completely different evaluation criteria.

    The first meaning is the one Karpathy coined: an engineering team using AI agents to write, test, and refine code. The human developer orchestrates the agents, reviews outputs, sets standards, and owns the final system. This applies to software product teams building applications.

    The second meaning is newer and gets far less coverage: agents performing specific business functions end-to-end. Content production, research, data analysis, customer operations, process automation. No code is being written. Business work is being done. The orchestration discipline is the same, but the domain is operational rather than technical.

    The distinction matters practically. If you’re evaluating a software development firm’s claim to “do agentic engineering,” you should be asking about their code review processes, their testing methodology, and how they handle agent-generated code that fails quietly. If you’re evaluating a vendor claiming to use agentic engineering for business operations, you should be asking about their quality gates, their output validation processes, and what their failure response looks like.

    The skills required are also different. Agentic engineering for software development requires deep engineering fundamentals. Agentic engineering for business operations requires deep domain expertise in whatever function the agent is performing, plus the architectural knowledge to design systems that catch their own errors.

    Diagram comparing two meanings of agentic engineering: software development vs business operations — shared methodology with different domain expertise requirements

    What Agentic Engineering for Business Operations Actually Looks Like

    The SERP for “agentic engineering” is full of definitional content and developer-facing advice. What’s missing is a description of what this looks like when applied to a business’s ongoing operations: not building software, but doing the work.

    Fountain City ships software using an agentic coding stack that demonstrates the same engineering principles, applied to software development rather than content production. Here is what it looks like to move a single feature from idea to merge.

    Work starts as an issue in BEADS, our issue tracker. The issue has a title, a rationale, and acceptance criteria. Not a Slack message or a sticky note. The issue is the unit of work, and it persists across sessions, machines, and agents.

    Claude Code, the directing agent, claims the issue and runs /start-task, which loads relevant prior context from the knowledge base before any design is drafted. Brainstorming then produces a plan. The plan does not get implemented yet.

    Three adversarial reviewers, Feasibility, Completeness, and Scope & Alignment, are spawned in parallel inside metaswarm, our workflow harness. All three must pass before the plan is shown to the human. If any one flags a problem, the plan goes back for revision. This gate exists because plans drafted by a single agent quietly assume away their own weaknesses.

    Implementation is test-driven. Tests are written first, watched to fail, then code is written until they pass. A coverage threshold in .coverage-thresholds.json is checked mechanically. Falling under it blocks the PR.

    Before the PR opens, Codex CLI, a separate model from a different provider, reviews the diff independently. Two model families catch different categories of mistake. Anything Codex flags goes back through the same gates.

    Once merged, /self-reflect extracts durable lessons and writes them back into the knowledge base, so the next issue benefits from this one.

    The same shape, issue, plan, adversarial review, mechanical gate, post-merge reflection, runs our content pipeline, our SEO research pipeline, and the systems we build for clients. The vocabulary changes (“article” instead of “PR,” “editorial review” instead of “code review”), but the engineering posture is identical.

    This self-evaluation problem is not unique to our setup. Anthropic’s own research on multi-agent harnesses surfaces the same pattern: when an agent is asked to evaluate work it produced, it tends toward confident self-approval rather than honest critique, a failure mode they had to design around explicitly. We hit the same wall, in the same shape, in our content pipeline and in our coding stack. The adversarial review gates described above, separate agents checking feasibility, completeness, and alignment before a plan advances, are our structural answer to it.

    Anthropic’s engineering team published a reference architecture for exactly this challenge in their multi-agent harness design: a planner, generator, and evaluator in sequence. Their finding: agents that generate content “confidently praise” their own output even when quality is mediocre. The solution is architectural: separate the generator from the evaluator so they’re not the same system assessing its own work.

    For longer worked examples, see our case studies on the Voice Intelligence Platform (telephony + AI orchestration, zero human-written code) and the Hydraulic 3D Simulation (18,000 lines of physics code, $360 in API spend).

    The broader point is that agentic engineering for business operations is fundamentally about orchestration design. The AI capability matters, but the system design, how tasks move, how quality is assessed, how errors get caught before they propagate, is where the real engineering lives.

    The 5 Signs Your Team (or Vendor) Is Actually Doing Agentic Engineering

    Because the term is being applied broadly, it’s useful to have a practical evaluation framework. These five markers separate professional practice from label adoption.

    1. They start with a spec, not a prompt. Agentic engineering requires designing the task before AI touches it: what inputs, what outputs, what quality criteria, what failure modes. If someone jumps straight to prompting without this design phase, that’s vibe coding with extra steps, not agentic engineering.
    2. They review every output every time through a defined process, not spot-checks. Systematic validation. The human owns the output even if an agent created it. A team genuinely doing agentic engineering will have a clear answer to “what is your output review process.” A team that isn’t will talk about how good the AI is.
    3. They have quality gates, not just outputs. Results pass through defined criteria before moving to the next stage. Automated tests, structured review rubrics, or a validation step that must pass before handoff. If every stage produces output that flows directly to the next stage without validation, that’s a pipeline, not engineering.
    4. They can explain what went wrong. Production agentic systems fail. The failure stories are the proof of production experience. A practitioner running real systems can tell you how a specific run failed, why it failed, and what changed in response. If someone has no failure stories, they have no production systems.
    5. Their agents do boring work reliably. The best agentic systems are optimized for repeatability, not just capability. A system that produces impressive output occasionally is a demo. A system that produces good-enough output consistently is engineering. If every run requires significant cleanup, it’s not there yet.

    These questions work for evaluating internal teams and vendors equally. The answers reveal whether someone has worked through the hard parts of production deployment, or is still describing what the technology is theoretically capable of.

    What This Means for Your AI Budget in 2026

    Agentic engineering is not a tool you buy. It’s a capability you build, hire, or contract for. The AI subscriptions are a small part of the cost. The capability to orchestrate, validate, and run systems reliably is where the investment actually goes.

    Three paths forward, with real tradeoffs for each:

    Build the capability in-house. This requires hiring engineers who understand both the domain and the orchestration layer. Practitioner analysis suggests consistent productivity gains require roughly 30-100 hours of deliberate practice per person. This is not something that comes from onboarding documentation. Expect a real ramp time before the investment returns measurable value. The payoff, when it arrives, compounds: a senior engineer running agentic workflows can handle workloads that would otherwise require multiple people.

    Train your existing team. Structured training on agentic development, how to design tasks, validate outputs, and build quality gates, accelerates the learning curve significantly. This is what our agentic coding workshops are built to do: take developers who understand their domain and give them the orchestration discipline that makes their AI use productive rather than risky.

    Contract with a team already running production systems. This is the lowest-risk path if the need is immediate. The cost is real, but you’re paying for operational depth, not just AI access. The key question to ask any vendor: “Show me a production system you’ve been running for more than six months. What failed, and what did you fix?” The answer tells you more than any capability list. If you’re evaluating this path, our agentic development services are built on production systems that have been running and failing and improving for well over a year.

    On the cost question specifically: production agentic systems for business operations are not expensive to run once they’re built. The AI infrastructure cost is a fraction of what the equivalent human work would cost. The investment is in building and validating the system, not in running it. A well-designed agentic system runs at a fraction of the cost of manual execution. This holds only after the engineering work is done correctly.

    Three illustrated paths for agentic engineering investment in 2026: build in-house, train your team, or partner with practitioners

    The Consensus Behind the Name

    Karpathy’s naming didn’t create this paradigm. It named something that was already developing. What makes early 2026 a meaningful moment is that three independent signals converged on the same conclusion within weeks of each other.

    Karpathy named the discipline from the practitioner developer community. Separately, Anthropic published a reference architecture for multi-agent systems, the planner/generator/evaluator design they developed through running production multi-hour autonomous coding sessions. And Cloudflare launched their Agents Week, announcing infrastructure specifically designed for agentic workloads, built on the premise that agents require one-to-one compute isolation that the container model can’t provide efficiently at scale.

    The model creator named the discipline. A leading AI lab published its reference architecture. A major infrastructure provider built the plumbing for it. When those three things happen independently in the same month, the paradigm is established rather than emerging.

    The question for business leaders is no longer whether agentic engineering is established. The evidence is clear. The question is how quickly your organization needs to develop or access the capability, and which path gets you there most efficiently given your current team and timeline.

    FAQ

    Is agentic engineering the same as vibe coding?

    No. Vibe coding describes generating code through informal prompting without systematic validation: the AI builds something, you hope it works. Agentic engineering describes orchestrating AI agents with professional discipline: designing tasks before executing them, validating outputs systematically, and maintaining human ownership of results. Vibe coding produces prototypes. Agentic engineering produces systems that hold up.

    What skills do you need to do agentic engineering?

    For software development: deep software engineering fundamentals plus the discipline to design, validate, and own AI-generated outputs. For business operations: deep domain expertise in whatever function the agent is performing, plus architectural knowledge of how to build multi-agent systems with reliable quality gates. In both cases, senior-level mastery of the underlying domain is the prerequisite. AI amplifies that expertise; it doesn’t substitute for it.

    How long does it take to see productivity gains from agentic engineering?

    Practitioner research suggests 30-100 hours of deliberate practice before consistent gains appear. That’s per person, per domain. The gains compound over time: once the orchestration patterns are internalized, the productivity differential between AI-augmented and non-augmented work becomes substantial. Expecting immediate returns from minimal onboarding will produce disappointment, not results.

    Can agentic engineering be applied to business operations, not just software development?

    Yes. This is the use case that gets least coverage. Agents can perform specific business functions end-to-end: content production, market research, data analysis, customer operations, knowledge management, process documentation. The orchestration discipline is identical; the domain expertise required shifts to match the function. We design and deploy these systems, and the methodology is the same as for software: spec the task, validate the output, gate the handoffs.

    What’s the difference between agentic engineering and AI automation?

    AI automation describes rule-based or AI-assisted workflows where the logic is predefined and the AI fills in specific tasks within that logic. Agentic engineering involves agents that make judgment calls, handle exceptions, and operate across long-horizon tasks with minimal handholding. The boundary is blurring, but the distinction is useful: automation executes defined steps; agentic engineering handles the steps that aren’t fully defined in advance.

    How do I evaluate whether a vendor is actually doing agentic engineering?

    Ask for their failure stories. Ask how their output review process works and who is accountable for results. Ask what their quality gates look like. A vendor running production agentic systems will have specific, concrete answers, including what broke, when, and what changed. A vendor who has adopted the terminology without the practice will describe capabilities and architectures. The difference in response texture is usually clear within a few questions.