Human orchestrator directing multiple autonomous AI agent processes — agentic engineering visualization

    Agentic Engineering Is Here: What Karpathy’s Naming Means for Your AI Investment

    | |

    Your team adopted AI coding tools six months ago. Are they actually faster?

    If the answer is ambiguous, you’re in good company. The productivity claims for AI-assisted development have ranged from 55-88% improvement (early Copilot studies) down to negative results for experienced engineers working on codebases they know well. The gap between those numbers isn’t a measurement error. It describes two different situations, and the difference shapes every AI investment decision.

    In February 2026, Andrej Karpathy gave this gap a name. He proposed retiring the term “vibe coding” and replacing it with something more precise: agentic engineering. Within weeks, monthly searches for the term grew from a few hundred to nearly 3,000. The naming stuck because the discipline behind it has its own skills, failure modes, and quality standards, distinct from both traditional software engineering and from casual AI prompting.

    Developer at workstation with AI agent companions under human direction — agentic engineering in practice

    What Karpathy Actually Said (And Why the Language Matters)

    Karpathy’s framing:

    “Agentic because the new default is that you are not writing the code directly 99% of the time, you are orchestrating agents who do and acting as oversight. Engineering to emphasize that there is an art & science and expertise to it.”

    Two things are happening in that sentence. First, the default mode of working has changed: instead of a developer writing code, a developer is directing agents that write code and then reviewing what comes back. Second, that orchestration takes expertise. It is not just a different interface for the same work. It’s a different discipline with its own skills, failure modes, and quality standards.

    Vibe coding was the early name for “give the AI a rough idea of what you want and see what it generates.” It worked well for prototypes, demos, and things that didn’t need to survive contact with reality. Agentic engineering is what you need when the output has to actually hold up.

    When a field gets a name that distinguishes craft from carelessness, it usually means the field is serious enough to have developed standards. That’s now true of this one.

    The Productivity Paradox Business Leaders Need to Understand

    The productivity claims for AI-assisted development have ranged from 55-88% improvement (early Copilot studies from 2023-2024) down to zero or negative. A METR study from mid-2025 found that experienced open-source developers were approximately 20% slower when using AI tools on their own codebases. The study ran 16 developers across real repositories averaging 22,000 GitHub stars, not toy projects.

    Research by Yegor Denisov-Blanch at Stanford puts the median productivity lift at 10-15%, not the 55-88% figure that circulated in early coverage.

    These numbers don’t contradict each other. They describe different situations. The high-end figures came from developers using AI on unfamiliar tasks: generating boilerplate, writing documentation, producing code in languages they knew less well. The lower or negative figures came from experienced developers working on complex codebases they already understood deeply. There, AI interrupted their flow more than it accelerated it.

    Addy Osmani’s practitioner analysis states it directly: “Agentic engineering disproportionately benefits senior engineers. If you have deep fundamentals, you can leverage AI as a massive force multiplier.” The inverse is also true. Developers who use AI to skip fundamentals accumulate invisible debt. Code that demos fine fails six months later when something needs to change and nobody understands the underlying structure.

    According to IBM’s coverage of the Stack Overflow 2025 Developer Survey, 84% of developers use or intend to use AI-assisted programming, but only 3% say they “highly trust” AI-generated output. The people closest to the tools are the least convinced by them. Seasoned engineers reported the lowest rate of high trust (2.6%) and the highest rate of high distrust (20%). The developers who are best positioned to use these tools well are also the most skeptical of what the tools produce. That caution is itself a core agentic engineering practice.

    ROI from agentic engineering depends far more on the skill of the orchestrator than on the cost of the AI tools. A senior engineer or a team that has put in the deliberate practice required will get dramatically different results than someone who installed an AI extension and called it done. Tool cost is nearly irrelevant. The human running the system determines the outcome.

    Diagram comparing AI productivity outcomes: junior developers accumulate technical debt vs senior engineers gain compounding returns from agentic engineering

    Two Things People Call Agentic Engineering (That Are Very Different)

    The term is being used for two distinct applications. They share a methodology but produce different value and require different evaluation criteria.

    The first meaning is the one Karpathy coined: an engineering team using AI agents to write, test, and refine code. The human developer orchestrates the agents, reviews outputs, sets standards, and owns the final system. This applies to software product teams building applications.

    The second meaning is newer and gets far less coverage: agents performing specific business functions end-to-end. Content production, research, data analysis, customer operations, process automation. No code is being written. Business work is being done. The orchestration discipline is the same, but the domain is operational rather than technical.

    If you’re evaluating a software development firm’s claim to “do agentic engineering,” you should be asking about their code review processes, their testing methodology, and how they handle agent-generated code that fails quietly. If you’re evaluating a vendor claiming to use agentic engineering for business operations, you should be asking about their quality gates, their output validation processes, and what their failure response looks like.

    The skills required are also different. Agentic engineering for software development requires deep engineering fundamentals. Agentic engineering for business operations requires deep domain expertise in whatever function the agent is performing, plus the architectural knowledge to design systems that catch their own errors.

    Diagram comparing two meanings of agentic engineering: software development vs business operations — shared methodology with different domain expertise requirements

    What Agentic Engineering for Business Operations Actually Looks Like

    Most coverage of agentic engineering is developer-facing. The same discipline applies to ongoing business operations, and one worked example is the pipeline that produced this article.

    The article you are reading started as a content brief produced by our SEO research agent. The brief contained a target keyword cluster, a competitive analysis of the top ten SERP results, and a set of source links to anchor factual claims. The brief is the spec. Without it, the writing agent would be generating content from vibes, not from data. The task is designed before the agent touches it.

    Once the brief was approved, the writing agent loaded it along with the company’s brand voice rules, positioning documents, and recent article history. The agent writes a first draft, but the draft does not go to the human yet. It passes through a self-review stage where the same agent evaluates the draft against the voice guide, checking for banned patterns (guru framing, AI-sounding repetition, dramatic setups), verifying that every specific claim has a source, and flagging sections that feel thin. The review generates a report.

    Anthropic’s research on multi-agent harnesses surfaces the same pattern: when an agent is asked to evaluate work it produced, it tends toward confident self-approval rather than honest critique. Their engineering team published a reference architecture for this exact challenge, a planner, generator, and evaluator in sequence, and their finding was blunt: agents that generate content “confidently praise” their own output even when quality is mediocre. The solution is architectural: separate the generator from the evaluator so they’re not the same system assessing its own work.

    In our pipeline, the structural answer to this problem is adversarial review. After self-review, the draft goes to a separate review stage that evaluates it from a different angle: not “does this match the voice guide” but “does this article add something new that a reader couldn’t get from the other nine results on the SERP.” A single agent reviewing its own work will miss things. Two stages with different evaluation criteria catch more. The generator and the evaluator have to be structurally separate.

    Once the review passes, the human editor, Sebastian in our case, reads the final draft. He approves, requests changes, or rejects. The human owns the output even though an agent produced the draft. The approval is not a formality. Articles come back with revision instructions regularly, and the revision loop runs until the human is satisfied.

    The article then moves through art direction (image generation based on brand visual guidelines), deduplication checking (ensuring this article doesn’t repeat the same proof points as the last three published pieces), and finally publication to WordPress. At each stage, defined quality gates determine whether the article advances or goes back. The article doesn’t flow forward because someone clicked approve. It flows forward because it passed a mechanical check.

    This is one article. The same pipeline runs dozens of pieces per month. The same architectural shape, spec, generate, review, gate, publish, runs our software development pipeline, our SEO research, and the systems we build for clients. The vocabulary changes (“article” instead of “PR,” “editorial review” instead of “code review”), but the engineering posture is identical.

    For longer worked examples, see our case studies on the Voice Intelligence Platform (telephony + AI orchestration, zero human-written code) and the Hydraulic 3D Simulation (18,000 lines of physics code, $360 in API spend).

    Agentic engineering for business operations is orchestration design. The AI capability matters, but the system design, how tasks move, how quality is assessed, how errors get caught before they propagate, is where the engineering lives.

    The 5 Signs Your Team (or Vendor) Is Actually Doing Agentic Engineering

    Five markers separate professional practice from label adoption:

    1. They start with a spec, not a prompt. Agentic engineering requires designing the task before AI touches it: what inputs, what outputs, what quality criteria, what failure modes. If someone jumps straight to prompting without this design phase, that’s vibe coding with extra steps, not agentic engineering.
    2. They review every output every time through a defined process, not spot-checks. Systematic validation. The human owns the output even if an agent created it. A team genuinely doing agentic engineering will have a clear answer to “what is your output review process.” A team that isn’t will talk about how good the AI is.
    3. They have quality gates, not just outputs. Results pass through defined criteria before moving to the next stage. Automated tests, structured review rubrics, or a validation step that must pass before handoff. If every stage produces output that flows directly to the next stage without validation, that’s a pipeline, not engineering.
    4. They can explain what went wrong. Production agentic systems fail. The failure stories are the proof of production experience. A practitioner running real systems can tell you how a specific run failed, why it failed, and what changed in response. If someone has no failure stories, they have no production systems.
    5. Their agents do boring work reliably. The best agentic systems are optimized for repeatability, not just capability. A system that produces impressive output occasionally is a demo. A system that produces good-enough output consistently is engineering. If every run requires significant cleanup, it’s not there yet.

    These questions work for evaluating internal teams and vendors equally. The answers reveal whether someone has worked through the hard parts of production deployment, or is still describing what the technology is theoretically capable of.

    What This Means for Your AI Budget in 2026

    Agentic engineering is not a tool you buy. It’s a capability you build, hire, or contract for. The AI subscriptions are a small part of the cost. The capability to orchestrate, validate, and run systems reliably is where the investment goes. Three paths get you there:

    Build the capability in-house. This requires hiring engineers who understand both the domain and the orchestration layer. Practitioner analysis suggests consistent productivity gains require roughly 30-100 hours of deliberate practice per person. This is not something that comes from onboarding documentation. Expect a real ramp time before the investment returns measurable value. The payoff, when it arrives, compounds: a senior engineer running agentic workflows can handle workloads that would otherwise require multiple people. The risk: if that engineer leaves, the capability leaves with them. For companies with thin technical teams, this is the strongest argument for the other two paths.

    Train your existing team. Structured training on agentic development, how to design tasks, validate outputs, and build quality gates, accelerates the learning curve significantly. This is what our agentic coding workshops are built to do: take developers who understand their domain and give them the orchestration discipline that makes their AI use productive rather than risky. Training distributes the knowledge across the team rather than concentrating it in one person, which mitigates the key-person risk.

    Contract with a team already running production systems. This is the lowest-risk path if the need is immediate. The cost is real, but you’re paying for operational depth, not just AI access. The key question to ask any vendor: “Show me a production system you’ve been running for more than six months. What failed, and what did you fix?” The answer tells you more than any capability list. If you’re evaluating this path, our agentic development services are built on production systems that have been running and failing and improving for well over a year.

    Production agentic systems for business operations are not expensive to run once they’re built. The AI infrastructure cost is a fraction of what the equivalent human work would cost. The investment is in building and validating the system, not in running it. A well-designed agentic system runs at a fraction of the cost of manual execution. This holds only after the engineering work is done correctly.

    Three illustrated paths for agentic engineering investment in 2026: build in-house, train your team, or partner with practitioners

    The Consensus Behind the Name

    Karpathy’s naming didn’t create this paradigm. It named something that was already developing. What makes early 2026 a meaningful moment is that three independent signals converged on the same conclusion within weeks of each other.

    Karpathy named the discipline from the practitioner developer community. Separately, Anthropic published a reference architecture for multi-agent systems, the planner/generator/evaluator design they developed through running production multi-hour autonomous coding sessions. And Cloudflare launched their Agents Week, announcing infrastructure specifically designed for agentic workloads, built on the premise that agents require one-to-one compute isolation that the container model can’t provide efficiently at scale.

    The model creator named the discipline. A leading AI lab published its reference architecture. A major infrastructure provider built the plumbing for it. When those three things happen independently in the same month, the paradigm is established rather than emerging.

    Whether agentic engineering is established is no longer the question. How quickly your organization needs to develop or access the capability is, and which of the three paths fits your current team and timeline.

    FAQ

    Is agentic engineering the same as vibe coding?

    No. Vibe coding describes generating code through informal prompting without systematic validation: the AI builds something, you hope it works. Agentic engineering describes orchestrating AI agents with professional discipline: designing tasks before executing them, validating outputs systematically, and maintaining human ownership of results. Vibe coding produces prototypes. Agentic engineering produces systems that hold up.

    What skills do you need to do agentic engineering?

    For software development: deep software engineering fundamentals plus the discipline to design, validate, and own AI-generated outputs. For business operations: deep domain expertise in whatever function the agent is performing, plus architectural knowledge of how to build multi-agent systems with reliable quality gates. In both cases, senior-level mastery of the underlying domain is the prerequisite. AI amplifies that expertise; it doesn’t substitute for it.

    How long does it take to see productivity gains from agentic engineering?

    Practitioner research suggests 30-100 hours of deliberate practice before consistent gains appear. That’s per person, per domain. The gains compound over time: once the orchestration patterns are internalized, the productivity differential between AI-augmented and non-augmented work becomes substantial. Expecting immediate returns from minimal onboarding will produce disappointment, not results.

    Can agentic engineering be applied to business operations, not just software development?

    Yes. This is the use case that gets least coverage. Agents can perform specific business functions end-to-end: content production, market research, data analysis, customer operations, knowledge management, process documentation. The orchestration discipline is identical; the domain expertise required shifts to match the function. We design and deploy these systems, and the methodology is the same as for software: spec the task, validate the output, gate the handoffs.

    What’s the difference between agentic engineering and AI automation?

    AI automation describes rule-based or AI-assisted workflows where the logic is predefined and the AI fills in specific tasks within that logic. Agentic engineering involves agents that make judgment calls, handle exceptions, and operate across long-horizon tasks with minimal handholding. The boundary is blurring, but the distinction is useful: automation executes defined steps; agentic engineering handles the steps that aren’t fully defined in advance.

    How do I evaluate whether a vendor is actually doing agentic engineering?

    Ask for their failure stories. Ask how their output review process works and who is accountable for results. Ask what their quality gates look like. A vendor running production agentic systems will have specific, concrete answers, including what broke, when, and what changed. A vendor who has adopted the terminology without the practice will describe capabilities and architectures. The difference in response texture is usually clear within a few questions.