Wide cinematic shot of a professional looking at a floating translucent amber and cyan evaluation dashboard

    Evaluation-Led Agent Development: Five Disciplines That Separate Production from Pilot

    | |

    The gap between an agent that runs in a demo and an agent that runs in production isn’t a tooling gap or a model-capability gap. It’s a discipline gap in discipline. The discipline that closes that gap is evaluation, not as a QA afterthought, but as the operating practice that determines whether the rest of the work ever gets used.

    Why evaluation became the production distinction

    In a recent Databricks’s State of AI Agents 2026 report (via Lovelytics’ practitioner summary) they found that organizations using systematic evaluation frameworks achieve nearly 6× higher production success rates. At the practitioner tier, NAV43’s Frase/Graphed data shows 90.3% of marketing organizations have AI agents somewhere in their stack, and only about 13% have those agents integrated into production workflows. The root cause they both point at isn’t a poor model, framework, or orchestrator but a lack of discipline in their building, testing, and re-testing systems on the pilot’s way to production.

    The academic name for this is evaluation-driven development and operations — EDDOps in the literature. We’ll use the more practitioner-readable phrase here: evaluation-led agent development. The disciplines themselves are converging across the publications. The InfoQ five-pillar framework, AWS Strands’ Cases/Experiments/Evaluators pattern, Microsoft’s online-and-offline split, and Arthur AI’s supervised-versus-unsupervised distinction all describe the same shape from different angles.  Below we will go deeper into the common overlap between all these articles and also bring in our personal perspective and experience into the subject.

    PS For more on why the pilot stage is the failure point in the first place, our piece on why AI pilots fail covers the broader operational pattern.

    Accuracy tier is the upfront decision that cascades into everything

    Before a single test gets written, the system needs to define the acceptable failure rate, declared upfront. This decision cascades into every downstream choice:

    • What counts as a passing test?
    • What level of judge-human agreement is acceptable?
    • Where can spot-test sampling replace full-coverage testing?

    Some example tiers:

    • 6-sigma for finance, scientific, regulated, and medical-adjacent domains. The cost of a misstep is large or irreversible; the system needs full-coverage validation against ground truth, often including cross-checks against external sources.
    • 4-sigma for HR, general knowledge-work, and most internal productivity agents. The cost of a misstep is real but recoverable; the volume is high enough that exhaustive coverage isn’t economical.
    • 80% tier for systems designed to augment human productivity rather than replace it. For example a system where the AI sets up the initial custom engineering solution for a new prospect client-project. Or an automated RFP response system. Getting the human 80% done at the start saves significant measurable hours without needing to be 100% accurate.

    An illustration from a recent client build: we worked on a hydraulic 3D simulation system that produced engineering models without human-written code in the loop. That system needed sigma-4 accuracy, with no errors creating anything beyond a small imperfection at small very small scales. So the validation method wasn’t a small gold set. It used Gemini 3.1 Pro to cross-check Anthropic’s Opus system-output against published peer-reviewed literature. Then substantial generation orders to ensure the model would be consistent in each generation. The tier dictated the validation method, not the other way around.

    Tier choice also sets what “passing” your LLM judge means. Arize’s published target of 75–90% judge-human agreement reads as a fixed number until you notice it’s tier-dependent — a system designed to augment human productivity can live with 80% right, 20% left to the human to finalize; a 6-sigma financial system likely can’t. Name the tier first, and every other decision in the chain gets cheaper to make.

    Two professionals in a modern office looking at a shared monitor reviewing system metrics

    Evaluation belongs at the harness layer, not just at the output

    An agent doesn’t fail at the output. It fails at memory, at a tool call, at a feedback loop that doesn’t terminate, at an API budget that doesn’t trigger a cutoff. An output-only eval tells you something broke. A layer-targeted eval tells you what — and that’s the difference between an alert and a fix.

    The anatomy of an agent harness breaks the harness into seven components: execution sandbox, auth and identity, memory and context, tool calls, orchestration, cost governance, and observability. Each has its own failure modes, and each gets its own test surface.

    What we’ve found running this in practice is that the test surface gets concrete quickly once the layers are named. The following list is the set of tests we think of first when planning our work:

    • Memory recall under context churn. Does the agent retrieve the right prior context when the window has been rewritten several times? Synthesize churn by injecting unrelated turns between question and answer, then measure retrieval accuracy.
    • Tool-call schema adherence. Does the agent produce tool arguments that match the declared schema, including under prompt variation? Does it always call the tools you expect? A tool-call linter at the gateway catches drift before it reaches the tool.
    • API overspend cutoffs. Does the cost-governance layer actually halt the run when the per-task budget is hit? Test by setting a deliberately low cap and confirming the cutoff fires; many systems alert without halting.
    • Feedback-loop termination. Does the agent escape a stuck state? Inject a recoverable failure (a tool that fails on the first call, succeeds on the second) and confirm the agent retries and proceeds, rather than looping or stalling without a logged failure.
    • Hallucination control gates. Where are the gates that catch fabricated outputs, and do they fire on known failure cases? Run a held-out set of prompts that are known to induce hallucination in similar systems and confirm the gates catch them.
    • Permission and policy boundaries. Does the agent attempt actions outside its authorization scope, and does the sandbox refuse correctly? How does the agent respond when a permission is denied, does it go into a death-spiral? Test by running prompts that try to escalate, and confirm the refusal is logged and surfaced.
    • Observability completeness. Can a trace be reconstructed for any production interaction? If a failure can’t be debugged after the fact, the observability layer itself has a failure mode the evaluation needs to catch.

    None of these tests live at the output. They live at the layer where the failure originates. Output-level evals stay useful as the canary; layer-level evals are how the team fixes what the canary surfaces.

    Where your evaluation signal comes from determines the cadence

    In our direction experience: the question “how often should evaluation crons run” is usually the wrong question. The right one we find is “where does your evaluation signal come from?”.

    1. Production observability. Real usage is the strongest evaluation signal. If the system is being used at any meaningful volume, the production traffic itself becomes the eval dataset. Microsoft’s continuous improvement loop describes the mechanic: observability data from production informs offline experimentation and refinement; the loop runs continuously, not as a one-time gate. Arthur AI’s distinction between supervised evaluations (which require a known correct answer) and unsupervised evaluations (which assess behavior from the agent’s own context alone) is the operational mechanism. Unsupervised evals can run against every production interaction without needing a labeled set.
    2. Trigger-based, in-process evaluation. One agent judges the prior agent’s output as part of the workflow. This is not on the clock; it’s driven by execution. For high-volume, lower-criticality operations, sampling is fine. Here the judge tastes a random percentage of runs, or uses a risk model to route higher-stakes outputs to the strict judge gate. We tend to think of this the way a factory tests bolts: you don’t have to inspect every bolt to know the batch is good, but you do have to inspect enough that the inference is defensible.
    3. Cron-based evaluation. For the system that doesn’t get used enough to accumulate production observation, but has to perform when called, cron is the fallback. Low-traffic internal agents, regulated systems with sparse usage, and pre-launch pilots where production data doesn’t exist yet: these are the specific kinds of cases where a scheduled benchmark run earns its place. Pilot-phase batch testing where we are synthesizing thousands to hundreds of thousands of test interactions through the system to surface failure modes before users see them, are also good examples, though it’s batch-on-demand rather than truly a “cron”.

    Systems with strong production traffic shouldn’t be running synthetic crons it doesn’t need, unless there are really critical scenarios that are otherwise not being hit otherwise. Meanwhile a system with no or little production usage shouldn’t pretend trigger-based evals will catch what only batch testing finds.

    Close up of glowing translucent data streams and metric panels hovering above a dark walnut desk

    When LLM-as-judge fails, it fails at the rubric

    It is good practice to use an LLM to test and evaluate the quality of your AI system. We call that the LLM-as-judge because it is testing your system or specific agents within your system, to determine if they are not making any mistakes.

    The judge is only as good as the rubric it judges against. Teams iterate the judge prompt and the judge model without iterating the pass/fail definition, and the wrong things keep passing while the right things keep failing. The dominant failure mode of LLM-as-judge in practice isn’t bias in the model; it’s a pass/fail definition that was never sharpened against actual failure cases. Refining the criteria: what specifically counts as a pass for a given test case, broken down by what the system needs to demonstrate often results in far greater improvements than than refining the prompt or swapping the judge model.

    Practically, the discipline has three moves. First, score binary pass/fail rather than on a range. Arthur AI’s observation is that the same interaction can score a 4 on one run and a 6 on another from the same judge; binary judgments are more consistent and force the rubric to be sharp. Second, validate the judge against a small golden dataset — your accuracy-tier-appropriate judge-human agreement target on the gold set is a reasonable bar for most tier-4-sigma systems and a starting point to tighten upward for higher-tier work. Third, refine the rubric on every failure case before refining anything else. If the judge passed something that shouldn’t have passed, review the rubric carefully to ensure your criteria is not the problem. The model is mostly innocent.

    This is not to say it is not worth swapping models and comparing. This can lead to very measurable changes in price or performance, but that doesn’t change that models, and your systems, will always optimize towards the thing we evaluate them against, not how smart they are generally.

    DSPy fits here as the structured-optimization layer for cases where the rubric is well-defined enough to optimize against. In plain English, DSPy is a way to declare your task as composable modules and let a compiler optimize the prompts against a measurable downstream metric: instead of hand-tuning prompts, you tune the metric and let the compiler find the prompt. It pays off most clearly for people-facing systems where input prompt quality varies widely (the input you can’t control), and less for closed-domain backend tasks where prompt quality is already stable. DSPy doesn’t replace LLM-as-judge; it operates on top of a judge metric that’s already calibrated. Sequence matters: calibrate the judge first, then optimize against it.

    A minimum-viable evaluation setup agent testing

    1. Trace and logging layer. You need to be able to review exactly what fails, when and under what condition, how often and after how many tries… you can’t really over log, logging is cheap, especially in pilot and development stages.
    2. A small gold set of 10–50 examples. The top most important cases the system has to handle correctly, written down explicitly, with expected outputs or expected trajectories.
    3. One deterministic grader. Schema validity, latency, cost per task, token usage. Things that don’t need an LLM to judge. Run on every interaction.
    4. One LLM judge with a calibrated rubric. Calibrated to your accuracy-tier-appropriate judge-human agreement target on the gold set before scaling to production traffic. Binary pass/fail. Rubric updates on every failure case.
    5. Production-feedback loop. Failures from production get added back to the gold set. The judge gets re-validated against the expanded set periodically. The system learns from being used, not just from being built.

    In our experience: Accuracy needs are defined upfront; infrastructure (including cost) is designed before the PoC; the gold set comes before the judge platform; the judge gets calibrated before any DSPy optimization gets layered on; and your cost-of-evaluation is baked into the infrastructure design from the start, then optimized with testing.

    Evaluation crons, judge calls, and continuous test runs all show up on the API invoice like any other model call. From here you could read our work on cost-optimization in AI systems which talks about the dispatcher-first architecture that catches needless model calls in the agent workflow.

    Flowchart of the minimum viable evaluation loop: define success criteria, build gold set, run hybrid eval, identify regressions, feed failures back

    The five disciplines

    DisciplineWhat it measuresWhat it costs to do badly
    Accuracy tier declarationThe acceptable failure rate, named before any test gets writtenWasted budget on over-engineered evals for low-stakes systems; shipping high-stakes systems without defensible accuracy
    Harness-layer testingMemory, tool calls, cost cutoffs, feedback loops, hallucination gates, permissions, observability — each with its own test surfaceFailures that surface at the output with no signal about which layer broke; alerts you can’t act on
    Signal-source matchingWhether evaluation runs against production traffic, in-process triggers, or scheduled batches — based on usage volumeSynthetic crons that miss what real users do; production systems with no offline regression coverage
    Judge + rubric calibrationPass/fail definition, judge prompt effectiveness, model validationConfident wrong answers passing through unnoticed; correct answers flagged as failures
    Cost-of-evaluation budgetingPer-task judge cost, weekly benchmark cost, cost per failure caughtEvaluation infrastructure costing more than the agents it evaluates; evaluation rollbacks under cost pressure

    The ten-question audit

    Here are some questions a technical lead, agency owner, or program owner can ask their team to quickly learn where the production gaps lay:

    1. Have we defined the level of accuracy this system has to meet — and does the team agree on it?
    2. Do we have a gold set of test cases (10–50 examples) that the system has to pass before any change ships?
    3. When a test fails, can we tell which harness layer broke: memory, tool call, cost cutoff, hallucination gate, or only that the output was wrong?
    4. Where does our evaluation signal come from: production traffic, in-process triggers, or scheduled batches? Have we made that choice deliberately?
    5. If we run an LLM as a judge, do we know what percentage of the time it agrees with a human on the gold set? Is that percentage acceptable for our accuracy tier?
    6. When the judge passes something that shouldn’t have passed, do we update the rubric, or only the data and the prompt?
    7. What does evaluation cost us per week, and is that cost line tracked alongside the agents’ own cost line?
    8. When a report of a real failure lands, does that failure end up in the gold set automatically, or does it get lost?
    9. If a high-volume operation can’t run full evaluation on every call, what is our sampling strategy — and is the risk model behind it defensible?
    10. Could we hand this evaluation setup to a new engineer joining the team next month and have them know what each component does and why?

    Beautiful fountain in a sunset-lit plaza with holographic data fragments floating in the mist

    FAQ

    What is evaluation-led agent development?

    A development practice in which evaluation is the primary discipline shaping how an agent is built, tested, and operated — not a quality-assurance step at the end. The academic name is evaluation-driven development and operations (EDDOps). In practice, it means defining accuracy tiers before writing tests, testing at the harness layer rather than only at the output, matching evaluation signal to production usage patterns, calibrating LLM-as-judge against human-labeled gold sets, and budgeting evaluation as infrastructure from the start.

    How do I evaluate an AI agent in production?

    Three signal sources to choose from based on usage volume: production observability with unsupervised evals running against every interaction, trigger-based in-process evals where one agent judges the prior agent’s output, and scheduled batch or cron evaluation for systems without enough production traffic to self-validate. Most production systems with real usage volume run unsupervised evals on production data continuously, with offline regression tests against a gold set on every change.

    What’s the difference between offline and online evaluation for AI agents?

    Offline evaluation runs against fixed datasets (held-out test cases, historical traces, synthesized usage) and is the default for pre-production regression testing and CI/CD gates. Online evaluation runs against live production traffic, often using unsupervised evals that don’t require a known correct answer. Both belong in a production system: offline catches regressions before deployment, online catches drift after deployment.

    When should I use LLM-as-judge for evaluating AI agents?

    Whenever the evaluation requires semantic judgment — helpfulness, groundedness, tone, reasoning quality — that deterministic checks can’t capture. Reserve deterministic graders (schema, latency, cost) for what they’re good at, and use LLM-as-judge for the rest. Always calibrate against a gold set first; aim for 75–90% agreement with human labels before scaling.

    What are the limitations of LLM-as-judge?

    The published limitations (position bias, verbosity bias, self-enhancement bias, prompt sensitivity) are real and worth knowing. The more common failure in practice is that the test criteria itself was undertheorized: the rubric the judge judges against was never sharpened against actual failure cases. Recent measurement work finds 74% of production agents still rely primarily on human-in-the-loop evaluation rather than standardized benchmarks.

    What is DSPy and when should I use it?

    DSPy is a framework for declaring tasks as composable modules and letting a compiler optimize prompts against a measurable downstream metric. Use it when the metric is well-defined (a calibrated judge counts) and the input prompt quality is variable — typically people-facing systems where you can’t control what users type. Skip it when the metric is squishy or when prompt quality is already stable; hand-tuning still wins there.

    How big should my held-out test set be for an AI agent?

    Start with 10–50 examples — small enough to write by hand, large enough to catch the failure modes you already know about. The set grows as production failures get added back into it. Most small-team systems plateau usefully around 100–300 examples, though the right size is whatever covers the failure modes the accuracy tier requires.

    How often should I run evaluation crons against my agents?

    Probably not on a clock. For systems with meaningful production traffic, run unsupervised evals against production interactions and offline regression tests on every deployment. Cron-based evaluation is the right cadence for systems with sparse usage — internal agents called rarely but expected to perform when called — where production data isn’t accumulating fast enough to provide its own signal.

    Can a small team (1-5 agents) actually do evaluation-led development?

    Yes, and the discipline matters more for small teams than for large ones because there’s less margin for a failure mode to surface twice. The five-component minimum stack (trace layer, gold set of 10–50 examples, deterministic grader, calibrated LLM judge, production-feedback loop) is buildable in one sprint. The constraint is sequencing discipline; headcount isn’t the gating factor.

    Is evaluation-led development the same as MLOps?

    Overlapping, not identical. MLOps covers the full lifecycle of ML systems (training, deployment, monitoring, retraining) and predates agentic systems. Evaluation-led development focuses on the testing and judgment discipline specifically, and applies to agent systems that often don’t involve model training at all. EDDOps is closer to TDD for agents than to MLOps for models.