How to Test AI Agents: Evaluation Framework for Production

How to Test AI Agents Before (and After) You Deploy Them: The Evaluation Gap That Kills Most Projects

Most AI agents don't fail with a crash. They fail silently - completing workflows, returning plausible-looking results, and confidently producing output that's subtly wrong in ways you won't catch until the damage compounds downstream.

This is the evaluation gap: the distance between "works in a demo" and "works reliably in production." It's the primary reason 80% of AI projects fail to deliver business value. And it's the question most teams skip entirely in the rush to ship.

If you're building AI agents - or evaluating whether a development partner knows what they're doing - the evaluation strategy tells you more about production readiness than any feature demo ever will.

Why Traditional Testing Breaks Down for AI Agents

Traditional software testing is binary: the API returns the right response code, the function produces the expected output, the database writes correctly. You can write a test, run it a thousand times, and get the same result.

AI agents break every one of these assumptions:

Non-deterministic outputs. The same input can produce different execution paths on different runs due to temperature sampling, tool response variations, and timing differences.
Multi-step state dependencies. A wrong tool argument at step 2 silently corrupts every subsequent step. The failure at step 8 looks like a step 8 problem when it's actually a step 2 problem.
Success is contextual. A research agent can call every required API correctly and still deliver a summary a domain expert would reject. "Working" depends on who's asking and why.
Errors don't throw exceptions. Goal drift, context loss, and quality degradation produce no error codes. The agent keeps running - it just stops being useful.

Anthropic's engineering team, drawing from building Claude Code and working with frontier agent developers, describes the breaking point clearly: teams get surprisingly far through manual testing and intuition in early prototyping. Then they ship to production, users report the agent "feels worse" after changes, and the team is flying blind with no way to verify anything except to guess and check.

The Six Agent-Specific Failure Modes

Before you can test agents effectively, you need to understand what specifically can go wrong - and why standard monitoring won't catch it.

Based on production data from teams running agents at scale, six failure modes are unique to (or significantly worse in) agentic systems:

1. Tool Misuse and Call Failures

The agent calls a tool with incorrect arguments, selects the wrong tool, or fails to handle a tool error and continues as if the call succeeded. This is the most common production failure mode because a single malformed argument silently corrupts every downstream step that depends on that output.

Amazon's shopping assistant team discovered this firsthand while integrating hundreds of APIs: poorly defined tool schemas led to wrong tool selection, which expanded the context window unnecessarily, increased latency, and escalated costs through redundant LLM calls.

2. Context Loss Across Turns

In multi-turn workflows, the agent loses track of constraints established earlier. Studies show context retention accuracy drops 15–30% in sessions exceeding 10 turns. Each individual response looks reasonable in isolation - it's only wrong relative to earlier context.

3. Goal Drift

The agent gradually shifts from the original objective. No individual step fails, but small reasoning deviations accumulate. Research on LLM agent benchmarks found agents evaluated only on final-output quality pass 20–40% more test cases than full trajectory evaluation reveals - meaning the path matters, not just the destination.

4. Retry Loops

The agent enters a loop - calling the same tool with the same arguments, cycling between sub-goals, or re-attempting a failed approach without updating strategy. In one documented production case, a CRM API timeout caused an agent to retry identically 11 times per session before timeout, generating 2,717 error log entries across 247 affected sessions in just 4 hours.

5. Cascading Errors in Multi-Agent Systems

In coordinated multi-agent architectures, failure in one agent propagates to dependent agents that receive its corrupted output. The receiving agent doesn't detect the problem, producing second-order failures that are extremely difficult to trace back to the source.

6. Silent Quality Degradation

Output quality decreases gradually without any discrete failure event. No error fires. Quality degrades due to model version changes, prompt drift, distribution shift in incoming queries, or accumulated technical debt. This is completely invisible to error-rate monitoring.

A Practical Evaluation Framework: What to Test and How

Anthropic's evaluation guidance and Amazon's production framework converge on a layered approach. Here's how to apply it pragmatically:

Layer 1: Define What "Working" Actually Means

Before writing a single test, specify success criteria explicitly. Two engineers reading the same spec will come away with different interpretations of edge case handling. An evaluation suite resolves this ambiguity.

For each agent capability, define:

The outcome check: Did the intended state change actually happen? (Not "did the agent say it booked the flight" - but does a reservation actually exist in the database?)
The quality bar: Is the output useful to the specific user in the specific context?
The constraint set: What must the agent NOT do? (Rate limits, data boundaries, tone requirements)

Amazon's evaluation library structures this across three layers: final response quality (correctness, faithfulness, helpfulness), component performance (tool selection accuracy, reasoning groundedness, memory retrieval), and task completion (did the agent achieve the user's actual goal).

Layer 2: Choose the Right Grading Method for Each Dimension

Not everything can be graded the same way. Effective evaluation combines three grader types:

Grader Type	Best For	Limitations
Code-based (pass/fail tests, schema validation, state checks)	Tool call correctness, outcome verification, performance constraints	Brittle to valid variations; can't assess nuance
Model-based (LLM-as-judge with rubrics)	Communication quality, reasoning coherence, open-ended tasks	Non-deterministic; requires calibration against human judgment
Human review (expert sampling, A/B testing)	Gold-standard quality assessment; calibrating model graders	Expensive; slow; doesn't scale to every session

The practical combination: use code-based graders for everything that has a verifiable right answer (did the tool get called with correct parameters? did the database state change?). Use model-based graders with clear rubrics for quality dimensions. Sample with human experts to calibrate the model graders periodically.

Layer 3: Separate Capability Evals from Regression Evals

This distinction is crucial and often missed:

Capability evals ask "What can this agent do well?" Start at a low pass rate, targeting tasks the agent struggles with. These give you a hill to climb.
Regression evals ask "Does the agent still handle everything it used to?" These should have a nearly 100% pass rate. A decline signals something broke.

As capability evals hit high pass rates, they "graduate" to the regression suite - ensuring yesterday's hard-won improvements don't silently disappear with tomorrow's prompt change.

Layer 4: Account for Non-Determinism

Because agent outputs vary between runs, single-attempt evaluation is misleading. Two metrics matter:

pass@k - probability of at least one success in k attempts. Relevant when finding one good solution matters (code generation, research synthesis).
pass^k - probability of ALL k trials succeeding. Relevant for customer-facing agents where users expect consistent behavior every time.

At k=1, these are identical. At k=10, they tell opposite stories. Choose based on your product requirements: if users interact once and need it to work, you care about pass^k.

What Production Monitoring Adds (That Pre-Deployment Testing Can't)

Static tests cannot fulfill the production monitoring requirement. Pre-deployment evaluation can't capture distribution drift, unexpected real-world inputs, or the gradual quality degradation that only appears over time.

The minimum production monitoring stack for agents:

Structured trace capture - Every agent action as a span with session IDs linking all steps together. You need the causal chain, not isolated log entries.
Continuous quality evaluation - Sample production sessions and score proactively against quality criteria. Don't wait for user complaints.
Failure clustering - Group related failures by shared signature before surfacing them. 40 sessions failing for the same underlying reason should surface as one issue with a frequency count, not 40 separate incidents.
Regression test generation from production failures - Every diagnosed production failure becomes a regression test. This is the step most teams skip, which is why the same failure recurs after every model upgrade.
Quality metric alerting - Alert on quality score distribution, task completion rate, and average session step count. Error rate and latency alone will miss the majority of agent-specific failures.

The 20-Task Starting Point

Teams delay building evals because they think they need hundreds of test cases. Anthropic's guidance is clear: 20–50 simple tasks drawn from real failures is a great start.

In early agent development, each change has a large, noticeable impact - small sample sizes suffice. As the agent matures, expand the suite to detect smaller effects. But don't let perfect be the enemy of deployed:

Collect 20 real failure cases from manual testing or early user sessions
Define explicit success criteria for each
Build code-based graders for the clear pass/fail dimensions
Add one model-based rubric for quality assessment
Run this suite before every deployment
Expand as production reveals new failure patterns

This takes days, not months. And it immediately separates your team from the majority of AI projects that ship without any systematic evaluation at all.

What This Means for Choosing a Development Partner

If you're evaluating AI development partners, ask about evaluation strategy early. The answers reveal more about production capability than any demo:

"How do you know when an agent is ready to deploy?" - Look for specific metrics, not vibes.
"What happens when the agent fails silently?" - Look for structured observability, not just error logging.
"How do you prevent regressions when you update the model or prompts?" - Look for regression suites that run automatically, not manual spot-checks.
"How do you handle non-determinism?" - Look for multi-trial evaluation, not single-pass testing.

A partner who builds evaluation into the architecture from day one will deliver agents that work in production - not just in the demo room. A partner who treats testing as an afterthought will deliver an agent that works perfectly on the day of the handoff and degrades the moment real users interact with it.

The Apptitude Approach

At Apptitude, we build evaluation into every agent engagement from the first sprint. We define success criteria before writing agent code, instrument traces from the start, and maintain regression suites that grow with every production deployment. When we hand off an agent, we hand off the evaluation infrastructure that keeps it working - because an agent without evaluation is just a demo with a deployment pipeline.

If you're planning an AI agent build and want to understand what production-grade evaluation looks like for your specific use case, start a conversation with us.