Agent Evaluation & Observability in Production AI

You deploy an agent to production. Task completion rate looks acceptable in your logs — roughly 80% success. Two weeks later, a customer files a support ticket: the agent has been silently returning malformed outputs on a specific input pattern. You check your dashboards. Nothing flagged it. You trace backward through the execution logs and find the agent had been invoking a tool with a stale API token for eleven days, receiving 401 errors, silently retrying with the same token, then proceeding with empty data as if the tool call succeeded.

This is not a hypothetical. This is the exact failure mode that blindsides teams who treat agent observability as an afterthought — who instrument output success/fail counts but not the intermediate execution trace.

Agent evaluation and observability in production AI is not just metrics dashboards and error rate graphs. It is a structured introspection layer that gives you causality, not just correlation. It is an evaluation pipeline that runs continuously, not just before a deploy. And it is the infrastructure difference between operating a system you understand and operating one that surprises you at the worst possible moments.

This article covers the architecture patterns for production-grade agent observability, what an evaluation pipeline actually needs to catch real failures, and which metrics reflect genuine system health versus vanity numbers that look good until something breaks.


Interactive Concept Map

Click any node to expand or collapse. Use the controls to zoom, fit to view, or go fullscreen.

Why Agent Observability Is Harder Than Traditional Service Monitoring

The observability patterns that work for microservices — request/response latency, error rate, throughput — are necessary but deeply insufficient for agents. Traditional services execute deterministic logic. An agent executing a multi-step research task may take fifteen different tool call paths depending on what each intermediate result looks like. The same user request produces structurally different execution graphs on different runs.

This non-determinism creates three problems that traditional monitoring does not solve.

Silent failure propagation. A downstream tool call that returns a schema-mismatched response does not throw an exception — the agent incorporates the malformed data into its context window and continues. By the time you see a bad output, the failure happened four steps earlier. Distributed tracing in traditional services catches this because failure surfaces as an error status code. In an agent, failure often surfaces as bad data dressed up as a successful response.

No single bottleneck to instrument. In a standard service, you trace the request through a defined set of components. In an agent, the number and type of tool calls varies per execution. You cannot pre-define spans. You need dynamic instrumentation that captures whatever execution path the agent actually took, not the path you expected it to take.

Evaluation and monitoring are the same problem. For a traditional service, testing happens pre-deploy and monitoring happens post-deploy. For an agent, you need both simultaneously in production. A customer query that worked yesterday may fail today because a tool’s API changed its response format, because the LLM’s behavior shifted on a particular input pattern, or because context accumulated across a session hit a threshold that causes the model to degrade. Pre-deploy testing cannot catch these — they require continuous evaluation against production traffic.


Building the Agent Execution Trace Architecture

The foundation of agent observability is structured execution traces. Not logs — traces. Logs capture events. Traces capture causality: this tool was called because of this LLM output, which was produced in response to this context state, which was assembled from these components.

A production execution trace for an agent needs to capture five categories of data per step:

  1. Context snapshot: The full input to the LLM at that step — system prompt, conversation history, retrieved documents, tool results accumulated so far. This is expensive to store at full fidelity, so most production systems store a hash of the context and the delta from the previous step.

  2. LLM invocation record: Model ID, temperature, token counts (prompt + completion), latency, finish reason. The finish reason is often overlooked — a length finish reason on a step that should produce structured output is a silent failure that corrupts the agent’s subsequent reasoning.

  3. Tool call record: Tool name, input arguments, raw response, response latency, HTTP status if applicable, schema validation result. The schema validation result is the critical addition most teams miss.

  4. Verification result: Whether the step’s output passed or failed any inline verification checks — more on this in the evaluation pipeline section.

  5. State transition: What changed in the agent’s working state after this step. For agents with explicit state machines, this is the state transition. For ReAct-style agents, this is the updated scratchpad or memory.

Here is a minimal instrumentation wrapper that captures this structure in Python:

import time
import hashlib
import json
from dataclasses import dataclass, field
from typing import Any, Optional
from opentelemetry import trace
from opentelemetry.trace import Status, StatusCode

tracer = trace.get_tracer("agent.harness")

@dataclass
class StepTrace:
    step_id: str
    agent_run_id: str
    step_type: str  # "llm_call", "tool_call", "verification"
    input_context_hash: str
    output: Any
    latency_ms: float
    token_usage: Optional[dict] = None
    tool_name: Optional[str] = None
    tool_status: Optional[int] = None  # HTTP status for external tools
    schema_valid: Optional[bool] = None
    finish_reason: Optional[str] = None
    verification_passed: Optional[bool] = None
    error: Optional[str] = None

def instrument_tool_call(tool_name: str, tool_fn, args: dict,
                          agent_run_id: str, step_id: str) -> StepTrace:
    """
    Wraps a tool call with full observability instrumentation.
    Captures timing, response, and schema validation in a single span.
    """
    with tracer.start_as_current_span(f"tool.{tool_name}") as span:
        span.set_attribute("agent.run_id", agent_run_id)
        span.set_attribute("agent.step_id", step_id)
        span.set_attribute("tool.name", tool_name)
        span.set_attribute("tool.args_hash",
                           hashlib.sha256(json.dumps(args, sort_keys=True).encode()).hexdigest()[:16])

        start = time.monotonic()
        error = None
        output = None

        try:
            output = tool_fn(**args)
            span.set_status(Status(StatusCode.OK))
        except Exception as e:
            error = str(e)
            span.set_status(Status(StatusCode.ERROR, str(e)))
            span.record_exception(e)

        latency_ms = (time.monotonic() - start) * 1000
        span.set_attribute("tool.latency_ms", latency_ms)

        return StepTrace(
            step_id=step_id,
            agent_run_id=agent_run_id,
            step_type="tool_call",
            input_context_hash=hashlib.sha256(json.dumps(args, sort_keys=True).encode()).hexdigest(),
            output=output,
            latency_ms=latency_ms,
            tool_name=tool_name,
            error=error,
        )

This is the instrumentation floor, not the ceiling. In production, you will add context hashing before each LLM call, finish reason capture on every model invocation, and schema validation on every tool response. The key principle: every step in the agent’s execution graph gets a span, and every span carries enough structured data to reconstruct what happened without needing to replay the execution.


The Evaluation Pipeline: Continuous Assessment in Production

Pre-deployment evaluation suites are necessary but structurally insufficient for agents. They test against a fixed dataset, under controlled conditions, with known input distributions. Production agents encounter a continuous stream of novel inputs, evolving tool API responses, model behavior shifts after provider updates, and context patterns that your pre-deploy dataset never covered.

A production evaluation pipeline runs against live traffic — sampled, replayed, or shadowed — using a combination of automated checks and async evaluation jobs. The architecture has three layers.

Layer 1: Inline Verification (Synchronous, Per-Step)

These checks run during agent execution, on the critical path. They should be fast — under 50ms — and focus on structural correctness rather than semantic quality. The investment is worth it: inline verification that catches a bad tool response at step 3 prevents the agent from propagating that error through steps 4 through 12.

Practical inline checks:
Schema validation on every tool response (use pydantic or jsonschema; fail loudly, not silently)
Finish reason check after every LLM call — stop is expected, length means context truncation, content_filter means something in your pipeline triggered a safety block
Required field presence — if your workflow requires the agent to produce a specific structured output, verify the fields exist before treating the step as successful
Sanity bounds on numeric outputs — an agent computing a financial figure that returns a negative number or an impossibly large value should not proceed

Layer 2: Async Quality Evaluation (Asynchronous, Per-Task)

After each completed agent task, an async job evaluates the full execution trace and output against a richer set of criteria. This layer is where LLM-as-judge patterns are practical — you can afford higher latency because you are not on the critical path.

Evaluation dimensions for this layer:
Task completion fidelity: Did the output actually address what was asked? LLM-graded on a 1-5 rubric, with the evaluator LLM provided the original task and the output.
Tool usage efficiency: Did the agent use an appropriate number of tool calls for the task complexity? Agents that over-call tools (>3x expected for a task class) are exhibiting planning failures.
Context coherence: Did the agent’s reasoning across steps remain consistent, or did it contradict earlier conclusions? This catches context window contamination and context drift.
Output schema compliance: For structured output agents, did the final output conform to the expected schema?

Store these evaluations in a time-series database alongside the execution traces. The evaluation scores become the signal for your agent deployment pipeline.

Layer 3: Regression Detection (Asynchronous, Population-Level)

The third layer operates on aggregated evaluation data, looking for population-level shifts rather than individual task failures. This is where you catch the model provider update that degraded your agent’s performance by 8% across a specific task class — a change that individual task evaluation missed because the failures were distributed across thousands of executions.

Regression detection runs on a sliding window of evaluation scores, using statistical process control to flag when a metric shifts outside its expected bounds. A p-value threshold on a rolling 24-hour window of task completion fidelity scores will catch model behavior shifts within a day of them occurring.


Production Agent Metrics That Actually Matter

Most agent monitoring setups track the wrong things. Error rate is a lagging indicator that captures only the failures loud enough to throw an exception. Latency averages hide the bi-modal distribution you actually have: fast successful paths and slow failing paths that eventually time out. Response length distributions look fine in aggregate while individual task classes degrade silently.

These are the metrics that reflect genuine production agent health:

Task completion rate by task class. Not overall success rate — broken down by the categories of tasks your agent handles. A 94% overall completion rate can hide a 40% failure rate on a specific task type that accounts for 5% of volume. Segment first, aggregate second.

Step-level tool call success rate. What percentage of tool invocations return valid, schema-compliant responses? This catches tool API degradation before it surfaces as task failures.

Finish reason distribution. Track the ratio of stop vs length vs content_filter finish reasons over time. A rising length rate means your context window management is degrading — context is growing per task, likely due to verbose tool responses or insufficient context trimming.

Inline verification pass rate. What percentage of step-level verification checks pass? A drop here is an early warning signal — it means your agent is encountering more malformed data or unexpected tool responses than normal.

Token cost per successful task. Not total token cost — cost per successful completion. If this metric rises, your agent is either taking more steps per task (planning degradation), making more tool calls (context quality degradation), or producing more output per step (verbosity drift). Each root cause has a different fix.

p99 task latency. The 99th percentile matters more than the mean for agents because failure paths are dramatically slower than success paths. A rising p99 with a stable mean means your slow-path failures are getting slower, often a symptom of retry exhaustion without circuit breaking.


Connecting Evaluation to Deployment Decisions

Observability data is only valuable if it connects to action. The most important action it should gate is deployment.

Most teams deploying new agent versions — whether that means a new model, an updated system prompt, or changes to the tool integration layer — rely on pre-deploy evaluation against a static test set. This is insufficient for catching regressions that appear only on the long tail of production inputs.

A production-grade agent deployment process uses shadow evaluation: the new agent version runs in parallel against sampled production traffic without serving responses to users. Both versions produce execution traces and evaluation scores. The deployment gate is a statistical comparison of the evaluation score distributions — the new version must not show a statistically significant regression in task completion fidelity, schema compliance, or cost per successful task before it receives live traffic.

This pattern applies equally to model provider updates you did not initiate. When Anthropic or OpenAI updates a model, your agent’s behavior can shift without any change on your side. Shadow evaluation running continuously against production traffic will catch these shifts within hours of them occurring.


Where Agent Observability Breaks Down

Inline verification and execution tracing are not magic. Three failure modes will burn you if you do not build defenses against them.

Semantic failure is invisible to structural checks. An agent that returns a factually incorrect but structurally valid response passes every schema check, every finish reason check, and every field presence check. LLM-as-judge evaluation in the async layer catches this — structural checks do not. Do not substitute structural observability for semantic evaluation.

Trace volume overwhelms storage at scale. Full-fidelity execution traces for an agent handling 10,000 tasks per day — each with 8-15 steps — generate substantial data. Context snapshots are the biggest cost driver. Most production deployments store full traces for a 7-day window, hash-indexed compressed traces for 30 days, and only evaluation scores and metadata beyond that. Design your retention policy before you hit storage pressure.

Evaluation latency creates a feedback blindspot. Async evaluation jobs that take 2-4 minutes per task mean you are 2-4 minutes behind when a failure mode emerges. For most deployments, this is acceptable. For agents handling time-sensitive or high-stakes tasks, you need the inline verification layer to be comprehensive enough to catch the failure modes that cannot wait for async evaluation.


Building the Observability Stack Incrementally

Do not wait until you have a fully instrumented pipeline to deploy agents to production. The right order is:

  1. Start with execution traces — even unstructured JSON logs with step ID, tool name, and outcome are better than nothing. You can query them when something breaks.

  2. Add inline schema validation on tool responses — this is a one-hour investment that eliminates the entire class of silent-failure-from-bad-data failures.

  3. Instrument token costs per task — you need to know your cost envelope before you scale.

  4. Add async LLM-as-judge evaluation for your most critical task classes — even covering 20% of your task types with quality evaluation is a meaningful improvement over none.

  5. Build regression detection on your evaluation score time series — this is the layer that makes your deployment pipeline genuinely safe.

Most teams who operate agents at scale tell a version of the same story: they built 80% of their observability stack in response to production incidents that would have been preventable. The incremental path above front-loads the coverage that prevents the costliest failures.


Conclusion

Agent evaluation and observability is not an instrumentation problem — it is an architecture problem. The teams operating agents reliably in production have built three interconnected layers: inline verification that catches structural failures on the critical path, async quality evaluation that catches semantic failures off the critical path, and regression detection that catches population-level shifts that individual evaluation misses.

The metrics that matter — task completion rate by task class, tool call success rate, finish reason distribution, cost per successful task, p99 latency — are not the ones most teams start tracking. But they are the ones that tell you whether your agent is working or quietly degrading.

Observability is what separates an agent deployment you understand from one that surprises you. And in production AI, surprises are expensive.


The evaluation pipeline architecture described here is tightly coupled to verification loop design — inline checks are only as good as the verification patterns you have built into your harness layer. Read our deep dive on verification loop architectures for production agents for the companion patterns that make inline verification effective rather than theatrical.

If you are building agent infrastructure and want to avoid the observability gaps that cause the expensive production incidents — subscribe to the harness-engineering.ai newsletter for production patterns published weekly.

Leave a Comment