March 7, 2025
Claude AI Infrastructure Development

Agent Evaluation Quality Metrics

Difficulty: Intermediate | Cluster: Subagents and Agent Teams

When you deploy agents to do real work, you need to know if they're actually doing it well. That's where evaluation and quality metrics come in. In this guide, we'll walk you through defining success criteria, measuring agent output automatically, running A/B tests on instructions, and building feedback loops that make your agents progressively smarter.

The core challenge is this: how do you measure success for something as nuanced as AI-generated output? You can't just check if a task ran without errors. You need deeper metrics that capture quality, consistency, cost-efficiency, and speed. This guide shows you exactly how to set up that measurement infrastructure using Claude Code and the -iNet framework. We'll cover not just the theory but the practical implementation details that make the difference between a metrics system you abandon and one you actually use.

Table of Contents
  1. Why Agent Evaluation Matters
  2. Pillar 1: Defining Success Criteria
  3. Pillar 2: Automated Quality Scoring with Rubrics
  4. Pillar 3: Task Completion Rate and Performance Tracking
  5. Pillar 4: A/B Testing Agent Instructions
  6. Pillar 4: Feedback Loops and Continuous Improvement
  7. Advanced: Stratified A/B Testing and Subgroup Analysis
  8. Advanced: Longitudinal Quality Tracking and Drift Detection
  9. Practical: Setting Up Metrics in Code
  10. Common Pitfalls (and How to Avoid Them)
  11. Advanced: Dimension Correlation and Tradeoff Analysis
  12. Advanced: Failure Pattern Clustering
  13. Advanced: Cost-Quality Pareto Frontier
  14. Building Your Evaluation System
  15. Real-World Example: Multi-Metric Evaluation in Practice
  16. Integrating Metrics with CI/CD
  17. The Investment Thesis: Why Evaluation Pays Off
  18. When NOT to Build Evaluation (Yet)
  19. Scaling Evaluation Across Many Agents
  20. Conclusion

Why Agent Evaluation Matters

Before you ship agents into production, you should know: Do they complete tasks reliably? How good is their output? Are they getting faster or slower over time? Can you spot patterns in failures? What's the ROI on running this agent versus human effort?

Without evaluation metrics, you're flying blind. Your agent might be consistently missing edge cases, hallucinating facts, or producing verbose output that wastes API tokens. You won't know until a human manually reviews dozens of outputs—and by then, you've burned money and credibility. You might also have shipped broken features to users.

The right evaluation setup gives you continuous visibility into agent performance. You catch regressions early (within hours, not days). You validate improvements before shipping them (preventing bad changes from reaching production). You understand exactly where your agents are strong and where they need work. You can even compare agents objectively and choose the best one for each task.

There are four pillars to solid agent evaluation:

  1. Task Completion Rate - Does the agent finish tasks successfully? How often do they need retries?
  2. Output Quality Scoring - Is the output actually good? By what dimensions?
  3. Cost and Performance Tracking - Are we efficient with API calls and time?
  4. Feedback Loops - Can we learn from failures and improve iteratively?

Let's build each one. By the end, you'll have a system that continuously improves your agents without manual intervention.

Pillar 1: Defining Success Criteria

You can't measure what you haven't defined. So the first step is to be explicit about what success looks like for your agent. This forces you to think deeply about what matters.

A good success criterion answers these questions:

  • Does the agent understand the task?
  • Does it produce output in the expected format?
  • Does it complete within acceptable time and token budgets?
  • Does it handle edge cases gracefully?
  • Are the outputs useful to the end user?

Let's say you have an agent that writes code comments. Success might mean:

  • Task accepted and started within 2 seconds
  • Comments generated for 100% of public methods
  • No syntax errors in generated code
  • Explanation is clear and under 120 characters per line
  • Completes in under 30 seconds per file
  • Comments are accurate (parameters match function signature)

Here's how to encode that in YAML:

yaml
success_criteria:
  task_name: code-comment-generation
  version: 1.0
  last_updated: 2026-03-16
 
  acceptance_criteria:
    - criterion: task_understanding
      description: Agent correctly identifies scope (public methods only)
      pass_condition: agent_request_includes(['public', 'method', 'comment'])
      weight: 1.0
      notes: "This is a gate condition. Misunderstanding scope is fatal."
 
    - criterion: completeness
      description: All public methods receive comments
      pass_condition: commented_methods >= total_public_methods
      weight: 1.0
      notes: "100% coverage expected. Missing any method is a failure."
 
    - criterion: syntax_integrity
      description: Generated code has no syntax errors
      pass_condition: parser.validate(output) == true
      weight: 1.5
      notes: "Broken syntax blocks downstream processes. Critical gate."
 
    - criterion: clarity
      description: Comments are concise (max 120 chars per line)
      pass_condition: all(len(line) <= 120 for line in comments)
      weight: 0.8
      notes: "Nice to have, but not critical. Readability matters though."
 
    - criterion: latency
      description: Task completes within acceptable time
      pass_condition: execution_time_seconds <= 30
      weight: 0.5
      notes: "Lower weight. Users care more about quality than speed here."
 
    - criterion: parameter_accuracy
      description: Documented parameters match function signature
      pass_condition: documented_params == actual_params
      weight: 1.2
      notes: "Inaccurate docs are worse than missing docs. Important gate."
 
  scoring:
    pass_threshold: 4.0 # Must achieve at least 4 points
    fail_on: [task_understanding, syntax_integrity, parameter_accuracy] # Critical gates
    notes: "If ANY gate fails, the output is rejected regardless of other scores."

Notice the weight field. Not all criteria are equally important. Syntax errors are a hard fail (weight 1.5), but exceeding time budget is less critical (weight 0.5). The fail_on list defines gates that can't be negotiated. If the agent doesn't understand the task or breaks syntax, you reject the output immediately.

The notes field is for you—future you, and teammates. Explain the reasoning. This isn't boilerplate. It's institutional knowledge.

Pillar 2: Automated Quality Scoring with Rubrics

Once you know what success looks like, you need to measure it automatically. That's where rubrics come in. A rubric is a structured scoring system that assigns points based on dimensions of quality. Instead of yes/no pass-fail, you score each dimension on a scale (typically 0-5), then calculate a weighted average.

Here's a rubric for code comment quality:

yaml
quality_rubric:
  task_name: code-comment-generation
  version: 2.1
  dimensions:
    - name: clarity
      weight: 0.25
      description: Are comments clear and understandable?
      levels:
        0: Comments are missing or unintelligible
        1: Comments exist but are vague or confusing
        2: Comments explain the basics but miss important details
        3: Comments are clear and cover main functionality
        4: Comments are clear with good examples
        5: Comments are exemplary - clear, concise, helpful, include examples
 
      scoring_function: |
        def score_clarity(comments, code):
          # Automated checks
          score = 3  # baseline: assume average clarity
 
          # Penalize length
          avg_length = sum(len(c) for c in comments) / len(comments) if comments else 0
          if avg_length > 120:
            score -= 0.5
          elif avg_length < 20:
            score -= 0.5  # Too short might mean insufficient info
 
          # Bonus for parameter docs
          if has_param_documentation(comments):
            score += 1
 
          # Bonus for return docs
          if has_return_documentation(comments):
            score += 0.5
 
          # Bonus for examples
          if has_code_example(comments):
            score += 0.5
 
          return min(score, 5)
 
    - name: coverage
      weight: 0.25
      description: Are all significant methods commented?
      levels:
        0: No methods are commented
        1: Less than 25% of public methods commented
        2: 25-50% of public methods commented
        3: 50-75% of public methods commented
        4: 75-99% of public methods commented
        5: 100% of public methods commented
 
      scoring_function: |
        def score_coverage(commented, total):
          if total == 0:
            return 5  # No methods to comment = full coverage
          percentage = commented / total
          return min(percentage * 5, 5)
 
    - name: correctness
      weight: 0.30
      description: Do comments accurately describe the code?
      levels:
        0: Comments contradict the code
        1: Comments describe wrong functionality
        2: Comments are partially accurate but miss key behavior
        3: Comments accurately describe code with minor gaps
        4: Comments accurately describe code with excellent accuracy
        5: Comments are perfectly accurate and comprehensive
 
      scoring_function: |
        def score_correctness(comments, code_ast):
          score = 3  # baseline: assume average accuracy
 
          # Check parameter names and counts
          doc_params = extract_documented_params(comments)
          actual_params = extract_function_signature(code_ast)
          if doc_params == actual_params:
            score += 0.5
          elif set(doc_params) == set(actual_params):
            score += 0.2  # Correct params, wrong order
 
          # Check for return type documentation
          if has_return_type(comments):
            score += 0.5
 
          # Check for exceptions documentation
          if has_exception_docs(comments):
            score += 0.5
 
          # Check for major implementation details (loops, recursion, etc)
          complex_features = detect_complex_features(code_ast)
          if documents_complex_features(comments, complex_features):
            score += 0.5
 
          return min(score, 5)
 
    - name: efficiency
      weight: 0.20
      description: Are comments concise, not verbose?
      levels:
        0: Comments are excessively verbose (>300 chars per method)
        1: Comments are verbose (200-300 chars per method)
        2: Comments are somewhat verbose (150-200 chars per method)
        3: Comments are reasonably concise (100-150 chars per method)
        4: Comments are concise (50-100 chars per method)
        5: Comments are exemplary - say more with less
 
      scoring_function: |
        def score_efficiency(comments):
          if not comments:
            return 0
          avg_length = sum(len(c) for c in comments) / len(comments)
 
          if avg_length > 300:
            return 0
          elif avg_length > 200:
            return 1
          elif avg_length > 150:
            return 2
          elif avg_length > 100:
            return 3
          elif avg_length > 50:
            return 4
          else:
            return 5
 
  calculation:
    method: weighted_average
    formula: >
      SCORE =
        (clarity × 0.25) +
        (coverage × 0.25) +
        (correctness × 0.30) +
        (efficiency × 0.20)
 
    thresholds:
      excellent: 4.5
      good: 4.0
      acceptable: 3.0
      poor: 2.0
      failing: 0.0
 
    notes: |
      A score of 3.0+ is acceptable but not good. Aim for 4.0+.
      4.5+ is excellent and production-ready without review.
      Anything below 3.0 should trigger investigation and retraining.

Notice how the rubric includes automated scoring functions. These aren't human judgments—they're deterministic algorithms that measure specific, verifiable attributes. This is crucial for automation at scale.

How do you actually run this? You'll implement a validator agent that:

  1. Takes the agent output
  2. Applies each scoring function
  3. Calculates the weighted score
  4. Returns a structured report

Here's a YAML config for that validator:

yaml
validator_agent:
  name: quality-scorer
  model: claude-opus-4
  version: 1.0
 
  system_prompt: |
    You are a quality validator for code comments. You score outputs on four dimensions.
 
    For each dimension, apply the scoring function provided. Return results as JSON.
 
    Do NOT make subjective judgments. Use only the automated scoring functions.
    If a function requires human interpretation, delegate that to the human reviewer.
 
    Your job is to be consistent, deterministic, and fair.
 
  tools:
    - type: code_analyzer
      functions:
        - analyze_syntax
        - extract_comments
        - extract_function_signatures
        - measure_length
        - detect_complex_features
 
    - type: comparison
      functions:
        - compare_documented_vs_actual_params
        - check_return_documentation
        - check_exception_documentation
        - verify_parameter_names
        - verify_return_type
 
  output_format:
    type: json
    schema:
      type: object
      properties:
        task_id:
          type: string
        timestamp:
          type: string
          format: date-time
        dimensions:
          type: array
          items:
            type: object
            properties:
              name:
                type: string
              raw_score:
                type: number
                minimum: 0
                maximum: 5
              reasoning:
                type: string
              evidence:
                type: array
                items:
                  type: string
              weight:
                type: number
        weighted_score:
          type: number
          minimum: 0
          maximum: 5
        verdict:
          type: string
          enum: [pass, conditional, fail]
        human_review_required:
          type: boolean
          description: Set to true if any dimension requires human judgment
        recommendations:
          type: array
          items:
            type: string
 
  usage:
    frequency: "On every task completion"
    timeout_seconds: 60
    retry_on_failure: true

Run this validator on every task completion. Store the results in a metrics database so you can track trends over time. You now have data.

Pillar 3: Task Completion Rate and Performance Tracking

Beyond quality, you need to track whether agents complete tasks at all—and how efficiently they do it. Task completion rate seems simple, but there's nuance.

Task completion rate is simple on the surface: of N tasks assigned, how many did the agent finish successfully? But there are important nuances:

  • Did it finish on first try, or did it need retries?
  • Did it hit rate limits or timeouts?
  • Did it ask for help (human-in-the-loop)?
  • Did it fail gracefully or crash?
  • How much time did it spend versus what was budgeted?

Here's a YAML config for tracking task lifecycle:

yaml
task_tracking:
  version: 2.0
 
  task_event_schema:
    task_id: string # UUID
    agent_id: string
    agent_name: string
    task_type: string # e.g., "code-comment-generation"
    created_at: timestamp
    started_at: timestamp
    completed_at: timestamp
    status: enum [pending, started, completed, failed, abandoned, retrying]
 
    # Completion metrics
    completion_time_seconds: number
    tokens_used:
      input: number
      output: number
      total: number
 
    cost_usd: number
    cost_per_quality_point: number # cost / quality_score (useful for ROI)
 
    retries: integer
    retry_reasons: array[string] # e.g., ["timeout", "rate_limit"]
 
    # Quality metrics (populated by validator)
    quality_score: number (0-5)
    dimensions:
      clarity: number
      coverage: number
      correctness: number
      efficiency: number
 
    # Failure info
    failure_reason: string # if status=failed
    failure_category: enum # e.g., "timeout", "rate_limit", "invalid_input", "hallucination"
    error_message: string
    human_intervention_required: boolean
    root_cause: string # Diagnosed root cause after analysis
 
  aggregation:
    intervals:
      - period: hourly
        metrics:
          - completion_rate: COUNT(status=completed) / COUNT(*)
          - avg_completion_time: AVG(completion_time_seconds)
          - avg_quality_score: AVG(quality_score)
          - total_cost: SUM(cost_usd)
          - retry_rate: COUNT(retries > 0) / COUNT(*)
          - failure_rate: COUNT(status=failed) / COUNT(*)
          - tokens_per_task: AVG(tokens_used.total)
 
      - period: daily
        metrics:
          - all_hourly_metrics_plus:
              - completion_rate_trend: has_completion_rate_changed?
              - quality_trend: has_quality_changed_significantly?
              - cost_efficiency: total_cost / COUNT(*)
              - failure_breakdown_by_reason: group_failures_by_category
              - most_common_failure: most_frequent_failure_reason
              - worst_performing_task_type: which_task_types_failed_most
 
      - period: weekly
        metrics:
          - all_daily_metrics_plus:
              - trend_analysis: linear_regression(metric, time)
              - anomaly_detection: is_outlier(metric)
              - improvement_from_last_week: (current - previous) / previous
              - correlation_with_model_changes: did_changes_improve_metrics?
 
    storage:
      backend: timeseries_database # e.g., InfluxDB, CloudWatch
      retention: 90_days_minimum
      resolution: 1_minute # Store raw data at 1-min granularity

Here's what this gives you:

Completion Rate: If you complete 95 tasks out of 100, that's a 95% rate. But if those 95 required retries on average 1.3 times each, your effective completion rate is 95% / 1.3 ≈ 73%. That's the real metric. Track both the nominal and effective rates.

Quality Trend: If your avg quality score drops from 4.2 to 3.8 over a week, you have a regression. Trigger alerts. Investigate what changed. Roll back recent agent changes if needed.

Cost Tracking: If tokens spike 40% week-over-week, investigate whether:

  • The agent is being verbose (fixable through instructions)
  • Task complexity increased (might be expected)
  • The model changed (might explain the jump)

Failure Patterns: If 30% of failures are "hallucinated dependency not found", that's actionable feedback. You need to retrain the agent on dependency verification or add guardrails.

Cost per Quality Point: This is the ROI metric. If quality improved 0.5 points but cost increased $2/task, is it worth it? You decide based on business value.

Store these metrics in a timeseries database so you can query trends, set up alerts, and build dashboards.

Pillar 4: A/B Testing Agent Instructions

Here's where it gets strategic. You think you can improve an agent's quality score by tweaking its instructions. How do you validate that change actually works? You could just deploy it and hope. But smart teams A/B test instead.

A/B testing: you run the same task with two different instruction sets and compare results. This gives you statistical confidence that changes are improvements, not regressions.

Here's a practical example. Your code-comment agent currently uses this instruction:

Write helpful comments for public methods. Comments should be clear and concise.

You hypothesize that adding explicit guidance about parameter documentation will improve quality:

Write helpful comments for public methods. Comments should:
- Clearly explain what the method does
- Document all parameters and their types
- Document the return value and its type
- Mention any exceptions that might be raised
- Keep comments under 120 characters per line

Set up an A/B test like this:

yaml
ab_test:
  name: parameter_documentation_guidance
  test_id: ab-2026-03-16-001
  created_at: 2026-03-16T09:00:00Z
 
  hypothesis: |
    Adding explicit guidance about parameter and return documentation
    will improve quality scores by at least 0.3 points (from 4.0 to 4.3)
    without increasing token usage by more than 15%.
 
    Reasoning: Current agent sometimes misses parameter documentation.
    Explicit guidance in the system prompt should increase coverage.
 
  variant_a:
    name: baseline
    description: "Current instruction set"
    instructions: |
      Write helpful comments for public methods. Comments should be
      clear and concise.
    expected_quality_baseline: 4.0
    expected_tokens_baseline: 150
 
  variant_b:
    name: enhanced_with_param_docs
    description: "Adds explicit parameter documentation guidance"
    instructions: |
      Write helpful comments for public methods. Comments should:
      - Clearly explain what the method does
      - Document all parameters and their types
      - Document the return value and its type
      - Mention any exceptions that might be raised
      - Keep comments under 120 characters per line
 
  test_design:
    sample_size: 200 # 100 tasks per variant
 
    sampling_strategy: stratified_random
    strata:
      - name: method_count
        buckets: [1-5 methods, 6-20 methods, 20+ methods]
        distribution: [0.3, 0.5, 0.2] # Match production distribution
      - name: code_type
        buckets: [api, business_logic, utilities, data_models]
        distribution: [0.25, 0.35, 0.25, 0.15]
 
    randomization:
      method: random_assignment
      seed: 12345 # For reproducibility
      stratification: ensure_even_distribution_across_strata
 
  metrics:
    primary:
      - name: quality_score
        measure: avg(quality_score)
        hypothesis_direction: greater_than
        hypothesis_delta: 0.3
        statistical_test: welch_t_test # Accounts for unequal variances
        significance_level: 0.05
        minimum_sample_size: 100_per_variant
 
    secondary:
      - name: token_efficiency
        measure: (output_tokens / quality_score)
        hypothesis_direction: less_than
        hypothesis_delta: 0.15 # 15% improvement in efficiency
 
      - name: parameter_coverage
        measure: avg(documentation_includes_parameters?)
        hypothesis_direction: greater_than
        hypothesis_delta: 0.2 # 20% improvement in coverage
 
      - name: completion_time
        measure: avg(completion_time_seconds)
        hypothesis_direction: not_significantly_different
 
      - name: failure_rate
        measure: COUNT(failed) / COUNT(*)
        hypothesis_direction: less_than_or_equal
        notes: "Don't want variant B to crash more often"
 
  analysis:
    period: 1_day # Complete test in 24 hours
 
    checkpoint_at:
      - 50_samples: interim_analysis
        # Optional: could stop early if results are conclusive
      - 100_samples: final_analysis
 
    success_criteria:
      - primary_metric_achieves_hypothesis
      - secondary_metrics_dont_regress
      - statistical_significance: p_value < 0.05
      - failure_rate_not_increased
 
  rollout_plan:
    if_success:
      - update_instructions_for_variant_b
      - deploy_to_production_gradually # 10% → 50% → 100% over 24 hours
      - monitor_production_quality_score_for_1_week
      - alert_if_regression_detected
      - document_change_in_changelog
      - archive_test_results_and_analysis
 
    if_failure:
      - analyze_why_hypothesis_was_wrong
      - check_for_confounding_variables
      - sample_failures_for_detailed_inspection
      - refine_hypothesis_based_on_findings
      - propose_new_variant_c
      - schedule_new_test
 
    if_inconclusive:
      - increase_sample_size_to_300
      - extend_test_duration_to_2_days
      - re_run_statistical_analysis

Here's the key insight: you're not just eyeballing results. You're running a proper statistical test (Welch's t-test accounts for different variances). You're measuring not just the primary metric (quality) but secondary metrics too (efficiency, time, parameter coverage). You're checking that you haven't optimized for one thing at the expense of another.

How do you implement this? Use a testing agent:

yaml
testing_agent:
  name: ab_tester
  model: claude-opus-4
 
  capabilities:
    - split_tasks_into_variants
    - assign_tasks_randomly_to_cohorts
    - invoke_agent_variant_a_with_config
    - invoke_agent_variant_b_with_config
    - validate_outputs_with_rubric
    - collect_metrics_in_timeseries_db
    - run_statistical_analysis
    - generate_test_report
    - make_go_nogo_decision
 
  workflow:
    1_prepare:
      - load_test_config
      - fetch_historical_baseline_metrics
      - verify_sample_size_adequacy
      - seed_random_number_generator
      - validate_test_design
 
    2_run_test:
      - for_each_task_in_batch:
          - assign_random_variant (stratified)
          - invoke_variant_with_config
          - collect_raw_output
          - validate_against_rubric
          - record_metrics
          - check_if_checkpoint_reached
 
    3_analyze:
      - calculate_summary_statistics
      - run_hypothesis_test
      - check_success_criteria
      - identify_confounding_variables
      - generate_detailed_report
 
    4_decide:
      - if_test_passed: trigger_rollout_plan_success
      - if_test_failed: trigger_rollout_plan_failure
      - if_inconclusive: trigger_extended_test
 
  output:
    format: json_and_markdown
    includes:
      - summary_statistics_table
      - statistical_test_results
      - confidence_intervals
      - effect_sizes
      - detailed_findings
      - recommendations

Run this test before shipping any instruction change. Yes, it takes 24 hours. But that's how long it takes to build statistical confidence. A bad instruction change can tank agent quality and waste thousands in API costs. Invest the time upfront.

Pillar 4: Feedback Loops and Continuous Improvement

Now you have metrics. You have A/B test results. The question is: what do you do with that data?

Build a feedback loop. This is the engine that makes your agents better over time. It's not a one-time thing. It's a repeating cycle.

Here's the structure:

yaml
feedback_loop:
  name: quality_improvement_cycle
  version: 2.0
 
  phases:
    phase_1_monitoring:
      cadence: daily_at_09_00_utc
      description: "Continuously track agent performance"
 
      actions:
        - fetch_metrics_from_timeseries_db
        - compare_current_to_baseline
        - detect_regressions_and_improvements
        - identify_top_failing_tasks
        - sample_failures_for_inspection
 
      failure_detection:
        - regression: current_quality < baseline - 0.2
        - anomaly: z_score > 2.5 (standard deviations)
        - pattern: same_failure_reason in >20% of failures
        - cost_spike: cost_per_task > historical_avg * 1.4
 
      alert_threshold:
        critical: quality < 2.5
        high: quality < 3.0 OR failure_rate > 0.3
        medium: quality < 3.5 OR cost increased 20%
 
    phase_2_diagnosis:
      trigger: if_regression_or_pattern_detected
      description: "Understand what went wrong"
 
      actions:
        - run_root_cause_analysis
        - sample_5_to_10_failures_for_human_review
        - correlate_failures_with_code_changes
        - check_for_data_distribution_shift
        - analyze_failure_patterns
        - look_for_timing_correlations
 
      root_cause_categories:
        - instruction_clarity: examples_given <= 2
        - task_complexity_increase: avg_tokens_per_task up >30%
        - model_change: was_model_updated_recently?
        - edge_case: failure_concentrated_in_specific_task_type
        - data_shift: input_characteristics_changed?
        - external_factor: rate_limits, timeouts, API changes?
 
      diagnosis_output:
        format: structured_report
        includes:
          - failure_samples
          - identified_root_cause
          - confidence_in_diagnosis
          - affected_tasks_and_patterns
          - estimated_impact
 
    phase_3_hypothesis:
      trigger: if_root_cause_identified
      description: "Formulate and validate improvement hypothesis"
 
      actions:
        - formulate_improvement_hypothesis
        - design_ab_test_to_validate_hypothesis
        - estimate_effort_and_expected_impact
        - assign_priority
 
      hypothesis_template: |
        Root Cause: {reason}
 
        Hypothesis: {proposed_improvement}
 
        Expected Impact:
          - Quality score: {delta} points
          - Token efficiency: {delta}%
          - Completion time: {delta}%
 
        Confidence: {low/medium/high}
 
        Why We Think This Will Work: {reasoning}
 
        Test Design: {brief overview}
 
        Success Criteria: {what proves hypothesis correct}
 
    phase_4_experimentation:
      trigger: if_hypothesis_approved_by_human
      description: "Test the improvement hypothesis"
 
      actions:
        - execute_ab_test_as_per_test_config
        - monitor_intermediate_results
        - check_success_criteria_at_checkpoints
        - stop_early_if_results_conclusive
        - adjust_if_confounds_detected
 
      resource_budget:
        - max_time: 24_hours
        - max_tasks: 500
        - max_cost: 50_usd
        - rollback_on_budget_exceeded: true
 
      checkpoint:
        at_50_samples: optional_early_stopping
        at_100_samples: required_decision_point
 
    phase_5_decision:
      trigger: if_ab_test_complete
      description: "Decide whether to deploy the improvement"
 
      if_test_passed:
        - update_instructions_for_variant_b
        - deploy_to_production_gradually
        - monitor_production_metrics_for_1_week
        - alert_if_regression_detected
        - document_improvement_in_changelog
 
      if_test_failed:
        - analyze_why_hypothesis_was_wrong
        - investigate_confounding_variables
        - propose_alternative_hypothesis
        - return_to_phase_3
 
      if_test_inconclusive:
        - increase_sample_size_if_promising
        - extend_test_duration
        - re_run_analysis_with_more_data
 
    phase_6_monitoring:
      trigger: always (continuous, especially post-deployment)
      description: "Track impact of deployed changes"
 
      actions:
        - track_quality_score_post_deployment
        - check_for_unexpected_side_effects
        - verify_no_regression_in_secondary_metrics
        - correlate_with_external_factors
 
      alert_conditions:
        - quality < baseline - 0.1 (regression detected)
        - token_efficiency worse than expected
        - completion_time increased by >20%
        - failure_pattern_emerged_in_new_area
 
      if_alert_triggered:
        - assess_severity
        - if_critical: rollback_to_previous_version
        - if_minor: document_for_future_fix
        - file_incident_report
        - return_to_phase_2_diagnosis
 
  cadence:
    - daily: monitoring, anomaly detection, diagnostics
    - weekly: review_of_trends, prioritization_of_improvements
    - biweekly: hypothesis_formulation_and_ab_test_scheduling
    - monthly: strategic_review, major_instruction_revisions, agent_training
 
  success_metrics:
    - quality_score_trend: should_be_increasing_or_stable
    - failure_rate_trend: should_be_decreasing_or_stable
    - cost_efficiency_trend: should_be_improving_or_stable
    - cycle_time: time_from_detection_to_deployment (aim_for_3_days)

This is a closed loop:

  1. Monitor - Continuously track quality and spot issues
  2. Detect - Find regressions, anomalies, patterns
  3. Diagnose - Understand root causes and inspect failures
  4. Hypothesize - Propose improvements backed by theory
  5. Experiment - Test with statistical rigor
  6. Deploy - Ship what works
  7. Monitor - Back to step 1

Each cycle should take 2-7 days depending on complexity. Over time, your agents get progressively better. After 6 months, you'll have an agent that's 2+ quality points higher than where you started. That's not incremental. That's transformative.

Advanced: Stratified A/B Testing and Subgroup Analysis

In real-world evaluation, you often discover that an improvement works great for some users/tasks but fails for others. This is subgroup analysis, and it's critical for avoiding deployment disasters.

For example, your code-comment agent might improve quality on short functions (5 lines) but degrade on long functions (100+ lines). If you deploy based on overall metrics, you'll look good in aggregate but break for users with large codebases.

Here's how to detect and analyze subgroups:

python
def analyze_subgroups(test_results, stratification_key='code_complexity'):
  """
  Analyze test results separately for different subgroups.
  Detects if improvement is consistent across all segments.
  """
 
  subgroups = {}
 
  # Partition results by stratification key
  for result in test_results:
    key = result[stratification_key]
 
    if key not in subgroups:
      subgroups[key] = {
        'variant_a': [],
        'variant_b': [],
      }
 
    variant = result['variant']
    subgroups[key][variant].append(result['quality_score'])
 
  # Analyze each subgroup
  findings = {}
 
  for key, groups in subgroups.items():
    avg_a = mean(groups['variant_a'])
    avg_b = mean(groups['variant_b'])
    delta = avg_b - avg_a
 
    # Statistical significance
    t_stat, p_value = welch_t_test(groups['variant_a'], groups['variant_b'])
 
    findings[key] = {
      'variant_a_avg': avg_a,
      'variant_b_avg': avg_b,
      'delta': delta,
      'p_value': p_value,
      'significant': p_value < 0.05,
      'sample_size_a': len(groups['variant_a']),
      'sample_size_b': len(groups['variant_b']),
    }
 
  return findings
 
# Example results showing subgroup analysis:
# Subgroup: 'simple' (5-10 lines)
#   Variant A: 4.1, Variant B: 4.4, Delta: +0.3, p=0.008 (SIGNIFICANT)
#
# Subgroup: 'medium' (10-50 lines)
#   Variant A: 3.9, Variant B: 4.0, Delta: +0.1, p=0.34 (NOT SIGNIFICANT)
#
# Subgroup: 'complex' (50+ lines)
#   Variant A: 3.6, Variant B: 3.4, Delta: -0.2, p=0.12 (NOT SIGNIFICANT, NEGATIVE TREND)
#
# Interpretation:
# Variant B is great for simple code, neutral for medium, potentially bad for complex code.
# Don't deploy globally. Deploy only for simple code, or refine variant to handle complex cases.

With subgroup analysis, you make targeted deployment decisions:

  • Deploy globally: if improvement is consistent across all subgroups
  • Deploy selectively: if improvement only works for certain tasks/users
  • Don't deploy: if improvement has mixed or negative results in important subgroups
  • Refine and retest: if you see potential but need to address failing subgroups

This is how mature teams avoid the trap of "metrics look good in aggregate but terrible in practice."

Advanced: Longitudinal Quality Tracking and Drift Detection

Once you deploy an agent, you can't just set it and forget it. Agent quality can drift over time for various reasons:

  • User inputs changing (distribution shift)
  • The model's behavior changing (Anthropic releases updates)
  • The agent's context window degrading due to long conversations
  • Edge cases you didn't test becoming common in production

You need to track quality longitudinally—over time—and detect when drift happens. Here's how:

python
def detect_quality_drift(daily_metrics, baseline_period=30, alert_threshold=0.2):
  """
  Detect significant quality degradation over time.
  """
 
  if len(daily_metrics) < baseline_period + 7:
    return None  # Need data
 
  # Calculate baseline (first N days)
  baseline_scores = [m['quality_score'] for m in daily_metrics[:baseline_period]]
  baseline_avg = mean(baseline_scores)
  baseline_std = std(baseline_scores)
 
  # Track recent metrics (last 7 days)
  recent_scores = [m['quality_score'] for m in daily_metrics[-7:]]
  recent_avg = mean(recent_scores)
 
  # Calculate drift
  drift_magnitude = baseline_avg - recent_avg
  drift_stdev = drift_magnitude / baseline_std if baseline_std > 0 else 0
 
  # Determine severity
  if drift_magnitude < 0:
    # Quality improved, not a concern
    return {
      'status': 'improving',
      'drift': drift_magnitude,
      'severity': None,
    }
 
  elif drift_magnitude < alert_threshold:
    return {
      'status': 'stable',
      'drift': drift_magnitude,
      'severity': 'low',
    }
 
  elif drift_magnitude < alert_threshold * 1.5:
    return {
      'status': 'degrading',
      'drift': drift_magnitude,
      'severity': 'medium',
      'action': 'investigate_root_cause',
    }
 
  else:
    return {
      'status': 'critical_degradation',
      'drift': drift_magnitude,
      'severity': 'high',
      'action': 'consider_rollback',
    }
 
# Example: Continuous monitoring
# Day 1-30 (baseline): avg quality = 4.2, std = 0.15
# Day 31-37 (recent): avg quality = 3.95
# Drift: -0.25 stdev (1.67 standard deviations)
# Status: MEDIUM SEVERITY DEGRADATION
# Action: Investigate what changed. Check if model was updated, if users changed inputs, etc.

With drift detection, you're not surprised by quality degradation. You catch it within a week and investigate. This is the difference between "agent mysteriously got worse" and "agent degraded when user input distribution changed, and we caught it."

Set up automated daily reports:

yaml
daily_quality_report:
  cadence: 09:00_UTC_every_day
  checks:
    - quality_score_trend
    - failure_rate_trend
    - cost_trend
    - drift_detection
    - subgroup_performance
  alerts:
    - if quality < baseline - 0.2: email_ops
    - if failure_rate > 0.15: page_oncall
    - if cost increased 30%: notify_finance
    - if subgroup_degrading: create_ticket

Practical: Setting Up Metrics in Code

Let's implement a real-world metrics collector. This is pseudocode, but it's close enough to real Python that you can adapt it to your stack.

python
from datetime import datetime
from dataclasses import dataclass, asdict
from typing import List, Dict, Any, Optional
import json
import hashlib
 
@dataclass
class TaskMetrics:
  """Metrics for a single task execution."""
  task_id: str
  agent_id: str
  agent_name: str
  task_type: str
  created_at: datetime
  started_at: datetime
  completed_at: datetime
  status: str  # pending, completed, failed, abandoned
  completion_time_seconds: float
  tokens_used_input: int
  tokens_used_output: int
  cost_usd: float
  retries: int
  quality_score: Optional[float] = None
  dimensions: Dict[str, float] = None
  failure_reason: Optional[str] = None
  human_review_required: bool = False
 
  def to_dict(self) -> Dict[str, Any]:
    return {
      'task_id': self.task_id,
      'timestamp': self.completed_at.isoformat(),
      'agent_id': self.agent_id,
      'agent_name': self.agent_name,
      'task_type': self.task_type,
      'completion_time': self.completion_time_seconds,
      'tokens_total': self.tokens_used_input + self.tokens_used_output,
      'cost': self.cost_usd,
      'quality_score': self.quality_score,
      'status': self.status,
      'retries': self.retries,
    }
 
class MetricsCollector:
  """Collects and stores metrics for agents."""
 
  def __init__(self, influxdb_client, prometheus_registry):
    self.influx = influxdb_client
    self.prometheus = prometheus_registry
    self.metrics_buffer = []  # Local buffer before flush
 
  def record_task_start(self, task_id: str, agent_id: str):
    """Record that a task has started."""
    self.prometheus.counter(
      'agent_tasks_started',
      labels={'agent_id': agent_id}
    ).inc()
 
  def record_task_completion(self, metrics: TaskMetrics):
    """Record task completion metrics."""
 
    # Write to timeseries DB for trend analysis
    self.influx.write_point({
      'measurement': 'agent_task_completion',
      'tags': {
        'agent_id': metrics.agent_id,
        'agent_name': metrics.agent_name,
        'task_type': metrics.task_type,
        'status': metrics.status,
      },
      'fields': {
        'completion_time_seconds': metrics.completion_time_seconds,
        'tokens_total': (
          metrics.tokens_used_input +
          metrics.tokens_used_output
        ),
        'cost_usd': metrics.cost_usd,
        'quality_score': metrics.quality_score,
        'retries': metrics.retries,
      },
      'time': metrics.completed_at,
    })
 
    # Update Prometheus metrics for real-time monitoring
    completion_time = self.prometheus.histogram(
      'agent_completion_time_seconds',
      labels={'agent_id': metrics.agent_id}
    )
    completion_time.observe(metrics.completion_time_seconds)
 
    quality_score = self.prometheus.gauge(
      'agent_quality_score',
      labels={'agent_id': metrics.agent_id}
    )
    quality_score.set(metrics.quality_score)
 
    cost = self.prometheus.gauge(
      'agent_cost_per_task',
      labels={'agent_id': metrics.agent_id}
    )
    cost.set(metrics.cost_usd)
 
    # Increment counters
    if metrics.status == 'completed':
      self.prometheus.counter(
        'agent_tasks_completed',
        labels={'agent_id': metrics.agent_id}
      ).inc()
    elif metrics.status == 'failed':
      self.prometheus.counter(
        'agent_tasks_failed',
        labels={
          'agent_id': metrics.agent_id,
          'reason': metrics.failure_reason,
        }
      ).inc()
 
  def calculate_daily_summary(self, agent_id: str, date) -> Optional[Dict]:
    """Calculate daily metrics summary."""
 
    # Query InfluxDB for all tasks on this date
    query = (
      f'SELECT * FROM agent_task_completion '
      f'WHERE agent_id = {agent_id} AND time > {date}'
    )
    tasks = self.influx.query(query)
 
    if not tasks:
      return None
 
    completed = [t for t in tasks if t['status'] == 'completed']
    failed = [t for t in tasks if t['status'] == 'failed']
 
    completion_rate = len(completed) / len(tasks) if tasks else 0
    avg_quality = (
      sum(t['quality_score'] for t in completed) / len(completed)
      if completed else 0
    )
    avg_time = (
      sum(t['completion_time_seconds'] for t in tasks) / len(tasks)
      if tasks else 0
    )
    total_cost = sum(t['cost_usd'] for t in tasks)
 
    # Failure analysis
    failure_reasons = {}
    for task in failed:
      reason = task.get('failure_reason', 'unknown')
      failure_reasons[reason] = failure_reasons.get(reason, 0) + 1
 
    return {
      'date': date.isoformat(),
      'agent_id': agent_id,
      'completion_rate': completion_rate,
      'avg_quality_score': avg_quality,
      'avg_completion_time': avg_time,
      'total_cost': total_cost,
      'task_count': len(tasks),
      'completed_count': len(completed),
      'failed_count': len(failed),
      'failure_reasons': failure_reasons,
    }
 
  def get_weekly_trend(self, agent_id: str, weeks: int = 4):
    """Get weekly trend data for trend analysis."""
    summaries = []
    for i in range(weeks):
      date = datetime.now() - timedelta(days=i*7)
      summary = self.calculate_daily_summary(agent_id, date)
      if summary:
        summaries.append(summary)
 
    return summaries

Now you have instrumentation. Every task generates metrics. Those metrics flow into monitoring systems where you can query them, set alerts, and build dashboards. You have the data infrastructure to make smart decisions.

Common Pitfalls (and How to Avoid Them)

Pitfall 1: Overfitting to the Rubric

If your rubric only measures conciseness, agents will get shorter and shorter until comments stop being helpful. Always include multiple dimensions with weights that balance competing concerns. A good rubric captures what matters to actual users.

Pitfall 2: Ignoring Statistical Noise

If you see quality jump from 4.0 to 4.1, is that real improvement or random variation? That's why you run statistical tests. A 0.1-point increase on 50 samples is noise. But a 0.3-point increase on 200 samples across multiple dimensions is signal.

Pitfall 3: Measuring the Wrong Thing

You measure completion rate, but you don't measure customer satisfaction. Your agent completes 100% of tasks, but they're useless. Define what success means to your users, then measure that. Completion without quality is worthless.

Pitfall 4: Not Automating Fast Feedback

If you can only manually review outputs once a week, your feedback loop moves at glacial pace. Automate everything you can. Reserve human review for high-uncertainty cases. Fast feedback loops drive improvement.

Pitfall 5: Testing Without a Hypothesis

Don't A/B test randomly. Have a reason. "I think parameter documentation will improve quality by 0.3 points because..." If you can't finish that sentence, don't run the test. You're just wasting time.

Pitfall 6: Not Acting on Data

You have metrics. You see a pattern. And then... nothing happens. Metrics without action are just vanity. When data shows a problem, invest time in understanding and fixing it.

Advanced: Dimension Correlation and Tradeoff Analysis

As your evaluation system matures, you'll notice that dimensions don't always improve together. Sometimes optimizing for one dimension makes another worse. This is the quality tradeoff problem.

For example, if you optimize an agent for speed (lower completion time), it might sacrifice correctness (accuracy). Or if you optimize for completeness (100% coverage), you might increase cost (more tokens used). Understanding these tradeoffs is crucial.

Here's how to analyze dimension correlations:

python
def analyze_dimension_correlations(metrics_history):
  """
  Calculate correlation matrix between dimensions.
  Helps identify tradeoffs.
  """
 
  # Extract dimension scores over time
  clarity_scores = [m['dimensions']['clarity'] for m in metrics_history]
  coverage_scores = [m['dimensions']['coverage'] for m in metrics_history]
  correctness_scores = [m['dimensions']['correctness'] for m in metrics_history]
  efficiency_scores = [m['dimensions']['efficiency'] for m in metrics_history]
  cost_per_task = [m['cost_usd'] for m in metrics_history]
  completion_time = [m['completion_time_seconds'] for m in metrics_history]
 
  # Calculate Pearson correlations
  correlations = {
    'clarity_vs_correctness': pearson_correlation(clarity_scores, correctness_scores),
    'clarity_vs_efficiency': pearson_correlation(clarity_scores, efficiency_scores),
    'completeness_vs_cost': pearson_correlation(coverage_scores, cost_per_task),
    'completeness_vs_time': pearson_correlation(coverage_scores, completion_time),
    'correctness_vs_efficiency': pearson_correlation(correctness_scores, efficiency_scores),
  }
 
  return correlations
 
# Example results:
# clarity_vs_correctness: 0.92 (strong positive: clearer comments tend to be more correct)
# clarity_vs_efficiency: -0.15 (weak negative: sometimes clarity requires more words)
# completeness_vs_cost: 0.78 (positive: more complete docs cost more tokens)
# completeness_vs_time: 0.65 (positive: more completeness takes more time)
# correctness_vs_efficiency: 0.22 (weak positive: correctness doesn't require extra cost)

With this analysis, you can make intelligent decisions:

If clarity and correctness are highly correlated (0.92): improvements in one benefit both. Invest in clarity.

If completeness and cost are highly correlated (0.78): you're facing a tradeoff. Ask yourself: is 100% coverage worth the cost? Maybe 90% coverage with lower cost is better ROI.

If correctness and efficiency are weakly correlated (0.22): you can improve correctness without significantly impacting cost. This is a "free win"—pursue it.

Use this data to guide your A/B test design. When you're testing a hypothesis, check if you're inadvertently creating negative correlations. For example, if your new instruction improves coverage but demolishes efficiency (tokens spike), you might want to refine it before deploying.

Advanced: Failure Pattern Clustering

Not all failures are equal. Some are systemic (the agent consistently misses something). Some are contextual (the agent fails on a specific type of input). Some are random noise.

Clustering failures helps you identify systemic issues:

python
def cluster_failures(failures_list):
  """
  Group failures by similarity to identify patterns.
  """
 
  clusters = {}
 
  for failure in failures_list:
    # Extract features that characterize this failure
    features = {
      'input_type': failure['task_type'],
      'error_category': failure['failure_category'],
      'complexity_level': failure['input_complexity'],
      'error_message_pattern': extract_pattern(failure['error_message']),
    }
 
    # Find similar failures
    cluster_key = hash_features(features)
 
    if cluster_key not in clusters:
      clusters[cluster_key] = {
        'count': 0,
        'examples': [],
        'features': features,
      }
 
    clusters[cluster_key]['count'] += 1
    clusters[cluster_key]['examples'].append(failure)
 
  return sorted(clusters.values(), key=lambda c: c['count'], reverse=True)
 
# Example output:
# Cluster 1 (47 failures):
#   Pattern: task_type=code-comment-generation, error=hallucinated_param
#   Features: happens when method has 5+ parameters
#   Root cause: Agent loses track of parameter names in long signatures
#   Action: Retrain on param-heavy functions
 
# Cluster 2 (23 failures):
#   Pattern: task_type=code-comment-generation, error=timeout
#   Features: happens on files with 100+ functions
#   Root cause: Agent processes sequentially, gets slow on large files
#   Action: Implement chunking or parallel processing
 
# Cluster 3 (8 failures):
#   Pattern: task_type=code-comment-generation, error=invalid_syntax
#   Features: random, no discernible pattern
#   Root cause: Random noise or environmental issue
#   Action: Monitor, unlikely to be systemic

With failure clusters, your improvement efforts are targeted. You fix the 47 failures in Cluster 1 first because that has the highest ROI. Cluster 3 (8 random failures) gets deprioritized because there's no systemic fix.

Advanced: Cost-Quality Pareto Frontier

In real production systems, you often face a choice: cheaper but lower quality, or expensive but higher quality. Which is the right balance?

The Pareto frontier helps you find the "sweet spot":

python
def find_pareto_frontier(experiments):
  """
  Identify the set of non-dominated solutions.
  On the Pareto frontier, you can't improve one metric without worsening another.
  """
 
  frontier = []
 
  for experiment in experiments:
    is_dominated = False
 
    # Check if any other experiment is better on all metrics
    for other in experiments:
      if (other['quality_score'] >= experiment['quality_score'] and
          other['cost_usd'] <= experiment['cost_usd'] and
          (other['quality_score'] > experiment['quality_score'] or
           other['cost_usd'] < experiment['cost_usd'])):
        is_dominated = True
        break
 
    if not is_dominated:
      frontier.append(experiment)
 
  return sorted(frontier, key=lambda e: e['cost_usd'])
 
# Example: Testing different prompt strategies
# Variant A: quality=3.8, cost=$0.42
# Variant B: quality=4.0, cost=$0.45
# Variant C: quality=3.9, cost=$0.50
# Variant D: quality=4.2, cost=$0.48
 
# Pareto frontier:
# Variant A: quality=3.8, cost=$0.42 (cheapest)
# Variant B: quality=4.0, cost=$0.45 (balanced)
# Variant D: quality=4.2, cost=$0.48 (best quality)
 
# Not on frontier: Variant C (dominated by B: same cost, worse quality)

The frontier points are your real options. You choose based on business requirements:

  • Cheapest path: Variant A (cost-sensitive, acceptable quality)
  • Balanced path: Variant B (good quality for reasonable cost)
  • Premium path: Variant D (best quality, worth the extra cost)

This removes the "should we even improve?" question. The data shows you exactly what's possible at different cost levels.

Building Your Evaluation System

Here's a minimal implementation checklist:

  • Define success criteria for your specific agent tasks (YAML config)
  • Build a rubric with 4-7 quantifiable dimensions
  • Implement a validator agent that scores outputs
  • Log metrics to a timeseries database
  • Calculate daily/weekly summaries and set up alerts
  • Set up A/B testing framework for instruction changes
  • Establish feedback loop: monitor → diagnose → hypothesize → experiment → deploy
  • Review trends weekly, make improvements monthly
  • Share results with the team

Start small. Measure one agent task thoroughly before expanding to others. Once the system is in place, it's self-reinforcing: you catch issues fast, you fix them faster, your agents get better, your users are happier.

Real-World Example: Multi-Metric Evaluation in Practice

Let's tie this all together with a concrete example. Imagine you have a documentation generation agent. You want to evaluate it holistically.

Month 1 Baseline:

  • Quality score: 3.8/5.0
  • Completion rate: 92%
  • Cost per document: $0.42
  • Avg time per document: 18 seconds
  • Top failure: "Missing API endpoint documentation" (35% of failures)
  • Token usage: 850 tokens per doc on average

You notice the "missing API endpoint" pattern. That's actionable. You hypothesize that adding explicit instruction about checking API schemas will improve coverage. You review 5 failure samples. Each one is missing endpoint documentation that was available in an OpenAPI spec. The agent just didn't think to look for it.

A/B Test Design:

  • Variant A (baseline): Current instructions
  • Variant B: Add "Always extract and document API endpoints from openapi.json or similar schemas. Look for endpoints that aren't documented."
  • Sample size: 200 documents (100 per variant)
  • Success metric: coverage_dimension_score improves by 0.5 points, quality score improves by 0.3 points
  • Timeline: 2 days
  • Failure acceptance: 5% max failure rate (variant B shouldn't be worse)

Results after 2 days:

  • Variant A quality: 3.8 (baseline confirmed)
  • Variant B quality: 4.2 (0.4 point gain - exceeds hypothesis!)
  • Variant B cost: $0.38 per document (actually cheaper due to more focused output)
  • Variant B time: 16 seconds per document (faster)
  • Variant B failure pattern: "missing endpoint" now only 8% of failures (down from 35%)
  • Variant B coverage dimension: improved from 3.0 to 3.9 (0.9 point gain)
  • p-value: 0.021 (statistically significant at p less than 0.05)

Outcome: Deploy Variant B to production. Gradual rollout: 10% → 50% → 100% over 24 hours. Monitor for unexpected issues.

One Week Later:

  • New baseline: 4.2 quality (improvement maintained!)
  • Cost per document: $0.38 (sustained)
  • But you notice a new pattern emerging: 20% of failures are now "inaccurate endpoint descriptions"
  • This is a new failure mode—the agent is documenting endpoints but sometimes getting them wrong

Diagnosis: The agent is hallucinating parameter types or descriptions. It's guessing at details instead of verifying them from the schema.

Next Hypothesis: Add guidance to the agent: "If you cannot verify endpoint details from the schema file, mark them as [UNVERIFIED] instead of guessing. It's better to be incomplete than incorrect."

Run another A/B test to validate this hypothesis. Test for 2 days. Results show:

  • Quality improves to 4.35 (endpoint accuracy improves, completeness stays high)
  • Inaccurate endpoint descriptions drop from 20% to 4%
  • Users are happy because they trust the documentation

Deploy Variant C. Monitor for one week. Trends look good.

Two Months Later:

  • Quality has improved from 3.8 to 4.4
  • Cost per document: down to $0.35
  • Completion rate: up to 98% (from 92%)
  • Top failure now: "unclear description" (only 7% of failures, was 35%)

And the cycle continues. Each small improvement compounds. You've gone from an agent that was "okay but unreliable" to one that's "good and trustworthy."

Integrating Metrics with CI/CD

One more critical piece: making evaluation part of your deployment pipeline.

When an engineer proposes a new agent instruction or logic change, you don't just merge it and hope. You:

  1. Run the change against your test task suite
  2. Score the outputs using your rubric
  3. Compare to baseline metrics
  4. Flag if any dimension regressed
  5. Block merge if metrics fall below thresholds

Here's a YAML config for that gate:

yaml
ci_cd_quality_gate:
  name: agent_evaluation_gate
  trigger: on_pull_request_to_main
 
  steps:
    - name: collect_baseline_metrics
      run: fetch_metrics_for(branch=main, agent_id=current)
 
    - name: run_test_suite
      run: |
        # Run agent against standardized test tasks
        test_results = run_agent_on_test_suite(
          num_tasks=50,
          task_types=[api, schema, function],
          agent_config=pr_branch
        )
 
    - name: validate_with_rubric
      run: |
        # Score each output
        for output in test_results:
          score(output, rubric=quality_rubric)
 
    - name: compare_to_baseline
      run: |
        pr_metrics = aggregate_metrics(test_results)
        baseline_metrics = fetch_metrics(branch=main)
 
        regression_check:
          - if quality_score < baseline - 0.2: FAIL
          - if completion_rate < baseline - 0.05: FAIL
          - if cost_increase > 20%: WARN
          - if any_critical_failure: FAIL
 
    - name: report_to_pr
      run: |
        # Post detailed report as comment
        comment_pr(
          title="Quality Gate Results",
          metrics=pr_metrics,
          comparison=baseline_metrics,
          verdict=pass/fail,
          dimension_breakdown=scores_by_dimension
        )
 
    - name: block_or_approve
      run: |
        if verdict == fail:
          block_merge()
          require_manual_review()
        else:
          auto_approve_quality_gate()

This is a hard gate. You cannot merge agent changes without proving they don't regress quality. This is non-negotiable for production systems.

The Investment Thesis: Why Evaluation Pays Off

You might be thinking: "This is a lot of infrastructure. Do I really need all of this for agents?"

Short answer: yes, if you care about production quality.

Here's the investment thesis. Building an evaluation system costs time upfront—maybe 2-3 weeks of work. But it pays dividends that compound. Here's the math:

Without evaluation system:

  • Deploy agent, assume it's fine
  • Users complain, you notice quality is bad
  • Spend a week debugging, make random changes
  • Deploy change, hope it's better
  • Cycle time: 2-3 weeks per improvement
  • Quality improvement: 0.2 points per month (slow and unpredictable)

With evaluation system:

  • Deploy agent with automated quality tracking
  • Metrics immediately show if it's good or bad
  • Spot patterns automatically (clustering failures)
  • Design hypotheses based on data
  • Run A/B test in 24 hours, validate improvement
  • Deploy with confidence
  • Cycle time: 2-3 days per improvement
  • Quality improvement: 0.5+ points per month (fast and predictable)

Over 6 months:

  • Without system: +1.2 quality points (slow, frustrating)
  • With system: +3+ quality points (fast, measurable)

And that's just quality. The system also helps with:

  • Cost reduction: Understanding tradeoffs means you spend less on API calls
  • Reliability: Catching regressions before they hit users
  • Confidence: Data-driven decisions instead of guessing

The ROI is massive if you care about production agents. The teams that win at scale all have evaluation infrastructure.

When NOT to Build Evaluation (Yet)

That said, there are scenarios where a full evaluation system is overkill:

  • Experimental/prototype agent: Just launched, not in production yet. Focus on getting it working first.
  • Low-impact agent: Used occasionally, failures aren't critical. A simple pass/fail gate is enough.
  • Simple task with obvious quality: Sometimes "does it work?" is obvious by inspection.

But the moment you go to production, the moment you care about cost, the moment you want to improve—build the evaluation system. The earlier you build it, the more data you accumulate, the better your decisions become.

Scaling Evaluation Across Many Agents

Real organizations don't have one agent. They have dozens. How do you scale evaluation?

Key principle: shared infrastructure, agent-specific rubrics.

Build one metrics collection system that all agents feed into. Build one A/B testing framework that all agents use. But build separate rubrics for each agent type.

For example:

  • Code reviewer agent: Rubric measures security depth, performance insights, actionability
  • Documentation generator: Rubric measures example quality, API coverage, searchability
  • Test writer: Rubric measures edge case coverage, test readability, maintainability

Each agent has different success criteria. But they all use the same evaluation infrastructure to track themselves.

yaml
evaluation_infrastructure:
  shared_components:
    - metrics_collection_service
    - timeseries_database (InfluxDB/CloudWatch)
    - statistical_analysis_library
    - ab_testing_framework
    - dashboard_and_alerting
 
  per_agent_components:
    - agent_specific_rubric
    - test_suite (task samples)
    - success_criteria_thresholds
    - improvement_hypotheses
 
  scaling_pattern: 1_define_rubric_for_new_agent
    2_bootstrap_test_suite (50-100 representative tasks)
    3_run_initial_baseline (get starting metrics)
    4_integrate_with_shared_metrics_system
    5_set_up_automated_daily_evaluation
    6_start_iterating_with_ab_tests

With this pattern, you can onboard a new agent into the evaluation system in about a week. The infrastructure is there; you just plug in the agent-specific parts.

Conclusion

Agent evaluation isn't a one-time checklist. It's an ongoing system that keeps your agents sharp and getting better over time. By defining success criteria, scoring quality automatically, tracking performance, running A/B tests, and analyzing data, you build the infrastructure to continuously improve.

The teams that ship production agents at scale all do this. They measure everything, they test every change, and they iterate relentlessly. That's how you go from "agent completed a task" to "agent completed a task consistently, cost-effectively, and with high quality."

The payoff compounds over time. Your first improvement might be small—0.2 points of quality. But month after month, those small improvements stack. By month six, you're running an agent that's 2.0 quality points higher than where you started. That's not linear improvement; that's exponential. And it's only possible if you instrument your system to capture, analyze, and act on data.

Build your metrics first. Automate the feedback loops. Let the data guide your improvements. Start with one agent task, one rubric, one feedback loop. As you see results, expand to more agents. After a year, you'll have a system that continuously improves itself without constant manual intervention. That's the future of agent-assisted development—systems that get smarter every day without you having to think about it.


-iNet

Need help implementing this?

We build automation systems like this for clients every day.

Discuss Your Project