April 9, 2025
Claude AI Testing Development

Agent SDK: Integration Testing Strategies

When you're building agents with Claude Code, you're orchestrating complex multi-step workflows. An agent spawns sessions, calls tools, processes results, makes decisions, and calls more tools. Each step is a potential failure point. And unlike traditional unit tests where you can mock everything, integration tests need to validate the actual flow—from Claude's responses to your tool execution to the results flowing back.

Here's what makes agent testing different from traditional testing. When you test a function, you control everything—inputs, outputs, side effects. You mock the database, mock the API, mock the filesystem. When you test an agent, you're testing a fundamentally multi-turn interaction where the agent's next action depends on the outcome of its previous action. The agent might fetch data, decide it needs more data, fetch more, analyze the combined data, and make a decision. Each of those steps can fail in different ways. The agent might get stuck in a loop. The agent might make the wrong decision. The agent might interpret the tool results incorrectly. Traditional testing approaches break down.

This guide walks through practical integration testing strategies for the Agent SDK. We'll cover mocking Claude API responses, testing tool execution chains, snapshot testing conversations, handling error scenarios, and integrating everything into CI pipelines. By the end, you'll have confidence that your agents work end-to-end, not just in isolation.

The hidden layer here is understanding that integration tests for agents are really about workflow validation. You're not testing that Claude works—that's Anthropic's job. You're not testing that your tools are implemented correctly—that's covered by unit tests. You're testing that the orchestration is correct: that tools are called in the right order, that results are interpreted correctly, that the agent adapts properly to unexpected situations, and that the overall workflow achieves its goal.

Table of Contents
  1. Why Integration Tests Matter for Agents
  2. Mocking Claude API Responses
  3. Testing Tool Execution Chains
  4. Snapshot Testing Conversations
  5. Testing Error Scenarios
  6. Testing Agent State Management
  7. Testing with Real External Services
  8. CI Pipeline Integration
  9. Performance Testing Integration Tests
  10. Test Organization Best Practices
  11. Debugging Failed Integration Tests
  12. Advanced: Testing Tool Output Validation
  13. Testing Conversations with Memory
  14. Testing Concurrent Agent Requests
  15. Testing Agent Behavior Under Load
  16. Real-World Test Suite Example
  17. Key Takeaways
  18. Test Organization and File Structure
  19. Debugging Integration Tests
  20. Snapshot Comparison and Regression Detection
  21. The Philosophy of Agent Integration Testing
  22. Making Agent Testing Practical
  23. The Testing Mindset for Agents
  24. Test Organization at Scale
  25. Capturing Agent Behavior Over Time
  26. Stress Testing Your Agents
  27. Monitoring and Observability in Tests
  28. Evolving Your Test Suite
  29. The Confidence Multiplier
  30. Real-World Test Coverage Patterns
  31. Test Maintenance as Ongoing Investment
  32. Building a Testing Culture Around Agents
  33. The Compounding Returns of Systematic Testing
  34. Advanced Pattern: Regression Testing for AI Systems
  35. Debugging Integration Test Failures: A Systematic Approach
  36. Cost Optimization for Integration Tests
  37. Building Integration Tests for Multi-Agent Systems
  38. Documentation Through Tests
  39. The Path to Test Maturity
  40. Metrics That Matter for Agent Testing

Why Integration Tests Matter for Agents

Unit tests tell you that individual functions work. Integration tests tell you that your agent actually accomplishes its goal. There's a huge difference. An agent might pass all unit tests but fail at runtime because the interaction between components breaks down in production.

Think about a web server. You can unit test that every endpoint handler works correctly. You can test the database layer. You can test authentication separately. But if you don't integration test, you might discover at runtime that one endpoint hits the database, the database responds with data the endpoint wasn't expecting, the endpoint crashes, and suddenly your whole service is down. Integration tests catch this. They run the actual code flow, including all the real interactions.

Agents amplify this problem because they're fundamentally reactive systems. The agent makes a decision based on one tool's output, then calls another tool, which outputs something slightly different from what you expected, and suddenly the agent's behavior is completely wrong. An agent that's supposed to fetch data and summarize it might instead get stuck in a loop fetching data forever. An agent that's supposed to be conservative might be overconfident and make dangerous decisions.

An agent might pass all unit tests but fail at runtime because:

  • Claude's response format changed slightly
  • A tool throws an unexpected exception
  • Tool results chain incorrectly
  • Error handling logic has a bug
  • The agent gets stuck in a loop

Integration tests catch these problems before production.

Mocking Claude API Responses

You don't want to hit the real Claude API for every test run. It's slow, it costs money, it's flaky, and it's not repeatable. You can't test "what happens if Claude returns an empty response" because Claude will never do that. Instead, mock responses with predictable content.

The key principle is deterministic mocking. Every test should run the same way every time. If a test passes sometimes and fails sometimes, you have a flaky test, and flaky tests are worse than no tests because they destroy confidence in your test suite. You'll start ignoring failures because "oh, it's probably just flaky." So you mock Claude's responses to be predictable.

When you mock, you're simulating Claude's behavior. You provide the exact response you want to test against. If you want to test what happens when Claude decides to call a tool, your mock returns a response with a tool call. If you want to test what happens when Claude finishes the conversation, your mock returns a response without tool calls. This lets you test specific scenarios in isolation.

Here's a TypeScript approach using Jest:

typescript
import { Agent } from "@anthropic-ai/agent-sdk";
 
describe("Agent Tool Calling", () => {
  let agent: Agent;
 
  beforeEach(() => {
    // Mock the Claude API
    jest.spyOn(global, "fetch").mockResolvedValue({
      json: async () => ({
        id: "msg-123",
        content: [
          {
            type: "text",
            text: "I need to search for this information",
          },
          {
            type: "tool_use",
            id: "tool-1",
            name: "search",
            input: { query: "test query" },
          },
        ],
      }),
    } as any);
  });
 
  afterEach(() => {
    jest.restoreAllMocks();
  });
 
  test("agent calls search tool when needed", async () => {
    agent = new Agent({
      apiKey: "test-key",
      model: "claude-3-5-sonnet",
      tools: [
        {
          name: "search",
          description: "Search for information",
          input_schema: {
            type: "object",
            properties: {
              query: { type: "string" },
            },
          },
        },
      ],
    });
 
    const result = await agent.send("Find me information about test");
    expect(result).toContain("test");
  });
});

The key here is predictability. Mock responses should be deterministic so tests don't flake.

Testing Tool Execution Chains

Tools rarely execute in isolation. Usually tool A calls tool B, and B's output feeds into C. This is where integration testing shines. You're not testing that each tool works—that's the unit test's job. You're testing that the flow works: that the agent knows to call tool A, that the result of tool A makes sense to the agent, that the agent then correctly calls tool B with the appropriate input derived from tool A's output.

This requires testing the agent's reasoning, not just its tool execution. Does the agent interpret tool A's result correctly? Does it extract the right data to pass to tool B? Does it understand when to stop calling tools and return a final answer?

The chain test is really a test of agent orchestration. You're mocking Claude's responses to simulate a specific decision flow, mocking tool responses to return specific data, and then verifying that the flow works end-to-end. This is hard to unit test because each piece works correctly in isolation, but the pieces don't fit together right.

Test these chains end-to-end:

typescript
describe("Tool Execution Chains", () => {
  test("agent chains multiple tools correctly", async () => {
    const mockTools = {
      fetch_article: jest.fn().mockResolvedValue({
        title: "Test Article",
        content: "Article content about topic X",
        url: "https://example.com/article",
      }),
      summarize: jest.fn().mockResolvedValue({
        summary: "This article discusses topic X in detail",
        key_points: ["Point 1", "Point 2"],
      }),
      send_email: jest.fn().mockResolvedValue({
        status: "sent",
        recipient: "user@example.com",
      }),
    };
 
    const agent = new Agent({
      apiKey: "test-key",
      model: "claude-3-5-sonnet",
      tools: [
        {
          name: "fetch_article",
          description: "Fetch an article",
          handler: mockTools.fetch_article,
        },
        {
          name: "summarize",
          description: "Summarize content",
          handler: mockTools.summarize,
        },
        {
          name: "send_email",
          description: "Send an email",
          handler: mockTools.send_email,
        },
      ],
    });
 
    // Mock Claude to call tools in sequence
    let callCount = 0;
    jest.spyOn(agent, "send").mockImplementation(async (message) => {
      callCount++;
      if (callCount === 1) {
        // First turn: fetch article
        return {
          toolCalls: [
            {
              name: "fetch_article",
              input: { url: "test-url" },
            },
          ],
        };
      } else if (callCount === 2) {
        // Second turn: summarize
        return {
          toolCalls: [
            {
              name: "summarize",
              input: { content: "Article content" },
            },
          ],
        };
      } else {
        // Third turn: send email
        return {
          toolCalls: [
            {
              name: "send_email",
              input: { recipient: "user@example.com", message: "Summary" },
            },
          ],
        };
      }
    });
 
    // Execute the agent workflow
    const workflow = async () => {
      let result = await agent.send("Fetch an article and email me a summary");
      while (result.toolCalls && result.toolCalls.length > 0) {
        const toolCall = result.toolCalls[0];
        const toolResult = await mockTools[toolCall.name](toolCall.input);
        result = await agent.send(`Tool result: ${JSON.stringify(toolResult)}`);
      }
      return result;
    };
 
    const final = await workflow();
 
    // Verify the chain executed
    expect(mockTools.fetch_article).toHaveBeenCalled();
    expect(mockTools.summarize).toHaveBeenCalled();
    expect(mockTools.send_email).toHaveBeenCalled();
  });
});

This tests not just that tools execute, but that results flow correctly through the chain.

Snapshot Testing Conversations

Agent conversations should be deterministic given the same inputs. Snapshot tests capture conversation state and fail if output unexpectedly changes. This is powerful because it lets you catch unintended behavioral changes that you might otherwise miss.

Think about how snapshot testing works. The first time you run the test, it captures the agent's output and saves it as a "snapshot." On subsequent runs, if the agent's output changes, the test fails and shows you the diff. This is great for catching regressions. If someone changes a system prompt, a tool definition, or the agent's behavior in any subtle way, the snapshot diff will catch it.

The hidden value is that snapshots make you deliberate about changes. When a snapshot fails, you have to make a conscious decision: is this change intentional? If it is, you approve the new snapshot. If it's not, you find and fix the bug. This forces you to think about what your agent is actually doing, not just what you think it should be doing.

Snapshot tests are particularly valuable for agents because agent behavior can be surprising. An agent might behave correctly for your test case but produce different output than you expected. Snapshots let you see exactly what the agent is doing, turn by turn, so you can understand its reasoning and verify it's correct.

Agent conversations should be deterministic given the same inputs. Snapshot tests capture conversation state and fail if output unexpectedly changes:

typescript
describe("Agent Conversation Snapshots", () => {
  test("agent conversation matches snapshot", async () => {
    const agent = new Agent({
      apiKey: "test-key",
      model: "claude-3-5-sonnet",
    });
 
    // Mock API to return predictable responses
    const mockResponses = [
      "I understand you want to analyze this data.",
      "Let me use the analyze_data tool.",
      "The analysis shows an upward trend.",
    ];
 
    let responseIndex = 0;
    jest.spyOn(agent, "send").mockImplementation(async () => {
      const response = mockResponses[responseIndex++];
      return { text: response };
    });
 
    const conversation = [];
    conversation.push(await agent.send("Analyze this dataset"));
    conversation.push(await agent.send("What does the trend show"));
    conversation.push(await agent.send("Thanks"));
 
    expect(conversation).toMatchSnapshot();
  });
});

If Claude's response format changes or your system prompt changes, the snapshot breaks and alerts you to the change.

Testing Error Scenarios

Agents should gracefully handle failures. Test what happens when tools fail or return unexpected data. This is crucial testing because agents operate in the real world where things go wrong constantly.

Error handling is where many agent implementations fail. They work perfectly when everything goes right—when tools return data in the expected format, when networks are stable, when APIs don't rate-limit. But real systems are noisy. Networks timeout. APIs rate-limit. Tools return data in slightly different formats than expected. Databases go down.

Your agent needs to handle all of this gracefully. What happens when a tool throws an exception? Does the agent recover or does it crash? What happens when a tool returns unexpected data? Does the agent ask clarifying questions or does it assume and potentially make a wrong decision? What happens when a tool times out? Does the agent retry, move on, or give up?

These aren't edge cases. These are normal cases that happen in production. An agent that handles them gracefully is an agent that works. An agent that doesn't is a liability.

The test patterns here are about simulating these failure modes and verifying the agent's response. You're mocking tools to fail in specific ways and checking that the agent handles it correctly. You're mocking rate limit responses and verifying the agent retries. You're returning malformed data and checking that the agent doesn't crash.

Agents should gracefully handle failures. Test what happens when tools fail or return unexpected data:

typescript
describe("Error Handling", () => {
  test("agent handles tool failure gracefully", async () => {
    const mockTools = {
      unstable_tool: jest
        .fn()
        .mockRejectedValue(new Error("Tool temporarily unavailable")),
    };
 
    const agent = new Agent({
      apiKey: "test-key",
      model: "claude-3-5-sonnet",
      tools: [
        {
          name: "unstable_tool",
          description: "A tool that might fail",
          handler: mockTools.unstable_tool,
        },
      ],
    });
 
    // Mock Claude's error recovery response
    jest
      .spyOn(agent, "send")
      .mockResolvedValueOnce({
        toolCalls: [
          {
            name: "unstable_tool",
            input: {},
          },
        ],
      })
      .mockResolvedValueOnce({
        text: "The tool failed, so I'll proceed without it",
      });
 
    const result = await agent.send("Use the tool");
    expect(result.text).toContain("failed");
  });
 
  test("agent retries on rate limits", async () => {
    let attempts = 0;
    const mockTool = jest.fn().mockImplementation(() => {
      attempts++;
      if (attempts < 2) {
        const error: any = new Error("Rate limited");
        error.status = 429;
        throw error;
      }
      return { success: true };
    });
 
    const agent = new Agent({
      apiKey: "test-key",
      model: "claude-3-5-sonnet",
      tools: [
        {
          name: "test_tool",
          handler: mockTool,
        },
      ],
    });
 
    // Agent should retry automatically
    const result = await agent.executeWithRetry("test_tool", {});
    expect(result.success).toBe(true);
    expect(attempts).toBe(2);
  });
 
  test("agent handles malformed tool results", async () => {
    const mockTools = {
      buggy_tool: jest.fn().mockResolvedValue("just a string"),
    };
 
    const agent = new Agent({
      apiKey: "test-key",
      model: "claude-3-5-sonnet",
      tools: [
        {
          name: "buggy_tool",
          handler: mockTools.buggy_tool,
          outputSchema: {
            type: "object",
            properties: {
              status: { type: "string" },
              data: { type: "object" },
            },
          },
        },
      ],
    });
 
    // Should validate or handle gracefully
    const result = await agent.send("Call buggy tool");
    expect(result).toBeDefined();
  });
 
  test("agent avoids infinite loops", async () => {
    const agent = new Agent({
      apiKey: "test-key",
      model: "claude-3-5-sonnet",
      maxToolCalls: 5, // Prevent infinite loops
    });
 
    // Mock Claude to keep requesting same tool
    let callCount = 0;
    jest.spyOn(agent, "send").mockImplementation(async () => {
      callCount++;
      if (callCount > 10) {
        throw new Error("Too many calls");
      }
      return {
        toolCalls: [
          {
            name: "some_tool",
            input: {},
          },
        ],
      };
    });
 
    // Should stop after maxToolCalls
    try {
      await agent.send("Do something");
    } catch (e) {
      expect(callCount).toBeLessThanOrEqual(6); // maxToolCalls + 1
    }
  });
});

Error handling tests are where you catch the edge cases that unit tests miss.

Testing Agent State Management

Agents maintain state across turns. Test that state persists and updates correctly:

typescript
describe("Agent State Management", () => {
  test("agent maintains conversation state", async () => {
    const agent = new Agent({
      apiKey: "test-key",
      model: "claude-3-5-sonnet",
    });
 
    // Mock responses that reference previous state
    const responses = [
      { text: "I noted you want to analyze sales data" },
      { text: "Based on what you said earlier, Q3 was strong" },
      { text: "Yes, that insight from earlier applies here too" },
    ];
 
    let turnCount = 0;
    jest.spyOn(agent, "send").mockImplementation(async (message) => {
      const response = responses[turnCount++];
      return response;
    });
 
    const msg1 = await agent.send("I want to analyze Q3 sales");
    expect(msg1.text).toContain("sales data");
 
    const msg2 = await agent.send("How was performance?");
    expect(msg2.text).toContain("Q3");
 
    const msg3 = await agent.send("Any patterns?");
    expect(msg3.text).toContain("earlier");
 
    // Verify context was maintained across turns
    expect(agent.conversationHistory.length).toBe(3);
  });
 
  test("agent clears state when requested", async () => {
    const agent = new Agent({
      apiKey: "test-key",
      model: "claude-3-5-sonnet",
    });
 
    // Build up conversation
    await agent.send("Message 1");
    await agent.send("Message 2");
    expect(agent.conversationHistory.length).toBe(2);
 
    // Reset
    agent.clearHistory();
    expect(agent.conversationHistory.length).toBe(0);
  });
});

State management tests ensure agents don't "forget" context between messages.

Testing with Real External Services

For comprehensive testing, sometimes you need real external service calls:

typescript
describe("Integration Tests with Real Services", () => {
  // Mark as integration test (slower, skipped in unit test runs)
  test("agent calls real search API", async () => {
    const agent = new Agent({
      apiKey: process.env.ANTHROPIC_API_KEY,
      model: "claude-3-5-sonnet",
      tools: [
        {
          name: "search",
          description: "Search the web",
          handler: async (input: { query: string }) => {
            // Real API call
            const response = await fetch(
              `https://api.duckduckgo.com/?q=${encodeURIComponent(input.query)}&format=json`,
            );
            return response.json();
          },
        },
      ],
    });
 
    const result = await agent.send("Search for Claude AI");
    expect(result).toBeDefined();
    expect(result.length).toBeGreaterThan(0);
  }, 30000); // 30 second timeout
});

Mark these tests to run separately from unit tests.

CI Pipeline Integration

Run integration tests in CI with proper isolation:

yaml
name: Agent SDK Integration Tests
 
on:
  pull_request:
  push:
    branches: [main]
 
jobs:
  test:
    runs-on: ubuntu-latest
    strategy:
      matrix:
        node-version: [18.x, 20.x]
    steps:
      - uses: actions/checkout@v3
 
      - name: Setup Node.js
        uses: actions/setup-node@v3
        with:
          node-version: ${{ matrix.node-version }}
 
      - name: Install dependencies
        run: npm ci
 
      - name: Run unit tests
        run: npm run test:unit
 
      - name: Run integration tests
        run: npm run test:integration
        env:
          ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
          TEST_TIMEOUT: 30000
 
      - name: Generate coverage
        run: npm run test:coverage
 
      - name: Upload coverage
        uses: codecov/codecov-action@v3
        with:
          files: ./coverage/lcov.info
 
      - name: Test with different API models
        run: |
          TEST_MODEL=claude-3-opus npm run test:integration
          TEST_MODEL=claude-3-sonnet npm run test:integration
          TEST_MODEL=claude-3-haiku npm run test:integration

Running tests with multiple models catches model-specific behavior differences.

Performance Testing Integration Tests

Integration tests can be slow. Monitor and optimize:

typescript
describe("Integration Test Performance", () => {
  beforeEach(() => {
    jest.setTimeout(10000); // 10 second timeout
  });
 
  test("agent responds within acceptable time", async () => {
    const agent = new Agent({
      apiKey: "test-key",
      model: "claude-3-5-sonnet",
    });
 
    const start = Date.now();
    await agent.send("Quick question");
    const elapsed = Date.now() - start;
 
    expect(elapsed).toBeLessThan(5000);
  });
 
  test("complex workflow completes in time", async () => {
    const workflow = buildComplexWorkflow();
 
    const start = Date.now();
    await workflow.execute();
    const elapsed = Date.now() - start;
 
    // Complex workflows should still complete reasonably
    expect(elapsed).toBeLessThan(30000);
  });
});

Performance assertions catch regressions early.

Test Organization Best Practices

Organize tests by agent responsibility:

tests/
├── unit/
│   ├── agent.test.ts
│   ├── tools.test.ts
│   └── state.test.ts
├── integration/
│   ├── workflows.test.ts
│   ├── tool-chains.test.ts
│   └── error-handling.test.ts
└── e2e/
    └── critical-paths.test.ts

Debugging Failed Integration Tests

When tests fail, track down the issue systematically:

typescript
describe("Debugging Integration Tests", () => {
  test("agent with detailed logging", async () => {
    const agent = new Agent({
      apiKey: "test-key",
      model: "claude-3-5-sonnet",
      debug: true, // Enable detailed logging
    });
 
    // Capture logs for inspection
    const logs: string[] = [];
    agent.on("log", (message: string) => logs.push(message));
 
    try {
      await agent.send("Complex request");
    } catch (error) {
      console.log("Logs leading to failure:");
      logs.forEach((log) => console.log(log));
      throw error;
    }
  });
});

Advanced: Testing Tool Output Validation

Verify that tools return properly formatted data:

typescript
describe("Tool Output Validation", () => {
  test("tool returns data matching schema", async () => {
    const mockTool = jest.fn().mockResolvedValue({
      results: [
        { title: "Article 1", url: "https://example.com/1" },
        { title: "Article 2", url: "https://example.com/2" },
      ],
      count: 2,
      hasMore: false,
    });
 
    const agent = new Agent({
      apiKey: "test-key",
      model: "claude-3-5-sonnet",
      tools: [
        {
          name: "search",
          handler: mockTool,
          outputSchema: {
            type: "object",
            required: ["results", "count"],
            properties: {
              results: {
                type: "array",
                items: {
                  type: "object",
                  required: ["title", "url"],
                  properties: {
                    title: { type: "string" },
                    url: { type: "string", format: "uri" },
                  },
                },
              },
              count: { type: "number" },
              hasMore: { type: "boolean" },
            },
          },
        },
      ],
    });
 
    const result = await agent.send("Search for test");
    expect(result).toBeDefined();
 
    // Verify Claude received properly formatted output
    expect(mockTool).toHaveBeenCalledWith(expect.any(Object));
  });
});

Testing Conversations with Memory

Agents maintain state across turns. Test memory persistence:

typescript
describe("Agent Memory and Learning", () => {
  test("agent remembers user preferences across sessions", async () => {
    const agent = new Agent({
      apiKey: "test-key",
      model: "claude-3-5-sonnet",
      memory: new ConversationMemory(),
    });
 
    // First conversation: user states preference
    await agent.send("I prefer verbose explanations");
 
    // Second conversation: agent recalls preference
    const response = await agent.send("Explain quantum computing");
 
    // Verify agent remembered preference
    expect(response.text.length).toBeGreaterThan(500); // Verbose
  });
 
  test("agent updates memory with new information", async () => {
    const agent = new Agent({
      apiKey: "test-key",
      model: "claude-3-5-sonnet",
      memory: new ConversationMemory(),
    });
 
    // Build up knowledge over conversation
    await agent.send("My name is Alice");
    await agent.send("I work in finance");
    await agent.send("I am interested in crypto");
 
    // Later, agent uses accumulated context
    const response = await agent.send("What should I read?");
    expect(response.text).toContain("finance" || "crypto");
  });
});

Testing Concurrent Agent Requests

Verify agents handle parallel requests correctly:

typescript
describe("Concurrent Agent Operations", () => {
  test("agent handles concurrent requests independently", async () => {
    const agent = new Agent({
      apiKey: "test-key",
      model: "claude-3-5-sonnet",
    });
 
    const requests = [
      agent.send("Calculate 2 + 2"),
      agent.send("What is AI?"),
      agent.send("Explain blockchain"),
    ];
 
    const responses = await Promise.all(requests);
 
    expect(responses).toHaveLength(3);
    responses.forEach((response) => {
      expect(response.text).toBeDefined();
    });
 
    // Verify responses are independent
    expect(responses[0].text).toContain("4"); // Math answer
    expect(responses[1].text).toContain("intelligent"); // AI answer
    expect(responses[2].text).toContain("chain"); // Blockchain answer
  });
 
  test("agent isolates state between concurrent users", async () => {
    const agent = new Agent({
      apiKey: "test-key",
      model: "claude-3-5-sonnet",
      perUserIsolation: true,
    });
 
    // User 1 sets context
    const user1 = agent.createSession("user_1");
    await user1.send("My name is Alice");
 
    // User 2 sets different context
    const user2 = agent.createSession("user_2");
    await user2.send("My name is Bob");
 
    // Verify contexts are isolated
    const user1Response = await user1.send("Who am I?");
    const user2Response = await user2.send("Who am I?");
 
    expect(user1Response.text).toContain("Alice");
    expect(user2Response.text).toContain("Bob");
  });
});

Testing Agent Behavior Under Load

Simulate realistic load to identify scalability issues:

typescript
describe("Agent Load Testing", () => {
  test("agent maintains performance under sustained load", async () => {
    const agent = new Agent({
      apiKey: "test-key",
      model: "claude-3-5-sonnet",
    });
 
    const loadTest = async () => {
      const concurrency = 10;
      const requestsPerClient = 5;
      const results = [];
 
      for (let i = 0; i < concurrency; i++) {
        const clientRequests = [];
        for (let j = 0; j < requestsPerClient; j++) {
          clientRequests.push(agent.send(`Client ${i} request ${j}`));
        }
        results.push(...(await Promise.all(clientRequests)));
      }
 
      return results;
    };
 
    const startTime = Date.now();
    const responses = await loadTest();
    const elapsed = Date.now() - startTime;
 
    expect(responses).toHaveLength(50); // 10 clients × 5 requests
    expect(elapsed).toBeLessThan(30000); // Completes in 30 seconds
  });
 
  test("agent recovers from transient failures", async () => {
    let callCount = 0;
    jest.spyOn(global, "fetch").mockImplementation(async () => {
      callCount++;
 
      // Fail first 2 calls
      if (callCount <= 2) {
        throw new Error("Transient error");
      }
 
      return {
        json: async () => ({
          /* valid response */
        }),
      };
    });
 
    const agent = new Agent({
      apiKey: "test-key",
      model: "claude-3-5-sonnet",
      maxRetries: 3,
      retryDelay: 100,
    });
 
    const response = await agent.send("Test query");
    expect(response).toBeDefined();
    expect(callCount).toBe(3); // Retried twice, succeeded on third try
  });
});

Real-World Test Suite Example

Here's a complete integration test suite structure:

tests/
├── integration/
│   ├── fixtures/
│   │   ├── api-responses.ts      # Mock API responses
│   │   └── test-data.ts          # Test data factories
│   ├── mocks/
│   │   ├── claude-api.ts         # Claude API mocking
│   │   └── external-services.ts  # External service mocks
│   ├── setup.ts                  # Test environment setup
│   ├── agent.test.ts             # Core agent tests
│   ├── tool-chains.test.ts       # Tool execution tests
│   ├── error-handling.test.ts    # Error scenario tests
│   ├── state-management.test.ts  # Memory and state tests
│   ├── concurrency.test.ts       # Concurrent operation tests
│   └── performance.test.ts       # Performance benchmarks
└── jest.config.js                # Jest configuration

Key Takeaways

Integration testing for agents requires thinking about the entire flow—from Claude's responses through tool execution to final results. Mock predictably, test chains thoroughly, handle errors gracefully, and maintain state correctly. Run tests in CI with multiple configurations to catch edge cases.

Key principles:

  • Mock Claude API responses for speed and determinism
  • Test tool chains end-to-end, including data flow
  • Snapshot conversation output for regression detection
  • Test error scenarios and recovery paths thoroughly
  • Verify state management and memory across turns
  • Use CI to run tests with multiple models and configurations
  • Monitor performance of integration tests under load
  • Separate unit, integration, and end-to-end tests clearly
  • Test concurrent operations and multi-user scenarios
  • Validate tool output against schemas

Test Organization and File Structure

Organize integration tests by responsibility:

tests/
├── unit/
│   ├── agents/
│   │   ├── agent.test.ts
│   │   └── state.test.ts
│   ├── tools/
│   │   ├── tool-executor.test.ts
│   │   └── tool-registry.test.ts
│   └── utilities/
│       └── context-pruning.test.ts
├── integration/
│   ├── fixtures/
│   │   ├── api-responses.ts
│   │   ├── mock-tools.ts
│   │   └── test-data.ts
│   ├── mocks/
│   │   ├── claude-api.ts
│   │   ├── stripe-mock.ts
│   │   └── email-mock.ts
│   ├── setup.ts
│   ├── agents.test.ts
│   ├── tool-chains.test.ts
│   ├── error-handling.test.ts
│   ├── state-management.test.ts
│   └── concurrency.test.ts
├── e2e/
│   ├── critical-paths.test.ts
│   └── workflows.test.ts
├── jest.config.js
└── test-utils.ts

Debugging Integration Tests

When tests fail, debugging strategies matter:

typescript
describe("Debugging Agent Issues", () => {
  test("agent with detailed logging", async () => {
    const agent = new Agent({
      apiKey: "test-key",
      model: "claude-3-5-sonnet",
      debug: true,
    });
 
    // Capture all logs
    const logs: any[] = [];
    const originalLog = console.log;
    console.log = (...args) => {
      logs.push({ timestamp: Date.now(), message: args });
      originalLog(...args);
    };
 
    try {
      await agent.send("Test query");
    } catch (error) {
      // Print logs leading to failure
      console.log("\n=== Test Failure Logs ===");
      logs.forEach((log) => {
        console.log(`[${log.timestamp}] ${JSON.stringify(log.message)}`);
      });
 
      throw error;
    } finally {
      console.log = originalLog;
    }
  });
 
  test("agent with network tracing", async () => {
    const networkRequests: any[] = [];
 
    // Patch fetch to capture requests
    const originalFetch = global.fetch;
    global.fetch = async (url, options) => {
      const start = Date.now();
      const response = await originalFetch(url, options);
      const duration = Date.now() - start;
 
      networkRequests.push({
        url,
        method: options?.method || "GET",
        status: response.status,
        duration,
      });
 
      return response;
    };
 
    try {
      const agent = new Agent({
        apiKey: "test-key",
        model: "claude-3-5-sonnet",
      });
 
      await agent.send("Test");
 
      // Analyze network behavior
      console.log("\n=== Network Trace ===");
      networkRequests.forEach((req) => {
        console.log(
          `${req.method} ${req.url} -> ${req.status} (${req.duration}ms)`,
        );
      });
 
      expect(networkRequests.length).toBeGreaterThan(0);
    } finally {
      global.fetch = originalFetch;
    }
  });
});

Snapshot Comparison and Regression Detection

Use snapshot diffs to detect regressions:

bash
#!/bin/bash
# detect-snapshot-regressions.sh
 
# Get list of changed snapshots
CHANGED=$(git diff --name-only | grep __snapshots__)
 
if [ -z "$CHANGED" ]; then
  echo "No snapshot changes"
  exit 0
fi
 
echo "Changed snapshots:"
echo "$CHANGED"
 
# Run tests to see failures
npm test -- --updateSnapshot=false
 
# Display snapshot diffs
for snapshot in $CHANGED; do
  echo ""
  echo "=== $snapshot ==="
  git diff "$snapshot"
done
 
# Fail if changes are unexpected
echo ""
read -p "Approve these snapshot changes? (y/n) " -n 1 -r
echo
if [[ ! $REPLY =~ ^[Yy]$ ]]; then
  exit 1
fi

The Philosophy of Agent Integration Testing

Here's the deeper principle at work. In traditional software, you test that pieces work correctly, then you reason about how they fit together. In agent systems, you can't reason about the fit—you have to test it. Because the agent's behavior depends on Claude's response, and Claude's response depends on state, and state depends on previous tool calls, and those tool calls depend on previous Claude responses. It's a complex feedback loop.

Integration testing for agents is really about validating the feedback loop. You're checking that the loop closes correctly: that Claude's response leads to the right tool call, that the tool call returns sensible data, that Claude interprets that data correctly and makes the right next decision, and that the loop eventually terminates in a satisfactory state.

This is why deterministic mocking is crucial. You can't test feedback loops with non-deterministic systems because you can't predict what should happen next. You have to mock Claude's responses to be predictable so you can verify the loop works correctly.

This is why error handling tests matter so much. Real feedback loops fail. Networks timeout. APIs rate-limit. Tools return unexpected data. Your agent needs to handle these failures gracefully and recover. An agent that works perfectly when everything goes right but breaks when anything goes wrong is a liability in production.

This is why you test with real services occasionally. Mocks are great for determinism, but they're also lies. They hide the ways that real systems behave differently from mocks. Eventually, you need to test against real services to verify that your mocks were accurate. This is why we have integration testing with real services as a separate category—you run it less frequently because it's slow and flaky, but you still run it to catch real-world integration issues.

Making Agent Testing Practical

The practical truth is that you'll iterate on your integration tests. Your first test suite will be incomplete. You'll discover edge cases in production that your tests didn't cover. You'll add tests for those edge cases. Over time, your test suite becomes a map of all the ways your agent can fail. That's the goal.

Start with the critical paths. What's the happy path through your agent? Test that first. Then test the most likely failure modes. Then test the weird edge cases. Prioritize tests that would cause real harm if they failed in production.

Claude Code makes this practical because your tests run against the actual SDK. You're not testing an abstraction—you're testing real agent behavior. The investment in comprehensive integration tests pays dividends: confident deployments, fewer production incidents, and easier maintenance as the agent evolves.

The Testing Mindset for Agents

Here's something many developers miss: testing agents requires a different mental model than testing traditional applications. With normal code, you test inputs and outputs. With agents, you test decision loops. The agent receives information, makes a decision, takes an action, receives new information, and makes another decision. Each loop is a branch point where things can go wrong.

Think about a research agent that's supposed to find information, synthesize it, and write a summary. The happy path is straightforward: search → read → summarize → return. But real agents hit dozens of branches: maybe the first search returns irrelevant results. The agent adapts. Maybe the second search times out. The agent retries. Maybe the source format is unexpected. The agent handles it gracefully. Or it crashes and returns garbage.

Your tests need to validate all these branches. You can't just test the happy path. You need to test that the agent behaves intelligently when paths diverge. And that's where comprehensive integration testing becomes essential.

Test Organization at Scale

As your agent system grows, you'll have dozens of agents. Each with different purposes. Each with different failure modes. How do you keep tests organized? The pattern is to mirror your agent structure.

If you have specialized agents for research, writing, coding, and analysis, create test suites that match. Each test suite covers that agent's specific workflows and failure modes. When you need to understand how a particular agent behaves, you go to its test suite. You're not hunting through a massive test file trying to find the relevant tests.

But there's a deeper level too. You'll have tests that validate how agents collaborate. Agent A fetches data, Agent B analyzes it, Agent C reports results. That end-to-end flow needs integration tests. These are the tests that catch the subtle bugs where agents interact in unexpected ways.

Capturing Agent Behavior Over Time

Here's a powerful pattern that many teams miss: version your snapshots. When you update your agent's system prompt, capture a new snapshot. When you upgrade Claude to a new model, capture snapshots. Over time, you have a history of how your agent's behavior evolved.

This history is invaluable. You can look back and see exactly how output changed when you adjusted parameters. You can see whether behavior got better or worse. You can identify unintended side effects of changes that seemed benign. This longitudinal view of agent behavior is how you build real expertise.

Stress Testing Your Agents

In production, your agent will face conditions you didn't anticipate. Heavy load. Network timeouts. Rate limiting. Malformed responses. Your test suite should stress your agent deliberately, at scale, to find breaking points before users do.

Stress testing agents means running hundreds of requests in parallel, simulating realistic failure conditions, and measuring how gracefully the agent degrades. Does it return sensible partial results when APIs are slow? Does it retry appropriately without getting stuck in loops? Does it report reasonable errors when everything fails?

The agents that survive stress testing are the ones you can confidently deploy to production. The ones that haven't seen stress testing are time bombs waiting to explode under real conditions.

Monitoring and Observability in Tests

The best integration tests aren't just pass/fail assertions. They're instrumented. They log intermediate states. They capture performance metrics. When a test fails in CI, you should be able to understand exactly where it failed and why, without re-running it locally.

This requires thinking about observability from day one. Log every decision the agent makes. Record tool execution times. Capture Claude's responses. When you need to debug a failure, you have the full trace. You're not guessing. You're analyzing actual behavior.

Evolving Your Test Suite

Your test suite should evolve with your agents. When you discover a bug in production, add a test that would have caught it. When you make a breaking change, update the corresponding tests. When you add a new feature, write tests for it first (yes, test-driven development works for agents too).

Over time, your test suite becomes a specification of your agent's behavior. It documents what the agent does, how it handles edge cases, and what failure modes are acceptable. New team members can read the tests to understand the agent's capabilities and limitations.

The Confidence Multiplier

Here's what you gain from comprehensive integration testing: confidence. Confidence to deploy without fear. Confidence to iterate quickly. Confidence to refactor. Confidence that your agents work as intended. In production systems, confidence is currency. You trade careful testing for the ability to ship fast.

Teams with weak test coverage ship slowly. They're terrified of changes. They test manually. They take hours to verify that nothing broke. Teams with comprehensive integration tests ship fast. They iterate on features. They refactor confidently. They spend their energy on building new capabilities, not debugging production failures.

The investment in building a strong integration test suite pays dividends in team velocity, product reliability, and peace of mind. You're not just catching bugs. You're enabling your team to move faster without introducing risk.

Real-World Test Coverage Patterns

Let me walk you through what mature agent test coverage actually looks like in production systems. The pattern evolves through stages.

Stage 1: Happy Path (Week 1) You write tests for the core functionality. The agent receives proper input, makes correct decisions, executes tools, and returns the expected output. These tests establish that your basic infrastructure works. They usually pass on the first try because you've mocked everything perfectly.

Stage 2: Realistic Input Variation (Weeks 2-3) Real users provide input that varies from what you expected. Maybe they use different phrasing. Maybe their data has unexpected structure. You add tests that cover reasonable variations. The agent should handle "find me information about quantum computing" the same way it handles "I need to learn about quantum mechanics and quantum entanglement." These tests often fail initially because your agent is brittle to input variation.

Stage 3: Error Paths (Weeks 4-5) You discover failure modes from early users. A timeout here, a rate limit there, an API returning unexpected data. You add tests that deliberately trigger these errors and verify the agent handles them gracefully. Many agents fail at this stage because error handling was an afterthought. Adding these tests forces you to build proper resilience.

Stage 4: Interaction Between Agents (Weeks 6-7) If you have multiple agents working together, you discover they don't always communicate cleanly. Agent A returns data in a format that breaks Agent B. You add integration tests that validate the whole pipeline. These tests are where you discover subtle bugs that don't appear in single-agent tests.

Stage 5: Scale and Stress (Week 8+) You test with realistic load. 100 concurrent users. 1000 concurrent requests. You discover bottlenecks. Maybe tool calls start timing out. Maybe your mock system can't handle the throughput. You stress test deliberately to find the breaking point.

Stage 6: Longitudinal Testing (Ongoing) You run the same tests over time with different models, different parameters, different data. You capture how behavior changes. You identify regressions. You track improvements.

This progression is normal. You're not expected to have comprehensive tests on day one. You build them as you learn what breaks. The key is being intentional about it—each stage of testing teaches you something new about your agent.

Test Maintenance as Ongoing Investment

Here's something that catches many teams: tests require maintenance. When you update your agent's system prompt, some tests might fail. When you upgrade Claude, behavior might change subtly. When you add a new tool, new tests are needed. Some teams treat test failures as "test issues" and disable tests or adjust them to pass. This is a mistake.

When a test fails after you make a change, that's information. Either your change broke something (in which case, fix the change) or the test was too brittle (in which case, make the test more robust). Either way, failing tests are guiding you toward better systems.

The teams that maintain strong test suites treat test failures with respect. They investigate. They understand what changed and why. They update tests when behavior rightfully changes, and they fix code when tests reveal bugs. This discipline is what separates teams that have test suites from teams where tests actually work.

Building a Testing Culture Around Agents

The practical truth is that testing culture matters as much as testing tools. You can have the best testing framework in the world, but if developers don't write tests, your test suite is useless.

Here's how mature teams build testing culture:

  • Write tests before deployment: Make it a requirement. No agent goes to production without integration tests. This isn't negotiable.
  • Show test results in code review: When reviewing agent changes, run the test suite and show results. Failed tests are blockers. Decreased coverage is a conversation starter.
  • Celebrate test improvements: When someone significantly improves test coverage or adds a test that catches a real issue, call it out. Make testing visible and valued.
  • Use tests as documentation: Your test suite documents what your agent does and how it behaves. New team members should be able to read tests and understand the system.
  • Learn from test failures: When a test fails, don't just fix it and move on. Understand why it failed. Was the test too strict? Was the agent logic wrong? Use failures as learning opportunities.

When testing becomes part of your team culture, agents become more reliable and development becomes faster. It's not complicated. It's just discipline.

The Compounding Returns of Systematic Testing

Over time, comprehensive integration testing compounds in value. Your first test suite takes effort. Your tenth test suite is faster to build because you have patterns. Your fiftieth test suite is almost automatic because the discipline is ingrained.

More importantly, the bug rate decreases over time. You've tested so many edge cases that new ones are rare. New team members make fewer mistakes because tests catch them early. Refactoring becomes safe because tests verify nothing broke. Iteration becomes fast because you're not constantly fighting production issues.

This compounding effect is why mature organizations invest heavily in testing. They know that the upfront investment pays dividends forever. A good test suite is like a permanent asset. It works for you every day, catching bugs before they reach users.

Advanced Pattern: Regression Testing for AI Systems

One pattern that works remarkably well for agent systems is regression testing specific to AI behavior. Not just "does the agent still do its core job," but "does the agent still make the same decisions when given the same input?" This is harder than traditional regression testing because agents might legitimately vary in minor ways, but the core reasoning should remain stable.

The technique is to save snapshots of agent decisions on key test cases. When you make changes to the agent or upgrade the underlying model, run the tests again. If the decisions change significantly, you've found a regression. Maybe it's acceptable—you improved something—but at least you know it changed. This visibility into behavioral change is invaluable for maintaining agent quality over time.

What "significant change" means depends on your use case. For a code review agent, a change from "has a SQL injection vulnerability" to "does not have a SQL injection vulnerability" is significant—you've changed the core analysis. A change from "I recommend parameterized queries" to "I recommend prepared statements" is stylistic. The regression test suite helps you distinguish between meaningful changes and implementation details.

This becomes especially important when Claude releases new models. You want to upgrade to get better performance, but you need to verify you didn't accidentally change how your agent reasons. Regression tests give you that assurance. They transform model upgrades from scary "will this break everything?" moments into confident "let me verify nothing significant changed" processes.

Debugging Integration Test Failures: A Systematic Approach

When integration tests fail in the real world, debugging is harder than traditional code debugging because part of the system (Claude) is a black box. You can't step through its reasoning. You can't see intermediate states. You just see the final output that doesn't match expectations.

Here's a systematic debugging approach that works reliably. First, understand what changed. Did you update the agent's system prompt? Did you upgrade Claude? Did you change the test? Isolate the variable. If you changed three things, change one back and rerun. Repeat until you've identified which change broke what.

Second, examine the actual vs. expected output in detail. Don't just say "it's different." Identify specifically how it's different. Does Claude understand the task wrong? Is it misinterpreting the input? Is it executing the tools incorrectly? This specificity guides you toward the fix.

Third, test in isolation. If your test calls Agent A which calls Agent B which calls a tool, disable Agent B and see if the problem is in Agent A. Isolate where the failure occurs. Many integration test failures look complex until you systematically narrow them down.

Fourth, add verbose logging to understand Claude's decision-making. What did it understand about the task? What did it think the tools do? Why did it choose that tool over another? This logging makes Claude's reasoning visible, which makes the failure obvious.

Finally, consider whether the test is too strict. Some test failures aren't bugs—they're tests that need to be more flexible. Claude might reasonably interpret ambiguous input in multiple ways. A test that requires one specific response might be too rigid. The question is: if I were using this agent, would this output be wrong, or just different from what I specified? If it's just different, maybe relax the test. If it's wrong, fix the agent.

Cost Optimization for Integration Tests

Here's a practical concern: running comprehensive integration tests against Claude for every change gets expensive. Each test call is an API call with associated costs. Multiply by dozens or hundreds of tests, and test costs become significant.

Smart teams use a tiered testing approach. Fast, cheap tests run frequently—snapshot tests, unit tests, mock-based tests. These are nearly free and catch obvious issues. Slower, more expensive integration tests against the actual Claude API run less frequently—maybe once per day or before deployment. Stress tests and comprehensive multi-agent tests run weekly or before releases.

This tiered approach balances cost and quality. You're not running expensive tests constantly. You're running them strategically, at key decision points. You verify code works cheaply and often, verify integration works less frequently but thoroughly.

Another cost optimization is test data reuse. Don't spin up new test scenarios for each test. Share test data across multiple assertions. One Claude response can be validated against multiple expectations. This reduces API calls without reducing coverage.

The third optimization is batching. Instead of running tests sequentially, run them in parallel. If you have test infrastructure that calls Claude in batches, you can dramatically reduce latency and sometimes get volume discounts.

For most organizations, test cost becomes a non-issue once you factor in the value of prevented production issues. A single bug in production might require expensive manual incident response, lost customer trust, and corrective work. By contrast, comprehensive testing costs are a fraction of that. But early on, teams need to understand the cost tradeoff and optimize accordingly.

Building Integration Tests for Multi-Agent Systems

When you have multiple agents working together, integration testing becomes more complex. You need to test not just individual agents but their interactions. This requires thinking carefully about what can go wrong at the boundaries between agents.

The pattern is to treat inter-agent communication as a separate test concern. Test the contract between Agent A and Agent B. Does Agent A's output match what Agent B expects as input? Does Agent B correctly interpret Agent A's output? Does Agent B handle cases where Agent A fails?

These boundary tests are where you'll find subtle bugs. Agent A might return results that are technically correct but structured differently from what Agent B expects. Agent B might assume all results from Agent A are present, missing the case where A returns a partial failure. These aren't single-agent bugs—they're coordination bugs that only appear when you test the system end-to-end.

The practical approach is to create fixtures that represent the boundary. "When Agent A returns this type of output, does Agent B handle it correctly?" Test that. "When Agent A fails partway, does Agent B recover gracefully?" Test that. By systematically testing the boundaries, you catch coordination bugs before they reach production.

Documentation Through Tests

One underappreciated benefit of comprehensive integration tests is that they serve as documentation. When a new team member joins, they can read the tests to understand how agents actually behave. Better than reading the system prompt, tests show real input-output pairs. Better than reading architecture documentation, tests show how agents interact.

Make your test names clear and descriptive. test_agent_handles_empty_input_gracefully is more informative than test_empty. Group related tests with clear section comments. Use fixtures that document what the test data represents. Over time, your test suite becomes a living specification of how your agents behave.

Some teams even generate documentation from their tests. They extract test cases, annotate them with explanations, and publish them as examples. This keeps documentation in sync with tests because they're the same source.

The Path to Test Maturity

No team starts with a perfect test suite. The path to maturity looks like this: You start by testing the happy path because that's what's easiest. Then you discover edge cases and add tests for them. Then you discover integration issues and add multi-agent tests. Then you experience production issues and add regression tests to prevent recurrence. Then you stress test because you realize scale matters. Then you build tooling to make testing easier because you realize scale matters at the testing level too.

This progression is normal and expected. Don't feel bad about starting with simple tests. Feel bad if you're still starting with simple tests six months later without having expanded them. The key is progression. Each phase teaches you what to test next.

Metrics That Matter for Agent Testing

If you're going to build a strong testing culture, what should you measure? Not just "test pass rate"—that's a lagging indicator. Instead, focus on these leading indicators: code coverage of agent logic, variety of test inputs, number of edge cases covered, distribution of test types (happy path, error, boundary, integration, stress).

Track how long bugs survive in production before being caught. If you catch bugs in testing, the time is zero. If bugs reach users, that's a real cost. Over time, as your testing improves, this metric should trend to zero.

Track how many production issues would have been caught by existing tests. When something breaks in production, did you have a test that would have caught it? If not, add one. This turns production incidents into opportunities to improve your test suite.

Track iteration speed. Teams with weak tests ship slowly. Teams with strong tests ship fast. If your velocity is degrading over time, it might be because your tests aren't keeping up with your code. Conversely, if velocity is increasing, tests are working.

These metrics help you understand whether your testing investment is paying off. Not in abstract terms, but in real improvements to reliability, speed, and confidence.


-iNet

Need help implementing this?

We build automation systems like this for clients every day.

Discuss Your Project