April 16, 2025
Claude AI Performance Development

Agent SDK: Performance Tuning and Optimization

Speed matters. When users interact with your agent, they expect sub-second responses. When agents orchestrate complex workflows, they expect efficient execution. But Claude Code agents handle asynchronous operations, manage context across turns, and execute tools—all of which introduce latency if not optimized carefully.

This guide covers practical performance optimization strategies for the Agent SDK. We'll tackle time-to-first-token, reducing redundant tool calls, managing context windows efficiently, implementing caching layers, and profiling agents to find bottlenecks. By the end, you'll know exactly where your agents are slow and how to fix it.

Table of Contents
  1. The Performance Baseline: Where Time Actually Goes
  2. Why Profiling Matters More Than Guessing
  3. The Context Window Trap
  4. The Hidden Impact of Perceived Performance
  5. Reducing Time-to-First-Token
  6. Optimizing Tool Call Frequency
  7. The Cost of Tool Call Overhead
  8. The Latency Budget: Thinking in Percentiles
  9. Managing Context Windows Efficiently
  10. Implementing Caching Strategies
  11. Profiling Agents to Find Bottlenecks
  12. Batch Tool Execution
  13. Connection Pooling and Reuse
  14. Model Selection for Different Use Cases
  15. Real-World Performance Case Study
  16. Before Optimization
  17. Problems Identified
  18. Optimizations Applied
  19. After Optimization
  20. Performance Monitoring in Production
  21. Advanced: Adaptive Model Selection
  22. Key Takeaways
  23. The Deeper Patterns in Agent Performance
  24. When Optimization Becomes Over-Optimization
  25. Building Observability from Day One
  26. Common Performance Optimization Mistakes
  27. Real-World Scenario: The Slow Code Review Agent
  28. Production Patterns for Performance
  29. The Deeper Patterns in Agent Performance
  30. When Optimization Becomes Over-Optimization

The Performance Baseline: Where Time Actually Goes

Before optimizing, understand where time actually goes. When you invoke Claude Code with a complex request, the elapsed time isn't just API latency. It's a combination of several factors, and understanding the breakdown is critical to optimizing effectively:

  1. Network latency (typically 100-300ms): Time for request to reach Anthropic's servers and back.
  2. Prompt processing (typically 50-200ms): Claude processes your system prompt and context.
  3. Token generation (depends on response length): Claude generates tokens (roughly 50-100ms per 100 tokens).
  4. Tool execution (highly variable): Your tools run and return results. This is often the biggest variable.
  5. Response parsing (typically 10-50ms): SDK parses Claude's response.

The biggest optimization opportunities are usually in tool execution and context management, not in Claude itself. Claude is already fast. Your tools and context are where you can squeeze the most performance gains.

Why Profiling Matters More Than Guessing

The biggest performance optimization mistake is optimizing something that doesn't matter. You look at your agent's average latency (8 seconds) and immediately think "the API calls are too slow!" You optimize the API calls and reduce them from 2 seconds to 1 second. But then you measure again and the agent is still 8 seconds. Turns out the API was only 2 seconds of the 8. The other 6 seconds were tool execution (parsing JSON, calling external services, loading from database) and context management (copying large context objects around).

You've wasted engineering effort on something that barely matters. This is why profiling is non-negotiable. You need data. You need to measure before you optimize. The ProfiledAgent class I showed earlier does this—it instruments every phase of execution so you see exactly where time goes.

When you have data, optimization becomes strategic. You see that tool X takes 3 seconds (40% of total), tool Y takes 1 second (12% of total). You now know to focus on tool X first. You implement caching for tool X, cut it to 0.2 seconds on cache hits, and suddenly your agent is half as slow. That's data-driven optimization.

Without profiling, you're guessing. With profiling, you're targeting.

The Context Window Trap

Here's a subtle performance trap that many teams don't see: as you build agents that interact for longer, context windows grow. You've had 10 back-and-forth exchanges with Claude. That's 10k tokens of history. Claude spends time processing all that history before answering your new question. More tokens in = more processing time.

The obvious solution is "prune old messages." But that's too aggressive. Sometimes Claude needs history from turn 3 to understand turn 11. What you really want is intelligent context management: keep the messages Claude needs, discard the ones it doesn't. You might keep the original task description and recent messages, but summarize or discard intermediate exploratory messages.

This requires understanding what Claude actually uses from context. Some of this is heuristic (recent messages are probably more useful), but you can also ask Claude to summarize long contexts: "Summarize our conversation so far in 200 words, keeping only the decision points and findings." Now instead of 5k tokens of history, you have 200 tokens of summary. Claude can still understand where you are, but processing is faster.

The Hidden Impact of Perceived Performance

Before we dive into optimization techniques, let's understand why perception matters so much. Research in human-computer interaction shows that users' subjective experience of performance is often disconnected from actual performance. A system that takes 5 seconds to show you a result feels faster than a system that takes 3 seconds total if the first system shows you something after 0.5 seconds, while the second system waits 3 seconds before showing anything.

This is called perceived latency, and it's crucial in agent design. Users don't care that your agent's total time is 5 seconds if they see nothing for those 5 seconds. But if they see "Claude is analyzing..." immediately, then "Found 3 issues..." after 2 seconds, then more details after 4 seconds, the same 5-second task feels much faster and more responsive.

This has implications for how you design agents. A streaming response feels better than a non-streaming response even if they take the same total time. Progress indicators matter. Showing intermediate results matters. Silence kills perceived performance; feedback improves it.

Reducing Time-to-First-Token

Time-to-first-token (TTFT) is the elapsed time from request to receiving the first token of Claude's response. Lower TTFT creates the illusion of responsiveness even if total time is long. Users perceive a system as faster when they see content appearing immediately, even if the full response takes longer. It's a psychological effect that matters more than you'd think.

Optimize TTFT with streaming and parallel operations. Instead of waiting for complete response:

typescript
import { Agent, StreamingAgent } from "@anthropic-ai/agent-sdk";
 
// Instead of waiting for complete response
const agent = new Agent({
  apiKey: process.env.ANTHROPIC_API_KEY,
  model: "claude-3-5-sonnet",
});
 
// Use streaming to get first tokens faster
const streamingAgent = new StreamingAgent({
  apiKey: process.env.ANTHROPIC_API_KEY,
  model: "claude-3-5-sonnet",
  onToken: (token) => {
    // Process token immediately
    console.log(token);
    // Send to client immediately if web socket is open
    sendToClient(token);
  },
});
 
// Instead of:
const response = await agent.send("Analyze this");
 
// Do:
for await (const token of streamingAgent.sendStream("Analyze this")) {
  // Handle each token as it arrives
  updateUI(token);
}

Streaming reduces perceived latency significantly. Users see responses appearing in real-time rather than waiting for complete computation. The total time might be the same, but it feels much faster.

Optimizing Tool Call Frequency

Tools are expensive—they require serialization, network calls, parsing, and execution. Every unnecessary tool call costs time and tokens. Think about tool calls as your most expensive operation.

Reduce tool calls with better prompting and intelligent caching. Tell Claude explicitly that tools are expensive:

typescript
const agent = new Agent({
  apiKey: process.env.ANTHROPIC_API_KEY,
  model: "claude-3-5-sonnet",
  systemPrompt: `You are an efficient research assistant.
 
IMPORTANT: Before calling a tool, check if you already know the answer from previous messages.
Only call tools when you genuinely need new information.
If you're repeating a query, reference the previous result instead.
 
Tools available:
- search: Search for information (EXPENSIVE, use sparingly)
- calculate: Perform calculations (FAST, use freely)
- cache_lookup: Check if we've already researched this topic`,
 
  tools: [
    {
      name: "cache_lookup",
      description: "Check if this topic was already researched",
      handler: async (input: { topic: string }) => {
        return {
          found: cache.has(input.topic),
          cachedResult: cache.get(input.topic),
        };
      },
    },
    {
      name: "search",
      description: "Search for information",
      handler: async (input: { query: string }) => {
        // Actual search that hits external API
        const result = await externalSearchAPI(input.query);
        cache.set(input.query, result); // Cache for next time
        return result;
      },
    },
  ],
});

The system prompt explicitly tells Claude to avoid redundant tool calls. This reduces calls significantly. Claude learns from these instructions and respects them.

The Cost of Tool Call Overhead

Here's what people miss about tool optimization: the overhead of calling a tool isn't just the tool execution time. It includes:

  • Serialization overhead (5-10ms): Converting input to JSON and back
  • Network round-trip (100-300ms): Sending request to server, getting response
  • Parsing overhead (10-20ms): Parsing the response
  • Context overhead (10-50ms): Adding tool results back to Claude's context

So a tool that "should take 50ms" actually takes 200-400ms when you add in all the overhead. A tool that takes 1 second actually takes 1.2-1.5 seconds.

This is why avoiding unnecessary tool calls is so important. Every tool call you eliminate saves you 200+ milliseconds. If you're calling the same search tool 10 times in a conversation for the same thing, that's 2 seconds of overhead times 9 (the 9 redundant calls). That's 18 seconds wasted.

The system prompt approach helps, but you can go further. Implement a tool call deduplication layer at the SDK level:

typescript
class DeduplingAgent {
  private toolCallHistory = new Map<string, any>();
 
  private getToolCallKey(toolName: string, input: any): string {
    return `${toolName}:${JSON.stringify(input)}`;
  }
 
  async executeWithDedup(
    toolName: string,
    input: any,
    handler: Function,
  ): Promise<any> {
    const key = this.getToolCallKey(toolName, input);
 
    // If we've called this exact tool with these exact inputs before, return cached result
    if (this.toolCallHistory.has(key)) {
      console.log(`Tool call dedup hit: ${key}`);
      return this.toolCallHistory.get(key);
    }
 
    console.log(`Tool call dedup miss: ${key}`);
    const result = await handler(input);
    this.toolCallHistory.set(key, result);
    return result;
  }
}

This layer sits between Claude and your tools. If Claude tries to call search("typescript best practices") twice in one conversation, the second call gets the cached result from the first call instead of hitting the API again. Claude doesn't even know it's being deduplicated—it gets a result either way. But you've saved the tool execution time and overhead on the second call.

The Latency Budget: Thinking in Percentiles

Before you start optimizing, you need to understand your latency budget. What's acceptable? For different use cases, acceptable latency is different:

  • Interactive Web UI: Users expect under 500 ms for perceived responsiveness, under 2s for acceptable wait
  • Chat interfaces: Users expect under 1s first token, under 10s total response
  • Background jobs: Users expect completion within hours
  • API endpoints: Depends on SLA, typically under 200 ms for real-time, under 5s for async

The key insight is thinking in percentiles. Your average response time might be 500ms, but your P99 (worst 1% of requests) might be 5 seconds. Which percentile matters?

If you're building an interactive UI, P50 matters (average user experience) but P99 also matters (doesn't serve you to be fast for 99% of users if 1% get a terrible experience). If you're building batch processing, P99 might not matter—you care about P95 and total throughput.

Define your latency SLA: "95% of requests under 1 second" or "P99 under 5 seconds." Once you've defined it, optimize toward that percentile. Don't waste effort optimizing P50 if your SLA is about P99.

This is subtle but important. You might optimize a feature that improves average latency by 100ms, but doesn't improve P99 latency at all. If your SLA is about P99, you've wasted effort. Focus on what your SLA actually demands.

Managing Context Windows Efficiently

Large context windows hurt performance—more tokens to process means slower responses. Keep context minimal and intentional:

typescript
interface ConversationTurn {
  role: "user" | "assistant";
  content: string;
  estimatedTokens: number;
}
 
class EfficientAgent {
  private conversationHistory: ConversationTurn[] = [];
  private maxContextTokens = 8000; // Leave room for new response
 
  async send(message: string) {
    this.conversationHistory.push({
      role: "user",
      content: message,
      estimatedTokens: this.estimateTokens(message),
    });
 
    // Prune old messages if context getting too large
    this.pruneContext();
 
    const response = await this.agent.send(message);
 
    this.conversationHistory.push({
      role: "assistant",
      content: response.text,
      estimatedTokens: this.estimateTokens(response.text),
    });
 
    return response;
  }
 
  private pruneContext() {
    let totalTokens = this.conversationHistory.reduce(
      (sum, turn) => sum + turn.estimatedTokens,
      0,
    );
 
    // If context is too large, remove oldest turns
    while (
      totalTokens > this.maxContextTokens &&
      this.conversationHistory.length > 2
    ) {
      const removed = this.conversationHistory.shift();
      totalTokens -= removed!.estimatedTokens;
    }
  }
 
  private estimateTokens(text: string): number {
    // Rough estimate: 1 token per 4 characters
    return Math.ceil(text.length / 4);
  }
}

Pruning old messages keeps context windows tight without losing critical context. This is especially important for long-running agents that accumulate history over time.

Implementing Caching Strategies

Cache tool results to avoid redundant executions. This is often the single biggest performance win:

typescript
import NodeCache from "node-cache";
 
class CachingAgent {
  private toolCache = new NodeCache({ stdTTL: 3600 }); // 1 hour TTL
 
  private createCachedTool(toolName: string, handler: Function) {
    return async (input: any) => {
      // Create cache key from tool name and input
      const cacheKey = `${toolName}:${JSON.stringify(input)}`;
 
      // Check cache first
      const cached = this.toolCache.get(cacheKey);
      if (cached) {
        console.log(`Cache hit for ${toolName}`);
        return cached;
      }
 
      // Execute tool if not cached
      console.log(`Cache miss for ${toolName}, executing...`);
      const startTime = Date.now();
      const result = await handler(input);
      const elapsed = Date.now() - startTime;
 
      // Log cache miss metrics
      console.log(`Tool ${toolName} took ${elapsed}ms`);
 
      // Store in cache
      this.toolCache.set(cacheKey, result);
 
      return result;
    };
  }
 
  setupCachedTools() {
    return [
      {
        name: "search",
        description: "Search for information",
        handler: this.createCachedTool(
          "search",
          async (input: { query: string }) => {
            // Actual search implementation
            return await performSearch(input.query);
          },
        ),
      },
      {
        name: "fetch_data",
        description: "Fetch structured data",
        handler: this.createCachedTool(
          "fetch_data",
          async (input: { url: string }) => {
            // Actual data fetch
            return await fetchData(input.url);
          },
        ),
      },
    ];
  }
}

Caching is especially effective for read-only operations like searches and data fetches. A 1-second operation hitting a 1-hour cache has dramatic performance implications for agents that process similar queries repeatedly.

Profiling Agents to Find Bottlenecks

You can't optimize what you don't measure. Profile your agents to identify slow paths:

typescript
class ProfiledAgent {
  private metrics = {
    apiLatency: [] as number[],
    toolExecutionTime: new Map<string, number[]>(),
    contextSize: [] as number[],
  };
 
  async send(message: string) {
    const startTime = Date.now();
    const apiStart = Date.now();
 
    const response = await this.agent.send(message);
 
    const apiLatency = Date.now() - apiStart;
    this.metrics.apiLatency.push(apiLatency);
 
    // Process tool calls
    if (response.toolCalls) {
      for (const toolCall of response.toolCalls) {
        const toolStart = Date.now();
        const result = await this.executeTool(toolCall);
        const toolTime = Date.now() - toolStart;
 
        if (!this.metrics.toolExecutionTime.has(toolCall.name)) {
          this.metrics.toolExecutionTime.set(toolCall.name, []);
        }
        this.metrics.toolExecutionTime.get(toolCall.name)!.push(toolTime);
      }
    }
 
    const totalTime = Date.now() - startTime;
    return { response, totalTime };
  }
 
  getMetricsSummary() {
    const apiLatencies = this.metrics.apiLatency;
    const avgApiLatency =
      apiLatencies.reduce((a, b) => a + b, 0) / apiLatencies.length;
 
    const toolMetrics: any = {};
    for (const [toolName, times] of this.metrics.toolExecutionTime) {
      const avgTime = times.reduce((a, b) => a + b, 0) / times.length;
      toolMetrics[toolName] = {
        calls: times.length,
        avgTime: Math.round(avgTime),
        maxTime: Math.max(...times),
        minTime: Math.min(...times),
      };
    }
 
    return {
      apiLatency: {
        avg: Math.round(avgApiLatency),
        max: Math.max(...apiLatencies),
        min: Math.min(...apiLatencies),
      },
      toolMetrics,
      totalRequests: apiLatencies.length,
    };
  }
 
  printProfile() {
    const summary = this.getMetricsSummary();
    console.log("\n=== Agent Performance Profile ===");
    console.log(
      `API Latency: avg=${summary.apiLatency.avg}ms, max=${summary.apiLatency.max}ms`,
    );
    console.log("\nTool Execution Times:");
    for (const [tool, metrics] of Object.entries(summary.toolMetrics)) {
      console.log(
        `  ${tool}: ${metrics.avgTime}ms avg (${metrics.calls} calls)`,
      );
    }
  }
}

Run this profiler on your agent workflows to see where time is actually spent. Often you'll find one tool is much slower than others, or API latency is consistently higher than expected. Data-driven optimization beats guessing.

Batch Tool Execution

If your agent calls multiple independent tools, execute them in parallel:

typescript
async executeBatchTools(toolCalls: ToolCall[]) {
  // Instead of sequential: await tool1; const result1 = ...; await tool2; ...
 
  // Do parallel:
  const results = await Promise.all(
    toolCalls.map(toolCall =>
      this.executeTool(toolCall).catch(error => ({
        toolName: toolCall.name,
        error: error.message
      }))
    )
  );
 
  return results;
}

This parallelizes independent tool executions, reducing total time. If you have three tools that each take 500ms, sequential execution takes 1500ms. Parallel execution takes 500ms. The difference is dramatic.

Connection Pooling and Reuse

Reuse connections instead of creating new ones per request:

typescript
import http from "http";
import { Agent as HTTPAgent } from "http";
import { Agent as HTTPSAgent } from "https";
 
const httpAgent = new HTTPAgent({
  keepAlive: true,
  keepAliveMsecs: 30000,
  maxSockets: 50,
  maxFreeSockets: 10,
});
 
const httpsAgent = new HTTPSAgent({
  keepAlive: true,
  keepAliveMsecs: 30000,
  maxSockets: 50,
  maxFreeSockets: 10,
});
 
// Use these agents for all requests
const agent = new Agent({
  apiKey: process.env.ANTHROPIC_API_KEY,
  model: "claude-3-5-sonnet",
  httpAgent,
  httpsAgent,
});

Connection pooling reduces overhead for repeated requests. Instead of establishing a new TCP connection for each request, you reuse existing connections. This saves hundreds of milliseconds per request at scale.

Model Selection for Different Use Cases

Choose the right model for the job. Not every task needs your most capable (and slowest) model:

typescript
class SmartAgent {
  selectModel(complexity: "low" | "medium" | "high") {
    switch (complexity) {
      case "low":
        return "claude-3-haiku"; // Fast, cheap
      case "medium":
        return "claude-3-sonnet"; // Balanced
      case "high":
        return "claude-3-opus"; // Most capable
    }
  }
 
  async send(message: string, complexity: "low" | "medium" | "high") {
    const model = this.selectModel(complexity);
    const agent = new Agent({
      apiKey: process.env.ANTHROPIC_API_KEY,
      model,
    });
    return agent.send(message);
  }
}

Using Claude 3 Haiku for simple tasks (classification, extraction, basic analysis) saves latency and costs. Save your more powerful models for genuinely complex reasoning.

Real-World Performance Case Study

Let's walk through optimizing a real agent that was too slow. Understanding how optimization compounds is powerful for your own work.

Before Optimization

Initial agent response time: ~2500ms
- API latency: 800ms
- Tool execution (search): 1200ms
- Tool execution (summarize): 400ms
- Response parsing: 100ms

Problems Identified

  1. Search tool called twice for the same query (no caching)
  2. No streaming—users wait for complete response
  3. Full conversation history sent every request (unnecessary context)
  4. Redundant tool invocations due to vague prompting

Optimizations Applied

1. Add caching for search results:

typescript
// Result: Search time reduced from 1200ms to 50ms on cache hits
// 60% of user queries repeat within 5 minutes

2. Implement streaming:

typescript
// Result: Perceived latency reduced dramatically
// Users see first token in 300ms instead of waiting 2500ms

3. Prune conversation history:

typescript
// Result: API latency reduced from 800ms to 200ms
// Context size halved from 12k tokens to 6k tokens

4. Reduce tool calls with better prompts:

typescript
// Result: 40% fewer tool invocations
// Average request time: 1500ms → 900ms

After Optimization

Optimized agent response time: ~900ms (64% improvement)
- API latency: 200ms (75% faster)
- Tool execution: 500ms (58% faster, mostly cache hits)
- Response parsing: 50ms
- Perceived latency (TTFT): 300ms (90% faster)

Performance Monitoring in Production

Set up alerts for performance regressions. You don't want to wake up to discover your agents have slowed down:

typescript
// Set performance budgets
const PERFORMANCE_BUDGETS = {
  apiLatency: 800, // ms
  totalLatency: 3000, // ms
  toolExecutionAvg: 500, // ms
  cacheHitRate: 0.5, // 50%
};
 
// Monitor continuously
setInterval(() => {
  const metrics = agent.getMetricsSummary();
 
  if (metrics.apiLatency.avg > PERFORMANCE_BUDGETS.apiLatency) {
    alertSlack("Agent API latency exceeding budget");
  }
 
  if (metrics.cacheHitRate < PERFORMANCE_BUDGETS.cacheHitRate) {
    alertSlack("Cache hit rate below threshold");
  }
}, 60000); // Check every minute

Advanced: Adaptive Model Selection

Choose models based on query complexity. Estimate complexity and use the minimum capable model:

typescript
class AdaptiveAgent {
  async send(message: string) {
    // Estimate complexity of query
    const complexity = this.estimateComplexity(message);
 
    const model =
      complexity < 3
        ? "claude-3-haiku"
        : complexity < 6
          ? "claude-3-sonnet"
          : "claude-3-opus";
 
    const agent = new Agent({
      apiKey: process.env.ANTHROPIC_API_KEY,
      model,
    });
 
    return agent.send(message);
  }
 
  estimateComplexity(message: string): number {
    // Simple heuristic: longer messages = more complex
    const wordCount = message.split(" ").length;
    const toolMentions = (message.match(/tool|search|fetch|calculate/gi) || [])
      .length;
    return Math.min(10, Math.floor((wordCount + toolMentions * 2) / 5));
  }
}

Key Takeaways

Performance optimization for agents requires understanding where time actually goes, then targeting the biggest wins. Profile your agents to identify bottlenecks. Optimize tool calls through intelligent caching and better prompting. Stream responses to reduce perceived latency. Manage context windows to keep processing fast. Parallelize independent operations.

The most impactful optimizations usually come from:

  1. Reducing unnecessary tool calls (biggest win, often 30-40% improvement)
  2. Implementing tool result caching (often 50-80% reduction on repeated queries)
  3. Streaming responses for perceived speed (dramatic UX improvement)
  4. Managing context windows efficiently (20-30% reduction in API latency)
  5. Profiling and measuring actual performance (essential for identifying real bottlenecks)

Claude Code makes this practical because you can measure and optimize the actual SDK behavior. You're not guessing—you're profiling real workflows and fixing what's slow. The data drives your decisions.

The Deeper Patterns in Agent Performance

Understanding agent performance requires thinking about what's actually happening under the hood. When you send a message to Claude, you're not just transmitting text. You're asking the model to think through the problem, decide if tools are needed, formulate exactly what information it wants, and synthesize all responses into an answer. Each step has latency implications.

The hidden layer here is recognizing that perceived performance and actual performance are sometimes different things. A streaming agent might take the same total time as a non-streaming agent, but users perceive it as much faster because they see content appearing. This isn't an illusion—it's genuinely better UX. The user can start reading while computation continues, which is psychologically more satisfying than staring at a loading spinner.

The same principle applies to error handling. A slow error message is worse than no message, because the user gets no feedback about what's happening. If your agent catches an error, immediately surface something to the user (streaming a partial response, showing what it tried, explaining why it failed) rather than waiting for the perfect response.

When Optimization Becomes Over-Optimization

There's a trap in performance optimization: you can spend enormous effort squeezing another 50ms of latency when that time has zero impact on user perception. Your P99 latency might be 500ms when your P50 is 100ms. It's tempting to optimize P99, but if 99% of your users are fine with 100ms, optimizing for the 1% is low ROI.

Focus on the percentiles that matter. If your SLA is "95% of requests under 1 second," optimize P95. If it's "all requests under 3 seconds," optimize P100 (worst-case). Most of the time, P95-P99 is where the business value lives.

Also recognize when network latency becomes your hard ceiling. If you're calling Anthropic's API across the Atlantic, you've got a floor of maybe 300-400ms just for the round trip. No amount of local optimization gets you below that. The wins come from reducing the number of round trips (fewer tool calls, better batching) not from making individual trips faster.

Building Observability from Day One

The most successful teams instrument their agents from the beginning. Not extensive logging that drowns you in data, but strategic metrics that tell you what matters. Log the fields from that ProfiledAgent example above. Set up dashboards showing average latency, cache hit rate, tool execution time, error rate. When performance changes, you'll know immediately what changed.

Observability also means being able to replay failures. If a user reports "the agent was slow," can you reconstruct what happened? What was the query? Which tools ran? How long did each take? If you're not logging enough context, you're just guessing at the problem.

Common Performance Optimization Mistakes

Teams trying to optimize agent performance often make the same mistakes. Understanding these pitfalls helps you avoid them.

Mistake 1: Optimizing the Wrong Thing You see 5 seconds of latency and immediately try to speed up the API calls. But maybe the API calls are only 1 second—the other 4 seconds are tool execution. You've wasted effort optimizing something that barely matters. Profile first. Optimize second.

Mistake 2: Assuming Determinism You optimize an agent's latency from 2 seconds to 1 second. Congratulations. But Claude is non-deterministic. Sometimes it uses different tools. Sometimes it needs to call tools multiple times. Your "1 second" result might become 3 seconds on a different query. Optimize for the median case, not the outlier.

Mistake 3: Breaking Functionality for Speed You remove context to speed things up. Now the agent doesn't understand the domain as well and makes worse decisions. You've traded quality for speed. That's almost never the right trade. Users prefer slow and correct to fast and broken.

Mistake 4: Not Thinking About Cost You spend engineering time saving 100ms of latency, costing the team $2000 in salary time. Meanwhile, Claude Code's API cost for that 100ms is $0.02. You've got the ROI backwards. Sometimes the right optimization is "spend more on the API, save time on engineering."

Mistake 5: Over-Caching You cache everything to avoid API calls. But now your cache is stale. The agent gives users outdated information because it's serving month-old cache results. Caching is good; mindless caching is bad.

Real-World Scenario: The Slow Code Review Agent

Let me walk you through a real optimization story. This is composite from several teams, but the pattern is universal.

A company built a code review agent using the Agent SDK. It reviewed PRs for security, performance, and architectural issues. It worked great but was slow—averaging 8-10 seconds per PR. Users complained. The team decided to optimize.

Initial Profiling:

  • API latency: 2 seconds average
  • Security tool execution: 1.5 seconds (network call to security scanner)
  • Performance tool execution: 3.5 seconds (analyzing large codebases)
  • Response parsing: 0.2 seconds
  • Other overhead: 1.3 seconds

Total: ~8.5 seconds. The obvious target was the performance tool—it was 40% of the latency.

But wait—they dug deeper. The performance tool was calling an external static analysis service that analyzed the entire codebase every time. Unnecessary. The company had a pre-computed performance baseline for the codebase. They changed the tool to load the baseline (cached locally, 50ms) and just diff against it. Result: performance tool went from 3.5 seconds to 0.2 seconds.

New breakdown:

  • API latency: 2 seconds
  • Security tool: 1.5 seconds
  • Performance tool: 0.2 seconds (cache hit)
  • Response parsing: 0.2 seconds
  • Other: 1.3 seconds

Total: ~5.2 seconds. Still too slow. They looked at "other overhead"—what was that 1.3 seconds?

Turns out, the agent was loading the entire PR context into memory for each tool call. The agent was copying huge strings around. They fixed this by passing references instead of copies. Overhead went from 1.3 seconds to 0.2 seconds.

After two optimizations:

  • API latency: 2 seconds
  • Security tool: 1.5 seconds (external, couldn't improve easily)
  • Performance tool: 0.2 seconds
  • Response parsing: 0.2 seconds
  • Other: 0.2 seconds

Total: ~4.1 seconds. But wait—the security tool is still 1.5 seconds. That's network latency to an external service. Could they cache that too?

They couldn't cache security results directly (they change with each PR), but they could batch security checks. Instead of scanning one PR in isolation, they scanned 5 PRs in one batch API call to the security service. Now the cost per PR is 1.5 / 5 = 0.3 seconds instead of 1.5 seconds.

Final optimization:

  • API latency: 2 seconds
  • Security tool: 0.3 seconds (batched)
  • Performance tool: 0.2 seconds
  • Response parsing: 0.2 seconds
  • Other: 0.2 seconds

Total: ~2.9 seconds. They'd gone from 8.5 to 2.9 seconds—66% faster.

Key insight: The biggest wins came from caching (baseline cache for performance), avoiding unnecessary work (memory copy elimination), and batching (security checks). They didn't optimize the Claude API calls themselves. Claude was already fast. The overhead around Claude was the problem.

Production Patterns for Performance

When your agent is in production serving real users, a few patterns keep performance consistent:

Pattern 1: Timeout Tiers Don't have one timeout. Have tiers:

  • Fast tier: 2 seconds max (simple analysis)
  • Normal tier: 5 seconds max (standard review)
  • Thorough tier: 10 seconds max (deep analysis)

Users choose what they need. Fast analysis for quick feedback. Thorough analysis for critical PRs. This respects user time constraints and gives you knobs to turn.

Pattern 2: Streaming for Perceived Speed Even if total time is 5 seconds, if users see output in 0.5 seconds, it feels responsive. Stream responses as they come in. Users can start reading while computation continues.

Pattern 3: Predictable Latency Users prefer consistent latency to unpredictable latency. A 3-second response every time beats 1-second average with 20-second outliers. Keep your P99 latency reasonable. If you can't, tell users what's happening ("this is taking longer than usual").

Pattern 4: SLO-Driven Optimization Define your Service Level Objective: "95% of code reviews complete in under 5 seconds." Then optimize for that SLO. Once you hit it, stop optimizing. Further optimization is low ROI. Spend time on other features users care about.

The Deeper Patterns in Agent Performance

Performance optimization for agents requires thinking about what's actually happening. When you send a message to Claude, you're not just transmitting text. You're asking the model to read your prompt, understand what you're asking, decide if tools are needed, formulate exactly what information it wants, and synthesize all responses. Each step has cost and latency implications.

The hidden layer here is recognizing that perceived performance and actual performance are different things. A streaming agent might take the same total time as a non-streaming agent, but users perceive it as much faster because they see content appearing. This isn't an illusion—it's genuinely better UX. The user can start reading while computation continues, which is psychologically more satisfying than staring at a loading spinner.

The same principle applies to error handling. A slow error message is worse than no message, because the user gets no feedback. If your agent catches an error, immediately surface something to the user (streaming a partial response, showing what it tried, explaining why it failed) rather than waiting for the perfect response.

When Optimization Becomes Over-Optimization

There's a trap in performance optimization: you can spend enormous effort squeezing another 50ms of latency when that time has zero impact on user perception. Your P99 latency might be 500ms when your P50 is 100ms. It's tempting to optimize P99, but if 99% of your users are fine with 100ms, optimizing for the 1% is low ROI.

Focus on the percentiles that matter. If your SLA is "95% of requests under 1 second," optimize P95. If it's "all requests under 3 seconds," optimize P100 (worst-case). Most of the time, P95-P99 is where the business value lives.

Also recognize when network latency becomes your hard ceiling. If you're calling Anthropic's API across the Atlantic, you've got a floor of maybe 300-400ms just for the round trip. No amount of local optimization gets you below that. The wins come from reducing the number of round trips (fewer tool calls, better batching) not from making individual trips faster.


Claude Code makes this practical because you can measure and optimize the actual SDK behavior. You're not guessing—you're profiling real workflows and fixing what's slow.

-iNet

Need help implementing this?

We build automation systems like this for clients every day.

Discuss Your Project