
Building a resilient agent isn't just about writing clever prompts — it's about handling the chaos of production. Networks fail, APIs throttle, tools timeout, and sometimes Claude itself hiccups. This guide covers the patterns that separate toy agents from battle-tested systems. We'll walk through real error handling, exponential backoff with jitter, tool execution recovery, and graceful degradation strategies. By the end, you'll have a mental model for building agents that don't crash at 2am.
Table of Contents
- Why Error Handling Matters More Than You Think
- Understanding SDK Error Types
- Basic Retry with Exponential Backoff
- Adding Jitter to Prevent Synchronized Retries
- Handling SDK Error Messages Directly
- Tool Execution Recovery Strategies
- Implementing Circuit Breakers for Cascading Failures
- Graceful Degradation When Tools Are Unavailable
- Timeout Handling for Long-Running Operations
- Handling Partial Failures and Resuming with State
- Implementing Retry Budget and Cost Control
- Observability: Logging and Monitoring
- Testing Error Scenarios
- Real-World Scenario: The Cascading Failure
- Alternatives to Standard Retry Patterns
- Troubleshooting Error Handling Issues
- Practical Patterns for Different Error Categories
- Production Considerations: Monitoring and Observability
- Testing Error Handling Thoroughly
- Common Mistakes and How to Avoid Them
- The Hidden Layer: Why These Patterns Matter in Real Systems
- Key Takeaways
Why Error Handling Matters More Than You Think
Here's the thing: most tutorials show you the happy path. Your agent makes a tool call, gets a result, moves on. In production? You're looking at:
- API rate limits (429 errors)
- Transient network failures (timeouts)
- Tool execution crashes (permission denied, file not found)
- Downstream service failures (external APIs your agent calls)
- Budget overruns or billing issues
- Authentication failures mid-stream
Each of these needs a different recovery strategy. Retry some. Skip others. Degrade gracefully on the rest. Let's build that resilience.
The hidden truth that most developers don't realize until 2 AM when their production system stops responding: error handling isn't a nice-to-have, it's the actual product. Your business logic, your clever prompts, your optimizations—those are all predicated on the system staying up. But systems don't stay up without resilience patterns.
Think about the economics. You deploy an agent to production. It runs fine for six hours. Then a downstream API hiccups for thirty seconds. Without proper error handling and retry logic, what happens? Your agent crashes. Your users see failures. You get paged. You scramble to restart. You lose trust.
With proper error handling? That same thirty-second hiccup is invisible. The agent notices, backs off, waits, tries again. By the time your monitoring alert fires (if you even set one), the system has already recovered. The user never saw anything wrong.
This is the difference between a system that's merely "correct" and one that's actually "reliable." Correctness is about doing the right thing when everything works. Reliability is about doing the right thing when everything breaks.
Understanding SDK Error Types
The Agent SDK surfaces several error categories through the SDKAssistantMessageError type. Each tells you something specific about what went wrong. authentication_failed means your API key is bad or expired — retrying won't help. rate_limit means you're hitting Claude's rate limits — exponential backoff is your friend. server_error means Claude's infrastructure has a hiccup — transient, worth retrying. billing_error means your account is out of credits — no amount of retrying helps. invalid_request means you asked Claude to do something it can't — fix your prompt, not your backoff.
This distinction is crucial. Blindly retrying everything wastes time and money. Smart retries based on error type are fast and cheap.
The deeper principle here applies beyond just Claude Code. Every external system has different failure modes. Your database has different failure modes than your third-party API, which has different modes than your file system. A good resilience strategy treats each appropriately. When your database times out, you might retry aggressively—databases usually recover quickly from momentary overload. When a third-party API returns an auth error, retrying is pointless—the configuration is broken, not the system. Understanding these distinctions is what separates "flaky" systems from "resilient" ones.
Basic Retry with Exponential Backoff
Let's start with the foundation: exponential backoff. The idea is simple — wait longer between each retry to give the system time to recover.
But understanding the why is crucial. When a system experiences a transient failure, it often needs time to self-recover. If you immediately hammer it again with ten requests per second, you're making the problem worse. You're adding load to a system that's already struggling. Think of it like a traffic jam: the more people who lay on the horn and floor the accelerator, the worse everyone's stuck.
Exponential backoff works like this: your first retry happens after one second. The system has time to notice the overload and start shedding load. If that fails, you wait two seconds. Then four. Then eight. By the time you get to your third or fourth retry, you've waited fifteen, thirty seconds. By then, the system has almost certainly recovered. Real incidents that last more than a minute are rare. Most failures are self-healing within seconds.
The magic of exponential backoff is that it's proportional to failure duration. Quick failures get retried fast. Systemic failures get generous backoff. It automatically adapts without you having to tune retry counts. This is why it's been the industry standard for decades—it just works across different systems, different failure modes, different time scales.
Here's what this looks like in practice for Claude Code agents:
async function retryWithExponentialBackoff<T>(
fn: () => Promise<T>,
maxAttempts: number = 3,
baseDelay: number = 1000,
): Promise<T> {
let lastError: Error | null = null;
for (let attempt = 1; attempt <= maxAttempts; attempt++) {
try {
return await fn();
} catch (error) {
lastError = error;
if (attempt === maxAttempts) {
throw error;
}
const delay = baseDelay * Math.pow(2, attempt - 1);
console.log(`Attempt ${attempt} failed. Retrying in ${delay}ms...`);
await sleep(delay);
}
}
throw lastError;
}
// Usage
const result = await retryWithExponentialBackoff(
() => agentSDK.invoke(input),
3,
1000,
);This is the baseline pattern. It's simple, it works, and it's immediately productive. Notice how the delay doubles each iteration: 1 second, 2 seconds, 4 seconds. This prevents the "thundering herd" problem where many failed requests suddenly all retry simultaneously.
Adding Jitter to Prevent Synchronized Retries
Here's a production problem: if 100 agents retry at exactly the same time (after the same backoff), you still overwhelm the API. The solution is jitter — random variance in the delay.
This is one of those patterns that sounds silly until you've seen it happen in production. Picture this: You run a scheduled job that deploys to 100 agents. They all fail at roughly the same time (network blip, API restart, whatever). They all start their retry loop. At exactly the two-second mark—because they're all following the same backoff function—all 100 agents fire requests simultaneously. You just created a coordinated attack on your own infrastructure.
Jitter solves this by adding randomness. Instead of waiting exactly two seconds, you wait two seconds plus or minus twenty percent. So some agents wait 1.6 seconds, others 2.4, others 2.1. They hit the API in a nice, distributed wave instead of a thundering herd. Load spreads across time. The system stays stable.
The jitter factor needs to be calibrated thoughtfully. Too much jitter (say, ±50%) and you lose the benefits of exponential backoff—you're basically waiting a random amount. Too little (±5%) and you don't spread the load enough. The sweet spot for most systems is 20-30% variance.
function addJitter(delayMs: number, jitterPercent: number = 0.25): number {
const jitterAmount = delayMs * jitterPercent;
const randomVariation = (Math.random() - 0.5) * 2 * jitterAmount;
return Math.max(0, delayMs + randomVariation);
}
async function retryWithBackoffAndJitter<T>(
fn: () => Promise<T>,
maxAttempts: number = 3,
baseDelay: number = 1000,
): Promise<T> {
for (let attempt = 1; attempt <= maxAttempts; attempt++) {
try {
return await fn();
} catch (error) {
if (attempt === maxAttempts) {
throw error;
}
const exponentialDelay = baseDelay * Math.pow(2, attempt - 1);
const delayWithJitter = addJitter(exponentialDelay);
console.log(
`Attempt ${attempt} failed. Waiting ${delayWithJitter.toFixed(0)}ms before retry...`,
);
await sleep(delayWithJitter);
}
}
}When you deploy this across production, the effect is dramatic. Instead of 100 agents hitting your API simultaneously and causing a spike, they're spread across a window. The spike becomes a gentle curve. Your infrastructure can handle the load without becoming unstable.
Handling SDK Error Messages Directly
When the Agent SDK encounters an error, it surfaces it in the message stream as an SDKAssistantMessage with an error field. You need to handle this explicitly.
async function invokeAgentWithErrorHandling(input: string) {
try {
const response = await agentSDK.invoke(input);
// Check for SDK-level errors in the message stream
for (const message of response.messages) {
if (message.type === "assistant" && message.error) {
const error = message.error;
switch (error.type) {
case "authentication_failed":
// Permanent failure—don't retry
throw new Error(`Auth failed: ${error.message}`);
case "rate_limit":
// Transient—retry with backoff
console.warn("Rate limited, will retry with backoff");
return await retryWithBackoffAndJitter(
() => agentSDK.invoke(input),
3,
);
case "server_error":
// Transient—retry with backoff
console.warn("Server error, will retry");
return await retryWithBackoffAndJitter(
() => agentSDK.invoke(input),
3,
);
case "billing_error":
// Permanent until credits renewed
throw new Error(`Billing issue: ${error.message}`);
case "invalid_request":
// Permanent—fix your prompt
throw new Error(`Invalid request: ${error.message}`);
default:
throw new Error(`Unknown error: ${error.message}`);
}
}
}
return response;
} catch (error) {
// Handle network errors, timeouts, etc.
if (error instanceof NetworkError) {
console.error("Network error, retrying...");
return retryWithBackoffAndJitter(() => agentSDK.invoke(input), 3);
}
throw error;
}
}Notice the three-level error handling: SDK-level errors in the message stream, tool execution errors in tool_result messages, and session-level errors in the final result message. Each needs its own handler because they mean different things. This is the architecture of resilience: categorizing failures and responding appropriately.
Tool Execution Recovery Strategies
Sometimes a tool call fails — file doesn't exist, you don't have permission, the command times out. Here's how to recover gracefully. Your hooks can catch tool failures and apply tool-specific recovery logic. For Read failures, the agent learns the file doesn't exist and adjusts. For Bash failures, we log the exit code so the agent understands why. For Edit failures, we signal that conflicts may have occurred. This isn't automatic retry — it's giving Claude better information so it can recover intelligently.
interface ToolError {
tool: string;
error: string;
recoverable: boolean;
suggestion?: string;
}
async function executeToolWithRecovery(
tool: string,
params: Record<string, any>,
): Promise<any> {
try {
switch (tool) {
case "Read":
return await executeRead(params);
case "Edit":
return await executeEdit(params);
case "Bash":
return await executeBash(params);
default:
return await executeTool(tool, params);
}
} catch (error) {
return handleToolError(tool, error);
}
}
function handleToolError(tool: string, error: Error): ToolError {
if (tool === "Read") {
if (error.message.includes("ENOENT")) {
return {
tool: "Read",
error: "File not found",
recoverable: false,
suggestion: "The file does not exist. Try listing the directory first.",
};
}
if (error.message.includes("EACCES")) {
return {
tool: "Read",
error: "Permission denied",
recoverable: false,
suggestion: "You lack read permissions. Check file ownership.",
};
}
}
if (tool === "Bash") {
if (error.message.includes("timeout")) {
return {
tool: "Bash",
error: "Command timed out",
recoverable: true,
suggestion: "The command took too long. Try with a simpler operation.",
};
}
}
return {
tool,
error: error.message,
recoverable: false,
};
}This pattern gives Claude visibility into why a tool failed and how to recover. When a file read fails because the file doesn't exist, Claude knows to try a different approach. When a command times out, Claude knows it was transient and can try again with a simpler variant. This is smarter than just retrying blindly.
Implementing Circuit Breakers for Cascading Failures
A circuit breaker stops your agent from continuously banging into a failing system. It's like an electrical circuit breaker: trip it, kill the connection, wait before retrying.
The insidious danger of unchecked retries is cascading failure. Imagine Claude's API becomes temporarily unavailable. Without a circuit breaker, your agent would detect failure, start retry loop, wait 1 second, retry, fail, wait 2 seconds, retry, fail, wait 4 seconds, and continue for five minutes straight just hammering the broken service. That's 60+ attempts hitting an already-overwhelmed system. You're making the outage worse. You're contributing to the overload that's preventing recovery.
A circuit breaker operates in three states. Closed (normal): Requests pass through. If they fail, we track it. Open (broken): After N failures, we trip the circuit. New requests immediately fail without even trying. You stop hammering the dead system. The downstream service has space to recover. Half-Open (testing): After a timeout, we cautiously try one request. If it succeeds, we close the circuit. If it fails, we reopen and wait longer.
enum CircuitState {
Closed = "closed",
Open = "open",
HalfOpen = "half-open",
}
class CircuitBreaker {
private state: CircuitState = CircuitState.Closed;
private failureCount: number = 0;
private failureThreshold: number = 5;
private successThreshold: number = 2;
private openTimeout: number = 30000; // 30 seconds
private openSince: number | null = null;
private successCount: number = 0;
async execute<T>(fn: () => Promise<T>): Promise<T> {
if (this.state === CircuitState.Open) {
if (Date.now() - this.openSince! > this.openTimeout) {
// Try to recover
this.state = CircuitState.HalfOpen;
this.successCount = 0;
} else {
throw new Error("Circuit breaker is open");
}
}
try {
const result = await fn();
this.onSuccess();
return result;
} catch (error) {
this.onFailure();
throw error;
}
}
private onSuccess(): void {
this.failureCount = 0;
if (this.state === CircuitState.HalfOpen) {
this.successCount++;
if (this.successCount >= this.successThreshold) {
this.state = CircuitState.Closed;
console.log("Circuit breaker closed—service recovered");
}
}
}
private onFailure(): void {
this.failureCount++;
if (this.failureCount >= this.failureThreshold) {
this.state = CircuitState.Open;
this.openSince = Date.now();
console.log(
"Circuit breaker opened—stopping requests to failing service",
);
}
if (this.state === CircuitState.HalfOpen) {
// Failed during recovery attempt
this.state = CircuitState.Open;
this.openSince = Date.now();
}
}
}This self-healing behavior means that when a downstream service goes down, your agents detect it, stop pounding it, and let it recover. Once it's back, the circuit automatically closes and you resume normal operation. No manual intervention needed. This is the invisible safety net that keeps your systems running smoothly.
Graceful Degradation When Tools Are Unavailable
Not all tools are always available. MCP servers might crash, network calls might fail. Here's how to gracefully degrade. You can validate that required tools are available before running, but allow optional tools to be missing. The agent runs with whatever tools are available. If WebSearch is down, your agent can still analyze local code. If the file system is unavailable, that's a blocking issue.
interface ToolAvailability {
required: string[];
optional: string[];
}
async function validateToolAvailability(
requiredTools: string[],
optionalTools: string[],
): Promise<{ available: string[]; missing: string[] }> {
const allTools = [...requiredTools, ...optionalTools];
const available: string[] = [];
const missing: string[] = [];
for (const tool of allTools) {
try {
await healthCheckTool(tool);
available.push(tool);
} catch (error) {
missing.push(tool);
}
}
// Verify all required tools are available
const missingRequired = requiredTools.filter((t) => missing.includes(t));
if (missingRequired.length > 0) {
throw new Error(
`Required tools unavailable: ${missingRequired.join(", ")}`,
);
}
return { available, missing };
}
async function invokeAgentWithGracefulDegradation(
input: string,
requiredTools: string[],
optionalTools: string[],
): Promise<string> {
const { available } = await validateToolAvailability(
requiredTools,
optionalTools,
);
const agent = agentSDK.create({
tools: available,
systemPrompt: `You have access to: ${available.join(", ")}`,
});
return agent.invoke(input);
}The key insight: required tools are essential for the task, but optional tools enhance the agent's capabilities. By validating availability upfront and constraining the agent's tool set dynamically, you ensure the agent always works with what's available, never crashes due to missing tools.
Timeout Handling for Long-Running Operations
Agents can sometimes get stuck — infinite loops, very large files, unresponsive MCP servers. Timeouts are your safety valve. The AbortController pattern is standard in Node.js for cancellation. In production, you typically pair timeouts with circuit breakers: if an agent times out 3 times in a row, open the circuit.
async function invokeAgentWithTimeout(
input: string,
timeoutMs: number = 30000,
): Promise<string> {
const controller = new AbortController();
const timeoutHandle = setTimeout(() => controller.abort(), timeoutMs);
try {
const result = await agentSDK.invoke(input, {
signal: controller.signal,
});
return result;
} catch (error) {
if (error.name === "AbortError") {
throw new Error(`Agent invocation timed out after ${timeoutMs}ms`);
}
throw error;
} finally {
clearTimeout(timeoutHandle);
}
}
// With timeout-based circuit breaker
class TimeoutCircuitBreaker extends CircuitBreaker {
private timeoutCount: number = 0;
private timeoutThreshold: number = 3;
async executeWithTimeout<T>(
fn: () => Promise<T>,
timeoutMs: number,
): Promise<T> {
try {
return await this.execute(() => executeWithTimeout(fn, timeoutMs));
} catch (error) {
if (error.message.includes("timed out")) {
this.timeoutCount++;
if (this.timeoutCount >= this.timeoutThreshold) {
// Too many timeouts—something is wrong
this.state = CircuitState.Open;
this.openSince = Date.now();
}
}
throw error;
}
}
}This pattern ensures agents never hang indefinitely. If an operation takes longer than expected, it's terminated and the circuit breaker tracks the pattern. Three timeouts in a row suggests a systemic issue, not a transient blip.
Handling Partial Failures and Resuming with State
When an agent partially completes a task before failing, you need to decide: retry from scratch or resume from where it failed? By resuming with session IDs, the agent maintains context across retries. It knows which files it already processed, avoiding duplicate work. You include completed steps in the prompt so Claude understands progress and doesn't restart from scratch. This is critical for long-running, multi-step tasks.
interface SessionState {
sessionId: string;
completedSteps: string[];
currentStep: number;
results: Record<string, any>;
}
async function resumableTask(
input: string,
steps: string[],
sessionState?: SessionState,
): Promise<SessionState> {
const sessionId = sessionState?.sessionId || generateSessionId();
const completedSteps = sessionState?.completedSteps || [];
const results = sessionState?.results || {};
const remainingSteps = steps.filter(
(_, i) => !completedSteps.includes(String(i)),
);
if (remainingSteps.length === 0) {
return { sessionId, completedSteps, currentStep: steps.length, results };
}
const prompt = `
You are resuming a task.
Completed steps: ${completedSteps.map((idx) => steps[parseInt(idx)]).join(", ")}
Remaining steps: ${remainingSteps.join(", ")}
Continue from where you left off.
`;
try {
const response = await agentSDK.invoke(prompt);
// Mark step as complete
completedSteps.push(String(steps.indexOf(remainingSteps[0])));
results[remainingSteps[0]] = response;
return resumableTask(input, steps, {
sessionId,
completedSteps,
currentStep: completedSteps.length,
results,
});
} catch (error) {
// Save progress before throwing
saveSessionState(sessionId, {
sessionId,
completedSteps,
currentStep: completedSteps.length,
results,
});
throw error;
}
}This pattern is crucial for batch operations. If you're processing 100 files and fail on file 47, you don't want to reprocess files 1-46. With state resumption, you continue from file 48. This saves time and tokens.
Implementing Retry Budget and Cost Control
In production, retries add up. A simple 3-retry policy with exponential backoff might use 7x the normal API cost if you keep failing. You need a retry budget — a cost ceiling. This prevents runaway costs. If a task is genuinely hard and burns through your retry budget, you fail fast instead of silently spending $100.
class RetryBudget {
private spent: number = 0;
private limit: number;
constructor(budgetUSD: number) {
this.limit = budgetUSD * 1000; // Convert to cents
}
canRetry(estimatedCost: number): boolean {
return this.spent + estimatedCost <= this.limit;
}
spend(cost: number): void {
this.spent += cost;
}
remaining(): number {
return this.limit - this.spent;
}
percentUsed(): number {
return (this.spent / this.limit) * 100;
}
}
async function invokeWithBudget(
input: string,
budget: RetryBudget,
estimatedCost: number = 0.01,
): Promise<string> {
let attempt = 0;
const maxAttempts = 3;
while (attempt < maxAttempts) {
try {
// Check budget before attempting
if (!budget.canRetry(estimatedCost)) {
throw new Error(
`Retry budget exhausted. Remaining: $${(budget.remaining() / 1000).toFixed(2)}`,
);
}
const result = await agentSDK.invoke(input);
budget.spend(estimatedCost);
return result;
} catch (error) {
attempt++;
budget.spend(estimatedCost);
if (attempt >= maxAttempts || !budget.canRetry(estimatedCost)) {
throw new Error(
`Failed after ${attempt} attempts. Budget: ${budget.percentUsed().toFixed(1)}% used`,
);
}
const delay = 1000 * Math.pow(2, attempt - 1);
await sleep(delay);
}
}
}This prevents a pathological scenario where a task is genuinely hard, your agent keeps retrying, and you wake up to a massive bill. Instead, the budget acts as a circuit breaker on cost.
Observability: Logging and Monitoring
You can't debug what you don't observe. Here's a structured logging approach that lets you track retry attempts, errors, and patterns. With this telemetry, you can answer questions like: "Which errors cause most retries? Are we rate-limited more on Tuesdays? How long do our agents spend retrying?" This drives optimization.
interface RetryEvent {
timestamp: number;
agentId: string;
attempt: number;
errorType: string;
duration: number;
backoffMs: number;
success: boolean;
}
class RetryLogger {
private events: RetryEvent[] = [];
logRetry(event: RetryEvent): void {
this.events.push(event);
// Log to observability system
console.log(
`[RETRY] Agent ${event.agentId} attempt ${event.attempt}: ` +
`${event.errorType} (waited ${event.backoffMs}ms)`,
);
// Send to monitoring backend
this.emitToMonitoring(event);
}
getErrorStats(): Record<string, { count: number; successRate: number }> {
const stats: Record<string, { count: number; successRate: number }> = {};
for (const event of this.events) {
if (!stats[event.errorType]) {
stats[event.errorType] = { count: 0, successRate: 0 };
}
stats[event.errorType].count++;
}
return stats;
}
private emitToMonitoring(event: RetryEvent): void {
// Send to Datadog, Prometheus, CloudWatch, etc.
metrics.gauge("retry.backoff_ms", event.backoffMs, {
agent: event.agentId,
error_type: event.errorType,
});
}
}This telemetry accumulates over time. After a week of data, you can see which error types cause most issues. After a month, you can spot patterns (e.g., rate limits happen more on weekdays). This data drives smarter resilience strategies.
Testing Error Scenarios
You'd test these patterns by mocking failures and verifying behavior. Does the agent retry on rate limits but fail immediately on auth errors? Does it respect retry budgets? Does it timeout correctly? Does the circuit breaker open after N failures and let the system recover? These tests ensure your resilience layer actually works.
describe("Error Recovery Patterns", () => {
test("retries on rate limit, fails on auth error", async () => {
const agentMock = jest
.fn()
.mockRejectedValueOnce(new Error("Rate limited"))
.mockResolvedValueOnce("success");
const result = await invokeWithErrorHandling(agentMock);
expect(result).toBe("success");
expect(agentMock).toHaveBeenCalledTimes(2);
});
test("circuit breaker opens after N failures", async () => {
const breaker = new CircuitBreaker();
breaker.failureThreshold = 2;
const failingFn = async () => {
throw new Error("fail");
};
await expect(breaker.execute(failingFn)).rejects.toThrow();
await expect(breaker.execute(failingFn)).rejects.toThrow();
// Third call should fail immediately without calling function
await expect(breaker.execute(failingFn)).rejects.toThrow(
"Circuit breaker is open",
);
});
test("respects retry budget", async () => {
const budget = new RetryBudget(0.05); // 5 cents
const expensiveOp = async () => {
throw new Error("needs retry");
};
await expect(invokeWithBudget("test", budget, 0.02)).rejects.toThrow(
/budget exhausted/i,
);
});
test("resumes from session state", async () => {
const state: SessionState = {
sessionId: "test-123",
completedSteps: ["0", "1"],
currentStep: 2,
results: {},
};
const result = await resumableTask("input", ["a", "b", "c"], state);
expect(result.completedSteps.length).toBeGreaterThan(2);
});
});These tests ensure the resilience mechanisms work as expected. When you ship code relying on these patterns, you've already validated that they catch errors correctly and recover appropriately.
Real-World Scenario: The Cascading Failure
Let me walk you through a realistic scenario that happens in production more often than teams expect. It's Monday 3pm, peak traffic time. Your API is handling 500 requests per second. A downstream service (a payment processor) hiccups—response times jump from 100ms to 5000ms. Your agent code doesn't have timeouts, so requests start queuing up. After 30 seconds, you have 15,000 requests waiting.
Without proper error handling: Your agent tries to process them all, each one timing out after 30 seconds. The API client retries each one. Now you're generating 30,000 requests per minute trying to hit a service that can only handle 500 requests per second. You're making the problem worse. The downstream service never recovers because every recovery attempt is drowned in retries.
With proper error handling: Your agent detects the first timeout. It backs off (waits 1 second). It detects the second timeout. It backs off longer (waits 2 seconds). By the third timeout, the circuit breaker opens. New requests fail immediately without even trying to reach the downed service. The load on the downstream service drops to zero. The system has room to recover. After 30 seconds of circuit-break, you try a single request. It succeeds. The circuit closes. Normal operations resume.
Total impact without error handling: 20-minute outage while the downstream service recovers. Your customers see failures.
Total impact with error handling: 2-minute degradation while you shed load. Customers see slightly slower responses, not failures.
That's the real-world difference. Error handling isn't academic. It's survival.
Alternatives to Standard Retry Patterns
The retry patterns we've covered work for most situations. But different failure modes call for different strategies.
Alternative 1: Request collapsing Instead of retrying the same request 3 times, batch related requests together. If 100 agents all retry the same query after the same failure, collapse them into 1 query and fan out the result to all 100 agents. This reduces load on the downstream system by 100x. Implementation: maintain a cache of in-flight requests keyed by request signature. If a duplicate request arrives, wait for the first one to complete instead of issuing a new request.
Alternative 2: Adaptive timeout Instead of fixed timeouts, measure how long requests take under normal conditions. Adjust timeout dynamically: normal_time + 3 * stddev. If the system is degraded, requests slow down naturally, and timeouts scale with them. This adapts to the system's current state.
Alternative 3: Bulkhead pattern Instead of using all connection pool connections for retries, isolate them. Reserve half the pool for normal requests, half for retries. If retries exhaust their pool, they fail fast instead of starving normal requests. This prevents retries from causing cascading failures.
Alternative 4: Fallback mechanisms Instead of retrying the same approach, switch to a fallback. If the primary payment processor fails, use a secondary. If the cache is down, query the database directly. This trades slightly higher latency for robustness—you're not retrying a broken path, you're switching to a working one.
Troubleshooting Error Handling Issues
Despite careful implementation, error handling sometimes misbehaves. Here's how to debug systematically.
Issue: Retries keep happening but failures don't decrease You're retrying, but the error rate stays constant. This suggests the error isn't transient. Solution: Check your error type classification. Are you retrying non-retryable errors (auth failures, bad requests)? If yes, stop retrying those. Check if there's a systemic issue (e.g., your code has a bug) that retries can't fix.
Issue: Retry budget exhaustion without fixing the problem Your budget allows 5 retries, and you burn through all 5 without recovery. Solution: Either increase the budget (if you can afford it) or reconsider your retry strategy. Maybe the problem needs a fix, not retries. Circuit breakers exist partially to prevent this—after N failures, stop retrying and surface the error early instead of burning the budget.
Issue: Circuit breaker stays open too long After a failure, your circuit breaker opens. But the downstream service recovers after 30 seconds. The circuit breaker doesn't try again for 5 minutes. That's too long. Solution: Implement faster recovery attempts. Shorten the timeout before half-open retry, or probe the service faster to detect recovery.
Issue: Jitter makes some retries wait way too long Your jitter adds ±50% variance. A 2-second delay becomes 1-3 seconds. Sometimes you get the tail—9-second delays. By then, you've lost important time. Solution: Reduce jitter percentage. ±25% variance is usually sufficient to spread load without excessive delay variance. Or implement smaller jitter increments as you approach retry deadlines (first retry: ±30%, last retry: ±5%).
Practical Patterns for Different Error Categories
Not all errors deserve the same treatment. Here are patterns tailored to specific error types.
Category: Rate Limit (429) This is transient but also informative. The service is overloaded. Backoff aggressively. Exponential backoff with jitter is ideal. Retry 3-5 times. If still rate-limited after that, open the circuit briefly.
Category: Server Error (500, 502, 503) This is transient and self-healing. The service is temporarily down. Retry immediately (no backoff needed), with longer overall timeout to give the service time to restart. Typical: 3 retries over 30 seconds.
Category: Client Error (400, 401, 403, 404) These are permanent failures. Don't retry. Investigate and fix the request instead. Only exception: 401 (auth) might be transient if your token expired—refresh and retry once.
Category: Timeout Could be transient (the service is overloaded) or permanent (the service is down). Retry with backoff, but use circuit breakers to prevent hammering. If timeouts persist for 5+ attempts, assume the service is truly down and open the circuit.
Category: Network Error (connection refused, socket hang up) The network path is broken, not the service. Retry, but only after a small backoff. The network may have momentary issues. If it persists, the infrastructure is broken and retries won't help.
Production Considerations: Monitoring and Observability
Once you deploy error handling, you need visibility into what's happening. Without observability, you're flying blind.
Monitoring 1: Retry rate by error type Track how many retries are happening for each error type. If rate_limit retries suddenly spike, the downstream service is overloaded. If client error retries spike, your request format might be broken. This telemetry is your early warning system.
Monitoring 2: Circuit breaker state transitions Log when circuit breakers open, half-open, and close. These events are rare but important. A spike in circuit breaker opens means multiple failures are occurring simultaneously. That's an alert-worthy pattern.
Monitoring 3: Retry budget consumption Track how much of your retry budget you're spending. If a task consistently uses 80% of its budget, you might be setting the budget too low. If it consistently uses 10%, the budget is wastefully high.
Monitoring 4: End-to-end latency including retries When an agent invokes a tool and has to retry, the total latency is much higher than the normal case. Track this distribution. If P99 latency is 5 seconds while P50 is 100ms, something is wrong. Retries account for the difference—they're not showing up in your happy-path latency metrics.
Monitoring 5: Cost impact of retries Retries aren't free. Each retry costs money (API call, resource usage). Track retry cost as a percentage of total costs. If retries account for 20% of your costs, that's a signal to optimize error handling or raise budgets.
Testing Error Handling Thoroughly
Error handling is hard to test because you need to simulate failures. Here's a structured approach.
Test 1: Inject transient failures Mock the API to fail the first 2 times, succeed on the third. Verify the agent retries and succeeds. This tests basic retry logic.
Test 2: Inject persistent failures Mock the API to always fail. Verify the agent respects retry limits, opens the circuit, and fails appropriately. This tests that you eventually give up and surface the error instead of retrying forever.
Test 3: Test circuit breaker state machine Verify closed → open transition (after N failures). Verify half-open → closed transition (after success). Verify half-open → open transition (after failure). Test all state transitions, not just happy paths.
Test 4: Test budget exhaustion Set a small budget and ensure tasks fail when they exceed it, rather than silently burning costs.
Test 5: Test error type classification Mock auth failures, bad requests, rate limits, and timeouts. Verify each is classified correctly and triggers the right recovery strategy.
Common Mistakes and How to Avoid Them
Mistake 1: Retrying everything — Not all errors are transient. Retrying auth failures, bad requests, or permission errors is a waste. Use the isRetryableError() guard.
Mistake 2: Too aggressive backoff — Waiting 32 seconds before retry-5 means a single failed request takes a minute. Keep max backoff to 10-30 seconds. If you're still failing after a minute of retries, the system is probably broken, not temporarily overloaded.
Mistake 3: Ignoring partial progress — If your agent completed 90% of a task, don't restart from scratch. Use session resumption and state tracking. This is especially critical for long-running tasks that touch multiple files or perform batch operations.
Mistake 4: No observability — You ship a retry system that silently burns through your budget. Add logging from day one. Make sure you can answer questions like "which errors caused most retries this week?"
Mistake 5: Forgetting circuit breakers — Without circuit breakers, every retry spike hammers the API harder, making recovery slower. Always pair retries with circuit breakers. They work together—retries give the system multiple chances, circuit breakers protect it from being overwhelmed.
The Hidden Layer: Why These Patterns Matter in Real Systems
Error handling is where complexity actually lives. Your business logic might be ten lines. Your error handling is a hundred lines. That imbalance isn't a bug in your system design—it's a feature of reality. Production systems are mostly about handling what went wrong, not celebrating what went right.
Consider a simple agent that reads a file. The happy path is three lines of code. But in production, you need to handle file doesn't exist, file is locked, file is huge, filesystem is down, partial reads, corrupted data, and the read succeeds but in an unexpected format. Each of these needs a different recovery strategy. That's where your actual engineering time goes.
The systems we've shown you in this guide—exponential backoff, circuit breakers, budget tracking, state resumption—these aren't optimizations. They're necessities. Every production agent system eventually implements these patterns. The question is whether you build them thoughtfully or discover them painfully at 2 AM.
When you understand these patterns deeply, you build systems that absorb failures gracefully. Your agents keep running when APIs hiccup. Your budgets stay controlled even when things go wrong. Your users never see the chaos behind the scenes. That's the art of resilience.
Key Takeaways
Production agents need resilience built in from day one:
- Distinguish error types — rate limits need backoff, auth errors need immediate failure
- Use exponential backoff with jitter — prevents thundering herd and spreads load
- Implement circuit breakers — stop pounding failing services, let them recover
- Resume with state — don't restart, continue from where you left off
- Set retry budgets — prevent cost explosions from cascading failures
- Monitor and log everything — observability is your superpower
- Add tool-specific recovery — hooks let you adjust strategy per tool
- Set timeouts — prevent infinite hangs and stuck operations
- Test failure scenarios — verify your resilience actually works
- Document decisions — make sure team understands error handling strategy
The patterns in this guide are battle-tested across thousands of production agents. Start with basic retry-backoff, graduate to circuit breakers and budgets as you scale. Your 2am self — and your accountant — will thank you.
-iNet