Agent SDK Architecture: How It Works Under the Hood

You're using the Claude Code Agent SDK. You're spawning sessions, sending messages, handling tool calls. It all feels smooth—almost magical. But if you're building production systems, deploying at scale, or debugging subtle issues, you need to understand what's actually happening under the hood. How does a message travel from your code to Claude and back? How are tools registered, validated, and dispatched? What's really happening inside a session lifecycle? What happens when you have dozens of concurrent sessions battling for resources?
This article pulls back the curtain on the Agent SDK's internals. We're talking architecture, state machines, message flow, and extension points. By the end, you'll understand not just how to use the SDK, but why it works the way it does. That knowledge transforms you from someone using a tool to someone who can debug, optimize, and extend it. You'll know where bottlenecks hide, how errors propagate, and what safety rails protect you.
Table of Contents
- The High-Level Mental Model
- Core Components and Their Responsibilities
- 1. The Session Manager
- 2. The Message Processor
- 3. The Tool Registry and Dispatcher
- 4. The Built-in Tool Executors
- 5. The Message Flow Controller
- The Session Lifecycle and State Machine
- 1. CREATE: Constructor Runs
- 2. CONFIGURE: Optional Customization
- 3. READY: Session Awaits Input
- 4. MESSAGE_PROCESSING: Message Arrives
- 5. TOOL_EXECUTION: If Needed
- 6. RESPONSE: Final Output
- 7. IDLE: Ready for Next Message
- Message Format and the Tool Calling Protocol
- What Claude Receives (System Context)
- How Claude Calls Tools (Tool Use Format)
- Extension Points and the Plugin Architecture
- 1. Custom Tools
- 2. Custom System Prompts
- 3. Message Interceptors
- 4. Custom Validators
- Error Handling and Recovery
- 1. Input Validation Errors
- 2. Tool Validation Errors
- 3. Tool Execution Errors
- 4. API Errors
- Performance Characteristics and Optimization
- 1. Reduce Tool Calls
- 2. Batch Operations
- 3. Cache Frequently Accessed Data
- 4. Stream Responses
- Debugging and Introspection
- Concurrency and Parallelism: Handling Multiple Sessions
- Session Isolation
- Queuing and Rate Limiting
- Memory Management and Context Windows
- 1. Summarization
- 2. Rolling Windows
- 3. Selective Retention
- Testing and Mocking
- Real-World Deployment Patterns
- Single-Machine Development Setup
- Server-Based Deployment with Session Pooling
- Multi-Tenant Deployment
- Performance Profiling and Optimization
- Backward Compatibility and Versioning
- Deployment Patterns and Real-World Configurations
- Advanced Memory Management Strategies
- Observability and Debugging in Production
- The Evolution of Patterns: From Prototyping to Scale
- The Reliability Mindset
- Conclusion: The Design Principles
The High-Level Mental Model
Before we dive into internals, let's establish a mental model. The Agent SDK is built on three pillars:
- Session Management: A session holds state (context, tools, working directory, conversation history). It's stateful and persistent across multiple turns. Think of it as your personal Claude instance with memory.
- Message Processing Pipeline: Every message you send goes through a validation→dispatch→execution→response flow. This is where the magic happens. Each step is deliberate.
- Tool Abstraction Layer: Tools are pluggable. You define them, the SDK registers them, validates their inputs, executes them, and feeds results back to Claude. This separation is what makes the whole system robust.
Think of the SDK as a layered onion. At the center is Claude's language model. Around it are tool execution engines (Bash, Read, Write, browser APIs). Around that is the message routing layer. And around everything is the session container that holds it all together.
Core Components and Their Responsibilities
Let's break down the architecture into functional components. Each has a single responsibility, and they communicate via well-defined interfaces.
1. The Session Manager
The SessionManager is the entry point. When you create a ClaudeCodeSession, you're instantiating this component. It owns:
- Configuration state: API keys, model selection, working directory, enabled tools
- Conversation history: Every message and response, kept for multi-turn context
- Tool registry: Mapping of tool names to tool implementations
- Connection state: Whether the session is active, idle, or closed
Here's a simplified conceptual model:
interface SessionConfig {
apiKey: string;
model: string;
workingDirectory: string;
tools: Tool[];
systemPrompt?: string;
maxTokens?: number;
temperature?: number;
}
class SessionManager {
private config: SessionConfig;
private history: Message[] = [];
private toolRegistry: Map<string, Tool> = new Map();
private state: "idle" | "processing" | "error" = "idle";
constructor(config: SessionConfig) {
this.config = config;
this.initializeToolRegistry();
}
async message(userInput: string): Promise<Response> {
// Simplified flow: validate → dispatch → execute → respond
this.state = "processing";
const response = await this.dispatchMessage(userInput);
this.history.push({ role: "user", content: userInput });
this.history.push({ role: "assistant", content: response });
this.state = "idle";
return response;
}
private initializeToolRegistry(): void {
// Register built-in tools (Bash, Read, Write, etc.)
// Register custom tools from config
}
private async dispatchMessage(input: string): Promise<Response> {
// We'll dive deeper into this next
}
}The SessionManager is orchestrator. It doesn't execute tools itself—it coordinates the flow between the message processor, tool executor, and response handler. This separation means you can swap implementations without breaking the contract.
2. The Message Processor
The MessageProcessor handles the parsing and validation of incoming messages. It doesn't care about Claude or tools; it just ensures input is well-formed and safe.
interface ProcessedMessage {
text: string;
metadata: {
timestamp: number;
turnNumber: number;
sourceContext?: string;
};
validationErrors: string[];
}
class MessageProcessor {
process(input: string): ProcessedMessage {
const processed: ProcessedMessage = {
text: input.trim(),
metadata: {
timestamp: Date.now(),
turnNumber: this.getTurnNumber(),
},
validationErrors: [],
};
// Validate length
if (processed.text.length === 0) {
processed.validationErrors.push("Message cannot be empty");
}
if (processed.text.length > 10000) {
processed.validationErrors.push("Message exceeds 10,000 character limit");
}
// Validate for injection attacks (basic example)
if (processed.text.includes("DELETE DATABASE")) {
processed.validationErrors.push("Potentially dangerous command detected");
}
return processed;
}
}This layer is a guard rail. It catches obvious problems before they reach Claude. In production systems, this is where you'd add rate limiting, content filtering, or audit logging. This is also where you'd implement tenant isolation if building multi-tenant systems.
3. The Tool Registry and Dispatcher
The ToolRegistry is the directory of all tools available to Claude. When Claude says "call the fetch_customer tool," the dispatcher looks it up here.
interface Tool {
name: string;
description: string;
input_schema: JSONSchema;
handler: (input: any) => Promise<any>;
validate?: (input: any) => ValidationResult;
}
class ToolRegistry {
private tools: Map<string, Tool> = new Map();
register(tool: Tool): void {
// Validate tool definition
if (!tool.name || !tool.description || !tool.input_schema) {
throw new Error(`Invalid tool definition: missing required fields`);
}
// Prevent duplicate registration
if (this.tools.has(tool.name)) {
throw new Error(`Tool "${tool.name}" already registered`);
}
this.tools.set(tool.name, tool);
}
get(name: string): Tool | undefined {
return this.tools.get(name);
}
list(): Tool[] {
return Array.from(this.tools.values());
}
}
class ToolDispatcher {
private registry: ToolRegistry;
async dispatch(toolName: string, input: any): Promise<ToolResult> {
const tool = this.registry.get(toolName);
if (!tool) {
return {
success: false,
error: `Tool not found: ${toolName}`,
};
}
// Validate input against the tool's schema
const validation = this.validateInput(input, tool.input_schema);
if (!validation.valid) {
return {
success: false,
error: `Validation failed: ${validation.errors.join(", ")}`,
};
}
// Execute the tool
try {
const result = await tool.handler(input);
return {
success: true,
data: result,
};
} catch (error) {
return {
success: false,
error: `Tool execution failed: ${error.message}`,
};
}
}
private validateInput(input: any, schema: JSONSchema): ValidationResult {
// Use ajv (JSON Schema validator) or similar
// Returns { valid: boolean; errors: string[] }
}
}Why this architecture? Separating registry from dispatcher lets you:
- Inspect available tools without executing them
- Validate inputs before execution (fail fast)
- Audit tool calls by intercepting at the dispatch layer
- Mock tools for testing by swapping the registry
This is the open/closed principle in action. The dispatcher is closed for modification (it always does the same thing), but open for extension (you add tools by registering them).
4. The Built-in Tool Executors
The SDK ships with built-in tool implementations: Bash, Read, Write, Browser, and Network. Each is a separate executor with its own error handling.
// Example: Bash executor
class BashExecutor implements Tool {
name = "bash";
description = "Execute shell commands in the working directory";
input_schema = {
type: "object",
properties: {
command: {
type: "string",
description: "The shell command to execute",
},
timeout: {
type: "number",
description: "Timeout in milliseconds (default: 30000)",
},
},
required: ["command"],
};
async handler(input: { command: string; timeout?: number }): Promise<any> {
const timeout = input.timeout || 30000;
// Spawn a child process in the working directory
const process = spawn("bash", ["-c", input.command], {
cwd: this.workingDirectory,
timeout,
});
return new Promise((resolve, reject) => {
let stdout = "";
let stderr = "";
process.stdout.on("data", (data) => {
stdout += data.toString();
});
process.stderr.on("data", (data) => {
stderr += data.toString();
});
process.on("close", (code) => {
resolve({
exitCode: code,
stdout,
stderr,
success: code === 0,
});
});
process.on("error", (error) => {
reject(error);
});
});
}
}Each executor:
- Defines its own input schema (what parameters it accepts)
- Implements the handler interface (async function)
- Handles its own error cases (command not found, timeout, permission denied)
- Returns structured results that Claude can reason about
This is why the SDK feels so smooth. Each executor is battle-tested, handles edge cases, and provides consistent output. The Bash executor, for example, carefully captures both stdout and stderr, distinguishes between success and failure, and respects timeouts to prevent hanging.
5. The Message Flow Controller
This is the orchestrator that ties everything together. It implements the state machine that governs how messages flow through the system.
type MessageState =
| "idle"
| "awaiting_claude"
| "executing_tool"
| "processing_response"
| "error";
class MessageFlowController {
private state: MessageState = "idle";
private currentToolCall?: ToolCall;
async processMessage(
input: string,
sessionManager: SessionManager,
dispatcher: ToolDispatcher,
): Promise<Response> {
// 1. VALIDATE: Ensure input is well-formed
this.state = "idle";
const processed = sessionManager.validateInput(input);
if (!processed.valid) {
throw new Error(`Validation failed: ${processed.errors}`);
}
// 2. SEND_TO_CLAUDE: Add message to history and call the model
this.state = "awaiting_claude";
const claudeResponse = await this.callClaude(input, sessionManager);
// 3. CHECK_FOR_TOOL_CALLS: Does Claude want to use a tool?
if (claudeResponse.hasToolCall) {
this.currentToolCall = claudeResponse.toolCall;
return this.handleToolCall(
claudeResponse.toolCall,
dispatcher,
sessionManager,
);
}
// 4. RETURN_FINAL_RESPONSE: No more tool calls, Claude is done
this.state = "idle";
return {
text: claudeResponse.text,
toolCalls: [],
finalResponse: true,
};
}
private async handleToolCall(
toolCall: ToolCall,
dispatcher: ToolDispatcher,
sessionManager: SessionManager,
): Promise<Response> {
this.state = "executing_tool";
// Dispatch the tool
const toolResult = await dispatcher.dispatch(toolCall.name, toolCall.input);
// Add the tool result to conversation history
sessionManager.addMessage({
role: "user",
type: "tool_result",
toolName: toolCall.name,
result: toolResult,
});
// Claude may want to use another tool or generate final response
// Loop back to "awaiting_claude"
this.state = "processing_response";
return this.processMessage("", sessionManager, dispatcher);
}
private async callClaude(
input: string,
sessionManager: SessionManager,
): Promise<ClaudeResponse> {
const client = new Anthropic({
apiKey: sessionManager.config.apiKey,
});
const response = await client.messages.create({
model: sessionManager.config.model,
max_tokens: sessionManager.config.maxTokens || 4096,
system: sessionManager.getSystemPrompt(),
tools: sessionManager.getToolDefinitions(),
messages: sessionManager.getHistory(),
});
return {
text: response.content[0].type === "text" ? response.content[0].text : "",
hasToolCall: response.stop_reason === "tool_use",
toolCall: this.extractToolCall(response),
};
}
private extractToolCall(response: Message): ToolCall | undefined {
// Find the tool_use block in the response
const toolUseBlock = response.content.find(
(block) => block.type === "tool_use",
);
if (!toolUseBlock || toolUseBlock.type !== "tool_use") {
return undefined;
}
return {
id: toolUseBlock.id,
name: toolUseBlock.name,
input: toolUseBlock.input,
};
}
}This controller implements a loop: Claude produces output → we check for tool calls → we execute tools → we feed results back to Claude → repeat until done.
Why is this a loop and not a single request? Because Claude doesn't execute tools; we execute tools on Claude's behalf. When Claude says "call fetch_customer with id='123'," we:
- Execute
fetch_customerin our environment - Get the result
- Tell Claude what happened
- Let Claude decide what to do next (use another tool, generate final response, ask for clarification)
This is the agentic loop. It's the reason Claude Code can work with real systems. It's also why debugging tool calls is easier—you can intercept at each step.
The Session Lifecycle and State Machine
Understanding the session lifecycle is critical for debugging and optimization. A session goes through predictable states:
CREATE → CONFIGURE → READY → MESSAGE_PROCESSING → (TOOL_EXECUTION)* → RESPONSE → IDLE → MESSAGE_PROCESSING → ...
Let's map this out:
1. CREATE: Constructor Runs
const session = new ClaudeCodeSession({
apiKey: process.env.ANTHROPIC_API_KEY,
model: "claude-3-5-sonnet-20241022",
workingDirectory: "/tmp/project",
});
// At this point:
// - API key is validated
// - Model is set
// - Working directory is created (or verified)
// - Tool registry is initialized with built-in tools
// - Conversation history is empty
// - State: CREATED2. CONFIGURE: Optional Customization
session.registerTool({
name: "fetch_user_data",
description: "Retrieve user by ID",
input_schema: {
type: "object",
properties: {
userId: { type: "string" },
},
required: ["userId"],
},
handler: async (input) => {
// Your implementation
return { name: "John", email: "john@example.com" };
},
});
// At this point:
// - Custom tools are registered
// - System prompt can be customized
// - State: CONFIGURED3. READY: Session Awaits Input
The session is configured and waiting. No API calls have happened yet. This is idle time, and it's important for understanding costs. You only pay for API usage when you call message(), not for sessions sitting idle.
4. MESSAGE_PROCESSING: Message Arrives
const response = await session.message(
"What files are in the working directory?",
);
// Timeline:
// T0: message() called
// T1: Input validated (MessageProcessor)
// T2: API call to Anthropic (awaiting Claude)
// T3: Response from Claude received
// T4: Check for tool calls
// T5: If tool calls: execute them (ToolDispatcher)
// T6: If more tool calls needed: loop back to T2
// T7: Final response returned to userThis timeline is critical for understanding latency. The latency you observe includes network round-trips to Anthropic's API plus tool execution time. If Claude is calling 5 different tools, you're waiting for all 5 to execute sequentially.
5. TOOL_EXECUTION: If Needed
This only happens if Claude requested a tool call. The dispatcher:
- Validates the input against the tool's schema
- Executes the tool
- Captures stdout, stderr, exit codes
- Adds the result to the conversation history
- Signals the flow controller to loop back to Claude
The key insight: Tool execution failures are not session failures. If a tool crashes, the error is captured and reported to Claude. Claude can then decide to retry, use a different tool, or explain the issue to the user.
6. RESPONSE: Final Output
Claude has made a final statement (no more tool calls), and we return that text to the user.
7. IDLE: Ready for Next Message
The session is back to state IDLE. Conversation history persists. The next message() call will include all previous context. This is where sessions shine—multi-turn conversations feel natural because Claude remembers everything.
Critical insight: Each message() call is not independent. Claude sees the entire history. This is why multi-turn conversations work so well. Claude knows what happened in previous turns and can reference them naturally.
Message Format and the Tool Calling Protocol
Let's zoom in on exactly what Claude sees and how it communicates back.
What Claude Receives (System Context)
When you call session.message(), the SDK builds a context and sends it to Claude:
// Example of what gets sent to Claude's API:
{
"model": "claude-3-5-sonnet-20241022",
"max_tokens": 4096,
"system": "You are Claude Code, an AI assistant that helps with...",
"tools": [
{
"name": "bash",
"description": "Execute shell commands...",
"input_schema": { /* JSON Schema */ }
},
{
"name": "read",
"description": "Read file contents...",
"input_schema": { /* JSON Schema */ }
},
// ... more tools
],
"messages": [
{
"role": "user",
"content": "What files are in the working directory?"
}
]
}Claude sees:
- System prompt: Instructions on how to behave
- Available tools: Complete list with descriptions and schemas
- Message history: Everything that's happened so far
Understanding this is crucial. The system prompt drives behavior. The tool list determines what Claude can do. The history determines context.
How Claude Calls Tools (Tool Use Format)
Claude doesn't execute tools. Instead, it says "I want to use tool X with these parameters." The response looks like:
{
"content": [
{
"type": "tool_use",
"id": "tool_use_xyz123",
"name": "bash",
"input": {
"command": "ls -la"
}
}
],
"stop_reason": "tool_use"
}The SDK extracts this and passes it to the ToolDispatcher. The dispatcher:
- Validates the input against the Bash tool's schema
- Executes the command
- Captures the output
Then the SDK adds the result back to the message history:
{
"role": "user",
"content": [
{
"type": "tool_result",
"tool_use_id": "tool_use_xyz123",
"content": "drwxr-xr-x 5 user staff 160 Mar 17 10:45 .\ndrwxr-xr-x 10 user staff 320 Mar 17 10:40 .."
}
]
}And Claude continues. It might use another tool, or it might have enough information to respond to the user.
Why this design? It enforces a clear separation: Claude doesn't execute code; it describes code execution. We execute on its behalf and report results. This is safer, more auditable, and lets us sandbox Claude's access. If you need to revoke Claude's access to a tool, you just unregister it. No code changes needed.
Extension Points and the Plugin Architecture
The SDK is designed for extension. Here's where you can hook in:
1. Custom Tools
Register tools that fit your domain:
session.registerTool({
name: "query_database",
description: "Execute SQL queries",
input_schema: {
/* ... */
},
handler: async (input) => {
// Your database logic
},
});2. Custom System Prompts
Change how Claude behaves:
session.setSystemPrompt(`
You are an expert backend engineer reviewing pull requests.
Focus on: security, performance, and maintainability.
Always check for SQL injection vulnerabilities.
`);3. Message Interceptors
Hook into the message flow to log, filter, or modify messages:
session.onBeforeToolCall((toolCall) => {
console.log(`Tool called: ${toolCall.name}`);
// Could reject certain tools here
return toolCall;
});
session.onAfterToolExecution((toolCall, result) => {
console.log(`Tool completed: ${toolCall.name}`);
// Could modify the result here
return result;
});4. Custom Validators
Add validation logic before tools execute:
session.registerValidator("bash", (input) => {
if (input.command.includes("rm -rf")) {
return { valid: false, reason: "Dangerous command blocked" };
}
return { valid: true };
});These extension points let you:
- Enforce security policies (no dangerous commands)
- Audit usage (log who did what)
- Customize behavior (rate limit, sandboxing)
- Integrate with external systems (send tool results to analytics)
In production systems, you'll use all of these. Security validators prevent accidents. Interceptors enable audit trails. Custom tools connect Claude to your business systems.
Error Handling and Recovery
The architecture handles errors at multiple levels:
1. Input Validation Errors
Caught immediately, before API calls:
try {
const response = await session.message("x".repeat(10001)); // Too long
} catch (error) {
// Error: Message exceeds 10,000 character limit
// State: IDLE (no API call made)
}2. Tool Validation Errors
Caught before execution:
// Claude tries to call fetch_user with missing userId parameter
try {
await dispatcher.dispatch("fetch_user", {});
} catch (error) {
// Error: Validation failed
// The tool's handler never runs
// State: READY_FOR_RETRY (Claude can try again with correct params)
}3. Tool Execution Errors
Caught during execution:
// Tool handler throws an error
const result = await dispatcher.dispatch("bash", {
command: "/nonexistent/file",
});
// Result: { success: false, error: "Command failed: ENOENT" }
// The error is returned to Claude as context
// Claude can decide what to do next4. API Errors
Network or API issues:
try {
const response = await session.message("Hello");
// Network error, timeout, rate limit, invalid key, etc.
} catch (error) {
// Error: Network/API error
// Message history is preserved
// Session can retry from this point
}Key insight: Errors don't necessarily break the conversation. Many errors are recoverable. The SDK feeds them back to Claude, which can retry, use a different tool, or explain the problem to the user. This resilience is baked into the architecture.
Performance Characteristics and Optimization
Understanding the architecture helps you optimize for production:
1. Reduce Tool Calls
Each tool call adds latency (API round-trip + execution). Design tools to minimize this:
Bad: 10 separate tool calls to fetch individual customer fields Good: One tool call that fetches the entire customer object
// Bad pattern:
await bash("get-customer-name", { id: "123" });
await bash("get-customer-email", { id: "123" });
await bash("get-customer-orders", { id: "123" });
// Good pattern:
await bash("get-customer", { id: "123", include: ["name", "email", "orders"] });2. Batch Operations
If Claude needs to process multiple items, batch them:
// Tool definition
{
name: "batch_process",
handler: async (input: { items: string[] }) => {
// Process all items at once
// Faster than 100 individual calls
}
}3. Cache Frequently Accessed Data
If Claude repeatedly asks for the same data, cache it:
session.registerTool({
name: "get_config",
handler: async (input) => {
// First call: expensive
// Subsequent calls: cached
return cache.get("config") || fetchConfig();
},
});4. Stream Responses
For long-running operations, stream progress:
session.onMessage(async (input) => {
return {
stream: true, // Indicate streaming
streamHandler: async (chunk) => {
// Each chunk arrives as it's generated
},
};
});Debugging and Introspection
The architecture exposes debugging capabilities:
// Inspect session state
console.log(session.getState()); // "idle", "processing", etc.
// View conversation history
console.log(session.getHistory());
// [
// { role: "user", content: "list files" },
// { role: "assistant", content: "..." },
// ]
// List registered tools
console.log(session.getTools());
// View the last API request/response
console.log(session.getLastRequest());
console.log(session.getLastResponse());
// Enable debug logging
session.setDebug(true);
// Outputs: [API] Request sent, [TOOL] bash called, [RESULT] ...With this introspection, you can diagnose issues, understand what Claude is thinking, and optimize your setup. You can see exactly what gets sent to Claude's API, which is invaluable for debugging strange behavior.
Concurrency and Parallelism: Handling Multiple Sessions
In production, you'll often need multiple concurrent sessions. The architecture handles this gracefully.
Session Isolation
Each session is fully isolated:
// Create two independent sessions
const session1 = new ClaudeCodeSession({
apiKey: process.env.ANTHROPIC_API_KEY,
workingDirectory: "/tmp/project-a",
});
const session2 = new ClaudeCodeSession({
apiKey: process.env.ANTHROPIC_API_KEY,
workingDirectory: "/tmp/project-b",
});
// Run concurrently
const [result1, result2] = await Promise.all([
session1.message("List files"),
session2.message("List files"),
]);
// Each sees its own directory
console.log(result1); // Files from /tmp/project-a
console.log(result2); // Files from /tmp/project-bSessions don't share state. Each maintains its own:
- Conversation history
- Tool registry
- Working directory
- Configuration
This means you can spawn dozens of sessions in parallel without interference. Perfect for batch processing, parallel code reviews, or multi-tenant systems.
Queuing and Rate Limiting
If you need to throttle requests, the architecture supports it:
class SessionPool {
private queue: Array<() => Promise<any>> = [];
private activeCount = 0;
private maxConcurrent = 5;
async enqueue<T>(fn: () => Promise<T>): Promise<T> {
return new Promise((resolve, reject) => {
this.queue.push(async () => {
try {
const result = await fn();
resolve(result);
} catch (error) {
reject(error);
}
});
this.process();
});
}
private async process(): Promise<void> {
if (this.activeCount >= this.maxConcurrent || this.queue.length === 0) {
return;
}
this.activeCount++;
const fn = this.queue.shift()!;
await fn();
this.activeCount--;
this.process();
}
}
// Usage
const pool = new SessionPool();
const results = await Promise.all(
items.map((item) => pool.enqueue(() => session.message(`Process: ${item}`))),
);The architecture supports this because sessions don't depend on global state. You can queue them, throttle them, load-balance them across servers.
Memory Management and Context Windows
One subtle but critical aspect: how much history does Claude see?
When you call session.message(), the SDK builds a context array with:
- System prompt (usually ~1000 tokens)
- Conversation history (grows with each turn)
- Tool definitions (fixed, but can be large if you have many tools)
// In your MessageFlowController
const context = {
system: this.getSystemPrompt(), // Fixed size
messages: this.getHistory(), // Grows over time
tools: this.getToolDefinitions(), // Fixed size
};
// By the 100th turn, history might look like:
{
system: "You are Claude Code...", // 1000 tokens
messages: [
{ role: "user", content: "First question" }, // T1
{ role: "assistant", content: "..." }, // T50
{ role: "user", content: "Second question" }, // T1
{ role: "assistant", content: "..." }, // T100
// ... 98 more turns ...
],
tools: [ /* 20 tools, ~2000 tokens */ ] // 2000 tokens
}The problem: As the history grows, the total context grows. Eventually, you approach the model's context limit (usually 100k-200k tokens for Claude). This impacts:
- Latency: Larger context = slower API requests
- Cost: You pay per token, including context
- Quality: Older context becomes less relevant
Solutions the architecture provides:
1. Summarization
Summarize old history:
class SessionManager {
private history: Message[] = [];
private maxHistoryMessages = 50;
async addMessage(message: Message): Promise<void> {
this.history.push(message);
// If history grows too large, summarize it
if (this.history.length > this.maxHistoryMessages) {
await this.summarizeOldMessages();
}
}
private async summarizeOldMessages(): Promise<void> {
const oldMessages = this.history.slice(0, -20); // Keep last 20
const summaryPrompt = `Summarize this conversation for context:\n${JSON.stringify(oldMessages)}`;
const summary = await this.callClaude(summaryPrompt);
// Replace old messages with summary
this.history = [
{
role: "system",
content: `Summary of earlier conversation: ${summary}`,
},
...this.history.slice(-20),
];
}
}2. Rolling Windows
Keep only recent history:
class SessionManager {
async message(input: string): Promise<Response> {
// Only send the last 10 messages
const recentHistory = this.history.slice(-10);
const response = await this.callClaudeWithHistory(input, recentHistory);
this.history.push({ role: "user", content: input });
this.history.push({ role: "assistant", content: response });
return response;
}
}3. Selective Retention
Keep important messages, discard trivial ones:
private isImportant(message: Message): boolean {
// Tool calls and results are important
if (message.hasToolCall || message.type === "tool_result") {
return true;
}
// Long user messages are important
if (message.role === "user" && message.content.length > 100) {
return true;
}
// Short acknowledgments are not
return false;
}Understanding memory management is crucial for production use. A 500-turn conversation will eventually become unwieldy unless you manage context carefully.
Testing and Mocking
The architecture's modularity makes testing straightforward:
describe("MessageFlowController", () => {
it("should call a tool when Claude requests it", async () => {
// Mock the tool
const mockTool = {
name: "test_tool",
handler: jest.fn().mockResolvedValue({ success: true }),
};
// Create a session with the mock tool
const session = new ClaudeCodeSession({
apiKey: "fake-key",
tools: [mockTool],
});
// Mock Claude's response to request a tool call
jest.spyOn(session, "callClaude").mockResolvedValue({
toolCall: {
name: "test_tool",
input: { param: "value" },
},
});
// Execute
await session.message("Use test_tool");
// Verify the tool was called with correct input
expect(mockTool.handler).toHaveBeenCalledWith({ param: "value" });
});
});You can test:
- Tool dispatch logic
- Validation rules
- Error handling
- State transitions
Without any API calls. Pure, fast unit tests.
Real-World Deployment Patterns
Understanding the architecture helps you deploy confidently. Different deployment contexts need different configurations.
Single-Machine Development Setup
For local development, simplicity wins:
const session = new ClaudeCodeSession({
apiKey: process.env.ANTHROPIC_API_KEY,
model: "claude-3-5-sonnet-20241022",
workingDirectory: process.cwd(),
tools: ["bash", "read", "write"],
});
// Session runs entirely locally
// No cross-machine concerns
// Maximum flexibilityThis setup is great for experimentation, prototyping, and learning. The downside: no persistence, no scalability, no audit trail.
Server-Based Deployment with Session Pooling
For production, you typically run multiple sessions in a pool:
class SessionServer {
private pool: SessionPool;
private database: AuditLog;
constructor() {
this.pool = new SessionPool({
maxSessions: 50,
sessionTimeout: 3600000, // 1 hour
idleCleanup: 300000, // 5 minutes
});
}
async handleRequest(request: SessionRequest): Promise<Response> {
const session = await this.pool.acquire();
try {
const result = await session.message(request.prompt);
await this.database.log({
clientId: request.clientId,
sessionId: session.id,
prompt: request.prompt,
result: result,
timestamp: Date.now(),
});
return result;
} finally {
this.pool.release(session);
}
}
}This pattern:
- Reuses sessions (faster than creating new ones)
- Limits concurrent sessions (resource control)
- Logs everything (audit trail)
- Cleans up idle sessions (prevents leaks)
Multi-Tenant Deployment
In multi-tenant systems, isolation is critical:
class TenantManager {
private sessionsByTenant: Map<string, SessionPool> = new Map();
async getSessionForTenant(tenantId: string): Promise<ClaudeCodeSession> {
if (!this.sessionsByTenant.has(tenantId)) {
// Create a dedicated pool for this tenant
this.sessionsByTenant.set(
tenantId,
new SessionPool({
maxSessions: 10, // Per-tenant limit
workingDirectory: `/data/tenants/${tenantId}`,
tools: await this.getAuthorizedTools(tenantId),
}),
);
}
return this.sessionsByTenant.get(tenantId).acquire();
}
private async getAuthorizedTools(tenantId: string): Promise<Tool[]> {
// Different tenants can access different tools
// Based on their license tier or permissions
const tier = await this.getTenantTier(tenantId);
return toolsByTier[tier];
}
}This ensures:
- Tenants can't interfere with each other
- Resource limits prevent one tenant from starving others
- Permissions are enforced at the session level
- Working directories are isolated
Performance Profiling and Optimization
Once you understand the architecture, you can profile and optimize. The SDK exposes hooks for instrumentation:
// Measure message latency
class PerformanceMonitor {
async measureMessage(
session: ClaudeCodeSession,
prompt: string,
): Promise<PerformanceMetrics> {
const startTime = performance.now();
const validationStart = performance.now();
const processed = session.validateInput(prompt);
const validationTime = performance.now() - validationStart;
const claudeStart = performance.now();
const response = await session.callClaude(prompt);
const claudeTime = performance.now() - claudeStart;
let executionTime = 0;
if (response.hasToolCall) {
const execStart = performance.now();
// Tool execution happens here
executionTime = performance.now() - execStart;
}
const totalTime = performance.now() - startTime;
return {
total: totalTime,
validation: validationTime,
apiCall: claudeTime,
execution: executionTime,
overhead: totalTime - claudeTime, // Everything except API call
};
}
}With this instrumentation, you can identify where time is actually spent:
- If
apiCallis 90%+ of time: You're doing I/O bound work. Parallelize multiple sessions. - If
overheadis high: Your tools are slow or your validation is expensive. Optimize tools. - If
executionis high: Tool execution is the bottleneck. Run tools in parallel or use faster implementations.
This empirical approach prevents premature optimization. Measure first, optimize where it matters.
Backward Compatibility and Versioning
The SDK needs to evolve, but existing code shouldn't break. The architecture supports this through careful versioning:
// SDK version tracking
const SDK_VERSION = "1.2.0";
class ClaudeCodeSession {
private apiVersion = "2024-03-17"; // API version session was created with
private sdkVersion = SDK_VERSION;
async message(input: string): Promise<Response> {
// Use the API version this session was created with
// Even if SDK is updated to a newer version
const response = await this.callClaudeWithVersion(input, this.apiVersion);
return response;
}
}
// Breaking changes happen in major versions
// Deprecations happen gradually:
// 1.0.0: Feature exists
// 1.1.0: Feature marked deprecated, new API provided
// 2.0.0: Feature removed
class SessionManager {
registerTool(tool: Tool | LegacyTool): void {
if (isLegacyTool(tool)) {
console.warn("Tool format deprecated in v2.0. Please update.");
// Still works, but with a warning
tool = adaptLegacyTool(tool);
}
this.toolRegistry.register(tool);
}
}This approach lets teams upgrade gradually without emergency migrations.
Deployment Patterns and Real-World Configurations
Understanding the architecture is one thing. Applying it effectively in production is another. Let's walk through how different organizations deploy the Agent SDK based on their specific constraints and opportunities.
Startups and small teams benefit from the simplicity of single-session deployments. Minimal operational overhead, no need for session pooling or distributed tracing. A single developer spins up a session, sends messages, and gets results. The SDK handles all the complexity behind the scenes. This lets small teams punch above their weight—you get sophisticated AI-assisted workflows without needing to manage the infrastructure that makes them possible.
Mid-size organizations typically adopt session pools with basic load balancing. Run multiple sessions, distribute requests across them, set resource limits per session. At this scale, you're starting to think about availability: if one session crashes, others keep working. You're starting to think about fairness: one long-running operation shouldn't starve others. The architecture supports this naturally—sessions are isolated, so you can implement straightforward queue-and-worker patterns.
Large enterprises implement multi-region deployments with sophisticated logging, monitoring, and compliance requirements. Sessions might be pinned to specific regions for data residency. Requests might be logged to audit systems. Different user tiers might get different tool access. The architecture supports all of this through its extension points. You're not modifying the SDK—you're plugging your infrastructure into it.
The key insight is that the architecture doesn't force a particular deployment pattern. It provides the building blocks that support patterns ranging from toy projects to enterprise-scale systems.
Advanced Memory Management Strategies
Earlier we discussed summarization and rolling windows. Let's go deeper into memory optimization because it's critical for long-running systems.
The naive approach is to keep the entire history forever. This works fine for short conversations (10-50 turns) but breaks for long conversations (hundreds of turns). The context window fills up, cost increases linearly with history length, and older information becomes noise.
A sophisticated approach uses semantic compression. Instead of summarizing old messages as text, extract their semantic meaning. "User asked about authentication issues, we diagnosed it as a JWT expiration problem, implemented token refresh" becomes a structured fact: {topic: "auth", issue: "jwt_expiration", resolution: "token_refresh"}. Store these facts in a separate structured format that Claude can reference without them taking up tokens in the message history.
Implement hierarchical memory. Recent messages stay in the conversation history (full fidelity). Older messages are compressed into summaries (lower fidelity). Older summaries are further compressed into just facts (minimal tokens). When Claude needs to reference something old, you can fetch the full history if necessary. This creates a memory system that's efficient for the common case (recent context) but can still access older information if needed.
Use importance scoring to decide what to keep. Assign scores to messages based on how much information they contain. "User provided error message" is low importance. "User explained the entire system architecture" is high importance. When memory is constrained, discard low-importance messages first. This biases toward keeping information that matters.
Implement topic-based organization. Group messages by topic (authentication, payments, logging, etc.). Within each topic, apply different retention strategies. Topics that are actively being discussed stay in full context. Topics that are inactive get summarized. Topics that are very old get compressed to facts. This matches how humans remember—recent and relevant stuff stays sharp, old stuff becomes fuzzy.
Observability and Debugging in Production
When your sessions are running in production, handling thousands of concurrent requests, you need visibility. The architecture provides hooks for instrumentation, but you need to know which ones to use.
Instrument the message flow controller to track state transitions. "Session moved from IDLE to AWAITING_CLAUDE at T=0s. Received response at T=2.3s. Started executing tool at T=2.4s." This timeline tells you where latency is coming from. If most latency is in AWAITING_CLAUDE, it's Claude's API. If most is in tool execution, your tools are slow.
Instrument tool dispatchers to track which tools are called, with what input, how long they take, what they return. Build a dashboard showing tool usage patterns. "The bash tool is called 10,000 times/day, averaging 200ms. The read tool is called 50,000 times/day, averaging 50ms." This data reveals where to optimize—maybe bash commands are slow because they're spawning many processes, and you could batch them.
Instrument error handling to distinguish between different error types. A tool returning an error is different from a tool crashing. A validation failure is different from a timeout. An API error from Claude is different from a network error. Categorizing errors lets you route them differently—some might retry automatically, some might alert on-call, some might just log.
Use structured logging. Instead of plain text logs, emit JSON with consistent fields: timestamp, session_id, user_id, message_type, duration, error_code. This lets you query your logs: "Show me all sessions with errors in the tool dispatcher." "How many message processing timeouts occurred in the last hour?" Structured logs are indexable and analyzable in ways plain text never will be.
Build synthetic tests that exercise the SDK in ways production doesn't. A test session that goes through 500 turns to check memory management. A test that spawns 1000 concurrent sessions to check for resource leaks. A test that generates pathological inputs (huge strings, deeply nested objects, circular references) to check error handling. These tests catch issues before production sees them.
The Evolution of Patterns: From Prototyping to Scale
The SDK supports a maturity curve. You start simple and grow sophisticated without rewriting fundamentally.
Phase 1: Prototyping. Single session, simple tools, no optimization. You're learning how Claude handles your domain, what patterns work, what instructions are effective. Minimal code. Minimal infrastructure.
Phase 2: Production MVP. Session pool, basic monitoring, structured logging. You're handling real requests, so you need availability and observability. Still straightforward—no distributed tracing or multi-region complexity.
Phase 3: Scale. Load balancing, regional deployment, advanced memory management, comprehensive monitoring. You're handling significant volume and need to think about efficiency and reliability at scale.
Phase 4: Sophistication. Custom session management, specialized tool dispatch, contextual memory, advanced optimization. You've integrated the SDK deeply into your infrastructure and are extracting maximum value.
Each phase builds on the previous one. You're not replacing the SDK—you're using it in more sophisticated ways. The SDK's architecture supports this progression because it's designed with extension points and clear separation of concerns.
The Reliability Mindset
The Agent SDK is built on a reliability-first mindset. Every component is designed with failure modes in mind. Tools can fail—that's expected, not an error. Claude can ask for things that aren't available—the SDK handles it gracefully. Network requests can time out—the SDK has strategies for recovery.
This mindset shapes how you deploy and operate the SDK. You don't assume sessions will run forever without issues. You design for restarts, for failures, for graceful degradation. You monitor proactively. You have runbooks for common failure modes.
The SDK gives you tools to build reliable systems. It's up to you to use them effectively. Understand the architecture, instrument it well, monitor it closely, test it thoroughly. Do those things, and you can confidently deploy AI-assisted workflows that the SDK orchestrates reliably.
Conclusion: The Design Principles
The Agent SDK's architecture reflects several core design principles:
- Separation of concerns: Each component has a single responsibility
- Clear interfaces: Tools, validators, and executors have well-defined contracts
- Layered security: Validation happens at multiple levels
- Auditability: Every step can be logged and inspected
- Extensibility: Custom tools, validators, and interceptors hook into defined extension points
- Robustness: Errors are caught, contextualized, and fed back to Claude
- Concurrency-friendly: Sessions are isolated, supporting parallel execution
- Production-ready: Built-in support for streaming, memory management, and testing
- Observable: Instrumentation hooks for performance monitoring
- Evolvable: Versioning and backward compatibility built in
Understanding this architecture means you can:
- Debug effectively: Know where problems occur in the pipeline
- Optimize for your use case: Adjust tools, prompts, and flow
- Scale confidently: Handle hundreds of concurrent sessions
- Extend fearlessly: Add tools, validators, and behaviors knowing the design patterns
- Build safer systems: Understand where security boundaries are
- Deploy strategically: Choose the right deployment pattern for your use case
- Monitor comprehensively: Instrument and measure what matters
- Upgrade smoothly: Understand versioning and backward compatibility
This isn't a simple library. It's a foundation for building intelligent systems that interact with your infrastructure. And now you understand how it works, from message entry to response exit, with all the safety rails in place. You can deploy it with confidence, knowing that the design is solid and that you can reason about its behavior at every level.
The architects and engineers who built this SDK made intentional decisions at every layer. They prioritized clarity over cleverness, robustness over performance, flexibility over prescriptiveness. Those decisions create a system that can grow with your needs—from a simple toy project to a mission-critical component of your infrastructure.
-iNet