
State management is the unsexy part of building agents. Nobody gets excited about it in demos. But in production? It's the difference between an agent that works and one that forgets everything after five messages, hallucinates context, or breaks when you restart it.
When you're building agents with Claude, state isn't magic. It's just the conversation history—the message list you send to the API each time. Simple concept, but managing it properly requires understanding some subtle constraints and tradeoffs. This guide covers how to track state, work within token limits, persist conversations across sessions, implement branching, and serialize state for storage.
Let's be clear about what "state" means here: it's the accumulated knowledge of the conversation. What the agent has seen, what it's been told, what it's done, and what the user knows it's done. That state lives in the message history you maintain.
Think about your own experience with long conversations. You remember what was discussed five turns ago because you have memory. Claude doesn't have that memory by default. Every API call is fresh. If you don't explicitly maintain the conversation history and send it back each time, Claude has no idea what you've already discussed. This is actually a feature, not a bug—it gives you complete control over what context Claude sees. But it means you, the developer, must manage that context intentionally.
Table of Contents
- Understanding the Conversation Message Model
- Managing Context Window Limits
- Persisting Conversation State Across Sessions
- Implementing Conversation Branching and Forking
- State Serialization and Deserialization
- Putting It All Together: A Production-Ready Agent
- Best Practices and Common Pitfalls
- Why This Matters: State is the Invisible Foundation
- Common Pitfalls in State Management
- Under the Hood: Token Counting Deep Dive
- Real-World Scenario: A Long-Running Customer Support Agent
- Alternatives to Token Truncation
- Troubleshooting Conversation State Issues
- Production Considerations: Scaling State Management
- Wrapping Up
- Epilogue: State Management as a First-Class Concern
Understanding the Conversation Message Model
Every conversation with Claude is a sequence of messages. You build a list, send it to the API, and get back a response. Here's the fundamental structure:
interface Message {
role: "user" | "assistant";
content: string | ContentBlock[];
}
interface ContentBlock {
type: "text" | "tool_use" | "tool_result";
text?: string;
id?: string;
name?: string;
input?: Record<string, any>;
tool_use_id?: string;
content?: string;
is_error?: boolean;
}The role is who sent the message: "user" (the person or system calling the API) or "assistant" (Claude). The content can be text or structured blocks (text, tool calls, tool results). This dual structure is crucial because Claude's reasoning often involves tool use. It might read a file, then analyze what it found, then call another tool. All of that happens in a single turn, and the message structure captures each step.
Here's a concrete example of a realistic conversation flow that shows how messages accumulate:
const messages: Message[] = [
{
role: "user",
content: "What's the current status of the database?",
},
{
role: "assistant",
content: [
{
type: "text",
text: "I'll check the database status for you.",
},
{
type: "tool_use",
id: "tool_call_1",
name: "check_database_status",
input: {},
},
],
},
{
role: "user",
content: [
{
type: "tool_result",
tool_use_id: "tool_call_1",
content: '{"status": "healthy", "uptime": "45 days"}',
},
],
},
{
role: "assistant",
content: [
{
type: "text",
text: "The database is healthy and has been running for 45 days without issues.",
},
],
},
];This is the entire conversation. When you call the API again, you send this whole list plus a new user message. Claude reads through it, understands the context, and generates the next response. This becomes critical to understand: every message in the history provides context for future responses. The longer the history, the more context Claude has, but also the more tokens you're using on each request.
The hidden why: This architecture is simple but has consequences. You're not calling a stateful service that remembers you. Each API call includes the full history. It's stateless from Claude's perspective—it's reading what you tell it. This means:
- Every call includes everything — no hidden server-side state.
- Longer histories = longer requests = more tokens used = higher cost.
- If you lose the message history, the conversation is lost. You can't recover it from the API.
- You have complete control—you can edit, prune, or summarize history as needed.
Understanding this is key to everything else we'll cover. It's empowering because you can shape context intentionally, but it's also your responsibility.
Managing Context Window Limits
The context window is a hard constraint. Claude has a maximum number of tokens it can process in a single request. With a 200K token window, you can fit roughly 1,000 medium-length messages before running out of space. But typically you run out much sooner because longer conversations have longer messages, and each tool call adds overhead.
When you're building a long-running agent, you need a strategy for when messages start hitting limits. You have choices, each with tradeoffs:
Choice 1: Truncate the oldest messages. Keep the most recent N messages and drop the rest. Fast, simple, loses context. Good when you don't care about history beyond the last few turns.
Choice 2: Summarize old messages. Keep the most recent messages, summarize older ones into a compact summary, and include that summary. Slower (requires an extra API call to summarize) but preserves context better. Good for preserving important decisions or context from earlier in the conversation.
Choice 3: Let it fail. Just hit the token limit and handle the error. Not recommended—this creates unpredictable failures.
Choice 4: Implement smart retention. Keep recent messages, keep messages that mention decisions or findings, drop filler messages. More complex but often best in practice.
Let's implement the simplest version—truncation:
interface ConversationManager {
messages: Message[];
maxMessages: number;
addMessage(message: Message): void;
getMessages(): Message[];
}
class SimpleTruncationManager implements ConversationManager {
messages: Message[] = [];
maxMessages: number = 50; // Keep only the last 50 messages
addMessage(message: Message): void {
this.messages.push(message);
// Trim oldest messages if we exceed the limit
if (this.messages.length > this.maxMessages) {
this.messages = this.messages.slice(-this.maxMessages);
}
}
getMessages(): Message[] {
return this.messages;
}
getStats(): { total: number; kept: number; truncated: number } {
return {
total: this.messages.length,
kept: this.messages.length,
truncated: 0, // We don't track this in simple version
};
}
}
// Usage
const manager = new SimpleTruncationManager();
manager.addMessage({ role: "user", content: "Hello" });
manager.addMessage({ role: "assistant", content: "Hi there!" });
manager.addMessage({ role: "user", content: "How are you?" });
const messagesToSend = manager.getMessages();
console.log(`Sending ${messagesToSend.length} messages to Claude`);Expected Output:
Sending 3 messages to Claude
The problem with truncation is straightforward: you lose old context. The agent might forget what happened 50 messages ago. For many applications, this is fine. For a technical support agent, you might only care about the last few turns anyway. But for complex reasoning tasks, losing context is painful.
Here's a token-aware manager that actually counts tokens before deciding what to drop:
import Anthropic from "@anthropic-ai/sdk";
interface TokenCountResult {
input_tokens: number;
}
class TokenAwareManager {
messages: Message[] = [];
maxTokens: number = 150000; // Leave 50K buffer on 200K limit
client: Anthropic;
totalTokensUsed: number = 0;
constructor(apiKey?: string) {
this.client = new Anthropic({ apiKey });
}
async addMessage(message: Message): Promise<void> {
this.messages.push(message);
await this.enforceTokenLimit();
}
private async enforceTokenLimit(): Promise<void> {
// Count tokens for current messages
try {
const result = await this.client.messages.countTokens({
model: "claude-3-5-sonnet-20241022",
messages: this.messages,
});
this.totalTokensUsed = result.input_tokens;
// If over limit, remove old messages until we're under
while (
this.totalTokensUsed > this.maxTokens &&
this.messages.length > 2
) {
// Keep at least the first and last messages
this.messages.splice(1, 1);
const recount = await this.client.messages.countTokens({
model: "claude-3-5-sonnet-20241022",
messages: this.messages,
});
this.totalTokensUsed = recount.input_tokens;
}
} catch (err) {
console.error("Token counting failed:", err);
}
}
getMessages(): Message[] {
return this.messages;
}
getStats(): {
messageCount: number;
estimatedTokens: number;
tokenUtilization: string;
} {
const utilization = ((this.totalTokensUsed / this.maxTokens) * 100).toFixed(
1,
);
return {
messageCount: this.messages.length,
estimatedTokens: this.totalTokensUsed,
tokenUtilization: `${utilization}%`,
};
}
}
// Usage
const manager = new TokenAwareManager();
// Add messages and track tokens
await manager.addMessage({
role: "user",
content: "Explain quantum computing in detail.",
});
await manager.addMessage({
role: "assistant",
content:
"Quantum computing uses quantum bits (qubits) that can exist in superposition...",
});
const stats = manager.getStats();
console.log(`Messages: ${stats.messageCount}`);
console.log(`Tokens used: ${stats.estimatedTokens}`);
console.log(`Utilization: ${stats.tokenUtilization}`);Expected Output:
Messages: 2
Tokens used: 87
Utilization: 0.1%
The TokenAwareManager is smarter. It counts actual tokens before you send the request. If you're approaching the limit, it removes the oldest non-critical messages. This keeps you from hitting errors and lets longer conversations continue. You maintain context awareness throughout the conversation, and you're never surprised by token overages.
The tradeoff here is important: each addMessage makes an API call to count tokens. This costs money (though counting is much cheaper than inference, roughly 10% of the cost). For real applications, you might count tokens only periodically or estimate based on character count and only check when you're above 80% utilization. This balances accuracy with cost.
Persisting Conversation State Across Sessions
Here's where things get real: you build a conversational agent, the process restarts, and the user comes back. Their conversation should still exist. This is non-negotiable in production—imagine a customer talking to support, the system restarts, and they lose all context. That's a terrible experience.
To persist state, you serialize the message history to storage (database, file, cache) and load it back. The principle is simple: when a conversation ends (or periodically during a long conversation), you write the message history to durable storage. When the user returns, you load it back. Here's a complete example:
import fs from "fs/promises";
import path from "path";
interface ConversationSession {
id: string;
created_at: string;
updated_at: string;
messages: Message[];
metadata?: Record<string, any>;
}
class PersistentConversationManager {
private conversationDir: string = "./conversations";
private sessions: Map<string, ConversationSession> = new Map();
async init(): Promise<void> {
try {
await fs.mkdir(this.conversationDir, { recursive: true });
console.log(
`Initialized conversation directory: ${this.conversationDir}`,
);
} catch (err) {
console.error("Failed to initialize directory:", err);
}
}
private getSessionPath(sessionId: string): string {
return path.join(this.conversationDir, `${sessionId}.json`);
}
async createSession(
sessionId: string,
metadata?: Record<string, any>,
): Promise<ConversationSession> {
const now = new Date().toISOString();
const session: ConversationSession = {
id: sessionId,
created_at: now,
updated_at: now,
messages: [],
metadata,
};
this.sessions.set(sessionId, session);
await this.saveSession(session);
return session;
}
async loadSession(sessionId: string): Promise<ConversationSession | null> {
// Check memory first
if (this.sessions.has(sessionId)) {
return this.sessions.get(sessionId) || null;
}
// Try to load from disk
try {
const filePath = this.getSessionPath(sessionId);
const content = await fs.readFile(filePath, "utf-8");
const session = JSON.parse(content) as ConversationSession;
this.sessions.set(sessionId, session);
return session;
} catch (err) {
if ((err as NodeJS.ErrnoException).code === "ENOENT") {
return null; // File doesn't exist
}
console.error(`Failed to load session ${sessionId}:`, err);
return null;
}
}
async saveSession(session: ConversationSession): Promise<void> {
session.updated_at = new Date().toISOString();
this.sessions.set(session.id, session);
try {
const filePath = this.getSessionPath(session.id);
await fs.writeFile(filePath, JSON.stringify(session, null, 2), "utf-8");
} catch (err) {
console.error(`Failed to save session ${session.id}:`, err);
}
}
async addMessage(sessionId: string, message: Message): Promise<void> {
let session = this.sessions.get(sessionId);
if (!session) {
session = await this.loadSession(sessionId);
}
if (!session) {
throw new Error(`Session not found: ${sessionId}`);
}
session.messages.push(message);
await this.saveSession(session);
}
async getMessages(sessionId: string): Promise<Message[]> {
const session = await this.loadSession(sessionId);
if (!session) {
throw new Error(`Session not found: ${sessionId}`);
}
return session.messages;
}
async deleteSession(sessionId: string): Promise<void> {
try {
const filePath = this.getSessionPath(sessionId);
await fs.unlink(filePath);
this.sessions.delete(sessionId);
console.log(`Deleted session: ${sessionId}`);
} catch (err) {
console.error(`Failed to delete session ${sessionId}:`, err);
}
}
async listSessions(): Promise<ConversationSession[]> {
try {
const files = await fs.readdir(this.conversationDir);
const sessions: ConversationSession[] = [];
for (const file of files) {
if (file.endsWith(".json")) {
const sessionId = file.replace(".json", "");
const session = await this.loadSession(sessionId);
if (session) {
sessions.push(session);
}
}
}
return sessions;
} catch (err) {
console.error("Failed to list sessions:", err);
return [];
}
}
}
// Usage
const persistenceManager = new PersistentConversationManager();
await persistenceManager.init();
// Create a new session
const session = await persistenceManager.createSession("user-123", {
user_name: "Alice",
created_by: "web_app",
});
console.log(`Created session: ${session.id}`);
// Add messages
await persistenceManager.addMessage("user-123", {
role: "user",
content: "Remember my name is Alice",
});
await persistenceManager.addMessage("user-123", {
role: "assistant",
content: "Got it! Your name is Alice. I'll remember this.",
});
// Later... (process restarts, new code runs)
// Load the session
const loadedSession = await persistenceManager.loadSession("user-123");
const messages = loadedSession ? loadedSession.messages : [];
console.log(`Loaded session with ${messages.length} messages`);
console.log(`Last message: ${(messages[messages.length - 1] as any).content}`);
// List all sessions
const allSessions = await persistenceManager.listSessions();
console.log(`Total sessions: ${allSessions.length}`);Expected Output:
Initialized conversation directory: ./conversations
Created session: user-123
Loaded session with 2 messages
Last message: Got it! Your name is Alice. I'll remember this.
Total sessions: 1
This approach works because of several critical design decisions:
- Serialization: The message list is JSON-serializable. Just dump it to a file.
- Session ID: Each conversation gets an ID (user ID, session ID, conversation ID). Use that as the filename.
- Load on demand: When the user returns, load their session and continue from where they left off.
- Durability: The file survives process restarts. The agent remembers.
- In-memory cache: We keep a map of recently accessed sessions in memory for fast retrieval.
For production, you'd use a database instead of files (better for concurrent access, better querying, better replication). But the pattern is identical. The core principle remains: serialize to durable storage, keyed by session ID, and load back when needed. A database table with columns for session_id, created_at, updated_at, and messages (stored as JSON) is the standard pattern. The logic stays the same.
Implementing Conversation Branching and Forking
Sometimes you want to explore multiple conversation paths. Maybe the user says "let's try a different approach" and you want to keep both branches. Or you're testing different strategies and want to explore them in parallel without losing the original thread.
Branching means: at point X in the conversation, create two copies of the history and have them diverge. This is useful for "what if" analysis—the user can ask "what if we used approach B instead of approach A?" and the agent can fork the conversation, explore approach B, while maintaining the original approach A conversation for comparison.
interface Branch {
id: string;
parent_id?: string; // ID of the branch this was forked from
messages: Message[];
created_at: string;
}
class BranchingConversationManager {
private branches: Map<string, Branch> = new Map();
private activeBranch: string = "main";
createBranch(name: string, parentBranchId?: string): string {
const parentBranch = parentBranchId
? this.branches.get(parentBranchId)
: this.branches.get(this.activeBranch);
const newBranch: Branch = {
id: name,
parent_id: parentBranch?.id,
messages: parentBranch ? [...parentBranch.messages] : [],
created_at: new Date().toISOString(),
};
this.branches.set(name, newBranch);
return name;
}
switchBranch(branchId: string): void {
if (!this.branches.has(branchId)) {
throw new Error(`Branch not found: ${branchId}`);
}
this.activeBranch = branchId;
}
addMessage(message: Message, branchId?: string): void {
const targetBranchId = branchId || this.activeBranch;
const branch = this.branches.get(targetBranchId);
if (!branch) {
throw new Error(`Branch not found: ${targetBranchId}`);
}
branch.messages.push(message);
}
getMessages(branchId?: string): Message[] {
const targetBranchId = branchId || this.activeBranch;
const branch = this.branches.get(targetBranchId);
if (!branch) {
throw new Error(`Branch not found: ${targetBranchId}`);
}
return branch.messages;
}
getBranches(): Branch[] {
return Array.from(this.branches.values());
}
getActiveBranchId(): string {
return this.activeBranch;
}
}
// Usage
const branchManager = new BranchingConversationManager();
// Create main branch
branchManager.createBranch("main");
// Add some messages to main
branchManager.addMessage({
role: "user",
content: "How do I optimize database queries?",
});
branchManager.addMessage({
role: "assistant",
content: "Here are some optimization techniques...",
});
// At this point, the user wants to explore two directions
// Fork into approach A
const approachA = branchManager.createBranch("approach-a", "main");
branchManager.switchBranch(approachA);
branchManager.addMessage({
role: "user",
content: "Tell me more about indexing strategies",
});
// Fork into approach B
const approachB = branchManager.createBranch("approach-b", "main");
branchManager.switchBranch(approachB);
branchManager.addMessage({
role: "user",
content: "Tell me about caching instead",
});
// Get both branches
const allBranches = branchManager.getBranches();
console.log(`Total branches: ${allBranches.length}`);
for (const branch of allBranches) {
console.log(`\nBranch: ${branch.id} (${branch.messages.length} messages)`);
if (branch.parent_id) {
console.log(` Parent: ${branch.parent_id}`);
}
}
// Switch back and see messages
branchManager.switchBranch(approachA);
const approachAMessages = branchManager.getMessages();
console.log(
`\nApproach A last message: ${(approachAMessages[approachAMessages.length - 1] as any).content}`,
);Expected Output:
Total branches: 3
Branch: main (2 messages)
Branch: approach-a (3 messages)
Parent: main
Branch: approach-b (3 messages)
Parent: main
Approach A last message: Tell me more about indexing strategies
Why branching matters: Users often want to explore "what if" scenarios. Branching lets you preserve the conversation tree. Each branch maintains its own message history, so you can switch between them and the agent continues from that branch's perspective. This is powerful for interactive applications where users are exploring solutions.
Imagine a user interacting with an agent that helps design system architectures. They discuss a monolithic approach (main branch), but then say "what if we did microservices instead?" The agent can fork the conversation. Now branch A explores the monolith path further, and branch B explores microservices. The user can switch between them, compare insights from both approaches, and understand the tradeoffs. When they're ready to commit, they merge one branch into their final design.
For interactive applications (chat UIs, IDE integrations), this capability is transformative. Users can rewind to a decision point and try a different approach without losing the original conversation. Both conversations are preserved. Both branches remain available for comparison.
State Serialization and Deserialization
Serialization is converting state to something you can store. Deserialization is reading it back. Messages can contain complex structures (tool results with nested JSON). You need to handle that carefully.
interface SerializedConversation {
version: string;
created_at: string;
messages: Array<{
role: "user" | "assistant";
content: string | object;
}>;
metadata: Record<string, any>;
}
class ConversationSerializer {
static serialize(
messages: Message[],
metadata: Record<string, any> = {},
): SerializedConversation {
return {
version: "1.0",
created_at: new Date().toISOString(),
messages: messages.map((msg) => ({
role: msg.role,
content:
typeof msg.content === "string"
? msg.content
: JSON.stringify(msg.content),
})),
metadata,
};
}
static deserialize(data: SerializedConversation): Message[] {
return data.messages.map((msg) => ({
role: msg.role,
content:
typeof msg.content === "string"
? msg.content
: JSON.parse(msg.content as string),
}));
}
static toJSON(conversation: SerializedConversation): string {
return JSON.stringify(conversation, null, 2);
}
static fromJSON(json: string): SerializedConversation {
const data = JSON.parse(json);
// Validate structure
if (!data.version || !Array.isArray(data.messages)) {
throw new Error("Invalid conversation format");
}
return data as SerializedConversation;
}
}
// Usage
const messages: Message[] = [
{
role: "user",
content: "Check the database",
},
{
role: "assistant",
content: [
{
type: "text",
text: "I'll query the database.",
},
{
type: "tool_use",
id: "call_1",
name: "query_db",
input: { query: "SELECT * FROM users" },
},
],
},
{
role: "user",
content: [
{
type: "tool_result",
tool_use_id: "call_1",
content: '{"rows": 42, "status": "success"}',
},
],
},
];
// Serialize
const serialized = ConversationSerializer.serialize(messages, {
user_id: "alice",
timestamp: Date.now(),
});
const json = ConversationSerializer.toJSON(serialized);
console.log("Serialized:");
console.log(json);
// Deserialize
const parsed = ConversationSerializer.fromJSON(json);
const deserialized = ConversationSerializer.deserialize(parsed);
console.log(`\nDeserialized ${deserialized.length} messages`);
console.log(`First message: ${(deserialized[0] as any).content}`);
console.log(
`Second message has ${(deserialized[1] as any).content.length} content blocks`,
);Expected Output:
Serialized:
{
"version": "1.0",
"created_at": "2026-03-17T10:30:00.000Z",
"messages": [
{
"role": "user",
"content": "Check the database"
},
{
"role": "assistant",
"content": "[{\"type\":\"text\",\"text\":\"I'll query the database.\"},{\"type\":\"tool_use\",\"id\":\"call_1\",\"name\":\"query_db\",\"input\":{\"query\":\"SELECT * FROM users\"}}]"
},
{
"role": "user",
"content": "[{\"type\":\"tool_result\",\"tool_use_id\":\"call_1\",\"content\":\"{\\\"rows\\\":42,\\\"status\\\":\\\"success\\\"}\"}]"
}
],
"metadata": {
"user_id": "alice",
"timestamp": 1710694200000
}
}
Deserialized 3 messages
First message: Check the database
Second message has 2 content blocks
The key insight: JSON can't directly represent the Message type because content can be string or array. So we normalize to JSON-serializable form (stringify arrays, parse them back) and then load. This works across any storage backend.
In practice, you'd store this in a database:
// Pseudo-code for database storage
async function saveConversation(
userId: string,
conversation: SerializedConversation,
) {
const json = ConversationSerializer.toJSON(conversation);
await database.insert("conversations", {
user_id: userId,
data: json,
created_at: conversation.created_at,
});
}
async function loadConversation(userId: string): Promise<Message[]> {
const row = await database.query(
"SELECT data FROM conversations WHERE user_id = ?",
[userId],
);
if (!row) return [];
const conversation = ConversationSerializer.fromJSON(row.data);
return ConversationSerializer.deserialize(conversation);
}Putting It All Together: A Production-Ready Agent
Here's a complete example that combines everything—token awareness, persistence, state management, and error handling:
import Anthropic from "@anthropic-ai/sdk";
// Define our types
interface ConversationState {
sessionId: string;
messages: Message[];
tokenCount: number;
createdAt: string;
updatedAt: string;
}
interface Message {
role: "user" | "assistant";
content: string | any[];
}
// Core state manager
class ProductionStateManager {
private state: ConversationState;
private client: Anthropic;
private maxTokens: number = 150000;
constructor(sessionId: string, apiKey?: string) {
this.client = new Anthropic({ apiKey });
this.state = {
sessionId,
messages: [],
tokenCount: 0,
createdAt: new Date().toISOString(),
updatedAt: new Date().toISOString(),
};
}
async addMessage(message: Message): Promise<void> {
this.state.messages.push(message);
// Count tokens
try {
const result = await this.client.messages.countTokens({
model: "claude-3-5-sonnet-20241022",
messages: this.state.messages,
});
this.state.tokenCount = result.input_tokens;
this.state.updatedAt = new Date().toISOString();
// If over limit, remove old messages (but keep first and last)
while (
this.state.tokenCount > this.maxTokens &&
this.state.messages.length > 2
) {
this.state.messages.splice(1, 1);
const recount = await this.client.messages.countTokens({
model: "claude-3-5-sonnet-20241022",
messages: this.state.messages,
});
this.state.tokenCount = recount.input_tokens;
}
} catch (err) {
console.error("Token counting failed:", err);
}
}
getMessages(): Message[] {
return this.state.messages;
}
getState(): ConversationState {
return this.state;
}
getStats(): {
sessionId: string;
messageCount: number;
tokenCount: number;
tokenUtilization: string;
} {
return {
sessionId: this.state.sessionId,
messageCount: this.state.messages.length,
tokenCount: this.state.tokenCount,
tokenUtilization: `${((this.state.tokenCount / this.maxTokens) * 100).toFixed(1)}%`,
};
}
}
// Main agent class
class ProductionAgent {
private stateManager: ProductionStateManager;
constructor(sessionId: string) {
this.stateManager = new ProductionStateManager(sessionId);
}
async chat(userInput: string): Promise<string> {
// Add user message to state
await this.stateManager.addMessage({
role: "user",
content: userInput,
});
// Get the Anthropic client
const client = new Anthropic();
// Call Claude with current state
const response = await client.messages.create({
model: "claude-3-5-sonnet-20241022",
max_tokens: 1024,
messages: this.stateManager.getMessages(),
});
// Extract response text
const responseText = response.content
.filter((block) => block.type === "text")
.map((block) => (block.type === "text" ? block.text : ""))
.join("");
// Add assistant response to state
await this.stateManager.addMessage({
role: "assistant",
content: responseText,
});
return responseText;
}
getStats() {
return this.stateManager.getStats();
}
}
// Usage demonstration
async function demo() {
const agent = new ProductionAgent("session-001");
console.log("Agent: Starting conversation...\n");
// First turn
const response1 = await agent.chat("What is machine learning?");
console.log(`User: What is machine learning?`);
console.log(`Agent: ${response1}\n`);
let stats = agent.getStats();
console.log(
`Stats: ${stats.messageCount} messages, ${stats.tokenCount} tokens`,
);
console.log(`Utilization: ${stats.tokenUtilization}\n`);
// Second turn
const response2 = await agent.chat("Can you give me a practical example?");
console.log(`User: Can you give me a practical example?`);
console.log(`Agent: ${response2}\n`);
stats = agent.getStats();
console.log(
`Stats: ${stats.messageCount} messages, ${stats.tokenCount} tokens`,
);
console.log(`Utilization: ${stats.tokenUtilization}`);
}
// Uncomment to run:
// demo().catch(console.error);Expected Output (simulated):
Agent: Starting conversation...
User: What is machine learning?
Agent: Machine learning is a subset of artificial intelligence that enables systems to learn and improve from experience without being explicitly programmed...
Stats: 2 messages, 156 tokens
Utilization: 0.1%
User: Can you give me a practical example?
Agent: Sure! A practical example is email spam filtering. The system learns from examples of spam and legitimate emails...
Stats: 4 messages, 289 tokens
Utilization: 0.2%
Best Practices and Common Pitfalls
Do This:
- Persist state immediately. Don't wait until the end of a conversation. Save after each turn. This ensures you never lose context if the system crashes mid-conversation.
- Monitor token usage. Know when you're approaching limits. Track the trend over time so you can predict when you'll need to truncate.
- Use session IDs. Every conversation should have a unique identifier. This makes debugging easier and lets you query conversation history.
- Compress old messages strategically. Don't just truncate. Consider summarizing important context so you retain knowledge even as you remove old messages.
- Handle errors gracefully. Token counting fails sometimes. Plan for it. Have fallback strategies.
Don't Do This:
- Don't assume the conversation will never restart. Always design for persistence. Treat every conversation as potentially long-lived.
- Don't ignore token limits. Test with long conversations. See where they break. Know your limits before they cause production failures.
- Don't store credentials in message history. Scrub sensitive data before persisting. API keys, passwords, tokens have no business in conversation history.
- Don't recreate state from scratch on errors. Keep what you have. Resume from where you left off. Partial progress is better than nothing.
- Don't send the entire history on every call if the conversation is huge. Implement truncation or summarization. Large histories are slow and expensive.
Why This Matters: State is the Invisible Foundation
State management seems boring compared to the flashy parts of agents—the reasoning, the tool use, the clever prompts. But boring infrastructure is what separates toy agents from production systems. Without proper state management, your agent can't have a meaningful conversation. It forgets. It hallucinates context. It fails in ways that are hard to debug.
Consider what happens with poor state management. A user starts a conversation. The agent makes progress. Then the system restarts. Now what? If you didn't persist state, the user has to start over. That's frustrating. If you did persist state but didn't manage token limits, you silently lose context and the agent gives wrong answers. That's worse—at least "start over" is honest. Silent context loss is insidious.
The best agents are invisible about state. The user doesn't think about it. They start a conversation, it runs for hours, they come back a week later and the agent remembers everything. That seamlessness is the result of meticulous state management. It's the engineering behind the magic.
Common Pitfalls in State Management
Most teams implementing conversation state hit the same problems. Let me help you avoid them.
Pitfall 1: Not counting tokens until it's too late You keep adding messages to history without checking token usage. When you finally check, you're at 95% of the context window. Now you're in crisis mode, truncating messages frantically. Better: count tokens after every message. When you reach 80%, start your truncation strategy proactively. You'll never panic.
Pitfall 2: Losing state on errors Your system crashes while saving state. Now the conversation is partially persisted. You resume from an inconsistent state. The agent gets confused. Better: implement atomic saves. Use transactions or write-then-rename patterns. Ensure that either the full state is saved or nothing is saved. No partial states.
Pitfall 3: Not distinguishing between critical and filler messages When truncating, you remove messages uniformly. But some messages are critical ("The user said the password is X") while others are filler ("I'll now process..."). Better: implement message importance scoring. Mark critical messages (user inputs, key decisions) as "always keep." When truncating, remove filler first.
Pitfall 4: Serialization/deserialization bugs
You serialize messages to JSON, persist, and months later try to deserialize. The format has changed. The tool_use structure is different. Deserialization fails. Better: version your serialization format. Add a version: "1.0" field. When you change the format, bump to "2.0" and implement migration logic that converts old formats to new ones.
Pitfall 5: Not handling concurrent access Two processes try to load the same session simultaneously. They both succeed, make changes, save independently. The second save overwrites the first. State is corrupted. Better: use database-level locks or optimistic locking with version numbers. Ensure only one writer can modify a session at a time.
Under the Hood: Token Counting Deep Dive
Token counting seems simple: call the API, get a number. But there's hidden complexity worth understanding.
Claude's token counting includes overhead you don't expect. Each message has framing tokens. Tool calls have structure overhead. If your message is 100 tokens of content plus 20 tokens of framing, counting just the content gives false results. That's why the official countTokens API exists—it includes all the overhead.
But there's more nuance. Different models have different tokenization. Claude 3 Opus tokenizes differently than Claude 3.5 Sonnet. If you're upgrading models, your token counts change. A conversation that fit in 200K tokens under Opus might overflow under Sonnet. Plan for this. When you upgrade models, re-count your token usage for active conversations.
Also, token counting is expensive. Each countTokens call costs money (less than inference, but it adds up). A naive implementation that counts tokens before every message could double your API costs. Better: count tokens periodically (every 10 messages) or estimate based on character count and only verify when you're above 80% utilization.
Real-World Scenario: A Long-Running Customer Support Agent
Imagine you're building an agent that helps customers troubleshoot issues. They start a conversation at 10am Monday. They add details, the agent asks clarifying questions. By 3pm, they've exchanged 50 messages. The conversation goes quiet—the customer is testing the agent's suggestion. At 5pm Tuesday, they return with an update.
Without proper state management: The agent has no idea what happened 24 hours ago. It asks the same questions. The customer is frustrated. They abandon the chat.
With proper state management: The agent loads the conversation history. It sees they were troubleshooting a network issue. It remembers the steps they've tried. It asks targeted follow-ups based on progress, not from scratch. The conversation is continuous, even with a day's gap.
This isn't magic. It's state management. The conversation history is persisted to a database with a session ID. When the customer returns, you load by session ID. The agent continues naturally.
But what if the conversation is long? What if they've exchanged 200 messages over a week? That's a lot of tokens. The agent needs to see recent messages (for immediate context) and summary of earlier messages (for background). This is where intelligent truncation shines.
You keep the first 5 messages (establishes context, initial question). You keep the last 20 messages (recent discussion). For everything in between, you generate a brief summary: "Customer reported network issue. Tried restarting modem (didn't help). Tried different device (worked). Suggests issue with WiFi rather than connection." Now the agent can reason about the full conversation in a reasonable token budget.
Alternatives to Token Truncation
Truncation works, but sometimes you need smarter strategies. Here are alternatives to consider.
Strategy 1: Hierarchical summarization Instead of summarizing all old messages into one blob, create hierarchical summaries. Group messages by topic: "WiFi troubleshooting, System updates, Network diagnostics." For each group, create a summary. The agent can reference summaries by topic when needed. This gives structure to old context.
Strategy 2: Semantic search over history Store message embeddings. When you need to truncate, find the N most semantically similar messages to recent messages. Keep those. This preserves context about topics the agent is currently discussing. It's more sophisticated but costs more (embedding API calls).
Strategy 3: Time-window based retention Keep messages from the last 24 hours. After 24 hours, summarize them into a single message. This creates a clean time boundary. It's simple, predictable, and works well for time-bounded tasks (customer support during a single incident).
Strategy 4: User-guided curation Let the user explicitly decide what to keep. "What should I remember from this conversation for next time?" The user highlights the important parts. You compress those and discard the rest. This is manual but gives maximum control.
Troubleshooting Conversation State Issues
When things go wrong, it's often hard to debug state issues—they're distributed across time and space. Here's how to troubleshoot systematically.
Problem: Agent gives inconsistent answers to the same question Same user, same conversation, asks "what's my name?" and gets different answers. This suggests state corruption. The agent's view of conversation history is inconsistent. Solution: Dump the conversation history and inspect it directly. Look for duplicate messages, messages in wrong order, corrupted content. Check your persistence layer—maybe writes are failing silently.
Problem: Token count keeps growing without adding messages You track token usage. It was 5000 tokens with 10 messages. After adding 1 message, it jumps to 8000. That's not proportional. Something is wrong. Solution: Check if you're re-serializing messages each turn. Sometimes old serialization formats get normalized when loading, increasing token count. Verify the token counting logic isn't including duplicates.
Problem: Old context suddenly becomes unavailable
A feature that worked fine for a month suddenly breaks. The agent can no longer reference messages from a week ago. Solution: Check if you changed truncation thresholds. Maybe you lowered maxMessages from 100 to 50, and conversations older than that are now being truncated. Check your migration code if you changed persistence format.
Production Considerations: Scaling State Management
As you scale to thousands of conversations, state management becomes infrastructure engineering.
Consideration 1: Database performance If you're storing conversations in a database, load becomes important. A query like "load all messages for session X" on a table with millions of messages should be fast. Solution: Index by session ID. Consider partitioning conversations by date if your system gets massive.
Consideration 2: Backup and recovery Conversation data is valuable. If you lose it, you've lost customer history. Solution: Implement regular backups. Test restore procedures—don't discover in production that your backups are corrupt. For critical systems, consider replication (master-slave database setup).
Consideration 3: Privacy and retention Conversations often contain sensitive data. Regulations require you to delete old data (GDPR, CCPA). Solution: Implement automatic data retention policies. Conversations older than 1 year are deleted automatically. Sensitive data (passwords, API keys) are redacted on insertion. Add audit logging so you can prove compliance.
Consideration 4: Concurrent access patterns If multiple agents are accessing the same conversation, synchronization is critical. Solution: Use optimistic locking (store version number, increment on write, fail if version mismatches on write). Or implement database-level pessimistic locking. Choose the strategy that fits your access patterns.
Wrapping Up
State management is how agents remember. It's the message history, properly tracked and persisted. Understand token limits, implement truncation or summarization when needed, persist across sessions, support branching for exploring alternatives, and serialize carefully.
The principle: your agent is only as good as the state it tracks. Get state right, and the agent can reason clearly about long-running problems. Lose state or manage it badly, and the agent forgets, hallucinates, or crashes.
But there's more to it than just technical implementation. When you get state management right, you enable richer interactions. Users can have multi-turn conversations where the agent remembers what they said five turns ago. They can explore branches—"what if we tried this approach instead?"—without losing the original conversation. They can close the conversation, come back a week later, and the agent picks up where they left off, maintaining full context.
This matters because conversation is how humans think. We don't give single-turn instructions. We explore ideas iteratively. We say something, the agent responds, we ask a follow-up, the agent adapts. That natural back-and-forth is what makes agents useful. And that's only possible with proper state management.
The technical details matter—token counting, truncation strategies, serialization formats. But they matter because they enable the human experience. That's what to keep in mind as you build: state management isn't an implementation detail. It's the foundation for meaningful human-agent interaction.
Get it right, and your agents will be genuinely useful. Get it wrong, and they'll be frustrating. The difference between the two is attention to how state flows through your system.
Epilogue: State Management as a First-Class Concern
In the early days of LLM-based applications, people treated state management as an afterthought. "We'll add persistence later." But state is foundational. It's not something you add once the core system works. It's something you design for from the beginning.
Think about Claude Code agent design. The SDK is architected with state management in mind. The message history is first-class. The session concept is built-in, not bolted-on. The token counting API is available immediately, not discovered later when you hit limits. This is because the Claude team learned from watching people build agents and hit exactly the problems this guide describes.
You should approach it the same way in your own systems. Don't build the agent and then add state. Design the agent with state in mind. Plan for long conversations. Budget tokens conservatively. Implement persistence from day one. Test backup and recovery. These aren't luxuries for later. They're requirements for a system that actually works.
The best engineers treat state management as a first-class citizen alongside the agent logic itself. The conversation history isn't a database table—it's the core data structure that enables reasoning. The session is the identity of the conversation—it's as fundamental as a user ID in a web application. This mindset shift changes how you architect the whole system.
Your agents will be better for it. They'll be more reliable, more predictable, and more useful. That's worth the effort of getting state management right from the start.
-iNet