Building an On-Call Assistant with Claude Code

On-call rotations suck. You're woken up at 2 AM, your heart rate spikes, and you've got maybe 30 seconds to understand what's actually broken before customers notice. Most of that time gets lost to context-switching—digging through alert dashboards, parsing cryptic error messages, hunting down related metrics. What if your on-call experience didn't have to feel like navigating in the dark?
In this article, we're building an on-call assistant with Claude Code that does the heavy lifting: it ingests raw alerts, stitches together context from multiple systems, suggests investigation steps, and even automates safe remediations. By the time you've finished your coffee, you'll know what's wrong and have a clear playbook for fixing it.
Understanding the Hidden Impact
Before diving into the code, let's understand what on-call actually costs your organization. Most managers measure on-call by counting pages—you got paged 5 times last week, so you were on-call 5 times. But that's only measuring the dramatic moments. The real cost is more insidious.
Being on-call fragments your mental state. You can't fully relax because part of your attention is always on your phone. If you're not sleeping well, you're less sharp at your day job. If you're responding to pages at 2 AM, you're exhausted the next day. Studies on on-call workers in other industries show measurable impacts on sleep quality, stress levels, and family time. For knowledge workers, this translates to reduced productivity, increased mistakes, and higher burnout.
Then there's the knowledge gap. A 2 AM incident response is not a good time to learn about a service. Your brain is foggy. You're stressed. You're working from memory, and memory is unreliable when you're tired. So you make conservative choices. You restart services instead of investigating. You get lucky and it works. You go back to bed not having learned anything useful about your system. Next time the same service fails, you're equally confused.
Compare that to an on-call assistant that immediately provides context: "Here's what changed in the last hour. Here's what similar incidents looked like. Here's what worked before." Now you're making decisions from a position of information rather than panic. You sleep better knowing you responded well. You learn something about your system that will help next time.
The assistant is ultimately about reducing the cognitive load of being on-call. Not eliminating it—the urgency is still real, the stakes are still high—but reducing the part that's about information gathering so you can focus on the part that's about problem-solving.
Table of Contents
- Understanding the Hidden Impact
- The Deeper Challenge: Alert Fatigue and On-Call Burnout
- Why Alerts Alone Aren't Enough
- Architecture: From Alert to Resolution
- Designing for the Human in the Loop
- Building the Alert Handler
- Learning from Resolutions: Building Feedback Loops
- Integration with Your Existing Stack
- Automating Safe Remediations
- Building a Sustainable Knowledge Base
- Handling Cascading Failures and Cross-Service Issues
- Improving Over Time Through Analysis
- On-Call Rotation Resilience
- Metrics and Success Measurement
- Wrapping It All Up: The Real-World Impact
- Implementation Roadmap: From Concept to Production
- Where to Start Your First On-Call Assistant
The Deeper Challenge: Alert Fatigue and On-Call Burnout
Before we build the on-call assistant, let's understand the real problem we're solving. It's not just about speed—it's about the human cost of being on-call.
On-call rotations are inherently stressful. Your sleep is fragile. Your evenings are never fully yours. You're always half-listening for your phone to buzz. This creates a low-level chronic stress that compounds over time. People burn out. Retention suffers. Teams get depleted.
Part of the stress comes from alert fatigue. You get paged at 2 AM for an alert, and when you investigate, it's a false positive or something that resolved itself. You get paged three times in a night for different alerts, and none of them are actual problems. The boy-who-cried-wolf effect makes you cynical about all alerts. When a real problem occurs, you're slower to respond because you've learned from experience that most alerts don't matter.
But another part of the stress comes from the cognitive burden. You're woken up disoriented, and immediately you have to shift your brain into problem-solving mode. You need to understand what's wrong, why it happened, and what to do about it. None of this is in your short-term memory—you have to reconstruct the entire context from logs, dashboards, and memory. That's cognitively expensive, especially at 2 AM when you're half-asleep.
An on-call assistant addresses the second problem directly. We can't eliminate alerts (they're necessary), and we can't make emergencies not emergencies. But we can dramatically reduce the cognitive load of responding to them. By the time you read the summary, you already know what's wrong and what to do. Your brain doesn't have to reconstruct context from fragments. You can get straight to solving the problem.
Why Alerts Alone Aren't Enough
Here's the specific problem: your monitoring tools send you alerts, but alerts are dumb. They tell you a metric crossed a threshold, but they don't tell you why, what usually causes it, or what you should do about it. You're expected to fill in those gaps manually, at 2 AM, under pressure.
A good on-call assistant bridges that gap. It:
- Pulls context from multiple sources (metrics, logs, deployment history, past incidents)
- Synthesizes information into a coherent narrative
- Suggests investigation steps based on alert type and patterns
- Proposes safe remediations that have worked before
- Learns from resolved incidents to improve future responses
The key insight here is that context is expensive to gather but incredibly valuable once you have it. Claude Code excels at this kind of data gathering and synthesis—it can read files, query APIs, parse structured data, and reason about what matters.
Architecture: From Alert to Resolution
Let's sketch out what we're building:
Alert fires
↓
Alert ingestion service (receives webhook)
↓
Claude Code orchestrator (coordinates investigation)
├─ Fetch alert details and history
├─ Pull relevant metrics from monitoring system
├─ Query logs for error patterns
├─ Check deployment timeline
├─ Look up related incidents
├─ Generate summary and recommendations
└─ Suggest remediation steps
↓
Human on-call engineer
├─ Reviews recommendation
├─ Executes suggested remediation OR investigates deeper
└─ Documents resolution (feedback loop)
The human is still in control—Claude is the assistant, not the decision-maker. This is crucial. We automate the gathering and reasoning, but the on-call engineer makes the final call on what to do. Over time, as the system sees how engineers resolve incidents, it learns what actually works.
Designing for the Human in the Loop
Before we write code, let's establish a crucial principle: the human is never removed from incident response. Claude makes suggestions, but humans make decisions. Claude performs investigation, but humans evaluate the results. This isn't a limitation—it's a feature.
Removing humans from incident response is dangerous. Claude might suggest a remediation that is technically correct but politically problematic (like canceling an important customer's subscription to stop resource usage). Claude might misinterpret a metric change and suggest action based on incorrect assumptions. Claude might recommend a remediation that violates compliance requirements it doesn't know about.
By keeping humans in the loop, we get the best of both worlds. Claude's analysis isn't constrained by the compliance requirements, organizational politics, and contextual knowledge that humans have. When Claude's suggestion conflicts with human judgment, the human chooses. The human learns from the suggestion and updates their mental model. The system learns from the human's choice for next time.
This also creates psychological safety in on-call. You're not trusting a black box to make critical decisions. You're using Claude as a very smart assistant. You read its analysis, you consider its suggestions, you apply your judgment. You stay in control of outcomes. That's what reduces stress. Not removing decisions from you, but giving you better information to make those decisions.
Design your entire system with this principle. No automatic remediation without approval. No suppressing alerts without human notification. No escalations that bypass humans. When you're tempted to automate something fully, ask yourself: what could go wrong if the automation makes a mistake? If the answer is "something serious," keep the human in the loop.
Building the Alert Handler
Let's start with TypeScript code that listens for alerts and kicks off the investigation. The on-call assistant receives webhooks from your monitoring system (Datadog, Prometheus AlertManager, PagerDuty, etc.) and coordinates immediate investigation:
import Anthropic from "@anthropic-ai/sdk";
import express, { Request, Response } from "express";
import fs from "fs";
import path from "path";
interface AlertPayload {
alertName: string;
severity: "critical" | "warning" | "info";
message: string;
alertedAt: string;
affectedService: string;
metricsUrl?: string;
logsUrl?: string;
}
interface IncidentContext {
alert: AlertPayload;
recentMetrics: string;
relevantLogs: string;
deploymentHistory: string;
similarPastIncidents: string;
}
class OnCallAssistant {
private client: Anthropic;
private app: express.Application;
private knowledgeBase: Map<string, string> = new Map();
constructor() {
this.client = new Anthropic();
this.app = express();
this.app.use(express.json());
this.setupRoutes();
this.loadKnowledgeBase();
}
private setupRoutes(): void {
// Webhook endpoint for alerts
this.app.post("/alert", async (req: Request, res: Response) => {
const alert: AlertPayload = req.body;
console.log(
`[${new Date().toISOString()}] Alert received: ${alert.alertName}`,
);
try {
const investigation = await this.investigateAlert(alert);
res.json({ status: "success", investigation });
// Asynchronously log to knowledge base
this.recordResolutionFeedback(alert, investigation);
} catch (error) {
console.error("Investigation failed:", error);
res.status(500).json({ status: "error", message: String(error) });
}
});
this.app.listen(3000, () => {
console.log("On-call assistant listening on port 3000");
});
}
private loadKnowledgeBase(): void {
// In production, this would read from a database or persistent store
// For now, we'll seed with a few common patterns
this.knowledgeBase.set(
"high_cpu",
`
PATTERN: Service experiencing high CPU
COMMON CAUSES:
- Memory leak in recent deployment
- Unexpected traffic spike
- Inefficient database query
- Runaway background process
INVESTIGATION STEPS:
1. Check deployment timeline (was something deployed in last 1h?)
2. Look at request rate graphs
3. Profile CPU usage by process
4. Check database query logs for slow queries
REMEDIATION OPTIONS:
- Roll back last deployment
- Scale horizontally (add instances)
- Kill specific process if runaway detected
- Enable query cache if applicable
`,
);
this.knowledgeBase.set(
"database_connection_pool_exhausted",
`
PATTERN: Database connection pool exhausted
COMMON CAUSES:
- Connection leak (not returning to pool)
- Sudden traffic spike
- Long-running queries blocking connections
- Application restart backlog
INVESTIGATION STEPS:
1. Check active connections vs pool size
2. Look for long-running queries
3. Check application logs for connection errors
4. Verify no cascading failures from upstream
REMEDIATION OPTIONS:
- Increase connection pool size
- Kill long-running queries (carefully)
- Restart application servers to clear leak
- Implement circuit breaker for downstream services
`,
);
this.knowledgeBase.set(
"disk_space_critical",
`
PATTERN: Disk space critically low
COMMON CAUSES:
- Log files not rotating
- Temporary files accumulating
- Database bloat
- Unexpected large file creation
INVESTIGATION STEPS:
1. Find largest directories (du -sh /*)
2. Check log rotation config
3. Look for large temp files
4. Verify database maintenance ran
REMEDIATION OPTIONS:
- Clean old logs (after backing up if needed)
- Clear temp directories
- Add disk space (if cloud VM)
- Trigger manual database maintenance
- Implement cleanup cron jobs
`,
);
}
private async investigateAlert(alert: AlertPayload): Promise<object> {
// Step 1: Gather context
const context = await this.gatherContext(alert);
// Step 2: Get Claude's analysis
const analysis = await this.analyzeWithClaude(context);
// Step 3: Suggest remediation
const remediation = await this.suggestRemediation(context, analysis);
return {
alert: alert.alertName,
severity: alert.severity,
timestamp: new Date().toISOString(),
contextSummary: {
affectedService: alert.affectedService,
alertedAt: alert.alertedAt,
},
analysis,
recommendedActions: remediation.actions,
estimatedResolutionTime: remediation.estimatedTime,
};
}
private async gatherContext(alert: AlertPayload): Promise<IncidentContext> {
// Simulate fetching from various systems
// In production, these would be real API calls to your monitoring/logging infrastructure
const recentMetrics = await this.fetchMetrics(alert.affectedService);
const relevantLogs = await this.fetchLogs(alert.affectedService);
const deploymentHistory = await this.fetchDeployments(
alert.affectedService,
);
const similarPastIncidents = await this.findSimilarIncidents(
alert.alertName,
);
return {
alert,
recentMetrics,
relevantLogs,
deploymentHistory,
similarPastIncidents,
};
}
private async fetchMetrics(service: string): Promise<string> {
// Mock implementation - would call actual monitoring API
return `
Service: ${service}
Last 1 hour metrics:
- CPU: 85% (baseline: 45%)
- Memory: 78% (baseline: 62%)
- Request rate: 12,500 req/s (baseline: 8,000 req/s)
- P99 latency: 450ms (baseline: 120ms)
- Error rate: 2.3% (baseline: 0.1%)
`;
}
private async fetchLogs(service: string): Promise<string> {
// Mock implementation - would query actual log aggregation system
return `
Last 50 error logs from ${service}:
2026-03-17T14:23:45Z ERROR: Connection pool exhausted (waiting for 342ms+)
2026-03-17T14:23:42Z WARN: 10 slow queries detected (>1s)
2026-03-17T14:23:38Z ERROR: OOM killer invoked on 3 processes
[... and 47 more errors with stack traces ...]
`;
}
private async fetchDeployments(service: string): Promise<string> {
// Mock implementation - would query deployment system
return `
Recent deployments for ${service}:
- 2026-03-17 13:45 UTC: Version 3.2.1 → 3.2.2 (rollout 100% complete)
Changes: Refactored query caching, added metrics collection
- 2026-03-17 08:30 UTC: Version 3.2.0 → 3.2.1 (minor patch)
Changes: Fixed logging format bug
- 2026-03-16 19:15 UTC: Version 3.1.9 → 3.2.0 (major update)
Changes: New feature flag system, database migrations
`;
}
private async findSimilarIncidents(alertName: string): Promise<string> {
// Check knowledge base for similar incidents
const knownPattern = this.knowledgeBase.get(
alertName.toLowerCase().replace(/ /g, "_"),
);
if (knownPattern) {
return `Known pattern detected: "${alertName}"\n${knownPattern}`;
}
return `No similar incidents found for "${alertName}" in knowledge base.`;
}
private async analyzeWithClaude(context: IncidentContext): Promise<object> {
const prompt = `
You are an expert on-call assistant. An alert has fired and you have the following context:
ALERT: ${context.alert.alertName}
SEVERITY: ${context.alert.severity}
AFFECTED SERVICE: ${context.alert.affectedService}
MESSAGE: ${context.alert.message}
ALERTED AT: ${context.alert.alertedAt}
RECENT METRICS:
${context.recentMetrics}
RELEVANT LOGS:
${context.relevantLogs}
DEPLOYMENT HISTORY:
${context.deploymentHistory}
SIMILAR PAST INCIDENTS:
${context.similarPastIncidents}
Based on this context, provide:
1. Root cause hypothesis (what's probably wrong?)
2. Confidence level (how sure are you?)
3. Key signals that support your hypothesis
4. What you'd investigate next if this is wrong
5. Estimated impact (how many users affected?)
Be concise and actionable. Assume the on-call engineer has 2 minutes to read this.
`;
const message = await this.client.messages.create({
model: "claude-opus-4-1",
max_tokens: 1024,
messages: [
{
role: "user",
content: prompt,
},
],
});
const analysisText =
message.content[0].type === "text" ? message.content[0].text : "";
return {
rootCauseHypothesis: this.extractSection(analysisText, "Root cause"),
confidenceLevel: this.extractConfidence(analysisText),
keySignals: this.extractSection(analysisText, "Key signals"),
nextSteps: this.extractSection(analysisText, "investigate next"),
estimatedImpact: this.extractSection(analysisText, "Estimated impact"),
};
}
private async suggestRemediation(
context: IncidentContext,
analysis: any,
): Promise<{ actions: string[]; estimatedTime: string }> {
const prompt = `
Based on this incident analysis, what are the safest remediation steps?
Alert: ${context.alert.alertName}
Root cause hypothesis: ${analysis.rootCauseHypothesis}
Confidence: ${analysis.confidenceLevel}
Return a JSON object with:
{
"actions": [
{
"step": 1,
"action": "description of action",
"safety": "high/medium/low",
"rollback": "how to undo if it fails",
"estimatedDuration": "5 minutes"
}
],
"estimatedTotalTime": "15 minutes"
}
Only suggest actions that:
- Have a clear rollback path
- Have been used successfully before for similar issues
- Don't require human judgment to be safe
`;
const message = await this.client.messages.create({
model: "claude-opus-4-1",
max_tokens: 1024,
messages: [
{
role: "user",
content: prompt,
},
],
});
try {
const responseText =
message.content[0].type === "text" ? message.content[0].text : "{}";
const jsonMatch = responseText.match(/\{[\s\S]*\}/);
const actionData = jsonMatch ? JSON.parse(jsonMatch[0]) : { actions: [] };
return {
actions: actionData.actions || [],
estimatedTime: actionData.estimatedTotalTime || "unknown",
};
} catch {
return {
actions: [
"Manual investigation required - Claude analysis inconclusive",
],
estimatedTime: "30+ minutes",
};
}
}
private recordResolutionFeedback(
alert: AlertPayload,
investigation: any,
): void {
// In production, log this to a database or file for learning
const feedbackFile = path.join(
".",
"incident_feedback",
`${alert.affectedService}_${Date.now()}.json`,
);
fs.writeFileSync(
feedbackFile,
JSON.stringify({ alert, investigation }, null, 2),
);
}
private extractSection(text: string, keyword: string): string {
const regex = new RegExp(`${keyword}[^]*?(?=\\n\\d|$)`, "i");
const match = text.match(regex);
return match ? match[0].trim() : "Not found in analysis";
}
private extractConfidence(text: string): string {
const match = text.match(/confidence[:\s]+([^.!?\n]+)/i);
return match ? match[1].trim() : "Unknown";
}
start(): void {
console.log("Starting on-call assistant...");
}
}
// Initialize and start
const assistant = new OnCallAssistant();
assistant.start();
export { OnCallAssistant, AlertPayload, IncidentContext };This code sets up an alert ingestion server that listens for webhooks and coordinates investigation. Here's what happens when an alert fires:
- The
/alertendpoint receives the alert payload investigateAlert()orchestrates the whole investigationgatherContext()pulls data from multiple systems in parallelanalyzeWithClaude()sends all that context to Claude for analysissuggestRemediation()asks Claude for safe, actionable fixes- We return everything to the on-call engineer in a structured format
The knowledge base (initialized in loadKnowledgeBase()) is the secret sauce here. It stores patterns from past incidents. When we see a similar alert again, we can immediately suggest what worked before. Over time, this becomes incredibly valuable—new on-call engineers don't need to reinvent the wheel.
Learning from Resolutions: Building Feedback Loops
Here's where the system gets smarter over time. We need to capture how incidents were actually resolved. The on-call assistant stores resolution records and analyzes patterns across multiple incidents to improve future recommendations:
interface ResolutionRecord {
incidentId: string;
alertName: string;
service: string;
timestamp: string;
suggestedActions: string[];
actualAction: string;
actionWasSuccessful: boolean;
timeToResolution: number; // in seconds
postMortemNotes: string;
}
class IncidentLearner {
private client: Anthropic;
private resolutionHistory: ResolutionRecord[] = [];
async learnPatterns(): Promise<void> {
if (this.resolutionHistory.length < 5) {
console.log("Need at least 5 resolution records to identify patterns");
return;
}
const prompt = `
Analyze these ${this.resolutionHistory.length} incident resolutions to identify patterns:
${this.resolutionHistory
.map(
(r, i) =>
`Incident ${i + 1}:
Alert: ${r.alertName}
Service: ${r.service}
Suggested: ${r.suggestedActions.join(", ")}
Actual: ${r.actualAction}
Success: ${r.actionWasSuccessful}
Time: ${r.timeToResolution}s`,
)
.join("\n\n")}
What patterns do you see? Which suggested actions work best? Which should we improve or retire?
`;
const message = await this.client.messages.create({
model: "claude-opus-4-1",
max_tokens: 1024,
messages: [
{
role: "user",
content: prompt,
},
],
});
const analysis =
message.content[0].type === "text" ? message.content[0].text : "";
console.log("\n--- Pattern Analysis ---");
console.log(analysis);
// Store these patterns back in the knowledge base
this.updateKnowledgeBase(analysis);
}
private updateKnowledgeBase(analysis: string): void {
// In production, write this back to persistent storage
console.log(
"Updating knowledge base with patterns...\n(In production: write to KB database)",
);
}
}After handling just five incidents, Claude can analyze what actually works and update recommendations. The system gets better every time you use it. On-call engineers implicitly teach the system through their actions.
Integration with Your Existing Stack
Your on-call assistant doesn't exist in isolation—it needs to integrate with the tools your team already uses. Most teams have invested in monitoring, logging, incident tracking, and communication platforms. The on-call assistant should be the glue that connects these systems.
For monitoring integration, you need to ingest alerts from your monitoring platform. Datadog, Prometheus, New Relic, and others all have webhook capabilities. Hook up your monitoring platform to send alerts to your on-call assistant. The assistant then gathers context from the monitoring system's APIs. Many platforms expose metric history, alert history, and service metadata through APIs. Use these to build the context.
For logging integration, you need to query logs based on the alert. Maybe you don't have a direct integration—no problem. The on-call assistant can query logs via API. Elasticsearch, Splunk, Datadog, and others expose search APIs. Build an abstraction that lets the assistant query logs from your platform of choice.
For incident tracking, record every incident the assistant handles. Many teams use Jira, PagerDuty, or custom tracking systems. When the assistant generates analysis and recommendations, create an incident record automatically. Link it to related alerts. Update it as the incident is resolved. This creates a historical record you can analyze.
For communication, alert the on-call engineer through their preferred channel. Slack, email, SMS, PagerDuty—get the message to them where they'll see it immediately. Don't just push data to a web dashboard and hope they notice.
When designing integrations, start with webhooks and APIs. Most modern platforms support these. Write adapters that let you swap implementations. Today you're monitoring with Datadog, but in two years you might switch. If your system is loosely coupled through adapters, the switch doesn't require rewriting everything.
Automating Safe Remediations
Now for the scary part—actually executing fixes automatically. We need guardrails to ensure we never make things worse:
interface RemediationAction {
id: string;
name: string;
description: string;
commandOrAPI: string;
safetyLevel: "low" | "medium" | "high";
requiresApproval: boolean;
rollbackCommand?: string;
estimatedDuration: number;
affectsSLA: boolean;
}
class SafeRemediationExecutor {
private client: Anthropic;
async executeRemediationWithConfirmation(
action: RemediationAction,
onCallEngineer: string,
): Promise<{ success: boolean; output: string; rollbackNeeded: boolean }> {
console.log(`\n🚨 Remediation requiring approval:`);
console.log(`Action: ${action.name}`);
console.log(`Description: ${action.description}`);
console.log(`Duration: ~${action.estimatedDuration}s`);
console.log(`Rollback: ${action.rollbackCommand || "Not reversible"}`);
console.log(`\nOn-call engineer (${onCallEngineer}): Approve? (yes/no)`);
// In real implementation, this would wait for API response, webhook, or UI approval
const approved = true; // Simulating approval
if (!approved) {
console.log("Remediation cancelled by on-call engineer");
return { success: false, output: "Cancelled", rollbackNeeded: false };
}
try {
console.log(`\n▶ Executing: ${action.name}`);
const startTime = Date.now();
// Execute the remediation (command, API call, etc.)
const output = await this.executeCommand(action.commandOrAPI);
const duration = Date.now() - startTime;
// Verify it worked
const success = await this.verifyRemediation(action.id);
if (success) {
console.log(`✓ Remediation succeeded in ${duration}ms`);
return { success: true, output, rollbackNeeded: false };
} else {
console.log(`✗ Verification failed, attempting rollback...`);
if (action.rollbackCommand) {
await this.executeCommand(action.rollbackCommand);
return {
success: false,
output: "Remediation failed, rolled back",
rollbackNeeded: true,
};
} else {
return {
success: false,
output:
"Remediation failed, no rollback available - manual intervention needed",
rollbackNeeded: true,
};
}
}
} catch (error) {
console.error(`✗ Execution failed: ${error}`);
if (action.rollbackCommand) {
console.log("Attempting rollback...");
await this.executeCommand(action.rollbackCommand);
}
return {
success: false,
output: `Execution failed: ${error}`,
rollbackNeeded: true,
};
}
}
private async executeCommand(command: string): Promise<string> {
// Mock implementation
console.log(` Running: ${command}`);
return `Command executed: ${command}`;
}
private async verifyRemediation(actionId: string): Promise<boolean> {
// Check if the remediation actually fixed the problem
// This would be alert-specific verification
console.log(` Verifying remediation...`);
return true; // Mock: assume success
}
}The key principle here: human approval gates for any action with risk. We never execute remediations blindly. The on-call engineer makes the final call, and we provide clear information for that decision.
Building a Sustainable Knowledge Base
The on-call assistant's knowledge base is the foundation of its value. When it's empty, the system is just gathering data and asking Claude for analysis—useful but not transformative. After three months of incidents, the knowledge base contains patterns. After a year, it's incredibly valuable. The trick is building it incrementally and iteratively.
Start with a template of common patterns you already know. Your team has handled similar issues before. Write down what you've learned. High CPU is often caused by a memory leak. Database connection exhaustion usually means a new feature is hitting the database harder. These patterns, encoded in the knowledge base, let the on-call assistant make better suggestions immediately.
As the on-call assistant handles incidents, feed the resolutions back into the knowledge base. Not manually—automatically. Every resolution is a data point. Claude sees the pattern, updates the knowledge base entry. Over time, the suggestions become increasingly tailored to your specific systems and your team's approach to problem-solving.
Periodically review and curate the knowledge base. Some patterns become outdated as you refactor systems. Some patterns emerge as more important than originally thought. Manage the knowledge base like you manage documentation—keep it current, keep it accurate, keep it organized.
The knowledge base is also incredibly valuable for new team members. Onboarding to on-call usually involves learning tribal knowledge—who to call for database issues, which services tend to fail together, what remediation steps usually work. Much of this tribal knowledge can be captured in the knowledge base. New engineers read the patterns and immediately understand your systems.
Handling Cascading Failures and Cross-Service Issues
Real production incidents are rarely isolated. A database slowdown in one service triggers backpressure in a dependent service, which then times out and affects a third service downstream. By the time you get paged, you have multiple alerts firing, and it's not immediately obvious which one is the root cause.
Your on-call assistant needs to be smart about this. Instead of treating each alert independently, correlate alerts across time and services. If three services all degrade at exactly the same time, they're probably related. If service B starts failing five minutes after service A, service A might be the root cause.
In your alert ingestion logic, check the alert timestamp against other recent alerts for related services. Build an incident correlation system that groups related alerts together. When you send context to Claude, include not just the direct alert but also nearby alerts for related services. This helps Claude see the bigger picture: maybe the root cause is upstream, not in the directly affected service.
Build a dependency graph of your services (even a simple one) and include it in the context. When investigating a database alert, Claude should know which services depend on that database. When troubleshooting a cache layer, Claude should know which services use it. This dependency awareness helps separate root causes from cascading effects.
Improving Over Time Through Analysis
The real power of an on-call assistant emerges when you analyze patterns across multiple incidents. After handling fifty incidents, what did you learn? Which types of issues get resolved fastest? Which types tend to recur? Are there services that trigger disproportionately many alerts?
Create a periodic analysis pipeline that Claude runs weekly. Feed it the entire incident history for the week and ask it to identify patterns: Which services were most unstable? Which remediation steps worked best? Which suggested actions did on-call engineers ignore, and why? This feedback loop turns your on-call system into a learning system.
When Claude identifies a pattern—say, "Service X fails within 30 minutes after every deployment"—surface that to your engineering team. Not as a criticism, but as actionable intelligence. Maybe there's a pre-deployment verification missing. Maybe the service needs better health checks. Maybe it's database migrations that need to complete before traffic ramps up.
Use these insights to improve your services themselves, not just your on-call response. The goal is fewer incidents, not better incident handling. Better incident handling is a means to that end.
On-Call Rotation Resilience
Another real-world consideration: what happens when your on-call assistant goes down? Your engineers still need to handle incidents. Build graceful degradation.
If the assistant service is unreachable, your alerting system should still page the on-call engineer—just without the assistant context. The alert still goes through. Humans still respond. You just don't get the analysis and recommendations, which is suboptimal but acceptable.
Document fallback procedures that work without the assistant. Keep runbooks in accessible places. Make sure newer team members aren't 100% reliant on the assistant for understanding what to do. The assistant amplifies human expertise, but humans should still be able to function without it.
Test your fallback procedures periodically. Do a practice incident response without the assistant. This serves two purposes: it verifies your fallbacks work, and it keeps your team's incident response muscles from atrophying. Even if the assistant works perfectly 99.9% of the time, that 0.1% when it fails shouldn't leave your team helpless.
Metrics and Success Measurement
How do you know if your on-call assistant is actually working? Measure these metrics before and after deployment:
Mean Time to Resolution (MTTR) tells you how long incidents take to resolve from alert to fix deployed. This should decrease significantly with a working assistant—maybe from 25 minutes to 8 minutes.
Mean Time to Detection (MTTD) tells you how long between when an issue starts and when your monitoring alerts. This doesn't change (your monitoring is unchanged), but it's good context for MTTR.
Mean Time to Acknowledge (MTTA) tells you how long between alert and the on-call engineer reading the assistant's summary. With a good assistant that immediately generates a summary, MTTA should drop to near-zero. With a slow assistant, MTTA increases, which defeats the purpose.
Incident recurrence rate tells you what percentage of incidents are repeats (same root cause, same service, similar timeframe). The assistant should help reduce this by identifying patterns and suggesting structural fixes, not just quick remediations.
On-call engineer satisfaction matters too. Survey them regularly. Is the assistant actually saving them time, or is it adding overhead? Is it providing useful information or noise? The best technical improvement means nothing if engineers disable it because it's not helpful.
Wrapping It All Up: The Real-World Impact
Let's talk about what this means in practice. An on-call incident that might have taken 20 minutes to resolve ("Who wrote this service? What changed? Where are the logs?") can now take 5 minutes because Claude pre-fetches all that context and synthesizes it into a coherent picture.
The mean time to resolution (MTTR) drops significantly. The mean time to detection (MTTD) stays the same (your monitoring is unchanged), but now the time between detection and understanding is minimal.
For teams with high-velocity deployments and complex microservices, this is transformative. You spend less time on on-call drudgery and more time on actual problem-solving. And crucially, new team members benefit immediately—they don't have to ask "what does this service do?" because Claude's summaries capture it.
But the deeper impact is on on-call culture. With an assistant that provides immediate context and suggestions, being on-call becomes less stressful. You're not going in blind. You have a co-worker who's already started investigating. You have a suggested first step. This reduces the psychological burden of on-call duties, which reduces burnout and improves team retention.
The safety-first approach (approval gates, clear rollback paths, non-critical investigation steps) means you're not handing over incident response to automation—you're augmenting human judgment with better information and suggestions. That's the sweet spot.
Your on-call assistant doesn't replace human judgment. It amplifies it, making humans more effective and decisions faster. It's the difference between being handed a problem and a few data points, versus being handed a problem with analysis, context, and multiple suggested solutions. The human still decides which solution to pursue, but they do so with much better information.
Over time, as the system learns from resolutions, it gets smarter. The suggestions get better. The patterns it identifies become more refined. An on-call assistant deployed today will be significantly more useful six months from now, not because the code changed, but because the knowledge base grew.
That's sustainable on-call support. Not just faster incident resolution, but continuous improvement driven by every incident your team handles. Every page becomes a learning opportunity. Every resolution contributes to the system's knowledge. After a year, your on-call system is substantially better at its job than when you started.
Implementation Roadmap: From Concept to Production
Building an on-call assistant is a multi-phase project. You don't need to implement everything at once. Start with the core, validate it works, then iterate.
Phase one is alerting integration. Get alerts flowing into your assistant. Build the webhook receiver. Parse alerts from your monitoring system. Store them with timestamps. This is the foundation. Once you have alerts flowing, you can build everything else on top.
Phase two is basic context gathering. When an alert fires, gather basic metrics and logs. This doesn't need to be perfect—just get the data flowing. Fetch metrics from your monitoring API. Query logs. Pull deployment history. Combine all this into a context object and pass it to Claude. Get a basic analysis back. This proves the concept works.
Phase three is knowledge base and learning. Start recording resolutions. Extract patterns. Build a knowledge base that feeds into future analyses. This is where the system starts becoming valuable—it learns from your experience.
Phase four is safe automation. Identify remediations that are safe to attempt automatically. These are operations that have clear rollback paths and high success rates. Start automating these with approval gates. An engineer approves, the system attempts the remediation, verifies it worked.
Phase five is integration with your communication stack. Get alerts to your on-call engineer through the channels they use. Build dashboards. Make sure information reaches the right people at the right time.
Phase six is optimization. Now that you have a working system, optimize it. Improve alert correlation. Reduce noise. Improve suggestion quality. Learn from what works and what doesn't.
This phased approach lets you get value early while building toward a more complete system over time. Each phase is a complete, working system. Each phase adds value. You're never stuck waiting for the perfect solution.
As you move through phases, document what you learn. Document the patterns you discover. Document the remediations that work best. Document the gotchas. This documentation becomes invaluable as you scale the system.
Building an on-call assistant might feel like a significant undertaking, but the payoff is genuine. You reduce on-call burden. You improve incident response. You build organizational knowledge. You create a system that compounds in value over time.
Start with phase one. Get alerts flowing. Move to phase two. Get basic analysis working. Then iterate from there. In six months, you'll have a system your team deeply depends on. In a year, it'll be an indispensable part of how you operate.
That's when you know you've built something worth building.
Where to Start Your First On-Call Assistant
If this vision excites you but feels overwhelming, start with the smallest possible version. Pick one service. Pick the most common incident affecting that service. Build an assistant that handles just that incident.
Feed it three past incidents for that service. Have Claude analyze them. Extract what it learns. Use that knowledge base entry to analyze the next incident. See if Claude's suggestions match what you actually did.
Once you validate that basic pattern for one service and one incident type, expand to a second service. Then a third. Grow your knowledge base as you go.
After three months of incremental development, you'll have an on-call assistant that's genuinely useful for your team. It won't handle every incident, but it'll handle the most common ones. The suggestions will be accurate because they're based on your team's actual incident history.
From there, continuous improvement is natural. Each month, the assistant handles more incidents. The knowledge base grows. Suggestions become more refined. You're not replacing your incident response process—you're augmenting it month by month.
That's sustainable systems development: start small, iterate constantly, let the real world guide improvements.
-iNet