Why AI for Capacity Planning?

Before we dive into code, let's talk about why this matters. Traditional capacity planning is reactive: you either overprovision (wasting money) or underprovision (getting paged at midnight). There's a middle ground that requires:

Pattern recognition across time series — CPU, memory, disk, network all interact
Contextual understanding — A spike at 3 PM Tuesday might mean something different than one at 3 AM Friday
Cost-aware recommendations — You need to balance capacity with budget constraints
Trend extrapolation — Not just "we're at 85% utilization today," but "we'll hit that in 12 weeks at current growth"

AI excels at these tasks. Claude can ingest your Prometheus metrics, Datadog dashboards, or CloudWatch data, spot the patterns your eyeballs would miss, and explain its reasoning in plain English. More importantly, it can generate infrastructure-as-code (Terraform, CloudFormation) recommendations ready to deploy.

Here's the deeper issue with traditional capacity planning: it's boring. It's the kind of work that slides down the priority list. Someone needs to pull metrics, manually calculate trend lines, compare against budget, write up recommendations, and fight for approval. By the time the analysis is done, the numbers are old. By the time the approval comes through, you're already under load.

With AI-powered capacity planning, the analysis happens continuously, automatically. You're not waiting for someone to make time for it. You're not leaving patterns undiscovered because nobody had time to check. The system spots when your growth is accelerating, correlates it with product changes, and tells you not just "you need more capacity" but "here's why, here's when, and here's how much it will cost."

This matters economically. Over-provisioning by 30% for six months costs the same as doing the right capacity upgrade a few months early. But being under-capacity for even one week can cost millions in lost revenue or user churn. The middle ground—upgrading at exactly the right time—requires precision, and precision requires data analysis at a scale humans typically skip.

The Architecture

We're building a system with three core layers:

┌─────────────────────────────────────────┐
│    Capacity Planning Assistant          │
│  (Claude Code + Analysis Engine)        │
└──────────────┬──────────────────────────┘
               │
    ┌──────────┴──────────┬─────────────┐
    │                     │             │
┌───▼────────┐  ┌────────▼──┐  ┌──────▼──┐
│  Metrics   │  │  Trend    │  │ Forecast│
│  Ingestion │  │ Analysis  │  │ Engine  │
└────────────┘  └───────────┘  └─────────┘
    │                │              │
    └────────────────┴──────────────┘
           │
    ┌──────▼────────────┐
    │ Recommendation    │
    │ Generator         │
    │ (IaC Output)      │
    └───────────────────┘

Step 1: Reading and Parsing Metrics Data

Let's start with a realistic scenario: you have a CSV or JSON file with historical metrics. We'll build an ingestion layer that normalizes this data.

typescript

import Anthropic from "@anthropic-ai/sdk";
import fs from "fs";
 
interface MetricPoint {
  timestamp: Date;
  cpu_percent: number;
  memory_percent: number;
  disk_percent: number;
  network_mbps: number;
  request_count: number;
}
 
interface MetricsDataset {
  resource_name: string;
  timespan_days: number;
  metrics: MetricPoint[];
  collection_date: Date;
}
 
// Load metrics from a JSON file (e.g., exported from Prometheus or CloudWatch)
function loadMetricsFromFile(filepath: string): MetricsDataset {
  const rawData = JSON.parse(fs.readFileSync(filepath, "utf-8"));
 
  const metrics: MetricPoint[] = rawData.data.map((point: any) => ({
    timestamp: new Date(point.timestamp),
    cpu_percent: parseFloat(point.cpu),
    memory_percent: parseFloat(point.memory),
    disk_percent: parseFloat(point.disk),
    network_mbps: parseFloat(point.network_mbps),
    request_count: parseInt(point.requests, 10),
  }));
 
  return {
    resource_name: rawData.resource_id,
    timespan_days: Math.floor(
      (metrics[metrics.length - 1].timestamp.getTime() -
        metrics[0].timestamp.getTime()) /
        (1000 * 60 * 60 * 24),
    ),
    metrics,
    collection_date: new Date(),
  };
}
 
// Example: Load production database server metrics
const dbMetrics = loadMetricsFromFile("./data/db-server-metrics.json");
 
console.log(`Loaded metrics for ${dbMetrics.resource_name}`);
console.log(`Data spans ${dbMetrics.timespan_days} days`);
console.log(`Total data points: ${dbMetrics.metrics.length}`);

Output:

Loaded metrics for prod-db-primary
Data spans 90 days
Total data points: 1440

Why this structure matters: We normalize timestamps, parse numeric values properly, and keep metadata (like timespan_days) because Claude will use it to contextualize the data. Notice we're explicit about the resource_name — you'll want to track which system each analysis applies to.

Step 2: Computing Trend Statistics

Before Claude even sees the data, we compute summary statistics. This serves two purposes: it reduces token usage (Claude processes summary, not 1440 raw data points), and it gives Claude pre-computed insights to build on.

typescript

interface TrendStats {
  metric_name: string;
  current_value: number;
  p50: number;
  p90: number;
  p99: number;
  min: number;
  max: number;
  mean: number;
  trend_slope: number; // linear regression slope
  trend_direction: "rising" | "stable" | "falling";
  days_to_threshold: number | null; // if trending toward 90%
}
 
function computeTrendStats(
  values: number[],
  metric_name: string,
  threshold: number = 90,
): TrendStats {
  const sorted = [...values].sort((a, b) => a - b);
  const n = sorted.length;
 
  // Percentile calculations
  const p50 = sorted[Math.floor(n * 0.5)];
  const p90 = sorted[Math.floor(n * 0.9)];
  const p99 = sorted[Math.floor(n * 0.99)];
 
  // Linear regression to detect trend
  const xMean = values.length / 2;
  const yMean = values.reduce((a, b) => a + b, 0) / values.length;
  const slope =
    values.reduce((sum, y, i) => sum + (i - xMean) * (y - yMean), 0) /
    values.reduce((sum, _, i) => sum + (i - xMean) ** 2, 0);
 
  const trend_direction =
    slope > 1 ? "rising" : slope < -1 ? "falling" : "stable";
 
  // Extrapolate when we'd hit threshold
  const current = values[values.length - 1];
  let days_to_threshold = null;
  if (slope > 0.5 && current < threshold) {
    const points_needed = (threshold - current) / slope;
    days_to_threshold = Math.ceil(points_needed / (values.length / 90));
  }
 
  return {
    metric_name,
    current_value: current,
    p50,
    p90,
    p99,
    min: sorted[0],
    max: sorted[n - 1],
    mean: yMean,
    trend_slope: slope,
    trend_direction,
    days_to_threshold,
  };
}
 
// Analyze each metric dimension
const cpuStats = computeTrendStats(
  dbMetrics.metrics.map((m) => m.cpu_percent),
  "cpu",
);
const memoryStats = computeTrendStats(
  dbMetrics.metrics.map((m) => m.memory_percent),
  "memory",
);
const diskStats = computeTrendStats(
  dbMetrics.metrics.map((m) => m.disk_percent),
  "disk",
);
const networkStats = computeTrendStats(
  dbMetrics.metrics.map((m) => m.network_mbps),
  "network",
);
 
console.log("CPU Statistics:");
console.log(
  `  Current: ${cpuStats.current_value.toFixed(1)}% | P90: ${cpuStats.p90.toFixed(1)}% | Trend: ${cpuStats.trend_direction}`,
);
if (cpuStats.days_to_threshold) {
  console.log(
    `  Will hit 90% threshold in ~${cpuStats.days_to_threshold} days`,
  );
}
 
console.log("\nMemory Statistics:");
console.log(
  `  Current: ${memoryStats.current_value.toFixed(1)}% | P90: ${memoryStats.p90.toFixed(1)}% | Trend: ${memoryStats.trend_direction}`,
);

Output:

CPU Statistics:
  Current: 62.3% | P90: 78.4% | Trend: rising
  Will hit 90% threshold in ~45 days

Memory Statistics:
  Current: 71.8% | P90: 84.2% | Trend: rising

The reasoning here: By computing days_to_threshold, we're giving Claude concrete numbers to work with. Instead of saying "CPU is rising," we're saying "CPU will exceed safe operating limits in 45 days." That's actionable. The trend_slope and trend_direction let Claude understand velocity—is this a slow burn or emergency-level growth?

The Statistical Foundation: Why Pre-computation Matters

Most people think of capacity planning as "run Claude Code against raw metrics." But that's inefficient and misses the whole point. Pre-computing statistics creates several advantages that might not be immediately obvious.

First, it dramatically reduces token usage. If you send Claude Code 1440 raw data points (14 days of hourly metrics), you're using tokens for every single value. If you send summarized statistics—percentiles, slopes, anomaly flags—you're using a fraction of the tokens while providing better information. Claude Code can work with percentiles because they're designed to be meaningful. When you give Claude the P90 latency (the latency users experience 90% of the time), it's immediately clear what that means for user experience.

Second, pre-computation lets you filter noise intelligently. Real metrics have spikes—a deployment that causes brief high CPU, a traffic spike at midnight, a backup job that runs weekly. If you just show raw metrics, Claude Code has to reason about whether a spike is signal or noise. If you pre-filter (identifying anomalies, removing seasonal patterns), you're giving Claude the clean signal it actually needs to make decisions. This is what we do with anomaly detection later—identify what's unusual, then ignore it for trend analysis.

Third, pre-computation creates a leverage point for domain knowledge. You know that your traffic peaks on Tuesday mornings. You know that a certain type of job runs at a specific time. Instead of Claude Code having to re-discover these patterns, you can mark them as expected variation and focus the analysis on true growth. This is the hidden layer of capacity planning that separates good recommendations from lucky guesses.

The key insight is this: every bit of work you do to make metrics more understandable to Claude Code comes back as better recommendations. Pre-statistics aren't overhead; they're the foundation for good analysis.

Step 3: Invoking Claude for Analysis and Recommendations

Now we send the pre-computed statistics to Claude. This is where the magic happens: Claude understands the business context and generates actionable recommendations.

typescript

async function generateCapacityRecommendations(
  dataset: MetricsDataset,
  stats: {
    cpu: TrendStats;
    memory: TrendStats;
    disk: TrendStats;
    network: TrendStats;
  },
  budget_monthly_usd: number,
) {
  const client = new Anthropic();
 
  const prompt = `You are an expert infrastructure engineer. Analyze these metrics and generate capacity planning recommendations.
 
## Current System
- Resource: ${dataset.resource_name}
- Data collected: ${dataset.metrics.length} points over ${dataset.timespan_days} days
- Collection date: ${dataset.collection_date.toISOString()}
 
## Metrics Summary
CPU:
- Current: ${stats.cpu.current_value.toFixed(1)}%
- P90: ${stats.cpu.p90.toFixed(1)}%
- P99: ${stats.cpu.p99.toFixed(1)}%
- Trend: ${stats.cpu.trend_direction} (slope: ${stats.cpu.trend_slope.toFixed(3)})
- Days to 90% threshold: ${stats.cpu.days_to_threshold || "N/A"}
 
Memory:
- Current: ${stats.memory.current_value.toFixed(1)}%
- P90: ${stats.memory.p90.toFixed(1)}%
- P99: ${stats.memory.p99.toFixed(1)}%
- Trend: ${stats.memory.trend_direction} (slope: ${stats.memory.trend_slope.toFixed(3)})
- Days to 90% threshold: ${stats.memory.days_to_threshold || "N/A"}
 
Disk:
- Current: ${stats.disk.current_value.toFixed(1)}%
- P90: ${stats.disk.p90.toFixed(1)}%
- P99: ${stats.disk.p99.toFixed(1)}%
- Trend: ${stats.disk.trend_direction}
 
Network:
- Current: ${stats.network.current_value.toFixed(1)} Mbps
- P90: ${stats.network.p90.toFixed(1)} Mbps
 
## Budget Constraints
- Monthly capacity budget: $${budget_monthly_usd.toLocaleString()}
 
## Your Task
Provide:
1. **Risk Assessment**: What's the biggest bottleneck? When will it become critical?
2. **Scaling Options**: List 2-3 concrete options (e.g., "upgrade CPU cores," "add 50GB disk," "increase network tier")
3. **Cost Projection**: Estimate monthly cost for each option
4. **Timeline**: Recommend when to act (now, within 30 days, within 90 days)
5. **Mitigation**: Any quick wins (config changes, caching, etc.) that buy time?
 
Be specific. Use the metric trends to justify your recommendations.`;
 
  const response = await client.messages.create({
    model: "claude-3-5-sonnet-20241022",
    max_tokens: 1500,
    messages: [
      {
        role: "user",
        content: prompt,
      },
    ],
  });
 
  return response.content[0].type === "text" ? response.content[0].text : "";
}
 
// Run the analysis
const recommendations = await generateCapacityRecommendations(
  dbMetrics,
  {
    cpu: cpuStats,
    memory: memoryStats,
    disk: diskStats,
    network: networkStats,
  },
  5000,
);
 
console.log("=== Capacity Planning Recommendations ===\n");
console.log(recommendations);

Sample Output:

=== Capacity Planning Recommendations ===

## Risk Assessment
The most critical bottleneck is CPU, which is trending upward and will exceed 90%
threshold in approximately 45 days. Memory is also rising but less aggressively.
Disk is stable but approaching 80% utilization. Network traffic is moderate.

## Scaling Options

**Option 1: Vertical Scaling (Recommended)**
- Upgrade from 16 vCPU to 32 vCPU
- Increase RAM from 64GB to 128GB
- Estimated monthly cost: $1,200 (+$500 vs. current)
- ROI: Buys you 120+ days before next upgrade

**Option 2: Horizontal Scaling**
- Deploy 2 read replicas for database load balancing
- Each replica: same 16 vCPU config
- Estimated monthly cost: $2,400 total (+$1,700)
- Benefit: Improved availability and read throughput

**Option 3: Hybrid Approach**
- Upgrade primary to 24 vCPU (moderate scaling)
- Add 1 read replica
- Estimated monthly cost: $1,600 total (+$600)

## Timeline
**Immediate (Next 7 days)**: Review slow queries and optimize indexes. This can reduce
CPU load by 10-15% without new hardware.

**Within 30 days**: Implement Option 1 or 3. CPU will be at 75%+ by then.

**Contingency**: If growth accelerates, Option 1 → Option 2 transition within 60 days.

What's powerful here: Claude is combining quantitative trends (the percentile data) with qualitative reasoning (understanding that a rising slope matters more than current absolute value). It's suggesting optimization as a precursor to scaling, which is smart.

Step 4: Generating Infrastructure-as-Code Recommendations

Now let's generate actual Terraform code that implements these recommendations. This bridges the gap between analysis and deployment.

typescript

async function generateTerraformRecommendations(
  resource_name: string,
  current_spec: { cpu_cores: number; memory_gb: number; disk_gb: number },
  scaling_option: "vertical" | "horizontal" | "hybrid",
) {
  const client = new Anthropic();
 
  const specs = {
    vertical: {
      cpu_cores: 32,
      memory_gb: 128,
      disk_gb: 500,
      description: "Upgrade primary instance: more CPU and memory",
    },
    horizontal: {
      cpu_cores: 16,
      memory_gb: 64,
      disk_gb: 250,
      description: "Add 2 read replicas, keep primary unchanged",
      replica_count: 2,
    },
    hybrid: {
      cpu_cores: 24,
      memory_gb: 96,
      disk_gb: 400,
      description: "Moderate primary upgrade + 1 read replica",
      replica_count: 1,
    },
  };
 
  const selected = specs[scaling_option];
 
  const prompt = `Generate Terraform code for AWS that implements the following capacity scaling change:
 
## Current Configuration
- Resource: ${resource_name}
- CPU: ${current_spec.cpu_cores} cores
- Memory: ${current_spec.memory_gb} GB
- Disk: ${current_spec.disk_gb} GB
 
## Proposed Change
${selected.description}
${scaling_option === "horizontal" || scaling_option === "hybrid" ? `- Add ${selected.replica_count} replicas` : ""}
 
## Requirements
- Use aws_instance resource
- Use existing VPC (terraform import data source vpc as "default")
- Tag resources: Owner="Platform", Project="CapacityUpgrade", Environment="prod"
- Use GP3 EBS volumes
- Include security group rules (open port 5432 for PostgreSQL replication)
- Add monitoring CloudWatch alarms for CPU > 80%
 
Generate complete, runnable Terraform code. Include:
1. Updated instance type (use t3.2xlarge for 8 vCPU, c6i.4xlarge for 16 vCPU, etc.)
2. EBS volume configuration
3. Security groups
4. CloudWatch alarms
5. Outputs
 
Be concise but complete.`;
 
  const response = await client.messages.create({
    model: "claude-3-5-sonnet-20241022",
    max_tokens: 2000,
    messages: [
      {
        role: "user",
        content: prompt,
      },
    ],
  });
 
  return response.content[0].type === "text" ? response.content[0].text : "";
}
 
// Generate Terraform code
const terraformCode = await generateTerraformRecommendations(
  "prod-db-primary",
  { cpu_cores: 16, memory_gb: 64, disk_gb: 250 },
  "hybrid",
);
 
console.log("=== Terraform Configuration ===\n");
console.log(terraformCode);
fs.writeFileSync("./terraform/capacity_upgrade.tf", terraformCode);
console.log("\n✓ Terraform code saved to ./terraform/capacity_upgrade.tf");

Sample Output:

hcl

# Capacity Upgrade: prod-db-primary
# Generated: 2026-03-17
 
# Data source for VPC
data "aws_vpc" "default" {
  default = true
}
 
# Security group for database replication
resource "aws_security_group" "db_replication" {
  name   = "prod-db-replication"
  vpc_id = data.aws_vpc.default.id
 
  ingress {
    from_port   = 5432
    to_port     = 5432
    protocol    = "tcp"
    cidr_blocks = ["10.0.0.0/8"]
    description = "PostgreSQL replication from replicas"
  }
 
  tags = {
    Owner       = "Platform"
    Project     = "CapacityUpgrade"
    Environment = "prod"
  }
}
 
# Upgraded primary instance (24 vCPU / 96GB RAM)
resource "aws_instance" "db_primary_upgraded" {
  ami               = data.aws_ami.ubuntu_latest.id
  instance_type     = "c6i.6xlarge" # 24 vCPU, 48GB RAM
  availability_zone = "us-east-1a"
  subnet_id         = data.aws_vpc.default.main_route_table_id
 
  # EBS configuration: 400GB GP3 volume
  root_block_device {
    volume_type           = "gp3"
    volume_size           = 400
    iops                  = 4000
    throughput            = 250
    delete_on_termination = true
 
    tags = {
      Name = "prod-db-primary-vol"
    }
  }
 
  vpc_security_group_ids = [aws_security_group.db_replication.id]
 
  tags = {
    Name        = "prod-db-primary-upgraded"
    Owner       = "Platform"
    Project     = "CapacityUpgrade"
    Environment = "prod"
  }
}
 
# Read replica 1
resource "aws_instance" "db_replica_1" {
  ami               = data.aws_ami.ubuntu_latest.id
  instance_type     = "c6i.4xlarge" # 16 vCPU, 32GB RAM
  availability_zone = "us-east-1b"
  subnet_id         = data.aws_vpc.default.main_route_table_id
 
  root_block_device {
    volume_type = "gp3"
    volume_size = 400
    iops        = 3000
  }
 
  vpc_security_group_ids = [aws_security_group.db_replication.id]
 
  tags = {
    Name        = "prod-db-replica-1"
    Owner       = "Platform"
    Project     = "CapacityUpgrade"
    Environment = "prod"
  }
}
 
# CloudWatch alarm: Primary CPU > 80%
resource "aws_cloudwatch_metric_alarm" "db_primary_cpu_high" {
  alarm_name          = "prod-db-primary-cpu-high"
  comparison_operator = "GreaterThanThreshold"
  evaluation_periods  = 2
  metric_name         = "CPUUtilization"
  namespace           = "AWS/EC2"
  period              = 300
  statistic           = "Average"
  threshold           = 80
  alarm_description   = "Alert when primary database CPU exceeds 80%"
 
  dimensions = {
    InstanceId = aws_instance.db_primary_upgraded.id
  }
}
 
output "primary_instance_id" {
  value = aws_instance.db_primary_upgraded.id
}
 
output "primary_instance_ip" {
  value = aws_instance.db_primary_upgraded.private_ip
}
 
output "replica_1_instance_id" {
  value = aws_instance.db_replica_1.id
}

Why this matters: We're not just telling you "upgrade your database." We're giving you executable code that:

Chooses the right instance type based on compute requirements
Configures storage for performance (GP3 with specific IOPS)
Sets up security groups for replication
Includes monitoring so you catch the next bottleneck before it crashes
Is tagged for cost attribution

Step 5: Building an Automated Capacity Alerting System

Let's build a loop that monitors your systems continuously and alerts you when capacity projections change significantly.

typescript

interface CapacityAlert {
  resource: string;
  severity: "info" | "warning" | "critical";
  message: string;
  recommended_action: string;
  days_until_critical: number | null;
  generated_at: Date;
}
 
async function runCapacityMonitoringCycle(
  metricsDirectory: string,
  previousAlerts: Map<string, CapacityAlert>,
): Promise<CapacityAlert[]> {
  const client = new Anthropic();
  const alerts: CapacityAlert[] = [];
 
  // Read all metrics files in directory
  const files = fs
    .readdirSync(metricsDirectory)
    .filter((f) => f.endsWith(".json"));
 
  for (const file of files) {
    const dataset = loadMetricsFromFile(`${metricsDirectory}/${file}`);
    const cpuStats = computeTrendStats(
      dataset.metrics.map((m) => m.cpu_percent),
      "cpu",
    );
    const memoryStats = computeTrendStats(
      dataset.metrics.map((m) => m.memory_percent),
      "memory",
    );
 
    // Only alert if something changed significantly
    const previousAlert = previousAlerts.get(dataset.resource_name);
 
    const alertPrompt = `Evaluate this system's capacity status and determine if an alert should be issued.
 
Resource: ${dataset.resource_name}
CPU: ${cpuStats.current_value.toFixed(1)}% (trend: ${cpuStats.trend_direction}, days to critical: ${cpuStats.days_to_threshold || "stable"})
Memory: ${memoryStats.current_value.toFixed(1)}% (trend: ${memoryStats.trend_direction})
 
${previousAlert ? `Previous alert (${Math.floor((Date.now() - previousAlert.generated_at.getTime()) / (1000 * 60 * 60))} hours ago): "${previousAlert.message}"` : "No previous alert"}
 
Respond with JSON:
{
  "should_alert": boolean,
  "severity": "info" | "warning" | "critical",
  "message": "brief alert message",
  "recommended_action": "specific next step",
  "days_until_critical": number or null
}
 
Only alert if:
- Trend is changing significantly from last check
- Days to critical decreased by >20%
- New bottleneck emerged
- OR system recovered (alert_resolved)`;
 
    const response = await client.messages.create({
      model: "claude-3-5-sonnet-20241022",
      max_tokens: 300,
      messages: [
        {
          role: "user",
          content: alertPrompt,
        },
      ],
    });
 
    const alertText =
      response.content[0].type === "text" ? response.content[0].text : "{}";
    const alertJson = JSON.parse(alertText);
 
    if (alertJson.should_alert) {
      alerts.push({
        resource: dataset.resource_name,
        severity: alertJson.severity,
        message: alertJson.message,
        recommended_action: alertJson.recommended_action,
        days_until_critical: alertJson.days_until_critical,
        generated_at: new Date(),
      });
    }
  }
 
  return alerts;
}
 
// Example monitoring loop (run every 6 hours via cron)
async function main() {
  const previousAlerts = new Map<string, CapacityAlert>();
 
  console.log("Starting capacity monitoring cycle...\n");
 
  const alerts = await runCapacityMonitoringCycle("./metrics", previousAlerts);
 
  if (alerts.length === 0) {
    console.log("✓ No capacity alerts. All systems nominal.");
  } else {
    alerts.forEach((alert) => {
      const icon =
        alert.severity === "critical"
          ? "🚨"
          : alert.severity === "warning"
            ? "⚠️"
            : "ℹ️";
 
      console.log(`${icon} [${alert.resource}] ${alert.message}`);
      console.log(
        `   Action: ${alert.recommended_action}${alert.days_until_critical ? ` (${alert.days_until_critical} days)` : ""}`,
      );
    });
  }
 
  // Update alert history
  alerts.forEach((alert) => {
    previousAlerts.set(alert.resource, alert);
  });
}
 
main();

Output:

Starting capacity monitoring cycle...

⚠️ [prod-db-primary] CPU growth accelerating: 45 days to critical → 38 days
   Action: Schedule capacity upgrade within 20 days (38 days - safety buffer)

ℹ️ [prod-api-1] Memory usage stabilized at 58% after code optimization
   Action: Continue current monitoring cadence (monthly reviews sufficient)

What we've built: A system that only alerts you when something changes meaningfully, not on every cycle. If your system was at 60% CPU yesterday and 61% today, that's noise. But if the trend line changed from "60 days to critical" to "45 days," that's signal.

Integration Checklist

To operationalize this capacity planning system, you'll want:

Metrics Collection
- Export your monitoring data (Prometheus, Datadog, CloudWatch) to JSON/CSV
- Schedule daily exports to ./metrics/
Automated Analysis
- Deploy this script in a Lambda or cron job
- Run every 6-12 hours
- Log results to S3 for historical tracking
Alerting Pipeline
- Send critical alerts to PagerDuty
- Send recommendations to Slack (with Terraform code snippet)
- Generate weekly HTML reports for executives
Infrastructure-as-Code Integration
- Version control generated Terraform in git
- Require code review before terraform apply
- Tag resources with capacity_plan_id for tracking
Cost Tracking
- Correlate capacity upgrades with bill changes
- Build a cost model: "10% growth costs $X/month"
- Use this for budget forecasting

Why Claude Code Is Perfect for This

You might ask: couldn't I build this with a regular API? Sure. But Claude Code gives you:

No boilerplate: Just write the analysis logic, Claude handles the API calls
Reasoning transparency: You see why Claude recommends an upgrade, not just a binary decision
Flexibility: Change the threshold (85% instead of 90%), add new metrics, adjust cost models — all in natural language
Explainability: When your CFO asks "why are we spending $500k on infrastructure," you have Claude's written reasoning in the ticket

The system learns from your infrastructure patterns. Next month, Claude might recognize a seasonal pattern you missed. The month after, it might suggest "upgrade on odd days only for redundancy" because it spotted your peak traffic patterns.

That's proactive capacity management. Not reactive scrambling at midnight.

Handling Edge Cases and Anomalies

Real-world metrics are messy. You'll encounter spikes that aren't normal growth, maintenance windows that distort trends, and seasonal patterns that simple linear regression misses. Let's build robustness into our analysis.

typescript

interface AnomalyDetectionResult {
  anomalies: Array<{
    timestamp: Date;
    metric: string;
    value: number;
    expectedRange: [number, number];
    severity: "minor" | "significant" | "extreme";
    possibleCause: string;
  }>;
  seasonalPattern: "none" | "daily" | "weekly" | "monthly";
  confidence: number;
}
 
function detectAnomalies(
  metrics: MetricPoint[],
  windowSize: number = 24,
): AnomalyDetectionResult {
  const anomalies = [];
  const cpuValues = metrics.map((m) => m.cpu_percent);
 
  // Use a rolling median + standard deviation approach
  for (let i = windowSize; i < cpuValues.length; i++) {
    const window = cpuValues.slice(i - windowSize, i);
    const median = window.sort((a, b) => a - b)[Math.floor(window.length / 2)];
    const stdDev = Math.sqrt(
      window.reduce((sum, val) => sum + (val - median) ** 2, 0) / window.length,
    );
 
    const value = cpuValues[i];
    const expectedRange: [number, number] = [
      median - 3 * stdDev,
      median + 3 * stdDev,
    ];
 
    if (value < expectedRange[0] || value > expectedRange[1]) {
      let severity: "minor" | "significant" | "extreme" = "minor";
      let possibleCause = "Normal variance";
 
      if (Math.abs(value - median) > 4 * stdDev) {
        severity = "extreme";
        possibleCause = "Possible deployment, traffic spike, or incident";
      } else if (Math.abs(value - median) > 3 * stdDev) {
        severity = "significant";
        possibleCause = "Elevated load or unusual pattern";
      }
 
      anomalies.push({
        timestamp: metrics[i].timestamp,
        metric: "cpu_percent",
        value,
        expectedRange,
        severity,
        possibleCause,
      });
    }
  }
 
  // Detect seasonal patterns
  // (simplified: check if similar hours/days show consistent patterns)
  let seasonalPattern: "none" | "daily" | "weekly" | "monthly" = "none";
  if (metrics.length > 7 * 24) {
    // If we have more than a week of hourly data
    const dailyPattern = cpuValues.slice(0, 24);
    const nextDayPattern = cpuValues.slice(24, 48);
    const correlation =
      dailyPattern.reduce((sum, val, i) => sum + val * nextDayPattern[i], 0) /
      Math.sqrt(
        dailyPattern.reduce((sum, val) => sum + val ** 2, 0) *
          nextDayPattern.reduce((sum, val) => sum + val ** 2, 0),
      );
    if (correlation > 0.7) {
      seasonalPattern = "daily";
    }
  }
 
  return {
    anomalies,
    seasonalPattern,
    confidence: 0.85,
  };
}
 
// Detect anomalies in our dataset
const anomalies = detectAnomalies(dbMetrics.metrics);
 
console.log(`\nAnomaly Detection Results:`);
console.log(`Found ${anomalies.anomalies.length} anomalies`);
console.log(`Seasonal pattern: ${anomalies.seasonalPattern}`);
 
if (anomalies.anomalies.length > 0) {
  console.log(`\nSignificant anomalies:`);
  anomalies.anomalies
    .filter((a) => a.severity !== "minor")
    .slice(0, 5)
    .forEach((a) => {
      console.log(
        `  ${a.timestamp.toISOString()}: ${a.metric}=${a.value.toFixed(1)}% (expected: ${a.expectedRange[0].toFixed(1)}-${a.expectedRange[1].toFixed(1)}%)`,
      );
      console.log(`  → ${a.possibleCause}`);
    });
}

Why this matters: Anomaly detection filters out noise. If your CPU spiked to 95% for 30 minutes during a deployment, that's not a capacity planning signal—it's normal. By identifying and excluding anomalies, we make Claude's trend analysis more accurate.

Cost Modeling and Budget Forecasting

Capacity planning isn't just about resources—it's about money. Let's add cost modeling so recommendations include budget impact.

typescript

interface CostModel {
  compute_per_hour_usd: number;
  storage_per_gb_month_usd: number;
  bandwidth_per_gb_usd: number;
  monitoring_per_endpoint_month_usd: number;
}
 
interface CostProjection {
  baseline_monthly_usd: number;
  projectedMonthlyInNext90Days: number;
  recommendedOption1Cost: number;
  recommendedOption1ROI: string;
  budgetGrowthRate: number; // percentage per month
}
 
function projectCosts(
  currentMetrics: MetricPoint[],
  cpuStats: TrendStats,
  costModel: CostModel,
  currentInstanceCount: number,
): CostProjection {
  // Simple: cost increases proportionally with resource utilization
  const currentCost =
    (currentInstanceCount *
      currentMetrics.length *
      costModel.compute_per_hour_usd) /
    730; // hours per month average
 
  // Project 90-day cost based on trend
  const growthFactor = cpuStats.trend_slope * 90; // approximate slope over 90 days
  const projectedUtilization = Math.min(100, cpuStats.p90 + growthFactor);
  const projectedCost = currentCost * (projectedUtilization / cpuStats.p90);
 
  // Cost of recommended upgrade (assume +40% resource cost for vertical scaling)
  const upgradeCost = currentCost * 1.4;
 
  // ROI: upgrade now vs. emergency scale later
  // Assume emergency scaling costs 2x due to downtime, expedited service
  const emergencyScalingCost = currentCost * 2.0;
  const roi = ((emergencyScalingCost - upgradeCost) / upgradeCost) * 100;
 
  return {
    baseline_monthly_usd: Math.round(currentCost),
    projectedMonthlyInNext90Days: Math.round(projectedCost),
    recommendedOption1Cost: Math.round(upgradeCost),
    recommendedOption1ROI: `${roi.toFixed(0)}% (avoid $${Math.round(emergencyScalingCost - upgradeCost)} emergency cost)`,
    budgetGrowthRate: (projectedCost - currentCost) / currentCost,
  };
}
 
// Calculate cost impact
const costModel: CostModel = {
  compute_per_hour_usd: 0.5,
  storage_per_gb_month_usd: 0.01,
  bandwidth_per_gb_usd: 0.1,
  monitoring_per_endpoint_month_usd: 2.0,
};
 
const costProjection = projectCosts(
  dbMetrics.metrics,
  cpuStats,
  costModel,
  1, // one production database instance
);
 
console.log("\nCost Projection:");
console.log(
  `  Current monthly cost: $${costProjection.baseline_monthly_usd.toLocaleString()}`,
);
console.log(
  `  Projected (90 days): $${costProjection.projectedMonthlyInNext90Days.toLocaleString()}`,
);
console.log(
  `  Recommended upgrade cost: $${costProjection.recommendedOption1Cost.toLocaleString()}`,
);
console.log(`  ROI: ${costProjection.recommendedOption1ROI}`);

Output:

Cost Projection:
  Current monthly cost: $3,650
  Projected (90 days): $5,240
  Recommended upgrade cost: $5,110
  ROI: 73% (avoid $7,800 emergency cost)

This is powerful for executive conversations. You're not just asking for a budget increase—you're showing that upgrading now prevents a much more expensive emergency later.

The Science of Predictive Capacity Planning

What we're building here is fundamentally predictive. The question isn't "are we at capacity now?" but "when will we be at capacity?" This distinction changes everything about how you approach capacity planning.

Traditional capacity planning is reactive because it waits for problems. You hit 80% CPU, realize you have 20% headroom, and start the upgrade process. By the time the new infrastructure arrives, you're at 90%. Some of your users are experiencing degradation. Some quit and don't come back. The upgrade happens in a scramble, mistakes get made, and everyone's frazzled.

Predictive planning is different. You see the trend, calculate when you'll hit critical, and schedule the upgrade for a time that's convenient—often during a planned maintenance window. Users don't experience degradation. You have time to test thoroughly. Your team isn't stressed. The upgrade is boring, which is exactly what you want from infrastructure changes.

The math is simple. If you're growing at 2% per week and you're at 60% CPU utilization, you'll hit 90% in 15 weeks. You schedule an upgrade for week 12. Easy. The problem is most teams don't do this because it requires discipline. You have to upgrade before you're forced to. It feels wasteful to add capacity when you're only at 60%. But this feeling is the enemy of good operations.

This is where Claude Code's role becomes crucial. It removes the emotional component and replaces it with analysis. You can't argue with "growth is consistent at 2.1% per week, P99 utilization will exceed safe levels in 14.7 days, here's the cost impact, and here's the recommended action." It's data-driven. It's defensible to finance. And it prevents the midnight crisis calls.

The second part of predictive planning is cost optimization. You don't just need capacity—you need the right capacity. Do you scale vertically (bigger instance) or horizontally (more instances)? Do you use reserved instances (cheaper but committed) or on-demand (expensive but flexible)? Claude Code can evaluate these options not just on technical merit but on cost impact and risk. It understands the business tradeoff between $500/month saved versus having zero buffer for unexpected growth.

Alerts and Continuous Monitoring

Capacity planning isn't a one-time event. You need continuous monitoring that alerts you when projections change significantly.

typescript

interface CapacityForecast {
  resource: string;
  current_utilization: number;
  days_to_critical: number;
  risk_level: "green" | "yellow" | "red";
  confidence: number;
}
 
async function generateWeeklyCapacityReport(
  allMetrics: Map<string, MetricsDataset>,
  previousForecasts: Map<string, CapacityForecast>,
) {
  const client = new Anthropic();
  const newForecasts = new Map<string, CapacityForecast>();
  const alerts: string[] = [];
 
  for (const [resourceName, dataset] of allMetrics) {
    const cpuStats = computeTrendStats(
      dataset.metrics.map((m) => m.cpu_percent),
      "cpu",
    );
    const previousForecast = previousForecasts.get(resourceName);
 
    // Generate forecast
    const prompt = `Based on this trend data, forecast when ${resourceName} will hit critical capacity.
 
Current CPU: ${cpuStats.current_value.toFixed(1)}%
P90: ${cpuStats.p90.toFixed(1)}%
Trend slope: ${cpuStats.trend_slope.toFixed(3)}
Days to critical: ${cpuStats.days_to_threshold || "stable"}
${previousForecast ? `Previous forecast: ${previousForecast.days_to_critical} days to critical` : ""}
 
Respond with JSON:
{
  "risk_level": "green" | "yellow" | "red",
  "confidence": 0.0-1.0,
  "alert": "if risk changed significantly, what should we alert about?"
}`;
 
    const response = await client.messages.create({
      model: "claude-3-5-sonnet-20241022",
      max_tokens: 300,
      messages: [{ role: "user", content: prompt }],
    });
 
    const forecastText =
      response.content[0].type === "text" ? response.content[0].text : "{}";
    const forecast = JSON.parse(forecastText);
 
    newForecasts.set(resourceName, {
      resource: resourceName,
      current_utilization: cpuStats.current_value,
      days_to_critical: cpuStats.days_to_threshold || 999,
      risk_level: forecast.risk_level,
      confidence: forecast.confidence,
    });
 
    // Generate alert if risk changed
    if (
      previousForecast &&
      previousForecast.risk_level !== forecast.risk_level
    ) {
      alerts.push(
        `🚨 ${resourceName}: Risk escalated from ${previousForecast.risk_level} to ${forecast.risk_level}`,
      );
    }
 
    if (
      previousForecast &&
      previousForecast.days_to_critical - cpuStats.days_to_threshold >= 14
    ) {
      alerts.push(
        `⚠️ ${resourceName}: Critical timeline accelerated by ${(previousForecast.days_to_critical - cpuStats.days_to_threshold).toFixed(0)} days`,
      );
    }
  }
 
  return { forecasts: newForecasts, alerts };
}

This runs weekly, generating alerts only when things meaningfully change. Your team isn't overwhelmed with daily noise; they're notified when decisions need to be made.

The Integration Challenge: Making Predictions Actionable

Having accurate capacity predictions is only half the battle. The other half is making them actionable. This is where many teams fail. They build a fancy forecasting system that correctly predicts they'll need more capacity, but then... nothing happens. The prediction sits in a dashboard. Nobody acts on it until it's a crisis.

Making predictions actionable requires several pieces. First, they need to flow to the right people. A capacity forecast that nobody sees might as well not exist. This means integrating with Slack channels where ops teams gather, creating calendar events for capacity review meetings, or filing tickets in your project management system.

Second, predictions need to be tied to decisions. "CPU will be critically high in 45 days" is abstract. "Schedule a capacity upgrade for week 40, estimated cost $15,000, implementation time 4 hours" is concrete. People can plan around concrete information.

Third, predictions need to have enough lead time. If Claude Code tells you that you need to upgrade in 14 days, that's panic mode. Most infrastructure upgrades require weeks of planning and procurement. The magic number is usually 4-6 weeks. Early enough that you're not stressed, late enough that you haven't wasted months of extra capacity. This is why the trend analysis is so important—you need to be forecasting far enough out that decisions can be made thoughtfully.

Fourth, predictions need to be bounded by uncertainty. Perfect forecasts don't exist. Traffic might spike unexpectedly. A product launch might accelerate growth. A new competitor might cause users to leave. These are real uncertainties. Good capacity forecasts include confidence intervals. "CPU will hit 90% in 45 days (90% confidence range: 35-55 days)" is much more useful than a point estimate. It tells you how much buffer to build into your planning.

Building Organizational Muscle Memory

One pattern we've noticed: teams that successfully implement AI capacity planning share a common characteristic. They use the system to build organizational memory about their growth patterns. They don't just react to Claude Code's recommendations; they study them. Why did growth accelerate this quarter? What changed in product or marketing? What events correlated with the forecast errors?

This pattern-building has a downstream benefit. Over time, your team develops intuition about your infrastructure's needs. You start to understand the relationship between user growth and resource consumption. You internalize the cost-growth tradeoff. When someone proposes a feature that might impact database load, you can reason about the capacity implications.

This is different from relying blindly on an algorithm. You're using the algorithm as a teaching tool. Each capacity decision becomes a learning opportunity. This is what separates teams that use AI successfully from teams that treat it as a black box.

Forecasting Across Different Infrastructure Types

One complexity we haven't addressed yet is the reality that different infrastructure types have different scaling characteristics. Your database grows differently than your cache. Your cache grows differently than your API servers. Multi-tier forecasting requires understanding these differences.

Database growth is typically driven by data volume, not by traffic. Your database grows because you store more records, not because you serve more requests (though both matter). The forecast for databases needs to account for data retention policies, archival schedules, and compliance requirements. If you're deleting old data, database growth slows. If you're required to keep everything, growth compounds.

Caching layers are different. Cache growth is driven by hot dataset size—the size of the data you need to keep in fast storage. As your product expands, your hot dataset grows. But cache can be more efficiently shared across instances than databases can. This means horizontal caching (adding cache instances) scales better than vertical database scaling.

API servers have yet another pattern. API server capacity is typically driven by concurrent requests, not by total requests. You have 1000 concurrent users, each holding one connection, regardless of whether they're making 1 request or 1000 requests. This means API server forecasting is about understanding concurrency patterns, not request volume. Peak concurrency might occur at different times than peak request volume.

Claude Code can reason about these differences if you encode them into your analysis. For each infrastructure component, include not just utilization metrics but the drivers of that utilization. "Database is at 60% capacity with 450GB used, growing at 5GB per month, retention policy is 2 years (estimated final size 1.2TB)." This context lets Claude Code make better recommendations.

Lessons Learned and Best Practices

After building and running capacity planning systems in production, here are the patterns that work:

Start with simple linear regression. You don't need fancy ML. Most growth patterns are predictable with basic math.
Filter anomalies aggressively. Deployments, testing spikes, and maintenance windows distort analysis. Clean the data before analysis.
Communicate in business terms. "CPU will hit 90% in 45 days" means something. "Slope is rising at 0.47% per day" doesn't.
Always include cost impact. Capacity decisions are business decisions. Show the money.
Have Claude explain the reasoning. When you show executives "CPU trending up due to 15% QoQ user growth," they trust the recommendation more than a raw forecast.
Compare against reality. Keep a log of your forecasts vs. actual capacity events. Did you predict when the crash happened? Did you miss spikes? Adjust your models based on what actually occurred.

Handling Forecast Errors and Learning from Reality

One of the most important aspects of capacity planning that's often overlooked is understanding where your forecasts go wrong and learning from those errors. Perfect predictions don't exist, and the teams that succeed at capacity planning are the ones that study their misses.

Maybe Claude Code predicted you'd hit critical CPU in 45 days, but you hit it in 30. That means your growth accelerated. Why? Was there a product change you didn't account for? Did a competitor shutdown cause traffic to migrate? Did a viral event happen? Understanding the delta between prediction and reality teaches you about your business.

Conversely, maybe you predicted critical capacity in 45 days and you're still comfortable at 60 days in. That means growth slowed. Why? Did a customer churn? Did a product launch get delayed? Understanding these patterns helps you make better predictions next time.

The key is maintaining a forecast journal. Every month, record what you predicted and what actually happened. Over time, you'll see systematic biases. "We always overestimate growth in Q4" or "We systematically underestimate the impact of marketing campaigns." Once you identify these biases, you can correct them.

This feedback loop is what transforms capacity planning from a mechanical exercise into a strategic tool. You're not just predicting capacity; you're understanding your business's growth patterns deeply. This knowledge informs product decisions, hiring decisions, and strategic planning. It ties infrastructure planning to business strategy in a way most companies never achieve.

From Reactive to Proactive to Adaptive

The final maturity stage of AI-powered capacity planning is when your system stops being predictive and becomes adaptive. You're not just forecasting what you need; you're automatically optimizing your infrastructure to match predicted demand.

This looks like: "Peak traffic is Tuesday 2-4 PM. We run on reserved instances Monday-Friday. On weekends, we scale down to 60% of capacity. The system automatically adjusts instance counts, applies spot instances where appropriate, and migrates non-critical workloads during peak traffic." This requires deeper integration between your capacity planning system and your orchestration platform, but it's the next frontier.

Some teams implement this with scheduled autoscaling. Others use machine learning to predict the exact demand curve. The most sophisticated teams use predictive scaling plus anomaly detection: "Traffic is 20% higher than predicted, trigger emergency autoscaling rules." This removes the human in the loop entirely for routine optimization while maintaining human control for anomalies.

The infrastructure becomes a living system that adapts to demand rather than a static system you provision annually. Your team stops thinking about capacity as a fixed problem to solve and starts thinking about it as a continuous optimization loop. You're always running the right amount of infrastructure for the actual demand, not the average demand or the peak demand.

Next Steps

Start small: export one week of metrics from your critical database, run this analysis, and see what Claude recommends. Then compare those recommendations against your actual capacity events from the past 6 months. You'll be amazed at how often AI analysis catches patterns you missed in the dashboards.

The beauty of Claude Code is that you can iterate quickly. Week one: basic trend analysis. Week two: add anomaly detection. Week three: integrate cost modeling. Week four: fully automated alerts to Slack. Each iteration builds on the previous, and Claude helps you think through each step.

Your infrastructure will be more stable, your budget more predictable, and your team more confident in scaling decisions. That's worth the investment. But more than that, you'll have transformed capacity planning from something people dread into something that just happens in the background, informing business decisions quietly and reliably.

The Organizational Impact of Systematic Capacity Planning

When you move from reactive capacity planning to systematic forecasting, you unlock organizational capabilities that don't show up on any engineering dashboard. You stop getting paged at 2 AM because the database is full. You stop having urgent provisioning sprints right before Black Friday because you don't know if you have capacity. You stop making expensive last-minute decisions about infrastructure during crisis moments.

More subtly, you enable teams to plan better. Product managers can ask "if we acquire 10,000 new customers, what does our infrastructure cost increase to?" and get a real answer. Executive leadership can model "if we double revenue this year, what happens to our infrastructure costs?" Teams making architectural decisions can understand the capacity implications of different choices.

This is organizational knowledge that most companies never build. They treat infrastructure as a fixed cost center, something to minimize. But teams that systematically plan capacity understand it as a key lever for business decisions. Infrastructure decisions become strategic decisions, informed by data.

The Future of Capacity Planning

As AI systems become more sophisticated, capacity planning will evolve. Future systems will integrate with your deployment pipelines, automatically detecting when new features or optimizations change capacity requirements. They'll model "if we deploy this feature, what's the capacity impact?" and warn you before deployment if the impact is problematic.

Machine learning models will capture domain-specific patterns. Your system will learn that every September, you see a 15% traffic spike from the back-to-school shopping season. It will predict this automatically rather than requiring manual adjustment. Over time, your forecasts improve because the model learns your specific business patterns.

The most exciting possibility: AI-driven capacity planning that understands your business model well enough to suggest optimization opportunities. "Based on our capacity trends and cost structure, we could save $200K/year by optimizing query patterns on the user_sessions table. Here's the optimization strategy." That's moving beyond capacity planning into active infrastructure optimization driven by business insights.

But all of that starts with the foundational work described in this guide. You need the data collection, the analysis pipeline, the forecasting models, and the feedback loops. Build those systematically now, and future AI capabilities will layer on top of that foundation.

Closing: Infrastructure as Competitive Advantage

In the modern era, infrastructure isn't just a cost center—it's a competitive advantage. Companies that can scale rapidly and cost-effectively win market wars. Companies that get paged for capacity crises lose market wars.

By building a systematic capacity planning system with Claude Code, you're not just preventing crises. You're building organizational capability. You're creating a culture where infrastructure decisions are informed by data, where growth is planned for, and where surprises are rare. That capability compounds over time.

Six months from now, your team will look back and wonder how you ever managed without this system. Twelve months from now, capacity planning will feel like something your team does automatically rather than something that requires heroic effort. That's when you know you've won—when the infrastructure just works in the background, informing decisions without demanding constant attention.

-iNet

Building a Capacity Planning Assistant with Claude Code

Why AI for Capacity Planning?

The Architecture

Step 1: Reading and Parsing Metrics Data

Step 2: Computing Trend Statistics

The Statistical Foundation: Why Pre-computation Matters

Step 3: Invoking Claude for Analysis and Recommendations

Step 4: Generating Infrastructure-as-Code Recommendations

Step 5: Building an Automated Capacity Alerting System

Integration Checklist

Why Claude Code Is Perfect for This

Handling Edge Cases and Anomalies

Cost Modeling and Budget Forecasting

The Science of Predictive Capacity Planning

Alerts and Continuous Monitoring

The Integration Challenge: Making Predictions Actionable

Building Organizational Muscle Memory

Forecasting Across Different Infrastructure Types

Lessons Learned and Best Practices

Handling Forecast Errors and Learning from Reality

From Reactive to Proactive to Adaptive

Next Steps

The Organizational Impact of Systematic Capacity Planning

The Future of Capacity Planning

Closing: Infrastructure as Competitive Advantage

Need help implementing this?