Building a Cost Optimization Assistant with Claude Code

Cloud bills are a problem that only gets worse if you ignore them. Most teams throw dashboards at the problem—red flags, yellow warnings, maybe a Slack notification. But dashboards don't fix anything. They just show you the damage. What if instead of staring at metrics, you had an assistant that actually generates code to fix your cloud waste?
That's what Claude Code does for cost optimization. It's not just an AI that tells you "your RDS instance is underutilized." It's an AI that can analyze your billing data, understand why resources are costing you money, and then write the actual Terraform changes to right-size them. This is the shift from reporting to action—and it fundamentally changes how you think about cloud costs.
In this article, we're building a real cost optimization assistant. We'll walk through analyzing AWS billing data, identifying wasteful patterns, generating infrastructure-as-code fixes, and tracking savings over time. By the end, you'll have a system that runs on a schedule, catches cost issues before they spiral, and actually fixes them. You'll move from "we should optimize" to "we are optimizing" with systematic automation backing the claim.
Table of Contents
- Why This Matters: The Cost Optimization Gap
- The Architecture: Four Layers
- Layer 1: Parsing and Analyzing Billing Data
- Layer 2: Identifying Optimization Opportunities
- Layer 3: Generating Terraform Code
- Layer 4: Validation, Tracking, and Reporting
- Putting It Together: The Full Pipeline
- Building Cost Guardrails
- Handling the Reserved Instances Trap
- Integration: Slack Notifications and Approvals
- The Measurement Loop: Quarterly Savings Audits
- Summary: From Insight to Action
- Real-World Scenario: A Complete Cost Optimization Success Story
- Handling Complexity: Multi-Account and Multi-Region Optimization
- Communicating Value: Building Cost Awareness in Your Organization
- Automating the Optimization Cycle
- Managing Risk in Automated Optimizations
- Avoiding the Pitfalls: Common Mistakes in Cost Optimization
- Building Organizational Culture Around Cost
- Summary: From Insight to Action
Why This Matters: The Cost Optimization Gap
Here's the brutal truth: most teams know where their money is being wasted. They know the dev database that's 1TB but only used by three people. They know the NAT gateway that costs $30/day but could be replaced by VPC endpoints. They know the autoscaling group that scales to 50 instances at 2am because someone set the metrics wrong.
But knowing and fixing are different things. Fixing requires:
- Understanding the current infrastructure
- Doing the math on alternative approaches
- Writing the IaC code to change it
- Testing it in a non-prod environment
- Getting it into your deployment pipeline
- Actually deploying it
That's friction. Seven distinct steps, each requiring thought and execution. And friction is why cloud bills stay high. Claude Code removes that friction by automating the jump from "we have a problem" to "here's the code that fixes it." You go from needing five hours of human effort to needing five minutes to approve a generated solution.
The strategic insight: cost optimization isn't a technology problem anymore. It's a management problem. You're not lacking the ability to optimize—AWS documentation tells you exactly how. You're lacking the organizational will to do the optimization work because it's tedious and low-priority compared to feature development. Claude Code flips the equation: optimization is now cheaper to implement than to ignore.
The Architecture: Four Layers
We're building a system with four tightly integrated layers:
Layer 1: Billing Analysis - Parse AWS Cost Explorer exports, identify patterns, spot anomalies. Layer 2: Resource Optimization - Cross-reference usage metrics, identify underutilized resources, calculate savings. Layer 3: Code Generation - Write Terraform modules that implement the optimizations. Layer 4: Validation & Reporting - Estimate real savings, track deployed changes, measure actual impact.
Let's build this piece by piece. Each layer builds on the previous, creating a full pipeline from raw billing data to deployed infrastructure changes.
Layer 1: Parsing and Analyzing Billing Data
First thing you need is raw billing data. AWS Cost Explorer exports are CSV files, and they're messy. But they're also your source of truth. The data is real. It's not estimated or projected—it's what you actually paid.
The reason billing data matters so much is that it's the only objective measure of cloud waste. You can have 100 opinions about whether a resource is needed, but billing data doesn't argue. It simply shows what you spent. The challenge is that billing data is usually aggregated and abstracted—AWS shows you "compute costs" or "database costs," but you need to drill down to understand which specific instances or databases are actually expensive. You need to correlate billing costs with actual resource utilization. An EC2 instance that costs $200/month might be either properly sized for a critical service or a forgotten development environment. Billing data alone can't tell you which. You need to combine it with CloudWatch metrics to understand the full picture. Is that expensive instance actually being used? Is it running at high utilization, justifying its cost, or is it sitting idle? This is where analysis becomes crucial—you're not just looking at costs, you're understanding whether those costs are necessary.
The deeper challenge is that cost patterns aren't static. They evolve over time. A spike in your RDS bill might mean a new application went into production (expected) or it might mean a query optimization issue caused runaway usage (a problem). A decrease might mean you decommissioned a service or it might mean a service migrated to a cheaper cloud provider. As a cost analyst, you're trying to detect not just current state but trends and anomalies. Why did costs spike last Tuesday? Why did they drop on Friday? Did something change, or is this normal variation?
This analytical thinking is why Claude Code is powerful here. Claude can hold context about historical patterns, understand that costs on weekends are always lower (batch jobs don't run), and detect when something genuinely unusual happens. It's not just running a SQL query; it's interpreting results in context and asking the right follow-up questions.
Here's a TypeScript function that reads an exported billing file and extracts the useful parts:
import * as fs from "fs";
import * as csv from "csv-parse/sync";
interface BillingRecord {
service: string;
linkedAccount: string;
usageAmount: number;
unblendedCost: number;
region: string;
resourceId: string;
}
async function parseBillingExport(filePath: string): Promise<BillingRecord[]> {
const fileContent = fs.readFileSync(filePath, "utf-8");
const records = csv.parse(fileContent, {
columns: true,
skip_empty_lines: true,
});
return records.map((record: any) => ({
service: record["product"] || "unknown",
linkedAccount: record["linkedAccount"] || "primary",
usageAmount: parseFloat(record["usageAmount"]) || 0,
unblendedCost: parseFloat(record["unblendedCost"]) || 0,
region: record["region"] || "global",
resourceId: record["resourceId"] || record["itemDescription"] || "",
}));
}What's happening here? We're reading the CSV, parsing it with proper type safety, and normalizing the fields. The BillingRecord interface ensures we work with consistent data downstream. The skip_empty_lines flag cuts down noise from AWS's sometimes-weird exports.
Now let's aggregate this to spot patterns:
interface ServiceCostBreakdown {
service: string;
totalCost: number;
totalUsage: number;
regions: Map<string, number>;
trend: "increasing" | "stable" | "decreasing";
}
function analyzeServiceCosts(records: BillingRecord[]): ServiceCostBreakdown[] {
const byService = new Map<string, BillingRecord[]>();
records.forEach((record) => {
const key = record.service;
if (!byService.has(key)) {
byService.set(key, []);
}
byService.get(key)!.push(record);
});
const breakdowns: ServiceCostBreakdown[] = [];
byService.forEach((recs, service) => {
const totalCost = recs.reduce((sum, r) => sum + r.unblendedCost, 0);
const totalUsage = recs.reduce((sum, r) => sum + r.usageAmount, 0);
const regions = new Map<string, number>();
recs.forEach((r) => {
const current = regions.get(r.region) || 0;
regions.set(r.region, current + r.unblendedCost);
});
breakdowns.push({
service,
totalCost,
totalUsage,
regions,
trend: "stable", // We'll calculate this from time-series data
});
});
return breakdowns.sort((a, b) => b.totalCost - a.totalCost);
}This groups costs by service and shows you the breakdown by region. The totalCost field tells you where the money is going. Run this function and you immediately know: "RDS is eating 40% of our bill, EC2 is 35%, and S3 is 15%." That focus is critical. Instead of trying to optimize everything, you focus on the services that matter most. An optimization that saves 20% on RDS is worth your time. An optimization that saves 20% on a service that's 1% of your bill is probably not.
Layer 2: Identifying Optimization Opportunities
Now we need to spot the actual waste. This is where domain knowledge comes in. Let's detect underutilized resources by combining billing data with CloudWatch metrics. You can't optimize what you don't understand. We need actual usage metrics to correlate with costs.
Here's the key insight: utilization tells the story. If an RDS instance is costing $500/month and its CPU usage is averaging 5%, that's waste. The instance is vastly oversized for the actual workload. You could downsize it significantly, cut the cost by 60%, and still have plenty of capacity.
import {
CloudWatchClient,
GetMetricStatisticsCommand,
} from "@aws-sdk/client-cloudwatch";
interface OptimizationOpportunity {
resourceId: string;
service: string;
currentMonthlyCost: number;
utilizationPercent: number;
recommendation: string;
estimatedMonthlySavings: number;
effort: "low" | "medium" | "high";
}
async function findOptimizationOpportunities(
records: BillingRecord[],
): Promise<OptimizationOpportunity[]> {
const cloudwatch = new CloudWatchClient({ region: "us-east-1" });
const opportunities: OptimizationOpportunity[] = [];
// Filter to high-cost services we can actually optimize
const rdsRecords = records.filter((r) => r.service.includes("RDS"));
const ec2Records = records.filter((r) => r.service.includes("EC2"));
// Analyze RDS instances
for (const record of rdsRecords) {
const instanceId = extractInstanceId(record.resourceId);
if (!instanceId) continue;
// Get CPU utilization for the past 7 days
const command = new GetMetricStatisticsCommand({
Namespace: "AWS/RDS",
MetricName: "CPUUtilization",
Dimensions: [{ Name: "DBInstanceIdentifier", Value: instanceId }],
StartTime: new Date(Date.now() - 7 * 24 * 60 * 60 * 1000),
EndTime: new Date(),
Period: 3600,
Statistics: ["Average"],
});
const response = await cloudwatch.send(command);
const avgCpuUtilization =
response.Datapoints?.reduce((sum, dp) => sum + (dp.Average || 0), 0) /
(response.Datapoints?.length || 1) || 0;
// If average CPU is below 10%, this instance is underutilized
if (avgCpuUtilization < 10) {
opportunities.push({
resourceId: instanceId,
service: "RDS",
currentMonthlyCost: record.unblendedCost,
utilizationPercent: avgCpuUtilization,
recommendation: `Downsize from current instance type. CPU averaged ${avgCpuUtilization.toFixed(
1,
)}% over the past week.`,
estimatedMonthlySavings: record.unblendedCost * 0.4, // Assume 40% cost reduction
effort: "medium",
});
}
}
return opportunities.sort(
(a, b) => b.estimatedMonthlySavings - a.estimatedMonthlySavings,
);
}
function extractInstanceId(resourceId: string): string | null {
const match = resourceId.match(/db-[A-Z0-9]+|i-[a-z0-9]+/);
return match ? match[0] : null;
}This is doing real work now. We're fetching CloudWatch metrics and correlating them with billing data. An instance with 5% CPU utilization is wasting money. The estimatedMonthlySavings field tells you where to focus—optimize the high-impact items first. You don't have time to optimize everything, so you optimize the things that save the most money.
But here's the hidden layer: why this approach? Because you can't trust AWS tags alone. Your team probably has instances they forgot about, or instances with misleading names. By correlating usage metrics with costs, you catch actual waste, not just things with suspicious tags. You're working with hard evidence, not assumptions.
Layer 3: Generating Terraform Code
This is where Claude Code shines. Instead of you writing the Terraform, Claude writes it based on the optimization opportunities you found. You found the waste; Claude turns it into executable code.
The fundamental reason code generation matters here is that writing infrastructure code is tedious. You know what needs to change—a database should be smaller, an autoscaling policy should be adjusted, a lifecycle rule should be added to S3. But translating that knowledge into correct, safe Terraform code takes time and carries risk. You might miss a subtle detail. You might write code that's technically correct but not idiomatic. You might forget to add the necessary migration steps or validation logic. All of this slows down optimization. And delay means money continues to leak.
Claude Code solves this by generating production-ready code from high-level specifications. You tell Claude "downsize this RDS instance from db.t3.xlarge to db.t3.large, but include validation that the smaller instance can still handle peak load" and Claude generates code that does exactly that. The code includes comments explaining the change, validation steps that verify the smaller instance will work, and rollback procedures in case something goes wrong. This isn't code that needs heavy review and modification—it's code you can deploy with confidence.
The deeper insight is that code generation combined with Claude's reasoning creates a new kind of safety. Claude doesn't just generate code; it reasons about why the code is safe. It looks at historical peak loads and verifies that the new instance type can handle them. It checks if the resource is tagged as "critical" and applies extra scrutiny if so. It identifies dependencies and warns about potential blast radius. This contextual reasoning is what makes Claude-generated code more reliable than blindly applying optimization patterns.
Another critical aspect: cost optimization code needs to be maintainable. When you implement an optimization, you're not just deploying code—you're making a decision that affects your cloud bill for months or years. If the code is incomprehensible, nobody understands why the decision was made. Six months from now, when your bill is lower, you want to know which optimizations contributed. Claude's code includes comments and explanations that create a paper trail. You can look at a Terraform module and understand exactly what was optimized, why, and what the expected savings are. This is how you create institutional knowledge about cost optimization.
import Anthropic from "@anthropic-ai/sdk";
interface TerraformChange {
resourceType: string;
currentConfig: Record<string, any>;
proposedConfig: Record<string, any>;
terraformCode: string;
estimatedSavings: number;
}
async function generateTerraformOptimizations(
opportunities: OptimizationOpportunity[],
): Promise<TerraformChange[]> {
const client = new Anthropic();
const changes: TerraformChange[] = [];
for (const opp of opportunities) {
const prompt = `
You are a Terraform expert. Given this optimization opportunity, generate a Terraform module that implements the change.
Resource: ${opp.resourceId}
Service: ${opp.service}
Current Monthly Cost: $${opp.currentMonthlyCost.toFixed(2)}
Utilization: ${opp.utilizationPercent.toFixed(1)}%
Recommendation: ${opp.recommendation}
Estimated Savings: $${opp.estimatedMonthlySavings.toFixed(2)}/month
Generate ONLY the Terraform code (HCL) that implements this optimization. Include comments explaining what changed and why. Do not include the resource ID or any variables—provide complete, runnable code.
`;
const message = await client.messages.create({
model: "claude-3-5-sonnet-20241022",
max_tokens: 1024,
messages: [{ role: "user", content: prompt }],
});
const terraformCode =
message.content[0].type === "text" ? message.content[0].text : "";
changes.push({
resourceType: opp.service,
currentConfig: {}, // In practice, you'd fetch the current state
proposedConfig: {}, // Claude would generate this
terraformCode,
estimatedSavings: opp.estimatedMonthlySavings,
});
}
return changes;
}When you run this, Claude returns Terraform code that:
- Modifies the RDS instance to a smaller type
- Updates autoscaling policies
- Adds lifecycle rules to S3 buckets
- Removes unused resources
The prompt is the key. Notice how we're giving Claude the context—the current cost, the utilization percentage, and the recommendation. This grounds Claude's reasoning. It can't hallucinate a good optimization because it has to explain why the optimization makes sense. "Change from db.t3.xlarge to db.t3.large because CPU is averaging 8%. This cuts cost by 40% and still leaves 2x overhead capacity."
Layer 4: Validation, Tracking, and Reporting
Now you need to validate that the changes are safe and track whether they actually deliver the savings you estimated. This is the feedback loop that teaches your system.
The challenge with cost optimization is that the savings are only real if you actually measure them. You can estimate that downsizing a database will save $400/month, but what if you're wrong about the workload characteristics? What if there's a periodic batch job that uses more resources than you expected? What if the application's performance degrades and it's no longer acceptable? These are real risks, and they're why validation and tracking matter so much.
The key is creating a feedback loop: estimate savings, implement changes, measure actual savings, compare estimate to actual. This feedback teaches you and Claude where your estimation models are wrong. If you consistently underestimate savings, future recommendations will be scaled accordingly. If you consistently overestimate, you know to be more conservative.
Tracking also creates accountability. When you implement an optimization, it should appear in a report showing estimated vs. actual savings. This forces the organization to follow through on optimizations. It's easy to delay deployment of a Terraform change if nobody's watching. It's much harder to delay when everyone knows the savings are stalled until the change goes live.
The deeper purpose of tracking is organizational learning. Over time, you build a database of optimizations: which ones worked, which ones didn't, how much each actually saved. This becomes invaluable intelligence. New engineers joining the team can look at past optimizations and understand the patterns. Your CFO can see exactly where the cost savings are coming from and whether they're sustainable. Claude can learn from this data and make better recommendations in the future.
interface CostOptimizationReport {
generatedAt: Date;
opportunities: OptimizationOpportunity[];
terraformChanges: TerraformChange[];
totalEstimatedMonthlySavings: number;
implementationRisk: "low" | "medium" | "high";
recommendations: string[];
}
function generateOptimizationReport(
opportunities: OptimizationOpportunity[],
changes: TerraformChange[],
): CostOptimizationReport {
const totalSavings = changes.reduce((sum, c) => sum + c.estimatedSavings, 0);
// Risk assessment: high-effort changes are riskier
const highEffortCount = opportunities.filter(
(o) => o.effort === "high",
).length;
const implementationRisk =
highEffortCount > 3 ? "high" : highEffortCount > 1 ? "medium" : "low";
const recommendations: string[] = [];
// Generate actionable recommendations
opportunities.forEach((opp) => {
if (opp.effort === "low") {
recommendations.push(
`[QUICK WIN] ${opp.recommendation} (Est. $${opp.estimatedMonthlySavings.toFixed(2)}/mo)`,
);
} else if (opp.estimatedMonthlySavings > 1000) {
recommendations.push(
`[HIGH IMPACT] ${opp.recommendation} (Est. $${opp.estimatedMonthlySavings.toFixed(2)}/mo)`,
);
}
});
return {
generatedAt: new Date(),
opportunities,
terraformChanges: changes,
totalEstimatedMonthlySavings: totalSavings,
implementationRisk,
recommendations,
};
}The report is the communication layer. It tells your team exactly what to optimize, in what order, and why. The QUICK WIN and HIGH IMPACT tags focus attention on the things that matter.
Now, the critical part: tracking actual savings. Generate a baseline bill today, implement the changes, and measure the bill 30 days from now.
interface CostOptimizationMetric {
changeId: string;
resourceId: string;
estimatedSavings: number;
actualSavings: number;
deploymentDate: Date;
measurementDate: Date;
accuracyPercent: number;
}
function calculateActualSavings(
beforeBillingData: BillingRecord[],
afterBillingData: BillingRecord[],
): CostOptimizationMetric[] {
const beforeByResource = new Map<string, number>();
const afterByResource = new Map<string, number>();
beforeBillingData.forEach((r) => {
const key = r.resourceId;
beforeByResource.set(
key,
(beforeByResource.get(key) || 0) + r.unblendedCost,
);
});
afterBillingData.forEach((r) => {
const key = r.resourceId;
afterByResource.set(key, (afterByResource.get(key) || 0) + r.unblendedCost);
});
const metrics: CostOptimizationMetric[] = [];
beforeByResource.forEach((beforeCost, resourceId) => {
const afterCost = afterByResource.get(resourceId) || 0;
const actualSavings = Math.max(0, beforeCost - afterCost);
const estimatedSavings = beforeCost * 0.4; // Example: 40% reduction
metrics.push({
changeId: `change-${resourceId}`,
resourceId,
estimatedSavings,
actualSavings,
deploymentDate: new Date(Date.now() - 30 * 24 * 60 * 60 * 1000),
measurementDate: new Date(),
accuracyPercent: (actualSavings / estimatedSavings) * 100,
});
});
return metrics;
}This is your feedback loop. If Claude estimates $500/month in savings and you actually see $480/month, that's 96% accuracy—Claude got it right. If you see $200/month instead, then Claude was too optimistic, and you adjust future estimates. Over time, your accuracy improves. You build institutional knowledge about how much you can actually save.
Putting It Together: The Full Pipeline
Here's how all four layers work together:
async function runCostOptimizationPipeline(
billingFilePath: string,
): Promise<CostOptimizationReport> {
// Layer 1: Parse billing data
console.log("Parsing billing data...");
const billingRecords = await parseBillingExport(billingFilePath);
// Layer 1: Analyze costs
const costBreakdown = analyzeServiceCosts(billingRecords);
console.log(`Found ${costBreakdown.length} services. Top 3:`);
costBreakdown.slice(0, 3).forEach((s) => {
console.log(` ${s.service}: $${s.totalCost.toFixed(2)}/month`);
});
// Layer 2: Identify opportunities
console.log("\nScanning for optimization opportunities...");
const opportunities = await findOptimizationOpportunities(billingRecords);
console.log(`Found ${opportunities.length} optimization opportunities.`);
// Layer 3: Generate Terraform code
console.log("\nGenerating Terraform modules...");
const terraformChanges = await generateTerraformOptimizations(opportunities);
// Layer 4: Generate report
const report = generateOptimizationReport(opportunities, terraformChanges);
console.log(`\n=== COST OPTIMIZATION REPORT ===`);
console.log(
`Estimated Monthly Savings: $${report.totalEstimatedMonthlySavings.toFixed(2)}`,
);
console.log(`Implementation Risk: ${report.implementationRisk}`);
console.log(`\nRecommendations:`);
report.recommendations.forEach((rec) => {
console.log(` - ${rec}`);
});
return report;
}
// Run it
const report = await runCostOptimizationPipeline("./aws-billing-export.csv");Expected output:
Parsing billing data...
Found 8 services. Top 3:
AmazonRDS: $12,450.32/month
AmazonEC2: $8,920.15/month
AmazonS3: $3,210.88/month
Scanning for optimization opportunities...
Found 7 optimization opportunities.
Generating Terraform modules...
=== COST OPTIMIZATION REPORT ===
Estimated Monthly Savings: $3,847.50
Implementation Risk: medium
Recommendations:
- [HIGH IMPACT] Downsize RDS instances. CPU averaged 8.2% over the past week. (Est. $1,820.00/mo)
- [QUICK WIN] Migrate old S3 data to Glacier. Unaccessed for 180+ days. (Est. $520.50/mo)
- [HIGH IMPACT] Reduce NAT Gateway redundancy across 3 AZs. (Est. $950.00/mo)
- [QUICK WIN] Remove unused Elastic IPs. (Est. $156.00/mo)
Building Cost Guardrails
Here's where this gets powerful: you can automate this. Run this pipeline weekly, diff the Terraform changes against what's already deployed, and flag new optimization opportunities.
interface CostGuardrail {
name: string;
threshold: number;
action: "alert" | "auto-remediate" | "require-approval";
enabled: boolean;
}
async function enforceCostGuardrails(
report: CostOptimizationReport,
): Promise<void> {
const guardrails: CostGuardrail[] = [
{
name: "monthly-savings-threshold",
threshold: 500, // Alert if we can save more than $500/month
action: "alert",
enabled: true,
},
{
name: "underutilized-resources",
threshold: 5, // Alert if more than 5 underutilized resources found
action: "alert",
enabled: true,
},
{
name: "quick-win-implementation",
threshold: 0,
action: "auto-remediate", // Automatically implement low-risk, low-effort optimizations
enabled: false, // Disabled by default for safety
},
];
guardrails.forEach((guardrail) => {
if (!guardrail.enabled) return;
if (guardrail.name === "monthly-savings-threshold") {
if (report.totalEstimatedMonthlySavings > guardrail.threshold) {
console.log(
`⚠️ Cost guardrail triggered: Potential $${report.totalEstimatedMonthlySavings.toFixed(2)}/month in savings available`,
);
}
}
if (guardrail.name === "underutilized-resources") {
if (report.opportunities.length > guardrail.threshold) {
console.log(
`⚠️ Cost guardrail triggered: ${report.opportunities.length} underutilized resources detected`,
);
}
}
});
}The action field is the key: alert means "flag it for a human," auto-remediate means "deploy it automatically," and require-approval means "generate the code and wait for manual review."
Handling the Reserved Instances Trap
Here's something that trips up most cost optimization: reserved instances (RIs) create artificial sunk costs. Your team bought 3-year RIs for m5.xlarge instances. But now you want to downsize to m5.large. The RI is non-refundable. Do you abandon it and buy new smaller RIs? Or do you run oversized instances to "get your money's worth"?
Claude Code can navigate this complexity. It understands sunk costs. It can reason through whether abandoning an RI is actually the right call despite the psychological pain:
interface ReservedInstanceAnalysis {
reservationId: string;
instanceType: string;
sunkCostRemaining: number;
yearsRemaining: number;
recommendedPath:
| "keep-and-oversized"
| "abandon-and-downsize"
| "resell-on-marketplace";
breakEvenAnalysis: {
daysUntilBreakeven: number;
futureAnnualSavings: number;
isBeneficial: boolean;
};
}
async function analyzeReservedInstanceDecision(
riDetails: any,
currentUsagePattern: BillingRecord[],
): Promise<ReservedInstanceAnalysis> {
const client = new Anthropic();
const prompt = `
You are analyzing whether to keep, abandon, or resell a reserved instance.
RI Details:
- Type: ${riDetails.instanceType}
- Sunk Cost Remaining: $${riDetails.sunkCostRemaining}
- Years Remaining: ${riDetails.yearsRemaining}
- Current On-Demand Cost: $${riDetails.onDemandCostPerMonth}/month
- RI Cost Per Month: $${riDetails.riCostPerMonth}/month
Proposed Change:
- New Instance Type: ${riDetails.proposedType}
- Downsizing Savings: ${riDetails.downsizingSavingsPercent}%
Analyze:
1. Should we continue using this RI (sunk cost) or abandon it?
2. Would downsizing save more than maintaining the RI?
3. Is the RI worth selling on the marketplace?
4. Break-even point in days/months
Provide a JSON response with: { recommendation, breakEvenDays, futureAnnualSavings, explanation }
`;
const message = await client.messages.create({
model: "claude-3-5-sonnet-20241022",
max_tokens: 512,
messages: [{ role: "user", content: prompt }],
});
const analysis = JSON.parse(
message.content[0].type === "text" ? message.content[0].text : "{}",
);
return {
reservationId: riDetails.id,
instanceType: riDetails.instanceType,
sunkCostRemaining: riDetails.sunkCostRemaining,
yearsRemaining: riDetails.yearsRemaining,
recommendedPath: analysis.recommendation,
breakEvenAnalysis: {
daysUntilBreakeven: analysis.breakEvenDays,
futureAnnualSavings: analysis.futureAnnualSavings,
isBeneficial: analysis.breakEvenDays < 365,
},
};
}The key insight: sunk cost is sunk. Claude understands this and can reason through whether abandoning an RI is actually the right call, despite the psychological pain of "wasting money" you already spent.
Integration: Slack Notifications and Approvals
Cost insights mean nothing if they sit in a database. You need them in your team's workflow. Here's how to integrate cost optimization reports into Slack with approval workflows:
async function notifySlackWithApproval(
report: CostOptimizationReport,
slackWebhookUrl: string,
): Promise<void> {
const blocks = [
{
type: "header",
text: {
type: "plain_text",
text: "💰 Cost Optimization Report",
},
},
{
type: "section",
text: {
type: "mrkdwn",
text: `*Estimated Monthly Savings:* $${report.totalEstimatedMonthlySavings.toFixed(2)}\n*Implementation Risk:* ${report.implementationRisk}`,
},
},
{
type: "divider",
},
];
// Add top 5 recommendations
report.recommendations.slice(0, 5).forEach((rec) => {
blocks.push({
type: "section",
text: {
type: "mrkdwn",
text: `• ${rec}`,
},
});
});
blocks.push({
type: "actions",
elements: [
{
type: "button",
text: {
type: "plain_text",
text: "📊 View Full Report",
},
url: `https://your-dashboard.internal/reports/${report.generatedAt.getTime()}`,
},
{
type: "button",
text: {
type: "plain_text",
text: "✅ Approve Optimizations",
},
value: "approve",
action_id: "approve_optimizations",
},
],
});
await fetch(slackWebhookUrl, {
method: "POST",
headers: { "Content-Type": "application/json" },
body: JSON.stringify({ blocks }),
});
}When someone clicks "Approve Optimizations," you can trigger a CI/CD pipeline that generates the Terraform code, runs terraform plan in a PR, waits for code review, and merges on approval.
The Measurement Loop: Quarterly Savings Audits
Here's what separates good cost optimization from great cost optimization: you track whether estimates match reality. Every quarter, run a comprehensive audit:
async function quarterlyAuditSavings(
previousQuarterReport: CostOptimizationReport,
): Promise<{ estimatedVsActual: number; accuracy: number }> {
const beforeBilling = await fetchBillingData("3-months-ago", "2-months-ago");
const afterBilling = await fetchBillingData("1-month-ago", "now");
// Calculate actual cost savings
const beforeTotal = beforeBilling.reduce(
(sum, r) => sum + r.unblendedCost,
0,
);
const afterTotal = afterBilling.reduce((sum, r) => sum + r.unblendedCost, 0);
const actualSavings = beforeTotal - afterTotal;
const estimatedSavings =
previousQuarterReport.totalEstimatedMonthlySavings * 3;
const accuracy = (actualSavings / estimatedSavings) * 100;
console.log(`
=== QUARTERLY AUDIT ===
Estimated Savings (Q): $${estimatedSavings.toFixed(2)}
Actual Savings (Q): $${actualSavings.toFixed(2)}
Accuracy: ${accuracy.toFixed(1)}%
${accuracy > 90 ? "✅ Estimates are accurate" : "⚠️ Review estimation logic"}
`);
return {
estimatedVsActual: actualSavings - estimatedSavings,
accuracy,
};
}This creates a feedback loop. If Claude's estimates are consistently 20% too optimistic, you adjust the confidence multiplier in future reports. Over time, your estimates become razor-sharp. You're not just generating recommendations—you're learning from them.
Summary: From Insight to Action
A cost optimization assistant built with Claude Code does something most FinOps tools don't: it closes the gap between finding waste and fixing it. You get:
- Automated analysis of billing data and resource utilization
- Generated Terraform code that implements optimizations without human IaC writing
- Measured feedback on whether optimizations actually save money
- Automated guardrails that catch cost issues before they spiral
- Organizational awareness of cost implications of infrastructure decisions
The pipeline runs weekly. By month's end, you've implemented $10-20K in annual savings without asking your engineers to dig through dashboards or write IaC by hand.
Start small: pick one service (RDS, EC2, or S3), build the analysis layer, and test the optimization detection. Once you trust the recommendations, add the Terraform generation. Once you trust the generated code, automate it.
Real-World Scenario: A Complete Cost Optimization Success Story
Let's walk through a realistic scenario to see how this system actually works in practice. Imagine a Series B startup spending $45,000/month on AWS. They know they're wasting money but haven't prioritized fixing it. Every week there's a new feature to build, a new market to target. Cost optimization feels like something for later.
Week 1 of cost optimization: Your team runs the pipeline. It parses the billing export, analyzes services, finds optimization opportunities. The report shows:
- RDS: $12,450/month (28% of bill). Average CPU 8%. Opportunity: downsize from db.r5.2xlarge to db.r5.xlarge. Savings: $4,900/month.
- EC2: $8,920/month (20% of bill). One cluster is running 50 instances at 2 AM for no reason. Opportunity: fix autoscaling metrics. Savings: $2,100/month.
- NAT Gateway: $2,400/month (5% of bill). Redundancy across 3 AZs, but you only need 1. Opportunity: reduce redundancy. Savings: $1,600/month.
- S3: $3,210/month (7% of bill). 400GB of data unaccessed for 18 months. Opportunity: migrate to Glacier. Savings: $320/month.
Total identified savings: $8,920/month. That's $107,000/year. Your CFO wanted a 15% cost reduction. This gets you most of the way there.
Week 2: Claude generates Terraform code for all four optimizations. Your infrastructure team reviews the code. Two are approved immediately. One (the autoscaling fix) requires more investigation—they want to understand why scaling happens at 2 AM. One (NAT Gateway) gets approved after clarification that single-AZ is acceptable for this workload.
Week 3: Three optimizations are deployed. The fourth is scheduled for a safer window.
Week 4: You measure results. RDS cost dropped from $12,450 to $7,850. The downsize worked. Actual savings: $4,600 (close to the $4,900 estimate). EC2 cost dropped from $8,920 to $6,900. That 2 AM spike is gone. Actual savings: $2,020. NAT dropped from $2,400 to $900. Actual savings: $1,500. S3 didn't change much yet—migration to Glacier is slower to show impact—but it will.
Total actual savings: $8,120/month. That's 99% accuracy on estimates.
Month 2: The pattern repeats. New billing data arrives. New opportunities are identified. The system found smaller optimizations: unused security groups, oversized Lambda functions, CloudWatch Logs that are too verbose. More Terraform code is generated. More savings are realized.
By month 3, the company is running $35,000/month instead of $45,000/month. They've reduced cloud costs by 22% without reducing capacity or functionality. And the system runs automatically every month, continuously looking for new waste. Every time an engineer spins up a resource, the system will notice if it's underutilized. Every time billing patterns change, opportunities surface.
That's the power of systematic cost optimization: it doesn't just save money once. It creates a culture of cost awareness because visibility and actionability are now automatic.
Handling Complexity: Multi-Account and Multi-Region Optimization
As you scale to multiple AWS accounts and regions, cost optimization gets more complex. You can't optimize in isolation—some resources are intentionally redundant across regions for disaster recovery. Some accounts exist for specific purposes. You need to respect architectural constraints while still finding waste.
Here's how to handle multi-account analysis:
interface AccountOptimization {
accountId: string;
accountName: string;
opportunities: OptimizationOpportunity[];
constraints: string[]; // "Keep 2 RDS replicas for DR", etc.
totalEstimatedSavings: number;
}
async function analyzeMultiAccountOpportunities(
allBillingData: BillingRecord[],
constraints: Map<string, string[]>,
): Promise<AccountOptimization[]> {
const byAccount = new Map<string, BillingRecord[]>();
allBillingData.forEach((record) => {
const key = record.linkedAccount;
if (!byAccount.has(key)) {
byAccount.set(key, []);
}
byAccount.get(key)!.push(record);
});
const accountOptimizations: AccountOptimization[] = [];
for (const [accountId, records] of byAccount.entries()) {
const accountConstraints = constraints.get(accountId) || [];
// Find opportunities for this account
const opportunities = await findOptimizationOpportunities(records);
// Filter opportunities that violate constraints
const constrainedOpportunities = opportunities.filter((opp) => {
const violatesConstraint = accountConstraints.some(
(constraint) =>
constraint.toLowerCase().includes(opp.resourceId) ||
constraint.toLowerCase().includes(opp.service.toLowerCase()),
);
return !violatesConstraint;
});
accountOptimizations.push({
accountId,
accountName: getAccountName(accountId),
opportunities: constrainedOpportunities,
constraints: accountConstraints,
totalEstimatedSavings: constrainedOpportunities.reduce(
(sum, opp) => sum + opp.estimatedMonthlySavings,
0,
),
});
}
return accountOptimizations.sort(
(a, b) => b.totalEstimatedSavings - a.totalEstimatedSavings,
);
}Now your report respects constraints. If an account is marked "keep 2x RDS redundancy for critical systems," the system won't recommend consolidation that violates that requirement. Smart optimization respects architectural decisions.
Communicating Value: Building Cost Awareness in Your Organization
The technical system is only half the battle. The other half is organizational: making cost visible and actionable for the entire company, not just the infrastructure team.
Here's how to build cost awareness without friction:
Every sprint: Include cost in the definition of done. "Feature complete, tested, documented, and cost-optimized." For most features, this is no extra work. But for resource-intensive features (batch processing, data pipelines), optimization becomes part of the requirement.
Every dashboard: Include cost projections. "This cluster will cost $8,900 this month based on current usage." Visibility makes impact real.
Every meeting: Show cost trends. "Cloud spend went down 4% last month. Here's what we did." Celebrate wins. Build momentum.
Every onboarding: Teach cost culture. "When you spin up infrastructure, know its cost. If it's not delivering $100+ value per month, consolidate or remove it." Engineers who understand costs make better decisions.
The goal: cost optimization becomes a shared value, not an infrastructure team burden. When the whole organization cares about cost efficiency, optimization becomes self-reinforcing.
Automating the Optimization Cycle
Once you've validated your approach with manual optimization, automate the entire pipeline. This is where your cost optimization system becomes self-sustaining:
// src/automation.ts
import cron from "node-cron";
interface OptimizationJob {
id: string;
service: string;
schedule: string; // cron expression
enabled: boolean;
lastRun?: Date;
nextRun?: Date;
}
const optimizationJobs: OptimizationJob[] = [
{
id: "weekly-rds-analysis",
service: "rds",
schedule: "0 2 * * 1", // Every Monday at 2 AM
enabled: true,
},
{
id: "daily-ec2-analysis",
service: "ec2",
schedule: "0 3 * * *", // Every day at 3 AM
enabled: true,
},
{
id: "weekly-s3-analysis",
service: "s3",
schedule: "0 2 * * 0", // Every Sunday at 2 AM
enabled: true,
},
];
async function runOptimizationJob(job: OptimizationJob) {
console.log(`Starting optimization job: ${job.id}`);
try {
// Step 1: Fetch billing data
const billingData = await fetchBillingExport();
// Step 2: Analyze for opportunities
const opportunities = await findOptimizationOpportunities(billingData);
if (opportunities.length === 0) {
console.log(`No opportunities found for ${job.service}`);
job.lastRun = new Date();
return;
}
// Step 3: Generate Terraform code
const changes = await generateTerraformOptimizations(opportunities);
// Step 4: Create PR with changes
const prUrl = await createPullRequestWithChanges(
job.service,
changes,
opportunities,
);
// Step 5: Notify team
await notifyTeamOfOptimizations(job.service, opportunities, prUrl);
// Step 6: Log results
await logOptimizationRun({
jobId: job.id,
service: job.service,
opportunitiesFound: opportunities.length,
potentialSavings: opportunities.reduce(
(sum, o) => sum + o.estimatedMonthlySavings,
0,
),
prUrl,
timestamp: new Date(),
});
job.lastRun = new Date();
console.log(`Optimization job completed: ${job.id}`);
} catch (error) {
console.error(`Optimization job failed: ${job.id}`, error);
await alertOncallAboutFailure(job.id, error);
}
}
// Schedule all enabled jobs
for (const job of optimizationJobs) {
if (job.enabled) {
cron.schedule(job.schedule, () => runOptimizationJob(job));
console.log(`Scheduled optimization job: ${job.id} (${job.schedule})`);
}
}
async function createPullRequestWithChanges(
service: string,
changes: TerraformChange[],
opportunities: OptimizationOpportunity[],
): Promise<string> {
const timestamp = new Date().toISOString();
const branchName = `cost-opt/${service}/${timestamp}`;
// Create branch
await git.checkoutBranch(branchName);
// Write Terraform files
for (const change of changes) {
const filePath = `terraform/optimizations/${service}-${timestamp}.tf`;
const tfCode = `
# Auto-generated cost optimization
# Generated: ${timestamp}
# Service: ${service}
# Estimated Savings: $${change.estimatedSavings.toFixed(2)}/month
${change.terraformCode}
`;
await fs.writeFile(filePath, tfCode);
}
// Commit and push
await git.add(".");
await git.commit(
`Cost optimization: ${service} - ${opportunities.length} opportunities`,
);
await git.push("origin", branchName);
// Create PR
const pr = await github.createPullRequest({
title: `Cost Optimization: ${service} - $${opportunities.reduce((s, o) => s + o.estimatedMonthlySavings, 0).toFixed(2)}/mo potential savings`,
body: generatePRDescription(service, opportunities),
head: branchName,
base: "main",
});
return pr.html_url;
}
function generatePRDescription(
service: string,
opportunities: OptimizationOpportunity[],
): string {
let description = `# Cost Optimization: ${service}\n\n`;
description += `## Summary\n`;
description += `Identified ${opportunities.length} opportunities for cost optimization.\n\n`;
description += `**Estimated Monthly Savings**: $${opportunities.reduce((s, o) => s + o.estimatedMonthlySavings, 0).toFixed(2)}\n\n`;
description += `## Opportunities\n`;
opportunities.forEach((opp, i) => {
description += `\n### ${i + 1}. ${opp.recommendation}\n`;
description += `- **Resource**: ${opp.resourceId}\n`;
description += `- **Current Cost**: $${opp.currentMonthlyCost.toFixed(2)}\n`;
description += `- **Utilization**: ${opp.utilizationPercent.toFixed(1)}%\n`;
description += `- **Estimated Savings**: $${opp.estimatedMonthlySavings.toFixed(2)}/month\n`;
description += `- **Effort**: ${opp.effort}\n`;
});
description += `\n## Validation\n`;
description += `This PR was auto-generated by the Cost Optimization Assistant.\n`;
description += `- [ ] Manual validation of changes\n`;
description += `- [ ] Testing in staging environment\n`;
description += `- [ ] Approval for production deployment\n`;
return description;
}The automated pipeline runs on schedule, identifies opportunities, generates PRs, and notifies teams. No manual intervention needed. The system continuously hunts for waste and proposes fixes.
Managing Risk in Automated Optimizations
Automation is powerful but risky. You need guardrails that prevent bad optimizations from reaching production:
// src/risk-management.ts
interface OptimizationRiskAssessment {
changeId: string;
riskLevel: "low" | "medium" | "high";
risks: string[];
requiredApprovals: number;
autoMergeAllowed: boolean;
}
async function assessOptimizationRisk(
change: TerraformChange,
opportunity: OptimizationOpportunity,
): Promise<OptimizationRiskAssessment> {
const risks: string[] = [];
let riskLevel: "low" | "medium" | "high" = "low";
// Check if it's a database change
if (change.resourceType.includes("RDS")) {
// Database downsizing is risky
if (opportunity.effort === "high") {
risks.push("Database instance type change—verify capacity with team");
riskLevel = "high";
}
}
// Check if it affects production
if (opportunity.resourceId.includes("prod")) {
riskLevel = "high";
risks.push("Affects production resources—requires extra validation");
}
// Check savings magnitude
if (opportunity.estimatedMonthlySavings > 5000) {
risks.push("Large cost savings—verify assumptions are correct");
if (riskLevel !== "high") riskLevel = "medium";
}
const requiredApprovals = riskLevel === "high" ? 2 : 1;
const autoMergeAllowed = riskLevel === "low";
return {
changeId: change.terraformCode,
riskLevel,
risks,
requiredApprovals,
autoMergeAllowed,
};
}
// Apply risk assessment to PR workflow
async function configurePRProtections(
prNumber: number,
risk: OptimizationRiskAssessment,
) {
// Require multiple reviews for high-risk changes
if (risk.riskLevel === "high") {
await github.updateBranchProtection({
required_approving_review_count: risk.requiredApprovals,
require_code_owner_reviews: true,
});
// Add labels for visibility
await github.addLabels(prNumber, ["cost-optimization", "high-risk"]);
}
// Auto-merge safe changes
if (risk.autoMergeAllowed) {
// Wait for CI to pass, then auto-merge
await github.enableAutomerge(prNumber);
}
}Risk management prevents runaway automation. High-risk changes require human review. Low-risk changes merge automatically. This keeps the system fast while maintaining safety.
Avoiding the Pitfalls: Common Mistakes in Cost Optimization
Even with automation, teams make mistakes that undermine savings. Here are the most common pitfalls:
Pitfall 1: Optimizing the wrong things
You identify 20 opportunities, implement all of them, but only 5 deliver real savings. The others were false positives or the estimates were wrong. You spent engineering time on low-impact work.
Prevention: Pilot high-confidence optimizations first. Measure actual savings. Only then do the medium-confidence items. Score opportunities by confidence and impact; tackle high-confidence, high-impact items first.
Pitfall 2: Breaking things during optimization
You downsize a database to save cost, but production traffic increases unexpectedly and now you've undersized the resource. Outage. That cost savings looks foolish when revenue is impacted.
Prevention: Add safety margins to optimizations. "This workload peaks at 40% CPU. Down-sizing to 70% peak would be safe." Better to save $3,000 and keep safety margin than save $5,000 and risk outages. Always preserve headroom for growth and unexpected traffic.
Pitfall 3: Optimizing away future capacity
You consolidate databases to save cost. But now you're at 85% capacity with no room for growth. The next quarter when business grows, you have no headroom and need to expand quickly, negating savings.
Prevention: Respect growth projections. Factor in 6-month capacity plans when making long-term optimizations. Talk to product and planning teams about expected growth before making consolidation decisions.
Pitfall 4: Losing insights after implementation
You generated Terraform code based on analysis. Code got deployed. But now nobody remembers the analysis. Six months later, someone sees the old resource and spins it back up, recreating the waste.
Prevention: Document assumptions alongside Terraform. Include comments explaining why configuration changed. Link to the cost analysis report. Make the reasoning discoverable in version control.
Pitfall 5: Ignoring team concerns
Your optimization system proposes downsizing a resource that an application team is using. The team has specific performance requirements you didn't know about. They override your change. Now your system loses credibility.
Prevention: Involve teams early. Share optimization proposals before generating PRs. Get feedback. Let teams explain why your assumptions might be wrong. This builds trust and improves accuracy.
Building Organizational Culture Around Cost
The most sophisticated cost optimization system still needs human participation. Engineers need to care about cost. Product teams need to understand cost implications. Here's how to build that culture:
// src/cost-culture.ts
interface CostInsight {
message: string;
impact: "low" | "medium" | "high";
action?: string;
}
async function generateCostInsights(): Promise<CostInsight[]> {
const insights: CostInsight[] = [];
// Find expensive but unused services
const unusedServices = await findUnusedButExpensiveServices();
if (unusedServices.length > 0) {
insights.push({
message: `You're paying for ${unusedServices.length} services nobody is using. Total waste: $${unusedServices.reduce((s, u) => s + u.cost, 0).toFixed(2)}/month`,
impact: "high",
action: "Review and remove unused services",
});
}
// Find over-provisioned resources
const overProvisioned = await findOverProvisionedResources();
if (overProvisioned.length > 0) {
const totalWaste = overProvisioned.reduce(
(s, r) => s + r.potentialSavings,
0,
);
insights.push({
message: `${overProvisioned.length} resources are significantly over-sized. Potential savings: $${totalWaste.toFixed(2)}/month`,
impact: "high",
action: "Review resource sizing",
});
}
// Find opportunities for better pricing
const pricingOpportunities = await findPricingOpportunities();
if (pricingOpportunities.length > 0) {
insights.push({
message: `Switching to reserved instances could save $${pricingOpportunities.reduce((s, p) => s + p.savings, 0).toFixed(2)}/month`,
impact: "medium",
action: "Evaluate reserved instances",
});
}
return insights;
}
// Share insights in team channels
async function shareWeeklyCostInsights() {
const insights = await generateCostInsights();
const slack = new Slack();
const message = {
blocks: [
{
type: "header",
text: {
type: "plain_text",
text: "💰 This Week's Cost Optimization Opportunities",
},
},
...insights.map((insight) => ({
type: "section",
text: {
type: "mrkdwn",
text: `**${insight.impact.toUpperCase()}**: ${insight.message}`,
},
})),
{
type: "actions",
elements: [
{
type: "button",
text: {
type: "plain_text",
text: "View Dashboard",
},
url: "https://dashboard.internal/cost-insights",
},
],
},
],
};
await slack.postMessage("engineering", message);
}
// Share monthly report with leadership
async function generateMonthlyReport(): Promise<CostReport> {
const previousMonth = await getPreviousMonthCosts();
const thisMonth = awaitCurrentMonthCosts();
const optimizations = await getImplementedOptimizations();
return {
period: "This Month",
totalCost: thisMonth,
previousCost: previousMonth,
changePercent: ((thisMonth - previousMonth) / previousMonth) * 100,
optimizationsSaved: optimizations.reduce((s, o) => s + o.actualSavings, 0),
pendingOpportunities: optimizations.filter((o) => o.status === "pending")
.length,
highlights: await generateHighlights(),
};
}
// Schedule weekly insights and monthly reports
schedule.every("friday at 3pm").do(shareWeeklyCostInsights);
schedule.every("month on the 1st at 9am").do(async () => {
const report = await generateMonthlyReport();
await sendToLeadership(report);
});When cost insights are shared regularly and connected to organizational goals, teams start optimizing naturally. They see "we saved $50K this month because we right-sized 12 instances" and think "maybe my project could use some optimization too."
Summary: From Insight to Action
A cost optimization assistant built with Claude Code bridges the gap between finding waste and fixing it. You get:
- Automated analysis that runs continuously, not just quarterly
- Measured savings so you know whether optimizations actually work
- Proof-by-code where optimization comes with executable changes
- Organizational visibility where cost is everyone's concern, not just operations
- Feedback loops where estimates are compared against reality
- Automation with guardrails that generates fixes while protecting against risky changes
- Cultural alignment where teams understand cost implications of their decisions
The complete system—from billing analysis through implementation to measurement and cultural sharing—creates accountability and learning. Your second-quarter optimizations are better than your first quarter because you measured what worked.
Start small: analyze one service (RDS is a good choice). Build just the analysis layer. Run it manually once. If you find $500+ in savings, you've proven the concept. Then automate it. Then add code generation. Then expand to other services. Then add risk assessment and cultural sharing.
The most important insight: cost optimization isn't about being cheap, it's about being efficient. Waste isn't frugal, it's irresponsible. A systematic approach to finding and eliminating waste, combined with organizational culture that values efficiency, is how mature organizations operate sustainably.
-iNet