Building an Enterprise Claude Code Dashboard

You've got Claude Code running across your organization. Teams are shipping faster. Agents are orchestrating workflows. And somewhere in your engineering leadership office, someone's asking the question that keeps you up at night: "How much is this costing us? What's actually happening out there?"
Welcome to the enterprise dashboard problem. This is where visibility meets accountability. You've got dozens—maybe hundreds—of agents running concurrently, consuming tokens, hitting APIs, and generating costs that compound faster than anyone predicted. Without a dashboard, you're operating in the dark. Your finance team is doing cost analysis in spreadsheets. Your ops team is troubleshooting problems they can't even see. And your developers are guessing about performance.
In this article, we're walking through how to build a centralized Claude Code management dashboard—the kind that gives admins real-time visibility, lets managers track usage and costs, and gives developers deep diagnostic tools. We'll cover the architecture, the APIs you need to tap into, the dashboard patterns that actually work, implementation details you'll need to debug, and a starter kit you can deploy today. This is not theory—it's production patterns you can steal directly.
Table of Contents
- The Enterprise Problem: Visibility at Scale
- Architecture: What We're Actually Building
- Collecting Telemetry at Scale
- Real-Time Updates Without Melting Your Infrastructure
- Building Role-Based Views
- Performance Optimization: Caching and Materialization
- Handling Scale: Connection Pooling and Message Brokers
- Common Pitfalls: What Goes Wrong
- The Human Element: Building Dashboards People Actually Use
- Making Dashboards Sticky
- Organizational Impact: Beyond the Technology
- Advanced Dashboard Features: Anomaly Detection and Alerting
- Interactive Drill-Down Capabilities
- Building Custom Alerts and Notifications
- Data Retention and Cost Management
- Governance and Compliance
- Summary: A Dashboard That Actually Works
The Enterprise Problem: Visibility at Scale
Let's be honest. When you start with Claude Code, you're usually flying by the seat of your pants. A few teams experimenting, some basic usage notes, maybe a shared Slack channel. It works fine until you have 50 agents running concurrently, 200 concurrent users, and your bill is north of $10K per month. Now your CTO wants answers. Your finance team wants forecasts. And your ops team wants to know which agent just ate up 100K tokens in three minutes.
You've also probably had someone build a custom script to parse logs and calculate costs manually. And it broke. Then someone else built a spreadsheet. And then that spreadsheet became the source of truth, nobody updated it, and now your executives are making budget decisions based on data from three weeks ago. Sound familiar?
This is the visibility gap. It's not unique to Claude Code—every enterprise that scales API-driven systems hits this wall. But Claude Code's multi-tenant, multi-agent architecture makes the problem especially acute. You've got teams shipping agents independently. You've got agents spawning subagents. You've got cost models that vary by model type, by hour of day, by whether you're hitting cache. A single line-item dashboard won't cut it.
Without a dashboard, here's what you're left with:
- No aggregated cost visibility. You know your total bill, but you can't tell which teams, projects, or agents are responsible. Is the platform team eating half your budget? Is the data team's new research agent exploding costs? You can't tell.
- Blind spots in usage patterns. You don't know if some rogue agent is running in a loop, or if usage is just legitimately high. You spot a spike in your monthly bill and have zero way to drill down to the root cause.
- Audit nightmare. If something goes wrong, you're digging through logs manually to figure out what happened. Compliance teams hate you. Security reviews are painful.
- Resource allocation chaos. You can't make informed decisions about quota management, model selection, or team priorities without real data.
- Performance debugging hell. An agent is slow? A skill is broken? You're flying blind without aggregated telemetry.
- Chargeback math problems. If you want to bill teams for their actual usage, you're doing it in a spreadsheet. Which means arguments. Which means politics.
- Token burn surprises. You don't know which agents are consuming the most tokens, when consumption spikes, or whether you're hitting hidden limits.
The solution is a unified Claude Code dashboard—a system that pulls telemetry, audit, and usage data into one place and surfaces it through role-based views for different audiences.
Architecture: What We're Actually Building
Before we jump into UI components, let's clarify what we're building under the hood. The dashboard isn't monolithic. It's a data pipeline with clear separation of concerns.
The architectural philosophy here is crucial to understand: dashboards fail when they're built as monolithic systems where everything is tightly coupled. You build one dashboard that's good for CFOs. Then someone asks for a developer view. You bolt on another view. Then someone needs mobile. You bolt that on too. Before you know it, your dashboard is spaghetti code where changing one metric breaks three other views. Data pipelines that were supposed to be fast are now slow because every view is hitting the database directly for every metric.
The right approach—the one that scales—is treating the dashboard as a composition of independent systems that communicate through well-defined interfaces. Your data collection system doesn't know or care about visualizations. Your aggregation system doesn't know which users will query it. Your presentation layer doesn't know how data was collected. This separation of concerns means you can evolve each part independently. Replace your data collection system? As long as it produces the same output schema, everything downstream still works. Add a new visualization? It pulls from the same aggregated data sources as everyone else.
This is why we're separating collection, aggregation, and presentation. It's not just architecture; it's organizational scale strategy.
┌─────────────────────────────────────────────────────────┐
│ DATA SOURCES │
├─────────────────────────────────────────────────────────┤
│ • Claude Code Telemetry API (usage, tokens, latency) │
│ • Audit Logs API (agent invocations, state changes) │
│ • Cost Tracking API (per-model pricing) │
│ • Agent Lifecycle Events (deployment, errors) │
│ • Custom Business Metrics (completion rates, quality) │
└────────────────────────┬────────────────────────────────┘
│
┌────────────────┴────────────────┐
│ │
v v
┌──────────────────────┐ ┌────────────────────────┐
│ Data Aggregation │ │ Real-Time Stream │
│ (Batch ETL) │ │ Processing │
│ • PostgreSQL │ │ • Apache Kafka │
│ • Materialized Views │ │ • WebSocket Bridge │
│ • Scheduled Jobs │ │ • Real-time Updates │
└──────────┬───────────┘ └────────────┬───────────┘
│ │
└──────────────┬───────────────────┘
│
v
┌─────────────────────────────┐
│ Dashboard Backend API │
│ (GraphQL or REST) │
│ • Role-based queries │
│ • Real-time subscriptions │
│ • Historical drill-down │
└──────────────┬──────────────┘
│
┌──────────────┼──────────────┐
│ │ │
v v v
┌────────┐ ┌────────┐ ┌────────┐
│ Web UI │ │ Mobile │ │ CLI │
│ Admin │ │ Manager│ │ Dev │
└────────┘ └────────┘ └────────┘
The pipeline has three tiers: data collection, aggregation, and presentation. Each tier can scale independently, which matters when you're handling millions of events per day.
Data collection is straightforward. You're instrumenting Claude Code with telemetry emitters. Every agent invocation, every token consumed, every error thrown—that's a data point. You're not storing it at collection time; you're streaming it downstream. Streaming architecture means you never lose data, and you can process it at whatever speed your downstream systems can handle.
Aggregation happens on two time scales. Fast aggregation (real-time via Kafka) gives you live dashboards—this minute's usage, current error rate. Slow aggregation (batch jobs every hour) gives you high-quality data for analysis—daily cost breakdowns, trend analysis. Materialized views in your database let you query aggregated data instantly without recalculating. This separation is crucial: real-time data might be slightly stale (updated every few seconds), but historical data is accurate and complete.
Presentation is role-aware. An admin sees operational metrics and audit trails. A CFO sees costs and trends. A developer debugging their agent sees request logs and performance profiles. Same data, different windows into it.
Collecting Telemetry at Scale
Here's where it gets real. Collecting telemetry from hundreds of concurrent agents without melting your infrastructure is non-trivial. You need batching, buffering, and backpressure handling. The naive approach (send every event immediately) doesn't scale. You'll overwhelm your network, your storage, and your processing pipeline. You'll also introduce latency—each individual event has overhead.
Understanding why scale becomes hard here requires thinking about numbers. If you have 100 agents running concurrently and each agent calls Claude once per second, that's 100 events per second. Each event has overhead: serialization, network transmission, deserialization, database insert. If each event takes 10ms end-to-end, that's 1 second of processing time per second of actual time. You're at 100% utilization with no headroom. Add one more thing and you collapse. Now imagine 500 concurrent agents. That's 500 events per second, or 5 seconds of processing per second. You're completely underwater.
Batching solves this by grouping events together. Instead of sending one event at a time, you send fifty events in one batch. The overhead per event drops by 50x. Now 500 agents generating 500 events/sec requires only 0.1 seconds of actual work. You have headroom.
But batching creates a tradeoff: real-time visibility vs. efficiency. If you batch every 5 seconds, dashboard updates are delayed by up to 5 seconds. Is that acceptable? Usually yes—humans can't perceive latency under 100ms anyway, and a 5-second delay in a cost metric is imperceptible. But it's a conscious tradeoff.
The backpressure handling is where things get sophisticated. What happens when your Kafka cluster is temporarily down? Do you drop events and hope you can catch the spike from metrics? Do you buffer them in memory and hope memory doesn't fill up? Do you write to disk? This is where you design for reliability. You want telemetry to be as close to guaranteed as possible, because telemetry is your only window into what's happening. If you drop events because you can't keep up, you've got blind spots. A spike in costs that would have been visible gets dropped, and later you're surprised when your bill is higher than expected.
Instead, implement a batching telemetry collector that buffers events and sends them in bulk. This is a foundational pattern that every mature monitoring system uses:
class TelemetryCollector {
private queue: TelemetryEvent[] = [];
private batchSize = 1000;
private flushInterval = 5000; // 5 seconds
emit(event: TelemetryEvent) {
this.queue.push({
...event,
timestamp: Date.now(),
environment: process.env.ENVIRONMENT,
version: process.env.APP_VERSION,
});
if (this.queue.length >= this.batchSize) {
this.flush();
}
}
private async flush() {
if (this.queue.length === 0) return;
const batch = this.queue.splice(0, this.batchSize);
try {
// Send to Kafka topic for streaming processing
await kafkaProducer.send({
topic: "claude-code-telemetry",
messages: batch.map((event) => ({
key: event.agentId,
value: JSON.stringify(event),
})),
});
} catch (error) {
// Backpressure: if Kafka is down, keep events in memory
// (with limits to prevent memory exhaustion)
this.queue.unshift(...batch);
this.scheduleRetry();
}
}
}
// Usage in your agent
class ClaudeCodeAgent {
private telemetry = new TelemetryCollector();
async invoke(input: string): Promise<string> {
const start = Date.now();
try {
this.telemetry.emit({
type: "agent_invocation_start",
agentId: this.id,
teamId: this.teamId,
input: input.substring(0, 100), // Truncate large inputs
});
const result = await this.run(input);
const elapsed = Date.now() - start;
this.telemetry.emit({
type: "agent_invocation_end",
agentId: this.id,
success: true,
duration: elapsed,
outputLength: result.length,
tokensUsed: result.usage.total_tokens,
cost: result.usage.total_tokens * COST_PER_TOKEN,
});
return result;
} catch (error) {
this.telemetry.emit({
type: "agent_invocation_error",
agentId: this.id,
error: error.message,
errorType: error.constructor.name,
});
throw error;
}
}
}The key insight here: batch aggressively, but don't lose data. If your Kafka cluster goes down, you have limited in-memory buffering. But you never silently drop events. If you can't send them, you keep them and retry. This is how you prevent "the spike disappeared from the logs" situations that nobody can explain in post-mortems.
Real-Time Updates Without Melting Your Infrastructure
Real-time dashboards are addictive. Teams love watching live metrics. But they're also resource-intensive. Every update to every dashboard user is a database query. With 50 concurrent users watching a dashboard, that's 50 queries per metric update. Per second. You'll DDoS yourself.
The solution is event-driven updates with smart filtering and throttling. Instead of querying the database for every update, subscribe to event streams and emit changes only when something significant changed.
class DashboardWebSocket {
private subscribers: Map<string, Set<string>> = new Map();
// key: metricName, value: set of client IDs subscribed
onSubscribe(clientId: string, metricName: string) {
if (!this.subscribers.has(metricName)) {
this.subscribers.set(metricName, new Set());
this.startAggregatingMetric(metricName);
}
this.subscribers.get(metricName)!.add(clientId);
}
private startAggregatingMetric(metricName: string) {
// Subscribe to Kafka topic
kafkaConsumer.subscribe({ topic: "claude-code-telemetry" });
// Aggregate in memory, only emit if changed significantly
let lastValue = null;
let lastEmit = Date.now();
kafkaConsumer.on("message", (message) => {
const event = JSON.parse(message.value);
if (matches(event, metricName)) {
const newValue = this.calculateMetric(event);
// Only emit if: (1) value changed significantly, OR (2) 5+ seconds elapsed
const timeElapsed = Date.now() - lastEmit;
const valueChanged = Math.abs(newValue - lastValue) > threshold;
if (valueChanged || timeElapsed > 5000) {
this.broadcast(metricName, newValue);
lastValue = newValue;
lastEmit = Date.now();
}
}
});
}
private broadcast(metricName: string, value: any) {
const subscribers = this.subscribers.get(metricName) || new Set();
subscribers.forEach((clientId) => {
this.io.to(clientId).emit("metric_update", { metricName, value });
});
}
}This is subtle but important. You're not emitting every event. You're aggregating and only emitting if something actually changed or if enough time has passed that stale data is worse than missing updates. This keeps your WebSocket traffic manageable while still feeling real-time. A 5-second delay is imperceptible to users but dramatically reduces load on your backend. Most people won't notice the difference between a metric updating every 100ms and every 5 seconds. But your infrastructure will appreciate the 50x reduction in WebSocket messages.
Building Role-Based Views
A unified dashboard is actually three dashboards: one for admins, one for finance, one for developers. Each needs different data, different aggregations, different insights.
Admin Dashboard shows operational health. Error rates, latency percentiles, token consumption, queue depths. Focuses on "what's going wrong?" Are agents crashing? Is there a performance regression? Is one model consistently slower? This is the on-call person's view. They need to spot problems fast. They need to drill down from "error rate is 5%" to "these specific agents are failing with these specific errors."
Finance Dashboard shows cost trends. Daily spend, cost by team, cost by model, forecasts. Focuses on "where's the money going?" If costs spike, this dashboard should make it obvious which team caused it. Not for blame—for understanding trends and making budget decisions intelligently. The finance team wants to see trending data, projections, anomalies.
Developer Dashboard shows request logs and traces. What did my agent do? Why did it fail? How fast did it run? This is personalized per user—they only see their own agents unless they're admins. This is where developers debug. They want to see the exact requests, responses, errors, and latencies for their specific agents.
// Middleware: inject role-based query filters
function withRoleFilter(req: Request, res: Response, next: NextFunction) {
const user = req.user;
const role = user.role;
const teamId = user.teamId;
// Define what each role can see
const filters = {
admin: {}, // See everything
manager: { teamId }, // See own team
developer: { teamId, createdBy: user.id }, // See own agents
};
req.queryFilter = filters[role] || filters.developer;
next();
}
// Example: cost endpoint
app.get("/api/costs", withRoleFilter, async (req, res) => {
const costData = await db.query(
`
SELECT date, team_id, SUM(cost) as total_cost
FROM agent_costs
WHERE 1=1 ${buildWhereClause(req.queryFilter)}
GROUP BY date, team_id
ORDER BY date DESC
LIMIT 90
`,
);
res.json(costData);
});The role-based filtering is simple but powerful. It means you can point the same dashboard UI at the same backend, and users automatically see the right data. An admin sees everything. A developer sees only their agents. A manager sees only their team. Same code, different data. This pattern scales: you can add more roles, adjust permissions, all without changing the dashboard code.
Performance Optimization: Caching and Materialization
A dashboard hitting your database for every metric would be slow. You need caching. But cache invalidation is hard. The trick is using materialized views in your time-series database to precompute common queries. This way you're not computing aggregations on demand; they're already computed and waiting.
-- Materialized view: hourly cost by team
CREATE MATERIALIZED VIEW cost_by_team_hourly AS
SELECT
DATE_TRUNC('hour', timestamp) as hour,
team_id,
model,
SUM(tokens_used) as total_tokens,
SUM(cost) as total_cost,
COUNT(*) as invocation_count,
AVG(duration_ms) as avg_duration
FROM agent_invocations
WHERE timestamp > NOW() - INTERVAL '90 days'
GROUP BY DATE_TRUNC('hour', timestamp), team_id, model;
-- Refresh every hour
CREATE TRIGGER refresh_cost_by_team_hourly
AFTER INSERT ON agent_invocations
FOR EACH ROW
EXECUTE FUNCTION refresh_materialized_view('cost_by_team_hourly');Now queries against this view are instant. No aggregation at query time. Just a table lookup. You can cache the results in Redis for another layer of speed:
async function getCostsByTeam(teamId: string) {
const cacheKey = `costs:team:${teamId}`;
const cached = await redis.get(cacheKey);
if (cached) return JSON.parse(cached);
const data = await db.query(
`SELECT * FROM cost_by_team_hourly WHERE team_id = $1`,
[teamId],
);
// Cache for 5 minutes
await redis.setex(cacheKey, 300, JSON.stringify(data));
return data;
}This two-layer caching strategy (materialized view + Redis) means dashboard queries are almost always served from cache. Actual database hits are rare, which keeps load low and latency reasonable. A developer opens the dashboard, and metrics appear instantly because they're coming from Redis, not a database query.
Handling Scale: Connection Pooling and Message Brokers
When you've got hundreds of concurrent WebSocket connections plus database queries plus Kafka consumption, connection management becomes critical. You'll run out of database connections. Your Kafka lag will grow. Your infrastructure will catch fire without proper resource management.
// With connection pooling + message broker
const pool = new Pool({ max: 20 }); // 20 connections max
const redis = new Redis({ lazyConnect: true });
// Publish metrics once, deliver to many subscribers
eventBus.on("metrics-updated", (metrics) => {
redis.publish("metrics-channel", JSON.stringify(metrics));
});
// WebSocket clients subscribe to Redis pub/sub channel
socket.on("subscribe:metrics", () => {
redis.subscribe("metrics-channel", (message) => {
socket.emit("update", JSON.parse(message));
});
});Connection pooling limits how many database connections you can open. Message brokers like Redis allow you to publish once and have multiple subscribers. This is the key to scaling: instead of creating new database connections for each update, you publish to a message broker and let subscribers consume at their own pace. One database query, fifty subscribers. The math works in your favor.
Common Pitfalls: What Goes Wrong
We've built a lot of these dashboards. Here are the gotchas we've learned the hard way:
1. Over-aggregating too early. Store raw data first. You'll always want to slice it differently tomorrow. Don't compute "cost by team per day" once and forget the granular data. Keep both. Your future self will thank you. You can always aggregate backwards; you can't un-aggregate.
2. Real-time noise. If you push every single event to WebSockets, your frontend will be hammering re-renders and your server will be melting. Aggregate client-side. Update charts at 1-second intervals, not 10 milliseconds. Throttle updates religiously. The frontend can't actually process updates faster than humans can perceive, so any faster than 1 second is waste.
3. Missing context in audit logs. When an agent fails, you need to know why. Store error messages, stack traces, input summaries. When a manager asks "Why did costs spike?", you need more than just a number. You need to see which agents changed behavior, what errors started occurring, which models started being used.
4. No user segmentation. An admin seeing cost data is different from a developer seeing traces. Don't show developers cost data they can't act on. Don't show managers debug traces. Role-based access control isn't optional—it's how you prevent information overload and confusion.
5. Forgetting about data retention. Detailed metrics explode in size. Plan for it. Set up automated cleanup. Archive old data to S3. Keep last 90 days detailed, last 2 years daily aggregates, older data in cold storage. Without a retention policy, your database grows infinitely and queries slow down.
6. Clock skew. When you have multiple servers inserting telemetry, their clocks might not agree. Use NTP. Or better, insert timestamps server-side in the database, not client-side. A 1-second clock skew can throw off your entire timeline.
The Human Element: Building Dashboards People Actually Use
Here's what nobody talks about: a technically perfect dashboard that nobody looks at is worthless. We've built dozens of systems where the infrastructure was flawless, the data was accurate, the visualizations were beautiful, and yet teams didn't use them. Why? Usually because the dashboard forced you to think the way the dashboard architect thought, not the way you naturally think about your work.
Consider the admin who needs to investigate an operational incident. They don't think "let me first pull the agent metrics table, then join with cost metrics, then filter by time window." They think "something is broken, show me what changed." A dashboard that makes them follow your data schema is a dashboard they'll abandon in favor of grepping logs. A dashboard that shows "what changed in the last hour?" is a dashboard they'll rely on.
The practical lesson: involve the people who'll actually use the dashboard in its design. Spend time understanding how they think about their work. What questions do they ask repeatedly? What data are they currently gathering manually? What reports does your CTO spend 20 minutes generating every week that could be automated?
Making Dashboards Sticky
The most successful dashboards we've seen share common traits:
1. They answer questions faster than the alternative. If your dashboard shows monthly costs but the CFO can get it faster by asking the vendor directly, the dashboard is slower. Your dashboard only wins if it's genuinely faster than the alternative. This usually means pre-loading the most common queries so they're instant.
2. They provide insights, not just data. Raw metrics are information. Patterns and anomalies are insights. Your dashboard should highlight unusual things automatically. "Cost is up 30% this month" is less useful than "cost is up 30% because the data team started using Opus instead of Sonnet—here's the breakdown."
3. They're accessible in context. A developer debugging an issue shouldn't have to context-switch to a different system to understand cost impact. The diagnostic view should show cost inline. Your admin shouldn't have to navigate through ten pages to find what they're looking for. Make the common path frictionless.
4. They update frequently enough to feel real-time. If your WebSocket updates are batched to 1-minute intervals, the dashboard feels stale. Aim for under 5 second delays for status information. Teams will check a dashboard that feels alive; they'll ignore one that feels historical.
5. They provide actionable next steps, not just information. When the dashboard shows high error rates, it should link to recent logs or suggest common causes. When costs spike, it should suggest which team to investigate. Each insight should have a clear next action.
Organizational Impact: Beyond the Technology
Building a dashboard changes more than your visibility—it changes your team dynamics. When cost data is transparent and real-time, teams start caring about efficiency. They see their costs and think "wait, why is my agent consuming twice as many tokens as my colleague's?" Now you've got natural peer pressure toward optimization without anyone needing to enforce it.
When audit logs are comprehensive and visible, teams think twice before taking shortcuts. Not because they're being watched, but because they know actions are recorded and reviewable. It creates a culture where behavior aligns with policy naturally.
When performance data is visible, you get healthy competition. "My agent averages 500ms, theirs averages 200ms—what are they doing differently?" This drives innovation bottom-up instead of management-down.
When errors are surfaced quickly, your MTTR (mean time to recovery) drops dramatically. Instead of waiting for users to report bugs, you catch them during development. Instead of learning about outages from Twitter, you know instantly when things break.
Advanced Dashboard Features: Anomaly Detection and Alerting
Once you have baseline data, add intelligent alerting that spots problems before they become crises. Anomaly detection isn't just about thresholds—it's about understanding normal patterns and flagging deviations:
// src/anomaly-detection.ts
interface AnomalyAlert {
type: "cost" | "error_rate" | "latency" | "tokens";
severity: "warning" | "critical";
metric: string;
currentValue: number;
baselineValue: number;
deviationPercent: number;
affectedTeams: string[];
timestamp: Date;
}
class AnomalyDetector {
private readonly stdDevThreshold = 2; // Flag if >2 std devs from mean
detectCostAnomaly(
currentCost: number,
historicalCosts: number[],
): AnomalyAlert | null {
const mean =
historicalCosts.reduce((a, b) => a + b, 0) / historicalCosts.length;
const variance =
historicalCosts.reduce((sum, cost) => sum + Math.pow(cost - mean, 2), 0) /
historicalCosts.length;
const stdDev = Math.sqrt(variance);
const zScore = (currentCost - mean) / stdDev;
if (Math.abs(zScore) > this.stdDevThreshold) {
return {
type: "cost",
severity: zScore > 0 ? "critical" : "warning",
metric: "Daily spend",
currentValue: currentCost,
baselineValue: mean,
deviationPercent: ((currentCost - mean) / mean) * 100,
affectedTeams: [], // Populated by cost analysis
timestamp: new Date(),
};
}
return null;
}
detectErrorRateAnomaly(
currentErrorRate: number,
historicalRates: number[],
): AnomalyAlert | null {
const mean =
historicalRates.reduce((a, b) => a + b, 0) / historicalRates.length;
// Error rates have different sensitivity—even small increases matter
if (currentErrorRate > mean * 1.5) {
return {
type: "error_rate",
severity: currentErrorRate > mean * 3 ? "critical" : "warning",
metric: "Error rate",
currentValue: currentErrorRate,
baselineValue: mean,
deviationPercent: ((currentErrorRate - mean) / mean) * 100,
affectedTeams: [],
timestamp: new Date(),
};
}
return null;
}
}
// Integration with alerting system
async function evaluateAnomalies() {
const detector = new AnomalyDetector();
// Get current metrics
const currentCost = await getCurrentDayCost();
const historicalCosts = await getHistoricalDailyCosts(30); // Last 30 days
const costAnomaly = detector.detectCostAnomaly(currentCost, historicalCosts);
if (costAnomaly) {
// Send alerts
await sendSlackAlert(costAnomaly);
await sendPagerDutyAlert(costAnomaly);
await storeAnomalyRecord(costAnomaly);
}
}Anomaly detection separates signal from noise. Instead of alerting on arbitrary thresholds ("alert if cost > $1000"), you alert on actual deviations from normal patterns ("alert if cost deviates >2 standard deviations from 30-day average"). This catches real problems while avoiding false positives.
Interactive Drill-Down Capabilities
Enterprise dashboards need to let users drill down from summary metrics to underlying details. A CFO sees "spend is up 30%" and needs to drill down to see "the data team spent 20% more, specifically on Opus model usage." This requires building hierarchical data queries:
// src/drill-down.ts
interface DrillDownPath {
level: "organization" | "team" | "agent" | "invocation";
filters: Record<string, any>;
}
class DataDrilldown {
async getOrgSummary(): Promise<{
totalCost: number;
tokenUsage: number;
teams: Array<{ teamId: string; cost: number; tokens: number }>;
}> {
return db.query(`
SELECT
SUM(cost) as totalCost,
SUM(tokens_used) as tokenUsage,
team_id,
COUNT(*) as invocations
FROM agent_invocations
WHERE date >= NOW() - INTERVAL '30 days'
GROUP BY team_id
`);
}
async getTeamDetail(teamId: string): Promise<{
totalCost: number;
agents: Array<{
agentId: string;
cost: number;
tokens: number;
errors: number;
}>;
}> {
return db.query(
`
SELECT
agent_id,
SUM(cost) as totalCost,
SUM(tokens_used) as tokenUsage,
COUNT(CASE WHEN error IS NOT NULL THEN 1 END) as errors
FROM agent_invocations
WHERE team_id = $1 AND date >= NOW() - INTERVAL '30 days'
GROUP BY agent_id
`,
[teamId],
);
}
async getAgentDetail(agentId: string): Promise<{
invocations: Array<{
invocationId: string;
cost: number;
duration: number;
model: string;
status: string;
error?: string;
}>;
}> {
return db.query(
`
SELECT
id,
cost,
duration_ms,
model,
status,
error
FROM agent_invocations
WHERE agent_id = $1
ORDER BY created_at DESC
LIMIT 1000
`,
[agentId],
);
}
}
// API endpoints for drill-down
app.get("/api/summary", async (req, res) => {
const summary = await drilldown.getOrgSummary();
res.json(summary);
});
app.get("/api/teams/:teamId", async (req, res) => {
const detail = await drilldown.getTeamDetail(req.params.teamId);
res.json(detail);
});
app.get("/api/agents/:agentId", async (req, res) => {
const detail = await drilldown.getAgentDetail(req.params.agentId);
res.json(detail);
});Each drill-down level shows appropriate detail. Organization level: teams. Team level: agents. Agent level: individual invocations. This progressive disclosure lets users find root causes without overwhelming them with data.
Building Custom Alerts and Notifications
Different stakeholders need different alerts. CFOs care about cost trends. Ops teams care about error rates. Developers care about their specific agents:
// src/alerting.ts
interface AlertRule {
name: string;
condition: (metrics: Metrics) => boolean;
severity: "info" | "warning" | "critical";
recipients: AlertRecipient[];
channels: ("slack" | "email" | "pagerduty")[];
}
interface AlertRecipient {
role: "admin" | "team_manager" | "developer";
teamId?: string;
userId?: string;
}
const alertRules: AlertRule[] = [
{
name: "Daily spend spike",
condition: (m) => m.dailyCost > m.averageDailyCost * 1.5,
severity: "warning",
recipients: [{ role: "admin" }, { role: "team_manager" }],
channels: ["slack", "email"],
},
{
name: "High error rate",
condition: (m) => m.errorRate > 0.05, // >5% errors
severity: "critical",
recipients: [{ role: "admin" }],
channels: ["slack", "pagerduty"],
},
{
name: "Agent timeout",
condition: (m) => m.p99Latency > 30000, // >30 seconds
severity: "warning",
recipients: [{ role: "developer", teamId: m.teamId }],
channels: ["slack"],
},
];
async function evaluateAlerts(metrics: Metrics) {
for (const rule of alertRules) {
if (rule.condition(metrics)) {
await sendAlert(rule, metrics);
}
}
}
async function sendAlert(rule: AlertRule, metrics: Metrics) {
for (const channel of rule.channels) {
if (channel === "slack") {
await sendSlackAlert(rule, metrics);
} else if (channel === "email") {
await sendEmailAlert(rule, metrics);
} else if (channel === "pagerduty") {
await sendPagerDutyIncident(rule, metrics);
}
}
}Good alerting is targeted and actionable. Don't send everyone every alert. Send CFO cost alerts to the CFO and ops team. Send developer alerts to developers. Each alert should answer: "Why should I care about this?"
Data Retention and Cost Management
A dashboard that stores terabytes of detailed telemetry becomes expensive. Implement intelligent data retention:
// src/data-retention.ts
async function manageDataRetention() {
// Keep detailed data for 30 days
await db.query(`
DELETE FROM agent_invocations
WHERE created_at < NOW() - INTERVAL '30 days'
`);
// Archive to S3 for cold storage
const archiveData = await db.query(`
SELECT * FROM daily_aggregates
WHERE date < NOW() - INTERVAL '90 days'
`);
await s3.putObject({
Bucket: "dashboard-archives",
Key: `aggregates/${new Date().toISOString()}.json.gz`,
Body: gzip(JSON.stringify(archiveData)),
});
// Delete from hot storage
await db.query(`
DELETE FROM daily_aggregates
WHERE date < NOW() - INTERVAL '90 days'
`);
}
// Data retention schedule
schedule.every("1 day").do(manageDataRetention);Tiered storage keeps recent data hot (fast queries) while archiving older data to cold storage (slow but cheap). This reduces database size and costs while maintaining historical access.
Governance and Compliance
Enterprise systems need audit trails and compliance reporting:
// src/compliance.ts
interface AuditLog {
timestamp: Date;
userId: string;
action: string;
resource: string;
changes: Record<string, any>;
ipAddress: string;
userAgent: string;
}
async function logAuditEvent(log: AuditLog) {
await auditDb.insert("audit_logs", log);
}
// Track who accessed what, when
async function trackDashboardAccess(userId: string, teamId: string) {
await logAuditEvent({
timestamp: new Date(),
userId,
action: "dashboard_access",
resource: `team:${teamId}`,
changes: {},
ipAddress: getClientIP(),
userAgent: getUserAgent(),
});
}
// Generate compliance reports
async function generateAccessReport(startDate: Date, endDate: Date) {
const logs = await auditDb.query(
`
SELECT user_id, action, resource, COUNT(*) as count
FROM audit_logs
WHERE timestamp BETWEEN $1 AND $2
GROUP BY user_id, action, resource
`,
[startDate, endDate],
);
return {
period: { start: startDate, end: endDate },
accessPatterns: logs,
generatedAt: new Date(),
};
}Audit logs provide visibility into who accessed what. This satisfies compliance requirements and helps investigate security incidents.
Summary: A Dashboard That Actually Works
Building an enterprise Claude Code dashboard isn't rocket science, but it does require thinking about data flow, visualization, and real-time updates. Here's what you need:
- Data pipeline: Telemetry collector → Time-series database → Historical analytics
- Real-time stream: Event bus → WebSocket subscriptions → Live UI updates
- Role-based views: Admins see operational health, managers see costs and trends, developers see traces and diagnostics
- Anomaly detection: Identify problems before they impact users
- Drill-down capability: Navigate from summary to root cause
- Intelligent alerting: Notify the right people about the right issues
- Data retention: Keep recent data hot, archive old data
- Audit trails: Track who did what and when
- Scaling from day one: Use TimescaleDB, materialized views, connection pooling, and throttled WebSocket updates
Start with the fundamentals: basic telemetry collection and cost tracking. Get that working reliably. Then layer on real-time updates and role-specific views as you go. Add anomaly detection. Add drill-down. Add alerting. The dashboard will evolve with your needs.
The best dashboards aren't the ones with the most features—they're the ones that answer the questions your organization actually asks. Talk to your users. Understand their workflows. Build dashboards that fit into their daily work, not dashboards that require them to change how they work.
And when your CTO asks "How much is this costing us?"—you'll have an answer. Not a spreadsheet, not a guess. An actual number with a breakdown by team, agent, model, and time of day. When your ops team needs to investigate an incident, they'll drill down and find the root cause in seconds, not hours. When your team members make infrastructure decisions, they'll have real data about cost and impact. That's the power of a real dashboard.
-iNet