Prometheus and Grafana for ML Infrastructure Metrics
So you've got your machine learning models in production, they're serving predictions, and life is good - until it isn't. Suddenly, inference times spike. GPU memory fills up. Models start predicting garbage. You're scrambling to understand what went wrong, wishing you had actual numbers instead of guesses.
This is where Prometheus and Grafana come in. We're going to walk through building a comprehensive observability system for ML infrastructure-argocd-flux)-flux) - one that gives you real-time visibility into everything from custom model KPIs to raw GPU metrics.
The infrastructure you're about to build is the same foundation used by production ML teams at major companies. It's battle-tested, proven at scale, and gives you the visibility to catch problems before they cascade into customer-facing issues.
Table of Contents
- Why You Need Metrics for ML Infrastructure
- Why This Matters More Than You Think
- Part 1: Instrumenting Custom ML Metrics with prometheus_client
- Setting Up prometheus_client
- Understanding Why Each Metric Type Matters
- Model-Level KPIs
- Part 2: GPU Metrics with DCGM Exporter
- Why You Can't Ignore GPU Metrics
- Deploying DCGM Exporter
- Key GPU Metrics to Track
- Part 3: Prometheus Configuration
- Recording Rules: Pre-compute Expensive Queries
- Part 4: Building Grafana Dashboards
- Dashboard JSON (Importable)
- Part 5: Alerting with AlertManager
- Production Considerations
- Metric Cardinality Planning
- Scrape Interval vs. Storage
- Recording Rules Strategy
- GPU Monitoring Gotchas
- Common Pitfalls and Troubleshooting
- "My dashboards are slow"
- "Alert fatigue - too many alerts"
- "GPU metrics aren't showing up"
- "Accuracy metric won't update"
- "Latency is spiky"
- The Complete Picture
- Decision Framework: When to Use What
- Best Practices Summary
- Summary
Why You Need Metrics for ML Infrastructure
Let's be honest: ML systems are different from typical applications. A service might return an HTTP 500 and you'll catch it immediately. But your model can return perfectly valid predictions while silently degrading in accuracy. Your batch job can process data slower and slower without throwing errors.
That's why standard application metrics aren't enough. You need:
- Custom business metrics: How accurate are your predictions? How long does inference take at the p99?
- Infrastructure metrics: GPU utilization, memory pressure, thermal throttling
- System metrics: Queue depths, throughput, error rates
- Training metrics: Loss curves, validation performance, data quality-scale)-real-time-ml-features)-apache-spark)-training-smaller-models)) issues
Prometheus gives you a time-series database and scraping framework. Grafana lets you visualize it. Together, they form the backbone of ML observability.
Why This Matters More Than You Think
Here's the reality: your model didn't crash, so your monitoring probably shows "everything is fine." But your model's F1 score dropped from 0.92 to 0.88 over the last two weeks - a silent catastrophe. Your inference latency crept from 50ms to 200ms because you never noticed the GPU memory fragmentation problem. Your batch job that was taking 2 hours is now taking 6, but it's not failing, so nobody knew until customers complained about stale predictions.
Prometheus and Grafana solve this by creating a single source of truth for everything that matters: infrastructure health, throughput, latency percentiles, and - most importantly - your model's actual business metrics. Once you instrument properly, you can correlate infrastructure problems with model performance problems. You'll see that accuracy dropped exactly when a new batch of training data came in. You'll notice latency spiked right when GPU utilization hit 95%.
This isn't theoretical. Teams that use this approach catch issues in hours instead of days. Issues that might have cost thousands in lost revenue or customer trust get fixed before they escalate. This is the leverage of proper observability.
Part 1: Instrumenting Custom ML Metrics with prometheus_client
Your ML workload is unique, so you'll instrument it with custom metrics-hpa-custom-metrics). The Python prometheus_client library makes this straightforward.
When you're building observability for machine learning systems, the generic application metrics you're used to - response time, error count, memory usage - aren't enough. They tell you whether your infrastructure is working, but they don't tell you whether your model is working correctly. This is the fundamental blindness that catches teams by surprise. Your model's inference server-inference-server-multi-model-serving) might be responding in 50 milliseconds with a 99.9% success rate, the infrastructure looks perfect, and your model is silently degrading by 5% accuracy per week because you're not tracking the metrics that actually matter.
This is where custom metrics come in. You need to instrument your inference code to capture what's happening at the model level. Are you processing requests? How long are they taking? Are predictions getting cached or computed? When was the last time you retrained? What's the current distribution of predicted classes? These questions point to metrics you need to track.
The beautiful part about Prometheus is that it's been designed from the ground up with exactly this use case in mind. It's not bolted on. It's native. You write code that emits metrics, Prometheus scrapes those metrics on a schedule, and you query them whenever you need answers. No custom collection infrastructure, no database schema design, no schema migrations.
Setting Up prometheus_client
from prometheus_client import Counter, Histogram, Gauge, CollectorRegistry
from prometheus_client import start_http_server
import time
# Create a registry for your metrics
registry = CollectorRegistry()
# Counter: increment for each inference
inference_counter = Counter(
'ml_inferences_total',
'Total number of inferences',
['model_name', 'endpoint'],
registry=registry
)
# Histogram: track inference latency with buckets
inference_latency = Histogram(
'ml_inference_latency_seconds',
'Inference latency in seconds',
['model_name'],
buckets=(0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1.0),
registry=registry
)
# Gauge: current queue depth
batch_queue_depth = Gauge(
'ml_batch_queue_depth',
'Current number of items in batch queue',
registry=registry
)
# Start the metrics HTTP server on port 8000
start_http_server(8000, registry=registry)The setup code here is doing something important. We're creating a dedicated CollectorRegistry rather than using the default global one - this prevents metric collisions if you're running multiple services in the same process. Each metric type serves a distinct purpose: Counter for things that only go up (inference count), Histogram for latency distributions, and Gauge for values that fluctuate (queue depth). The buckets parameter in the histogram is critical - we're defining boundaries at 10ms, 25ms, 50ms, 100ms, 250ms, 500ms, and 1000ms. These boundaries matter because Prometheus will later ask "what fraction of requests were faster than 100ms?" and these buckets provide the answer.
Now instrument your inference code:
def run_inference(input_data, model_name="production_model"):
"""Run inference and record metrics."""
start = time.time()
try:
# Your actual inference logic
prediction = model.predict(input_data)
# Record success
inference_counter.labels(
model_name=model_name,
endpoint="v1_predict"
).inc()
finally:
# Always record latency, even on failure
duration = time.time() - start
inference_latency.labels(model_name=model_name).observe(duration)
return predictionThis pattern is crucial. Notice we're using a try/finally block - we always record the latency, even if inference fails. Why? Because if your model crashes on 50% of requests, but you only record latency for successful requests, your dashboard will show 50ms average latency when the truth is "your model is broken half the time." The finally block ensures you capture failures. Also notice we're labeling metrics with model_name and endpoint - this lets us answer questions like "what's the latency for model X versus model Y?" without writing separate metrics.
Understanding Why Each Metric Type Matters
Understanding the different metric types is not just about picking the right one for a given situation - it's about understanding how these primitive building blocks let you answer complex questions about your system's behavior over time.
Counters capture volume and monotonic growth. They only increase, and they reset on service restart. If your inference counter shows it's gone up by 1000 in the last hour, you know you processed 1000 requests. This is your baseline - it tells you "we were running" and gives you throughput when you divide by time. But counters are more powerful than they first appear. Because Prometheus stores the raw counter value plus the rate of change over time, you can reconstruct not just "how many requests" but "how many requests per minute" and detect when your throughput is dropping or spiking. A sudden drop in counter growth rate is often the first sign that something is wrong with your pipeline-pipelines-training-orchestration)-fundamentals)), even before errors start appearing.
Histograms capture distributions, not just central tendency. If you only tracked average latency, you'd miss that 1% of requests take 5 seconds - that's the 99th percentile, and your customers experience it regularly. Your users don't care about the median. They care about the slowest requests because those are the ones that frustrate them. Histograms with buckets solve this by binning latencies into ranges. When Prometheus scrapes your metrics endpoint, it gets counts for each bucket: how many requests fell between 0 and 10 milliseconds, how many between 10 and 25, and so on. Later, you can ask Prometheus to compute the p99 latency from those buckets, and it uses interpolation to give you an accurate answer. The elegance here is that by storing counts per bucket, Prometheus avoids storing individual latency samples, which would explode your storage requirements. You get percentile tracking at constant storage cost.
Gauges are snapshots of instantaneous values. A gauge represents what something is right now. Queue depth is a gauge because it changes constantly - it goes up when messages arrive, down when workers process them. Unlike counters that only increase, gauges can go down. When you're monitoring GPU memory usage or the number of models loaded in memory, you're using gauges. They answer the question "what's the current state?" rather than "what's the total accumulation?"
Model-Level KPIs
Beyond)) infrastructure metrics, track what actually matters for your model:
# Gauge for current model accuracy (update periodically)
model_accuracy = Gauge(
'ml_model_accuracy',
'Current model accuracy on validation set',
['model_name', 'version'],
registry=registry
)
# Counter for prediction errors or anomalies
prediction_errors = Counter(
'ml_prediction_errors_total',
'Predictions flagged as anomalies',
['model_name', 'error_type'],
registry=registry
)
# Update periodically (e.g., hourly from a validation job)
model_accuracy.labels(
model_name="fraud_detector",
version="v2.3.1"
).set(0.945)This is where most teams get it wrong. They instrument infrastructure metrics religiously but skip the business metrics. Your accuracy gauge should update every hour from a validation job that evaluates your model on a holdout test set. Your error counter should increment whenever you detect a prediction that's obviously wrong (e.g., a fraud detector flagging legitimate transactions).
The key difference: this isn't just monitoring "is the system up?" - it's monitoring "is the system working correctly?" You want to see accuracy trends over days and weeks. When you see accuracy drop from 94% to 92%, you want to know immediately, because that's probably a harbinger of worse problems coming.
Part 2: GPU Metrics with DCGM Exporter
Your models run on GPUs, but the Prometheus client library doesn't give you GPU metrics directly. That's where NVIDIA's DCGM (Data Center GPU Manager) exporter comes in. This is where observability gets serious, because GPU metrics reveal problems that are completely invisible to application-level monitoring.
Why You Can't Ignore GPU Metrics
People think GPU monitoring is a nice-to-have optimization. It's not. It's essential infrastructure, and the difference between a system that works and a system that breaks down mysteriously lies in GPU-level visibility.
Start with memory fragmentation, which is the silent killer of GPU infrastructure. Your GPU has 40 gigabytes of total memory. You're using 35 gigabytes, and on paper, you have 5 gigabytes free. But memory is fragmented. The largest contiguous block you can allocate is 2 gigabytes because the 35 gigabytes in use is scattered across the memory space in fragments. Now your next inference request needs a 4-gigabyte allocation for activations. It fails. There's no error message that makes this obvious. You don't get a nice "out of memory" exception. You get a segmentation fault buried deep in CUDA kernel code. From the application's perspective, the GPU just crashed silently. Without DCGM metrics showing you the memory fragmentation pattern, you'll spend days debugging, thinking your model code has a memory leak.
Then there's thermal throttling, which is even more insidious. Your GPU was originally running at 2.0 gigahertz. Now it's overheating because your data center's cooling isn't keeping up with the power draw from eight GPUs running simultaneously. The GPU firmware automatically throttles down to 1.2 gigahertz to keep temperatures manageable. Your inference latency climbs from 50 milliseconds to 200 milliseconds. You dig into your code, profile everything, and find nothing wrong. The problem is that you're not looking at GPU clock speed. Without DCGM metrics, you'd never know your hardware is throttling.
Finally, consider XID errors, which are GPU-level exceptions that the GPU firmware handles internally. An XID error happens when something goes wrong deep in the GPU execution pipeline-pipeline-parallelism)-automated-model-compression) - maybe a kernel timeout, maybe a hardware error that's correctable but still shouldn't happen. The GPU firmware logs it and keeps running. Your Kubernetes cluster doesn't notice because the pod didn't crash. Your application doesn't notice because the GPU kept running. But every 50th request silently produces incorrect results inside the GPU. Without DCGM metrics showing you the XID error rate, you'll have silent failures in production and no way to know about them until your metrics start degrading and you can't figure out why.
Without DCGM metrics, you're flying blind. With them, you can see memory pressure in real time, spot thermal issues before they cascade into latency spikes that affect your SLA, and catch XID errors before they affect production accuracy.
Deploying DCGM Exporter
The DCGM exporter runs as a DaemonSet in your Kubernetes cluster, collecting metrics from every GPU on every node.
apiVersion: v1
kind: ConfigMap
metadata:
name: dcgm-exporter-config
namespace: monitoring
data:
default_counters.csv: |
DCGM_FI_DEV_SM_CLOCK,1000,0
DCGM_FI_DEV_GPU_CLOCK,0,1000
DCGM_FI_DEV_MEMORY_CLOCK,0,1000
DCGM_FI_DEV_GPU_UTIL,1000,0
DCGM_FI_DEV_FB_USED,1000,0
DCGM_FI_DEV_FB_FREE,1000,0
DCGM_FI_DEV_GPU_TEMP,1000,0
DCGM_FI_DEV_POWER_USAGE,1000,0
DCGM_FI_DEV_PCIE_REPLAY_COUNTER,1000,0
DCGM_FI_DEV_XID_ERRORS,1000,0
---
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: dcgm-exporter
namespace: monitoring
spec:
selector:
matchLabels:
app: dcgm-exporter
template:
metadata:
labels:
app: dcgm-exporter
spec:
nodeSelector:
accelerator: nvidia-gpu
tolerations:
- effect: NoSchedule
key: nvidia.com/gpu
operator: Exists
containers:
- name: dcgm-exporter
image: nvidia/dcgm-exporter:3.1.2-3.1.4-ubuntu20.04
ports:
- containerPort: 9400
name: metrics
securityContext:
privileged: true
volumeMounts:
- name: pod-resources
mountPath: /var/lib/kubelet/pod-resources
- name: config
mountPath: /etc/dcgm-exporter
env:
- name: DCGM_EXPORTER_INTERVAL
value: "30000"
- name: DCGM_EXPORTER_KUBERNETES
value: "true"
volumes:
- name: pod-resources
hostPath:
path: /var/lib/kubelet/pod-resources
- name: config
configMap:
name: dcgm-exporter-config
---
apiVersion: v1
kind: Service
metadata:
name: dcgm-exporter
namespace: monitoring
spec:
selector:
app: dcgm-exporter
ports:
- port: 9400
targetPort: 9400The ConfigMap defines which metrics DCGM should export. Each line represents a metric: the first column is the metric ID, and the subsequent numbers define measurement intervals. The DaemonSet ensures every node with a GPU label gets an exporter instance. The privileged: true security context is required - DCGM needs low-level GPU access. The volume mount to /var/lib/kubelet/pod-resources is crucial: it lets DCGM map GPU usage back to specific pods and containers.
Key GPU Metrics to Track
Once deployed, you'll collect a range of metrics that tell you the complete story of your GPU health. Each metric answers a specific question about what's happening on your hardware.
The first metric you need is SM utilization, which tells you if the GPU is actually doing work. Streaming multiprocessor utilization ranges from 0 to 100 percent. A busy model inference should show 80 to 95 percent here. If you're seeing 20 percent, your GPU is sitting idle waiting for data, which means your data loading pipeline is your bottleneck, not the GPU itself. This metric immediately tells you whether to optimize GPU kernels or optimize data loading.
GPU memory tracking requires two metrics working together. Framebuffer memory used tells you how much of your GPU's memory is currently allocated. Framebuffer memory free tells you how much is available. Together, they let you calculate memory pressure. When memory used is at 95 percent or higher, you're walking a tightrope. You have almost no headroom for dynamic allocations, and memory fragmentation becomes critical. Watch the free memory trend over time. If it's consistently declining as your system runs, fragmentation is happening.
Power consumption is a window into thermal health. Your GPU draws a certain amount of power when running at full clock speed. If you see power consumption suddenly drop while utilization stays high, thermal throttling is almost certainly happening. The firmware detected high temperatures and dialed down the clock speed to reduce heat output. This is your warning sign to investigate cooling or to reduce the workload on that GPU.
Finally, XID errors are critical. This counter should never increase in a healthy system. Any increase means the GPU experienced an internal error. Even if the GPU recovered, you've got a reliability problem. Set your alert threshold to any increase at all. These metrics are automatically tagged with GPU, node, and other Kubernetes labels so you can drill down by node or GPU and understand which specific hardware is having problems.
Part 3: Prometheus Configuration
At this point, you've got metrics being emitted from your inference code and from the DCGM exporter. Now you need to tell Prometheus where to find these metrics and how often to collect them. This is where the configuration tells the story of how you're thinking about observability.
Wire up Prometheus to scrape your application metrics and the DCGM exporter. Prometheus works by periodically connecting to exposed endpoints and pulling metric data. This is a fundamentally different approach from pushing metrics to a collection point, and it matters because Prometheus can decide when and how often to scrape without your application code needing to know anything about it. You just expose metrics, Prometheus finds them, collects them, and stores them.
global:
scrape_interval: 30s
evaluation_interval: 30s
scrape_configs:
- job_name: "ml-inference"
static_configs:
- targets: ["localhost:8000"]
relabel_configs:
- source_labels: [__address__]
target_label: instance
- job_name: "dcgm-exporter"
kubernetes_sd_configs:
- role: endpoints
namespaces:
names:
- monitoring
relabel_configs:
- source_labels: [__meta_kubernetes_endpoints_name]
regex: dcgm-exporter
action: keepThe scrape_interval: 30s tells Prometheus to pull metrics from all targets every 30 seconds. This is a good balance - faster means more storage usage, slower means less granular dashboards. For ML workloads, 30 seconds is usually right because inference latencies are typically in the 50-500ms range, and you want enough data points to calculate accurate percentiles. The static config for ml-inference assumes your inference service is running locally; adjust the target as needed. The Kubernetes service discovery for DCGM automatically finds all DCGM exporter endpoints in the monitoring namespace.
Recording Rules: Pre-compute Expensive Queries
Recording rules represent a fundamental shift in how you think about metrics. Instead of computing complex queries on demand when someone looks at a dashboard, you precompute expensive queries automatically at regular intervals and store the results as new metrics. This is the difference between a system that's responsive and a system that bogs down when multiple people are looking at dashboards simultaneously.
Recording rules run expensive PromQL queries at scrape time and store results. This saves compute and makes dashboards fast. The concept is elegantly simple but profound in its impact. When you have a dashboard showing percentile latencies to fifty people simultaneously, without recording rules, Prometheus would compute that percentile calculation from raw data fifty times in parallel. With recording rules, it computes once per interval and stores the answer, and everyone just reads the precomputed result. The mathematics is identical, but the infrastructure load is completely different.
groups:
- name: ml_recording_rules
interval: 30s
rules:
# Latency percentiles per model
- record: ml:inference_latency_p50:5m
expr: histogram_quantile(0.50, rate(ml_inference_latency_seconds_bucket[5m]))
- record: ml:inference_latency_p95:5m
expr: histogram_quantile(0.95, rate(ml_inference_latency_seconds_bucket[5m]))
- record: ml:inference_latency_p99:5m
expr: histogram_quantile(0.99, rate(ml_inference_latency_seconds_bucket[5m]))
# GPU utilization by node
- record: gpu:utilization:avg_by_node:5m
expr: avg(dcgm_sm_utilization) by (node)
# GPU memory pressure
- record: gpu:memory_used_percent:5m
expr: |
(dcgm_fb_used_bytes / (dcgm_fb_used_bytes + dcgm_fb_free_bytes)) * 100
# Throughput: inferences per second
- record: ml:inference_throughput:5m
expr: rate(ml_inferences_total[5m])Here's why recording rules matter: calculating the p99 latency from raw histogram buckets requires reading hundreds or thousands of data points and computing quantiles. If you run this query on a dashboard that refreshes every 5 seconds, you'll compute it hundreds of times per minute. That's wasteful. Recording rules solve this by pre-computing the answer every 30 seconds and storing the result. When your dashboard queries ml:inference_latency_p99:5m, it's just reading a single pre-computed value instead of doing expensive calculations.
The naming convention here (ml: prefix, :5m suffix) makes it clear what these metrics are: recorded metrics over a 5-minute window. This is a best practice that keeps your metric namespace organized.
Part 4: Building Grafana Dashboards
Up to this point, you've been collecting metrics. Now comes the human interface. Raw metrics are numbers in a database. Dashboards are stories about what those numbers mean. The difference between a useful dashboard and a useless one is the difference between understanding what's happening in your system and staring at meaningless graphs.
Grafana visualizes your metrics and lets you build dashboards that tell coherent stories about your system's behavior. Let's build a dashboard specifically for ML inference monitoring. The key principle is that every panel should answer a specific operational question. If you're staring at a graph and can't articulate why it matters, it shouldn't be on your dashboard.
Dashboard JSON (Importable)
Here's a complete, importable Grafana dashboard JSON:
{
"dashboard": {
"title": "ML Inference Monitoring",
"panels": [
{
"id": 1,
"title": "Inference Latency (p50/p95/p99)",
"type": "graph",
"targets": [
{
"expr": "ml:inference_latency_p50:5m",
"legendFormat": "p50",
"refId": "A"
},
{
"expr": "ml:inference_latency_p95:5m",
"legendFormat": "p95",
"refId": "B"
},
{
"expr": "ml:inference_latency_p99:5m",
"legendFormat": "p99",
"refId": "C"
}
],
"yaxes": [
{
"format": "s",
"label": "Latency"
}
]
},
{
"id": 2,
"title": "GPU Utilization by Node",
"type": "heatmap",
"targets": [
{
"expr": "dcgm_sm_utilization",
"format": "heatmap",
"refId": "A"
}
]
},
{
"id": 3,
"title": "GPU Memory Used %",
"type": "gauge",
"targets": [
{
"expr": "gpu:memory_used_percent:5m",
"refId": "A"
}
],
"fieldConfig": {
"defaults": {
"max": 100,
"min": 0,
"thresholds": {
"steps": [
{ "color": "green", "value": 0 },
{ "color": "yellow", "value": 75 },
{ "color": "red", "value": 90 }
]
}
}
}
},
{
"id": 4,
"title": "Inference Throughput",
"type": "graph",
"targets": [
{
"expr": "ml:inference_throughput:5m",
"legendFormat": "{{model_name}}",
"refId": "A"
}
],
"yaxes": [
{
"format": "ops",
"label": "Inferences/sec"
}
]
},
{
"id": 5,
"title": "GPU XID Errors",
"type": "stat",
"targets": [
{
"expr": "increase(dcgm_xid_errors_total[5m])",
"legendFormat": "{{gpu}}/{{node}}",
"refId": "A"
}
],
"fieldConfig": {
"defaults": {
"thresholds": {
"steps": [
{ "color": "green", "value": 0 },
{ "color": "red", "value": 1 }
]
}
}
}
},
{
"id": 6,
"title": "Model Accuracy Trend",
"type": "graph",
"targets": [
{
"expr": "ml_model_accuracy",
"legendFormat": "{{model_name}} {{version}}",
"refId": "A"
}
],
"yaxes": [
{
"format": "percentunit",
"max": 1,
"min": 0
}
]
}
]
}
}To import this:
- Create a new dashboard in Grafana
- Go to Dashboard JSON Model
- Paste the JSON above
- Adjust datasource references as needed
The dashboard panels tell a coherent story about your system's health and behavior. The latency graph shows you if inference is getting slower - not just the average, but the percentiles, because that's where you see user-facing degradation. The GPU utilization heatmap shows you which nodes are bottlenecks and whether you have uneven load distribution across your cluster. The memory gauge shows you when you're close to out-of-memory crashes before they actually happen. The throughput graph shows you if you're handling the expected load or if you're saturating. The XID errors panel goes red if there are any GPU errors, and because it's red when there should be no errors, it's immediately visible. The accuracy trend shows if your model is degrading, which is often the most important metric but the easiest to forget to track. Together, they give you a 360-degree view of your ML system's health, with every panel designed to answer an operational question that matters.
Part 5: Alerting with AlertManager
Here's the hard truth about observability: metrics mean nothing without action. A dashboard that nobody looks at is worthless. A metric that tells you something important but doesn't trigger action is just noise. Alerting is how you bridge the gap between "we're collecting data" and "we're actually responding to problems."
AlertManager is the mechanism that routes alerts to the right people at the right time. It's not just a notification system. It's a filter and router that understands alert grouping, deduplication, inhibition, and escalation. The wrong alert routing can lead to alert fatigue, where your team stops paying attention to alerts because they're drowning in noise. The right routing means people see exactly the alerts they need to see.
groups:
- name: ml_alerts
rules:
# GPU out of memory
- alert: GPUMemoryHigh
expr: gpu:memory_used_percent:5m > 95
for: 2m
labels:
severity: critical
team: infra
annotations:
summary: "GPU {{ $labels.gpu }} memory critical"
description: "GPU {{ $labels.gpu }} on node {{ $labels.node }} at {{ $value }}%"
# Model accuracy drop
- alert: ModelAccuracyDegraded
expr: |
(ml_model_accuracy offset 1h - ml_model_accuracy) > 0.05
for: 5m
labels:
severity: warning
team: ml
annotations:
summary: "Model {{ $labels.model_name }} accuracy dropped"
description: "Accuracy fell by {{ $value | humanizePercentage }} in the last hour"
# Inference latency spike
- alert: InferenceLatencyHigh
expr: ml:inference_latency_p99:5m > 0.5
for: 5m
labels:
severity: warning
team: ml
annotations:
summary: "p99 latency spike for {{ $labels.model_name }}"Notice the for: 2m condition on the GPU memory alert. This means the condition must be true for 2 minutes before firing. Why? Because transient spikes don't matter. If GPU memory hits 96% for 10 seconds then drops back down, that's normal behavior - don't wake anyone up. But if it stays above 95% for 2 minutes, something is wrong, and we should alert.
The accuracy alert is more sophisticated. It calculates the difference between current accuracy and accuracy from 1 hour ago, then fires if it dropped by more than 5 percentage points. This catches model drift automatically.
Configure AlertManager routing:
route:
receiver: "default"
routes:
- match:
team: infra
receiver: "infra-oncall"
continue: true
- match:
team: ml
receiver: "ml-team"
group_wait: 30s
group_interval: 5m
receivers:
- name: "default"
slack_configs:
- api_url: $SLACK_WEBHOOK_DEFAULT
- name: "infra-oncall"
pagerduty_configs:
- service_key: $PAGERDUTY_KEY_INFRA
- name: "ml-team"
slack_configs:
- api_url: $SLACK_WEBHOOK_ML
channel: "#ml-alerts"
inhibit_rules:
# Don't alert on high latency if GPU is throttling
- source_match:
alertname: GPUThermalThrottle
target_match:
alertname: InferenceLatencyHigh
equal: ["instance"]AlertManager doesn't just send alerts in isolation - it routes them intelligently to the right channels and people. Infrastructure alerts go to the on-call engineer via PagerDuty, which means they're woken up if something is wrong. ML-team alerts go to Slack, which means the team sees them in their normal workflow. The group_wait setting means AlertManager waits up to 30 seconds to batch related alerts together before sending them. Why does this matter? Imagine your system has eight GPUs and all eight simultaneously exceed memory threshold within a few seconds. Without grouping, you get eight separate PagerDuty notifications that all say the same thing: memory is high. Your on-call engineer is woken up eight times in rapid succession. With grouping, you get one notification that says eight GPUs have memory pressure, which is vastly more useful information and less disruptive.
The inhibition rule pattern shows sophisticated alerting thinking. If the GPU is thermally throttling, you're expecting latency to be high. That's not a surprise. So don't alert on high latency in that case - it would just be noise that distracts from the real problem, which is the thermal throttling. By suppressing downstream alerts when the root cause is already firing, you focus attention on fixing the actual problem rather than dealing with cascading symptom alerts.
Production Considerations
Taking this to production requires more thought than just copy-pasting configs.
Metric Cardinality Planning
"Cardinality" is the number of unique label combinations. A metric with 100 different model names, 10 different endpoints, and unlimited request IDs has cardinality of 100 × 10 × unlimited = unlimited. Prometheus will run out of memory. Always use bounded labels.
Good: model_name, endpoint, gpu, node (these are finite)
Bad: request_id, user_id, input_hash (these are unbounded)
Scrape Interval vs. Storage
30-second scrape interval at 100 metrics per target means 3,600 new data points per target per hour. Multiply that by number of targets and you see storage grow quickly. But slower scrapes mean less granular percentiles. A useful rule of thumb: for ML workloads, 30 seconds is a good default. If you need sub-second latencies, scrape faster. If you're monitoring batch jobs (which run for hours), scrape slower.
Recording Rules Strategy
Not all queries need recording rules. Record if:
- The query is expensive (complex aggregations, histogram_quantile)
- The query is run frequently (on dashboards that auto-refresh)
- Multiple queries depend on it
Don't record simple queries like rate(requests_total[5m]) unless they're used on dashboards.
GPU Monitoring Gotchas
DCGM exporter can be resource-hungry if you're monitoring a lot of metrics. Start with the essential ones (utilization, memory, temperature) and add more only if needed. Also, DCGM might require special Nvidia drivers - check compatibility before deploying.
Common Pitfalls and Troubleshooting
"My dashboards are slow"
You're running expensive PromQL queries directly on dashboards instead of using recording rules. Move complex queries to recording rules and run them on a schedule, then query the pre-computed results.
"Alert fatigue - too many alerts"
You're alerting on infrastructure metrics instead of business metrics. A GPU at 95% utilization is fine if latency is good. Alert on p99 latency > threshold, not on GPU utilization > 90%. Also use alert grouping and inhibition rules to reduce noise.
"GPU metrics aren't showing up"
Check that DCGM exporter pod is running and has privileged access. Check that Prometheus can reach the DCGM exporter endpoint (port 9400). Verify the node selector accelerator: nvidia-gpu matches your actual GPU node labels.
"Accuracy metric won't update"
Your batch validation job might be failing silently. Check logs. Also, make sure you're using the right metric registry and that the validation job has network access to your metrics endpoint.
"Latency is spiky"
This is usually GPU memory fragmentation or thermal throttling. Look at the GPU memory gauge and temperature. If memory is at 95% or temperature is at 85°C+, that's your problem. Consider increasing model batch size or adding more GPUs.
The Complete Picture
Here's how these pieces fit together:
graph LR
A["ML Inference Code<br/>prometheus_client"] -->|Port 8000| C["Prometheus<br/>Scraper"]
B["NVIDIA DCGM<br/>Exporter DaemonSet"] -->|Port 9400| C
C -->|Recording Rules| D["Time Series DB<br/>prometheus/tsdb"]
D -->|PromQL Queries| E["Grafana<br/>Dashboards"]
D -->|Alert Rules| F["AlertManager"]
F -->|Route & Filter| G["Slack/PagerDuty<br/>Notifications"]
E -->|Visual Display| H["Team Dashboard"]And here's your monitoring data flow:
sequenceDiagram
participant App as ML App
participant Prom as Prometheus
participant TSDB as Time Series DB
participant Rules as Recording Rules
participant Grafana as Grafana
participant Alerts as AlertManager
App->>App: Record metric<br/>inference_latency
Prom->>App: Scrape /metrics<br/>every 30s
Prom->>TSDB: Store raw samples
Rules->>TSDB: Compute p50, p95, p99<br/>at scrape interval
Grafana->>TSDB: Query pre-computed<br/>percentiles (fast!)
TSDB->>Alerts: Evaluate alert rules
Alerts->>Alerts: Group similar alerts
Alerts->>Alerts: Apply inhibition
Alerts->>Alerts: Route by teamDecision Framework: When to Use What
| Metric Type | When to Use | Example |
|---|---|---|
| Counter | Track volume or cumulative events | Total inferences, errors since startup |
| Histogram | Track latency or request size distributions | p50/p95/p99 inference latency |
| Gauge | Track instantaneous values | GPU memory used, queue depth |
| Custom Business Metric | Track model-specific KPIs | Accuracy, precision, recall |
| DCGM Metric | Track GPU hardware health | Utilization, memory, temperature, errors |
| Query Type | When to Use | Example |
|---|---|---|
| Direct Query | Dashboard panels, one-off exploration | "What's the p99 latency right now?" |
| Recording Rule | Expensive queries that run frequently | Pre-compute percentiles every 30s |
| Alert Rule | Conditions that need human attention | "GPU memory > 95% for 2 minutes" |
Best Practices Summary
Label cardinality: Don't add unbounded labels like request IDs. Use bounded labels like model_name, endpoint, gpu.
Scrape interval balance: 30 seconds is usually right. Faster scrapes mean more storage; slower means less granular dashboards.
Recording rules: Use them aggressively. A histogram_quantile query on millions of samples is slow. Pre-compute percentiles.
Alert thresholds: Set alerts on business metrics, not just infrastructure. A GPU at 95% utilization is fine if latency is good.
Silence over inhibit: When you're deploying and expect alerts, silence them for 15 minutes. Use inhibition rules for permanent alert relationships.
Correlation first: When something goes wrong, look at the timeline. Did accuracy drop right when GPU memory spiked? Correlate metrics to understand root causes.
Summary
Prometheus and Grafana give you industrial-grade observability for ML workloads. With custom metrics from your inference code, DCGM exporter for GPU stats, recording rules to pre-compute expensive queries, and AlertManager routing, you've got a complete picture of your ML infrastructure.
The key insight: ML systems need both infrastructure metrics and business metrics. You need to know when your GPU is saturated and when your model's accuracy is drifting. Build both into your observability from day one. Start simple - instrument inference latency and accuracy - then expand to GPU metrics and alerting as you scale. The investment in observability pays for itself the first time you catch a silent failure before it affects customers.