So you've got your machine learning models in production, they're serving predictions, and life is good - until it isn't. Suddenly, inference times spike. GPU memory fills up. Models start predicting garbage. You're scrambling to understand what went wrong, wishing you had actual numbers instead of guesses.

This is where Prometheus and Grafana come in. We're going to walk through building a comprehensive observability system for ML infrastructure-argocd-flux)-flux) - one that gives you real-time visibility into everything from custom model KPIs to raw GPU metrics.

The infrastructure you're about to build is the same foundation used by production ML teams at major companies. It's battle-tested, proven at scale, and gives you the visibility to catch problems before they cascade into customer-facing issues.

Why You Need Metrics for ML Infrastructure

Let's be honest: ML systems are different from typical applications. A service might return an HTTP 500 and you'll catch it immediately. But your model can return perfectly valid predictions while silently degrading in accuracy. Your batch job can process data slower and slower without throwing errors.

That's why standard application metrics aren't enough. You need:

Custom business metrics: How accurate are your predictions? How long does inference take at the p99?
Infrastructure metrics: GPU utilization, memory pressure, thermal throttling
System metrics: Queue depths, throughput, error rates
Training metrics: Loss curves, validation performance, data quality-scale)-real-time-ml-features)-apache-spark)-training-smaller-models)) issues

Prometheus gives you a time-series database and scraping framework. Grafana lets you visualize it. Together, they form the backbone of ML observability.

Why This Matters More Than You Think

Here's the reality: your model didn't crash, so your monitoring probably shows "everything is fine." But your model's F1 score dropped from 0.92 to 0.88 over the last two weeks - a silent catastrophe. Your inference latency crept from 50ms to 200ms because you never noticed the GPU memory fragmentation problem. Your batch job that was taking 2 hours is now taking 6, but it's not failing, so nobody knew until customers complained about stale predictions.

Prometheus and Grafana solve this by creating a single source of truth for everything that matters: infrastructure health, throughput, latency percentiles, and - most importantly - your model's actual business metrics. Once you instrument properly, you can correlate infrastructure problems with model performance problems. You'll see that accuracy dropped exactly when a new batch of training data came in. You'll notice latency spiked right when GPU utilization hit 95%.

This isn't theoretical. Teams that use this approach catch issues in hours instead of days. Issues that might have cost thousands in lost revenue or customer trust get fixed before they escalate. This is the leverage of proper observability.

Part 1: Instrumenting Custom ML Metrics with prometheus_client

Your ML workload is unique, so you'll instrument it with custom metrics-hpa-custom-metrics). The Python prometheus_client library makes this straightforward.

When you're building observability for machine learning systems, the generic application metrics you're used to - response time, error count, memory usage - aren't enough. They tell you whether your infrastructure is working, but they don't tell you whether your model is working correctly. This is the fundamental blindness that catches teams by surprise. Your model's inference server-inference-server-multi-model-serving) might be responding in 50 milliseconds with a 99.9% success rate, the infrastructure looks perfect, and your model is silently degrading by 5% accuracy per week because you're not tracking the metrics that actually matter.

This is where custom metrics come in. You need to instrument your inference code to capture what's happening at the model level. Are you processing requests? How long are they taking? Are predictions getting cached or computed? When was the last time you retrained? What's the current distribution of predicted classes? These questions point to metrics you need to track.

The beautiful part about Prometheus is that it's been designed from the ground up with exactly this use case in mind. It's not bolted on. It's native. You write code that emits metrics, Prometheus scrapes those metrics on a schedule, and you query them whenever you need answers. No custom collection infrastructure, no database schema design, no schema migrations.

Setting Up prometheus_client

python

from prometheus_client import Counter, Histogram, Gauge, CollectorRegistry
from prometheus_client import start_http_server
import time
 
# Create a registry for your metrics
registry = CollectorRegistry()
 
# Counter: increment for each inference
inference_counter = Counter(
    'ml_inferences_total',
    'Total number of inferences',
    ['model_name', 'endpoint'],
    registry=registry
)
 
# Histogram: track inference latency with buckets
inference_latency = Histogram(
    'ml_inference_latency_seconds',
    'Inference latency in seconds',
    ['model_name'],
    buckets=(0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1.0),
    registry=registry
)
 
# Gauge: current queue depth
batch_queue_depth = Gauge(
    'ml_batch_queue_depth',
    'Current number of items in batch queue',
    registry=registry
)
 
# Start the metrics HTTP server on port 8000
start_http_server(8000, registry=registry)

The setup code here is doing something important. We're creating a dedicated CollectorRegistry rather than using the default global one - this prevents metric collisions if you're running multiple services in the same process. Each metric type serves a distinct purpose: Counter for things that only go up (inference count), Histogram for latency distributions, and Gauge for values that fluctuate (queue depth). The buckets parameter in the histogram is critical - we're defining boundaries at 10ms, 25ms, 50ms, 100ms, 250ms, 500ms, and 1000ms. These boundaries matter because Prometheus will later ask "what fraction of requests were faster than 100ms?" and these buckets provide the answer.

Now instrument your inference code:

python

def run_inference(input_data, model_name="production_model"):
    """Run inference and record metrics."""
    start = time.time()
 
    try:
        # Your actual inference logic
        prediction = model.predict(input_data)
 
        # Record success
        inference_counter.labels(
            model_name=model_name,
            endpoint="v1_predict"
        ).inc()
 
    finally:
        # Always record latency, even on failure
        duration = time.time() - start
        inference_latency.labels(model_name=model_name).observe(duration)
 
    return prediction

This pattern is crucial. Notice we're using a try/finally block - we always record the latency, even if inference fails. Why? Because if your model crashes on 50% of requests, but you only record latency for successful requests, your dashboard will show 50ms average latency when the truth is "your model is broken half the time." The finally block ensures you capture failures. Also notice we're labeling metrics with model_name and endpoint - this lets us answer questions like "what's the latency for model X versus model Y?" without writing separate metrics.

Understanding Why Each Metric Type Matters

Understanding the different metric types is not just about picking the right one for a given situation - it's about understanding how these primitive building blocks let you answer complex questions about your system's behavior over time.

Counters capture volume and monotonic growth. They only increase, and they reset on service restart. If your inference counter shows it's gone up by 1000 in the last hour, you know you processed 1000 requests. This is your baseline - it tells you "we were running" and gives you throughput when you divide by time. But counters are more powerful than they first appear. Because Prometheus stores the raw counter value plus the rate of change over time, you can reconstruct not just "how many requests" but "how many requests per minute" and detect when your throughput is dropping or spiking. A sudden drop in counter growth rate is often the first sign that something is wrong with your pipeline-pipelines-training-orchestration)-fundamentals)), even before errors start appearing.

Histograms capture distributions, not just central tendency. If you only tracked average latency, you'd miss that 1% of requests take 5 seconds - that's the 99th percentile, and your customers experience it regularly. Your users don't care about the median. They care about the slowest requests because those are the ones that frustrate them. Histograms with buckets solve this by binning latencies into ranges. When Prometheus scrapes your metrics endpoint, it gets counts for each bucket: how many requests fell between 0 and 10 milliseconds, how many between 10 and 25, and so on. Later, you can ask Prometheus to compute the p99 latency from those buckets, and it uses interpolation to give you an accurate answer. The elegance here is that by storing counts per bucket, Prometheus avoids storing individual latency samples, which would explode your storage requirements. You get percentile tracking at constant storage cost.

Gauges are snapshots of instantaneous values. A gauge represents what something is right now. Queue depth is a gauge because it changes constantly - it goes up when messages arrive, down when workers process them. Unlike counters that only increase, gauges can go down. When you're monitoring GPU memory usage or the number of models loaded in memory, you're using gauges. They answer the question "what's the current state?" rather than "what's the total accumulation?"

Model-Level KPIs

Beyond)) infrastructure metrics, track what actually matters for your model:

python

# Gauge for current model accuracy (update periodically)
model_accuracy = Gauge(
    'ml_model_accuracy',
    'Current model accuracy on validation set',
    ['model_name', 'version'],
    registry=registry
)
 
# Counter for prediction errors or anomalies
prediction_errors = Counter(
    'ml_prediction_errors_total',
    'Predictions flagged as anomalies',
    ['model_name', 'error_type'],
    registry=registry
)
 
# Update periodically (e.g., hourly from a validation job)
model_accuracy.labels(
    model_name="fraud_detector",
    version="v2.3.1"
).set(0.945)

This is where most teams get it wrong. They instrument infrastructure metrics religiously but skip the business metrics. Your accuracy gauge should update every hour from a validation job that evaluates your model on a holdout test set. Your error counter should increment whenever you detect a prediction that's obviously wrong (e.g., a fraud detector flagging legitimate transactions).

The key difference: this isn't just monitoring "is the system up?" - it's monitoring "is the system working correctly?" You want to see accuracy trends over days and weeks. When you see accuracy drop from 94% to 92%, you want to know immediately, because that's probably a harbinger of worse problems coming.

Part 2: GPU Metrics with DCGM Exporter

Your models run on GPUs, but the Prometheus client library doesn't give you GPU metrics directly. That's where NVIDIA's DCGM (Data Center GPU Manager) exporter comes in. This is where observability gets serious, because GPU metrics reveal problems that are completely invisible to application-level monitoring.

Why You Can't Ignore GPU Metrics

People think GPU monitoring is a nice-to-have optimization. It's not. It's essential infrastructure, and the difference between a system that works and a system that breaks down mysteriously lies in GPU-level visibility.

Start with memory fragmentation, which is the silent killer of GPU infrastructure. Your GPU has 40 gigabytes of total memory. You're using 35 gigabytes, and on paper, you have 5 gigabytes free. But memory is fragmented. The largest contiguous block you can allocate is 2 gigabytes because the 35 gigabytes in use is scattered across the memory space in fragments. Now your next inference request needs a 4-gigabyte allocation for activations. It fails. There's no error message that makes this obvious. You don't get a nice "out of memory" exception. You get a segmentation fault buried deep in CUDA kernel code. From the application's perspective, the GPU just crashed silently. Without DCGM metrics showing you the memory fragmentation pattern, you'll spend days debugging, thinking your model code has a memory leak.

Then there's thermal throttling, which is even more insidious. Your GPU was originally running at 2.0 gigahertz. Now it's overheating because your data center's cooling isn't keeping up with the power draw from eight GPUs running simultaneously. The GPU firmware automatically throttles down to 1.2 gigahertz to keep temperatures manageable. Your inference latency climbs from 50 milliseconds to 200 milliseconds. You dig into your code, profile everything, and find nothing wrong. The problem is that you're not looking at GPU clock speed. Without DCGM metrics, you'd never know your hardware is throttling.

Finally, consider XID errors, which are GPU-level exceptions that the GPU firmware handles internally. An XID error happens when something goes wrong deep in the GPU execution pipeline-pipeline-parallelism)-automated-model-compression) - maybe a kernel timeout, maybe a hardware error that's correctable but still shouldn't happen. The GPU firmware logs it and keeps running. Your Kubernetes cluster doesn't notice because the pod didn't crash. Your application doesn't notice because the GPU kept running. But every 50th request silently produces incorrect results inside the GPU. Without DCGM metrics showing you the XID error rate, you'll have silent failures in production and no way to know about them until your metrics start degrading and you can't figure out why.

Without DCGM metrics, you're flying blind. With them, you can see memory pressure in real time, spot thermal issues before they cascade into latency spikes that affect your SLA, and catch XID errors before they affect production accuracy.

Deploying DCGM Exporter

The DCGM exporter runs as a DaemonSet in your Kubernetes cluster, collecting metrics from every GPU on every node.

yaml

apiVersion: v1
kind: ConfigMap
metadata:
  name: dcgm-exporter-config
  namespace: monitoring
data:
  default_counters.csv: |
    DCGM_FI_DEV_SM_CLOCK,1000,0
    DCGM_FI_DEV_GPU_CLOCK,0,1000
    DCGM_FI_DEV_MEMORY_CLOCK,0,1000
    DCGM_FI_DEV_GPU_UTIL,1000,0
    DCGM_FI_DEV_FB_USED,1000,0
    DCGM_FI_DEV_FB_FREE,1000,0
    DCGM_FI_DEV_GPU_TEMP,1000,0
    DCGM_FI_DEV_POWER_USAGE,1000,0
    DCGM_FI_DEV_PCIE_REPLAY_COUNTER,1000,0
    DCGM_FI_DEV_XID_ERRORS,1000,0
---
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: dcgm-exporter
  namespace: monitoring
spec:
  selector:
    matchLabels:
      app: dcgm-exporter
  template:
    metadata:
      labels:
        app: dcgm-exporter
    spec:
      nodeSelector:
        accelerator: nvidia-gpu
      tolerations:
        - effect: NoSchedule
          key: nvidia.com/gpu
          operator: Exists
      containers:
        - name: dcgm-exporter
          image: nvidia/dcgm-exporter:3.1.2-3.1.4-ubuntu20.04
          ports:
            - containerPort: 9400
              name: metrics
          securityContext:
            privileged: true
          volumeMounts:
            - name: pod-resources
              mountPath: /var/lib/kubelet/pod-resources
            - name: config
              mountPath: /etc/dcgm-exporter
          env:
            - name: DCGM_EXPORTER_INTERVAL
              value: "30000"
            - name: DCGM_EXPORTER_KUBERNETES
              value: "true"
      volumes:
        - name: pod-resources
          hostPath:
            path: /var/lib/kubelet/pod-resources
        - name: config
          configMap:
            name: dcgm-exporter-config
---
apiVersion: v1
kind: Service
metadata:
  name: dcgm-exporter
  namespace: monitoring
spec:
  selector:
    app: dcgm-exporter
  ports:
    - port: 9400
      targetPort: 9400

The ConfigMap defines which metrics DCGM should export. Each line represents a metric: the first column is the metric ID, and the subsequent numbers define measurement intervals. The DaemonSet ensures every node with a GPU label gets an exporter instance. The privileged: true security context is required - DCGM needs low-level GPU access. The volume mount to /var/lib/kubelet/pod-resources is crucial: it lets DCGM map GPU usage back to specific pods and containers.

Key GPU Metrics to Track

Once deployed, you'll collect a range of metrics that tell you the complete story of your GPU health. Each metric answers a specific question about what's happening on your hardware.

The first metric you need is SM utilization, which tells you if the GPU is actually doing work. Streaming multiprocessor utilization ranges from 0 to 100 percent. A busy model inference should show 80 to 95 percent here. If you're seeing 20 percent, your GPU is sitting idle waiting for data, which means your data loading pipeline is your bottleneck, not the GPU itself. This metric immediately tells you whether to optimize GPU kernels or optimize data loading.

GPU memory tracking requires two metrics working together. Framebuffer memory used tells you how much of your GPU's memory is currently allocated. Framebuffer memory free tells you how much is available. Together, they let you calculate memory pressure. When memory used is at 95 percent or higher, you're walking a tightrope. You have almost no headroom for dynamic allocations, and memory fragmentation becomes critical. Watch the free memory trend over time. If it's consistently declining as your system runs, fragmentation is happening.

Power consumption is a window into thermal health. Your GPU draws a certain amount of power when running at full clock speed. If you see power consumption suddenly drop while utilization stays high, thermal throttling is almost certainly happening. The firmware detected high temperatures and dialed down the clock speed to reduce heat output. This is your warning sign to investigate cooling or to reduce the workload on that GPU.

Finally, XID errors are critical. This counter should never increase in a healthy system. Any increase means the GPU experienced an internal error. Even if the GPU recovered, you've got a reliability problem. Set your alert threshold to any increase at all. These metrics are automatically tagged with GPU, node, and other Kubernetes labels so you can drill down by node or GPU and understand which specific hardware is having problems.

Part 3: Prometheus Configuration

At this point, you've got metrics being emitted from your inference code and from the DCGM exporter. Now you need to tell Prometheus where to find these metrics and how often to collect them. This is where the configuration tells the story of how you're thinking about observability.

Wire up Prometheus to scrape your application metrics and the DCGM exporter. Prometheus works by periodically connecting to exposed endpoints and pulling metric data. This is a fundamentally different approach from pushing metrics to a collection point, and it matters because Prometheus can decide when and how often to scrape without your application code needing to know anything about it. You just expose metrics, Prometheus finds them, collects them, and stores them.

yaml

global:
  scrape_interval: 30s
  evaluation_interval: 30s
 
scrape_configs:
  - job_name: "ml-inference"
    static_configs:
      - targets: ["localhost:8000"]
    relabel_configs:
      - source_labels: [__address__]
        target_label: instance
 
  - job_name: "dcgm-exporter"
    kubernetes_sd_configs:
      - role: endpoints
        namespaces:
          names:
            - monitoring
    relabel_configs:
      - source_labels: [__meta_kubernetes_endpoints_name]
        regex: dcgm-exporter
        action: keep

The scrape_interval: 30s tells Prometheus to pull metrics from all targets every 30 seconds. This is a good balance - faster means more storage usage, slower means less granular dashboards. For ML workloads, 30 seconds is usually right because inference latencies are typically in the 50-500ms range, and you want enough data points to calculate accurate percentiles. The static config for ml-inference assumes your inference service is running locally; adjust the target as needed. The Kubernetes service discovery for DCGM automatically finds all DCGM exporter endpoints in the monitoring namespace.

Recording Rules: Pre-compute Expensive Queries

Recording rules represent a fundamental shift in how you think about metrics. Instead of computing complex queries on demand when someone looks at a dashboard, you precompute expensive queries automatically at regular intervals and store the results as new metrics. This is the difference between a system that's responsive and a system that bogs down when multiple people are looking at dashboards simultaneously.

Recording rules run expensive PromQL queries at scrape time and store results. This saves compute and makes dashboards fast. The concept is elegantly simple but profound in its impact. When you have a dashboard showing percentile latencies to fifty people simultaneously, without recording rules, Prometheus would compute that percentile calculation from raw data fifty times in parallel. With recording rules, it computes once per interval and stores the answer, and everyone just reads the precomputed result. The mathematics is identical, but the infrastructure load is completely different.

yaml

groups:
  - name: ml_recording_rules
    interval: 30s
    rules:
      # Latency percentiles per model
      - record: ml:inference_latency_p50:5m
        expr: histogram_quantile(0.50, rate(ml_inference_latency_seconds_bucket[5m]))
 
      - record: ml:inference_latency_p95:5m
        expr: histogram_quantile(0.95, rate(ml_inference_latency_seconds_bucket[5m]))
 
      - record: ml:inference_latency_p99:5m
        expr: histogram_quantile(0.99, rate(ml_inference_latency_seconds_bucket[5m]))
 
      # GPU utilization by node
      - record: gpu:utilization:avg_by_node:5m
        expr: avg(dcgm_sm_utilization) by (node)
 
      # GPU memory pressure
      - record: gpu:memory_used_percent:5m
        expr: |
          (dcgm_fb_used_bytes / (dcgm_fb_used_bytes + dcgm_fb_free_bytes)) * 100
 
      # Throughput: inferences per second
      - record: ml:inference_throughput:5m
        expr: rate(ml_inferences_total[5m])

Here's why recording rules matter: calculating the p99 latency from raw histogram buckets requires reading hundreds or thousands of data points and computing quantiles. If you run this query on a dashboard that refreshes every 5 seconds, you'll compute it hundreds of times per minute. That's wasteful. Recording rules solve this by pre-computing the answer every 30 seconds and storing the result. When your dashboard queries ml:inference_latency_p99:5m, it's just reading a single pre-computed value instead of doing expensive calculations.

The naming convention here (ml: prefix, :5m suffix) makes it clear what these metrics are: recorded metrics over a 5-minute window. This is a best practice that keeps your metric namespace organized.

Part 4: Building Grafana Dashboards

Up to this point, you've been collecting metrics. Now comes the human interface. Raw metrics are numbers in a database. Dashboards are stories about what those numbers mean. The difference between a useful dashboard and a useless one is the difference between understanding what's happening in your system and staring at meaningless graphs.

Grafana visualizes your metrics and lets you build dashboards that tell coherent stories about your system's behavior. Let's build a dashboard specifically for ML inference monitoring. The key principle is that every panel should answer a specific operational question. If you're staring at a graph and can't articulate why it matters, it shouldn't be on your dashboard.

Dashboard JSON (Importable)

Here's a complete, importable Grafana dashboard JSON:

json

{
  "dashboard": {
    "title": "ML Inference Monitoring",
    "panels": [
      {
        "id": 1,
        "title": "Inference Latency (p50/p95/p99)",
        "type": "graph",
        "targets": [
          {
            "expr": "ml:inference_latency_p50:5m",
            "legendFormat": "p50",
            "refId": "A"
          },
          {
            "expr": "ml:inference_latency_p95:5m",
            "legendFormat": "p95",
            "refId": "B"
          },
          {
            "expr": "ml:inference_latency_p99:5m",
            "legendFormat": "p99",
            "refId": "C"
          }
        ],
        "yaxes": [
          {
            "format": "s",
            "label": "Latency"
          }
        ]
      },
      {
        "id": 2,
        "title": "GPU Utilization by Node",
        "type": "heatmap",
        "targets": [
          {
            "expr": "dcgm_sm_utilization",
            "format": "heatmap",
            "refId": "A"
          }
        ]
      },
      {
        "id": 3,
        "title": "GPU Memory Used %",
        "type": "gauge",
        "targets": [
          {
            "expr": "gpu:memory_used_percent:5m",
            "refId": "A"
          }
        ],
        "fieldConfig": {
          "defaults": {
            "max": 100,
            "min": 0,
            "thresholds": {
              "steps": [
                { "color": "green", "value": 0 },
                { "color": "yellow", "value": 75 },
                { "color": "red", "value": 90 }
              ]
            }
          }
        }
      },
      {
        "id": 4,
        "title": "Inference Throughput",
        "type": "graph",
        "targets": [
          {
            "expr": "ml:inference_throughput:5m",
            "legendFormat": "{{model_name}}",
            "refId": "A"
          }
        ],
        "yaxes": [
          {
            "format": "ops",
            "label": "Inferences/sec"
          }
        ]
      },
      {
        "id": 5,
        "title": "GPU XID Errors",
        "type": "stat",
        "targets": [
          {
            "expr": "increase(dcgm_xid_errors_total[5m])",
            "legendFormat": "{{gpu}}/{{node}}",
            "refId": "A"
          }
        ],
        "fieldConfig": {
          "defaults": {
            "thresholds": {
              "steps": [
                { "color": "green", "value": 0 },
                { "color": "red", "value": 1 }
              ]
            }
          }
        }
      },
      {
        "id": 6,
        "title": "Model Accuracy Trend",
        "type": "graph",
        "targets": [
          {
            "expr": "ml_model_accuracy",
            "legendFormat": "{{model_name}} {{version}}",
            "refId": "A"
          }
        ],
        "yaxes": [
          {
            "format": "percentunit",
            "max": 1,
            "min": 0
          }
        ]
      }
    ]
  }
}

To import this:

Create a new dashboard in Grafana
Go to Dashboard JSON Model
Paste the JSON above
Adjust datasource references as needed

The dashboard panels tell a coherent story about your system's health and behavior. The latency graph shows you if inference is getting slower - not just the average, but the percentiles, because that's where you see user-facing degradation. The GPU utilization heatmap shows you which nodes are bottlenecks and whether you have uneven load distribution across your cluster. The memory gauge shows you when you're close to out-of-memory crashes before they actually happen. The throughput graph shows you if you're handling the expected load or if you're saturating. The XID errors panel goes red if there are any GPU errors, and because it's red when there should be no errors, it's immediately visible. The accuracy trend shows if your model is degrading, which is often the most important metric but the easiest to forget to track. Together, they give you a 360-degree view of your ML system's health, with every panel designed to answer an operational question that matters.

Part 5: Alerting with AlertManager

Here's the hard truth about observability: metrics mean nothing without action. A dashboard that nobody looks at is worthless. A metric that tells you something important but doesn't trigger action is just noise. Alerting is how you bridge the gap between "we're collecting data" and "we're actually responding to problems."

AlertManager is the mechanism that routes alerts to the right people at the right time. It's not just a notification system. It's a filter and router that understands alert grouping, deduplication, inhibition, and escalation. The wrong alert routing can lead to alert fatigue, where your team stops paying attention to alerts because they're drowning in noise. The right routing means people see exactly the alerts they need to see.

yaml

groups:
  - name: ml_alerts
    rules:
      # GPU out of memory
      - alert: GPUMemoryHigh
        expr: gpu:memory_used_percent:5m > 95
        for: 2m
        labels:
          severity: critical
          team: infra
        annotations:
          summary: "GPU {{ $labels.gpu }} memory critical"
          description: "GPU {{ $labels.gpu }} on node {{ $labels.node }} at {{ $value }}%"
 
      # Model accuracy drop
      - alert: ModelAccuracyDegraded
        expr: |
          (ml_model_accuracy offset 1h - ml_model_accuracy) > 0.05
        for: 5m
        labels:
          severity: warning
          team: ml
        annotations:
          summary: "Model {{ $labels.model_name }} accuracy dropped"
          description: "Accuracy fell by {{ $value | humanizePercentage }} in the last hour"
 
      # Inference latency spike
      - alert: InferenceLatencyHigh
        expr: ml:inference_latency_p99:5m > 0.5
        for: 5m
        labels:
          severity: warning
          team: ml
        annotations:
          summary: "p99 latency spike for {{ $labels.model_name }}"

Notice the for: 2m condition on the GPU memory alert. This means the condition must be true for 2 minutes before firing. Why? Because transient spikes don't matter. If GPU memory hits 96% for 10 seconds then drops back down, that's normal behavior - don't wake anyone up. But if it stays above 95% for 2 minutes, something is wrong, and we should alert.

The accuracy alert is more sophisticated. It calculates the difference between current accuracy and accuracy from 1 hour ago, then fires if it dropped by more than 5 percentage points. This catches model drift automatically.

Configure AlertManager routing:

yaml

route:
  receiver: "default"
  routes:
    - match:
        team: infra
      receiver: "infra-oncall"
      continue: true
 
    - match:
        team: ml
      receiver: "ml-team"
      group_wait: 30s
      group_interval: 5m
 
receivers:
  - name: "default"
    slack_configs:
      - api_url: $SLACK_WEBHOOK_DEFAULT
 
  - name: "infra-oncall"
    pagerduty_configs:
      - service_key: $PAGERDUTY_KEY_INFRA
 
  - name: "ml-team"
    slack_configs:
      - api_url: $SLACK_WEBHOOK_ML
        channel: "#ml-alerts"
 
inhibit_rules:
  # Don't alert on high latency if GPU is throttling
  - source_match:
      alertname: GPUThermalThrottle
    target_match:
      alertname: InferenceLatencyHigh
    equal: ["instance"]

AlertManager doesn't just send alerts in isolation - it routes them intelligently to the right channels and people. Infrastructure alerts go to the on-call engineer via PagerDuty, which means they're woken up if something is wrong. ML-team alerts go to Slack, which means the team sees them in their normal workflow. The group_wait setting means AlertManager waits up to 30 seconds to batch related alerts together before sending them. Why does this matter? Imagine your system has eight GPUs and all eight simultaneously exceed memory threshold within a few seconds. Without grouping, you get eight separate PagerDuty notifications that all say the same thing: memory is high. Your on-call engineer is woken up eight times in rapid succession. With grouping, you get one notification that says eight GPUs have memory pressure, which is vastly more useful information and less disruptive.

The inhibition rule pattern shows sophisticated alerting thinking. If the GPU is thermally throttling, you're expecting latency to be high. That's not a surprise. So don't alert on high latency in that case - it would just be noise that distracts from the real problem, which is the thermal throttling. By suppressing downstream alerts when the root cause is already firing, you focus attention on fixing the actual problem rather than dealing with cascading symptom alerts.

Production Considerations

Taking this to production requires more thought than just copy-pasting configs.

Metric Cardinality Planning

"Cardinality" is the number of unique label combinations. A metric with 100 different model names, 10 different endpoints, and unlimited request IDs has cardinality of 100 × 10 × unlimited = unlimited. Prometheus will run out of memory. Always use bounded labels.

Good: model_name, endpoint, gpu, node (these are finite) Bad: request_id, user_id, input_hash (these are unbounded)

Scrape Interval vs. Storage

30-second scrape interval at 100 metrics per target means 3,600 new data points per target per hour. Multiply that by number of targets and you see storage grow quickly. But slower scrapes mean less granular percentiles. A useful rule of thumb: for ML workloads, 30 seconds is a good default. If you need sub-second latencies, scrape faster. If you're monitoring batch jobs (which run for hours), scrape slower.

Recording Rules Strategy

Not all queries need recording rules. Record if:

The query is expensive (complex aggregations, histogram_quantile)
The query is run frequently (on dashboards that auto-refresh)
Multiple queries depend on it

Don't record simple queries like rate(requests_total[5m]) unless they're used on dashboards.

GPU Monitoring Gotchas

DCGM exporter can be resource-hungry if you're monitoring a lot of metrics. Start with the essential ones (utilization, memory, temperature) and add more only if needed. Also, DCGM might require special Nvidia drivers - check compatibility before deploying.

Common Pitfalls and Troubleshooting

"My dashboards are slow"

You're running expensive PromQL queries directly on dashboards instead of using recording rules. Move complex queries to recording rules and run them on a schedule, then query the pre-computed results.

"Alert fatigue - too many alerts"

You're alerting on infrastructure metrics instead of business metrics. A GPU at 95% utilization is fine if latency is good. Alert on p99 latency > threshold, not on GPU utilization > 90%. Also use alert grouping and inhibition rules to reduce noise.

"GPU metrics aren't showing up"

Check that DCGM exporter pod is running and has privileged access. Check that Prometheus can reach the DCGM exporter endpoint (port 9400). Verify the node selector accelerator: nvidia-gpu matches your actual GPU node labels.

"Accuracy metric won't update"

Your batch validation job might be failing silently. Check logs. Also, make sure you're using the right metric registry and that the validation job has network access to your metrics endpoint.

"Latency is spiky"

This is usually GPU memory fragmentation or thermal throttling. Look at the GPU memory gauge and temperature. If memory is at 95% or temperature is at 85°C+, that's your problem. Consider increasing model batch size or adding more GPUs.

The Complete Picture

Here's how these pieces fit together:

graph LR
    A["ML Inference Code<br/>prometheus_client"] -->|Port 8000| C["Prometheus<br/>Scraper"]
    B["NVIDIA DCGM<br/>Exporter DaemonSet"] -->|Port 9400| C
    C -->|Recording Rules| D["Time Series DB<br/>prometheus/tsdb"]
    D -->|PromQL Queries| E["Grafana<br/>Dashboards"]
    D -->|Alert Rules| F["AlertManager"]
    F -->|Route & Filter| G["Slack/PagerDuty<br/>Notifications"]
    E -->|Visual Display| H["Team Dashboard"]

And here's your monitoring data flow:

sequenceDiagram
    participant App as ML App
    participant Prom as Prometheus
    participant TSDB as Time Series DB
    participant Rules as Recording Rules
    participant Grafana as Grafana
    participant Alerts as AlertManager
 
    App->>App: Record metric<br/>inference_latency
    Prom->>App: Scrape /metrics<br/>every 30s
    Prom->>TSDB: Store raw samples
    Rules->>TSDB: Compute p50, p95, p99<br/>at scrape interval
    Grafana->>TSDB: Query pre-computed<br/>percentiles (fast!)
    TSDB->>Alerts: Evaluate alert rules
    Alerts->>Alerts: Group similar alerts
    Alerts->>Alerts: Apply inhibition
    Alerts->>Alerts: Route by team

Decision Framework: When to Use What

Metric Type	When to Use	Example
Counter	Track volume or cumulative events	Total inferences, errors since startup
Histogram	Track latency or request size distributions	p50/p95/p99 inference latency
Gauge	Track instantaneous values	GPU memory used, queue depth
Custom Business Metric	Track model-specific KPIs	Accuracy, precision, recall
DCGM Metric	Track GPU hardware health	Utilization, memory, temperature, errors

Query Type	When to Use	Example
Direct Query	Dashboard panels, one-off exploration	"What's the p99 latency right now?"
Recording Rule	Expensive queries that run frequently	Pre-compute percentiles every 30s
Alert Rule	Conditions that need human attention	"GPU memory > 95% for 2 minutes"

Best Practices Summary

Label cardinality: Don't add unbounded labels like request IDs. Use bounded labels like model_name, endpoint, gpu.

Scrape interval balance: 30 seconds is usually right. Faster scrapes mean more storage; slower means less granular dashboards.

Recording rules: Use them aggressively. A histogram_quantile query on millions of samples is slow. Pre-compute percentiles.

Alert thresholds: Set alerts on business metrics, not just infrastructure. A GPU at 95% utilization is fine if latency is good.

Silence over inhibit: When you're deploying and expect alerts, silence them for 15 minutes. Use inhibition rules for permanent alert relationships.

Correlation first: When something goes wrong, look at the timeline. Did accuracy drop right when GPU memory spiked? Correlate metrics to understand root causes.

Summary

Prometheus and Grafana give you industrial-grade observability for ML workloads. With custom metrics from your inference code, DCGM exporter for GPU stats, recording rules to pre-compute expensive queries, and AlertManager routing, you've got a complete picture of your ML infrastructure.

The key insight: ML systems need both infrastructure metrics and business metrics. You need to know when your GPU is saturated and when your model's accuracy is drifting. Build both into your observability from day one. Start simple - instrument inference latency and accuracy - then expand to GPU metrics and alerting as you scale. The investment in observability pays for itself the first time you catch a silent failure before it affects customers.

Prometheus and Grafana for ML Infrastructure Metrics

Why You Need Metrics for ML Infrastructure

Why This Matters More Than You Think

Part 1: Instrumenting Custom ML Metrics with prometheus_client

Setting Up prometheus_client

Understanding Why Each Metric Type Matters

Model-Level KPIs

Part 2: GPU Metrics with DCGM Exporter

Why You Can't Ignore GPU Metrics

Deploying DCGM Exporter

Key GPU Metrics to Track

Part 3: Prometheus Configuration

Recording Rules: Pre-compute Expensive Queries

Part 4: Building Grafana Dashboards

Dashboard JSON (Importable)

Part 5: Alerting with AlertManager

Production Considerations

Metric Cardinality Planning

Scrape Interval vs. Storage

Recording Rules Strategy

GPU Monitoring Gotchas

Common Pitfalls and Troubleshooting

"My dashboards are slow"

"Alert fatigue - too many alerts"

"GPU metrics aren't showing up"

"Accuracy metric won't update"

"Latency is spiky"

The Complete Picture

Decision Framework: When to Use What

Best Practices Summary

Summary

Need help implementing this?