Alerting Strategies for ML Systems: What to Monitor and When
You've probably been there: your ML system is humming along, predictions flowing smoothly, and then - suddenly - your dashboard lights up like a Christmas tree. Fifty alerts in five minutes. Some are real problems. Some are noise. Some are telling you about issues that don't matter yet.
The difference between a mature ML observability practice and a chaotic one comes down to knowing what to alert on and when. This isn't just about preventing alert fatigue (though that's important). It's about creating a system where each alert represents genuine, actionable intelligence about your models' health.
Let's build that system together.
Table of Contents
- Why ML Alerting Is Different From Traditional Systems
- Defining SLOs for ML Systems
- Why SLOs Matter in Practice
- The Alert Taxonomy: Three Layers
- Layer 1: Infrastructure Alerts
- Layer 2: Service Alerts
- Layer 3: Model Quality Alerts
- Multi-Window Burn Rate Alerts: The Google SRE Approach
- Why This Approach Is Revolutionary
- Prediction Quality Alerts: Drift Detection
- Output Distribution Drift
- Embedding Drift
- Ground Truth Feedback Loop
- Reducing Alert Fatigue: The Real Challenge
- 1. Symptom-Based Alerting
- 2. Alert Inhibition During Maintenance
- 3. Composite Alerts and Correlation
- 4. ML-Based Anomaly Detection (Advanced)
- Alert Fatigue in Numbers
- Implementation Roadmap: Starting Simple
- Real-World Alerting Scenarios
- Scenario 1: The Silent Model Degradation
- Scenario 2: The Cascading Failure
- Scenario 3: The Resource Squeeze
- Real-World Questions from ML Teams
- Putting It Together: AlertManager Configuration
- Visualizing Your Alert Strategy
- The SLO Decision Framework
- Summary
Why ML Alerting Is Different From Traditional Systems
Traditional infrastructure monitoring is straightforward: your server is either up or down, your database either responds or hangs. ML systems are sneakier. Your model can be completely live, serving predictions with perfect latency, and still be completely wrong. This is the fundamental challenge of ML monitoring: all your traditional signals can look healthy while your model silently decays.
Consider what happens when you rely purely on infrastructure metrics for an ML system. Your CPU is fine. Your memory is fine. Your network is fine. Your inference latency is perfect. All green lights. But your recommendation model has learned outdated patterns. It's recommending products users don't want. Your fraud detection model is missing fraud because user behavior has changed. Your forecasting model is predicting demand incorrectly because the world changed. The infrastructure looks perfect while the model has become useless.
This is why traditional alerting fails for ML systems. A traditional on-call engineer monitoring infrastructure metrics would never be alerted to these problems. They might notice slow business metrics - engagement dropping, fraud increasing, forecast inaccuracy rising - but those metrics aren't tied to the ML system's health, so the alerts don't fire.
The solution is ML-specific monitoring. You need to understand not just whether your system is running, but whether it's running correctly. This requires domain knowledge. For a recommendation system, you need to understand what "good" recommendations look like and alert when recommendations degrade. For a fraud detector, you need to understand fraud patterns and alert when the model starts missing fraud. For a forecast model, you need to understand demand patterns and alert when forecast accuracy drops.
This domain knowledge is the hard part. Infrastructure monitoring is mechanical - CPU > 90% is bad for any workload. ML monitoring requires thinking about what your specific model should be doing and detecting when it drifts from that expectation.
A model can slowly drift toward worthlessness. A feature pipeline-pipelines-training-orchestration)-fundamentals)) can corrupt data silently. Your embeddings can shift in subtle ways that tank relevance. A recommendation model might start recommending irrelevant products. A fraud detector might start missing obvious fraud. None of this shows up in your infrastructure metrics. Your API responds in 200ms. Your GPU utilization is normal. Your error rates are fine. But your model is broken.
These problems don't trigger traditional alerts because the infrastructure looks healthy. They require domain-specific monitoring. This is why so many ML systems fail invisibly. Teams build sophisticated monitoring for infrastructure - CPUs, memory, disk, network - but forget to monitor the thing that actually matters: is the model still right?
That's where a proper alerting strategy comes in. You need three distinct monitoring layers working together:
- Infrastructure alerts (GPU, memory, network)
- Service alerts (latency, throughput, error rates)
- Model quality alerts (predictions, drift, feedback loops)
Most teams focus on layers one and two. Layer three is where ML differentiation happens. It's also where most teams fall short, which is why so many ML systems degrade undetected. Building this layer requires a different mindset than traditional monitoring. You're not checking if the system is up. You're checking if the system is right.
Defining SLOs for ML Systems
Before you can alert intelligently, you need to know what "healthy" actually means. Service Level Objectives (SLOs) are your foundation.
An SLO is a contractual promise: we will deliver X level of service over Y time period. Unlike an SLA (Service Level Agreement, which includes penalties for missing), an SLO is your internal target. It shapes your alerting, your reliability engineering, your on-call rotations, your deployment-production-inference-deployment) strategies.
The magic of SLOs is that they force clarity about what actually matters. Do you care about P99 latency or average latency? (You care about P99 - users feel the slow cases.) Do you want 99.9% uptime or 99.99%? (99.9% allows 43 minutes of downtime monthly, 99.99% allows 4 minutes. Pick based on impact if the service goes down.) What accuracy baseline does your model need to maintain? SLOs make these decisions explicit and force you to quantify them.
For ML specifically, you need three dimensions of SLOs: latency SLO (predictions arrive within X milliseconds), availability SLO (the service is up X% of the time), and quality SLO (predictions are correct within X% accuracy or other domain metric). Most teams focus on latency and availability, treating quality as an afterthought. That's a mistake. Your users care about quality most - a slow prediction is bad, a down service is bad, but a wrong prediction is worst. Yet quality SLOs are hardest to define because they're domain-specific.
The benefit is that SLOs become the source of truth for your alerting strategy. Once you have SLOs, everything else follows: what to alert on (things that violate SLOs), when to page (fast burn of error budget), when to wait (slow burn of error budget), and when to deploy (only if you still have budget to spare).
For ML systems, typical SLOs look like this:
Latency SLO: P99 latency < 500ms
- "99% of predictions arrive in under 500 milliseconds"
- Measured over a rolling 30-day window
- Supports your business requirements (user experience, real-time constraints)
Availability SLO: 99.9% uptime
- Your model endpoint is reachable and responds
- Account for planned maintenance windows
- This is about service availability, not prediction quality
Quality SLO: Prediction Set Size Index (PSI) < 0.2
- Your prediction distribution hasn't shifted dramatically from baseline
- PSI measures how much your current predictions differ from your training distribution
- Lower is better; PSI > 0.2 usually signals something's broken
Error Budget: If your SLO is 99.9%, your error budget is 0.1%
- Over 30 days: ~43 minutes of unplanned downtime
- Over 90 days: ~130 minutes
- Your burn rate tells you how fast you're consuming this budget
The key insight? You only have so many failures to burn through. Once your error budget is consumed, you need to take action - either fix the problem or reduce traffic.
SLO: 99.9% availability
Daily error budget: 86.4 seconds (1 day = 86,400 seconds)
If you burn 86.4 seconds in 1 hour → 1440x baseline burn rate
If you burn 86.4 seconds in 6 hours → 240x baseline burn rate
This math drives everything that follows.
Why SLOs Matter in Practice
Most teams skip SLO definition because "it feels like extra work." But here's what happens without them: you end up with random alert thresholds, no correlation between metrics, and no way to distinguish critical from noise. An engineer keeps alert fatigue by setting thresholds so high they miss real problems. Then a P99 latency spike at 2 AM goes unnoticed because nobody was alerted, because the threshold was "good enough."
With SLOs, you have a shared understanding: we care about P99 < 500ms because our customers experience delays above that. We're willing to burn 86 seconds of downtime per day, but not more, because the business would suffer. This shared understanding lets you design alerts that align with business impact, not just technical metrics.
The Alert Taxonomy: Three Layers
Let's get specific about what to alert on.
Layer 1: Infrastructure Alerts
These are your smoke detectors. They tell you when the building's on fire.
- GPU/TPU Memory OOM: If you're running out of VRAM, inference will fail or become glacially slow
- CPU Utilization > 90%: Indicates bottleneck, may cause timeout cascades
- Network Latency to Model Server: Unusual spikes indicate network problems
- Container Restarts: Frequent restarts suggest the model server is crashing
- Disk Space: If your model artifacts run out of space, you can't update models
Alert Condition: Trigger immediately on hard limits (OOM, disk full), and on sustained 90th percentile metrics.
- alert: GPUMemoryOOM
expr: gpu_memory_available_bytes < 100000000 # 100MB
for: 1m
annotations:
summary: "GPU memory critically low"
- alert: ContainerRestartSpike
expr: rate(container_restarts_total[5m]) > 0.1 # > 6 restarts/hour
for: 5m
annotations:
summary: "Model server restarting frequently"Layer 2: Service Alerts
These tell you whether your system is responsive.
- Prediction Latency P99 > 500ms: You're missing your SLO
- Request Error Rate > 1%: Something's returning errors that shouldn't
- Throughput Drop > 50%: Suddenly getting way fewer requests (upstream issue?)
- Response Timeout Rate: Requests timing out at the client
These are your first line of defense. They alert you to systemic availability problems.
- alert: HighPredictionLatency
expr: histogram_quantile(0.99, prediction_latency_seconds) > 0.5
for: 5m
annotations:
summary: "P99 latency exceeding SLO"
runbook: "Check GPU utilization, batch size, downstream API calls"
- alert: HighErrorRate
expr: rate(prediction_errors_total[5m]) > 0.01
for: 10m
annotations:
summary: "Error rate > 1%"Layer 3: Model Quality Alerts
This is where you catch silent failures. Your model is serving, but it's serving garbage.
- Prediction Distribution Shift (PSI > 0.2): Your predictions have moved away from the baseline
- Input Feature Distribution Shift: Your features look different than training data
- Embedding-automated-model-compression)-engineering-chunking-embedding-retrieval) Drift: Your learned representations are shifting
- Ground Truth Feedback Divergence: Users are correcting your predictions more than usual
- Calibration Drift: Your confidence scores no longer match actual accuracy
These require domain knowledge and ground truth feedback loops.
- alert: PredictionDistributionShift
expr: psi_score > 0.2
for: 1h
annotations:
summary: "Prediction distribution shifted"
runbook: "Review recent training data, check for data pipeline changes"
- alert: HighUserCorrectionRate
expr: rate(user_corrections_total[1h]) / rate(predictions_total[1h]) > 0.05
annotations:
summary: "Users correcting >5% of predictions"Multi-Window Burn Rate Alerts: The Google SRE Approach
Here's where things get sophisticated. Google SRE's multi-window burn rate approach lets you catch problems before you consume your entire error budget.
The idea: measure how fast you're burning through your error budget at different timescales. If you're burning fast, you need to move fast. If you're burning slowly, you have time to think.
Baseline error budget burn rate = 1.0
If your latency SLO is 99.9%, your baseline burn rate = 0.1% per unit time
A "fast burn" is when you burn your *entire monthly budget in 1 hour*
A "slow burn" is when you burn it in 30 days (that's expected)
Fast burn multiplier = 30 days / 1 hour = 720x
Slow burn multiplier = 30 days / 6 hours = 120x
In practice:
Fast Burn Alert (1-hour window):
- Your error rate is 720x the baseline
- You're on pace to burn your entire monthly budget in 1 hour
- Action: Page an on-call engineer immediately, consider traffic reduction
- Threshold: 14.4% error rate (720 × 0.1% * 20 sec buffer)
Slow Burn Alert (6-hour window):
- Your error rate is 120x the baseline
- You're on pace to burn your entire monthly budget in 6 hours
- Action: Create incident ticket, investigate root cause
- Threshold: 2.4% error rate (120 × 0.1% * 20 sec buffer)
The math looks like this:
Target SLO = 99.9% (0.1% error budget)
Burn rate multiplier (1 hour) = 24 * 30 = 720x
Fast burn threshold = 0.001 * 720 = 0.72 (72% errors)
↓ (with 20 sec duration buffer)
= 0.72 * (20 sec / 3600 sec) = 0.004 = 0.4%
Burn rate multiplier (6 hours) = 4 * 30 = 120x
Slow burn threshold = 0.001 * 120 = 0.12 (12% errors)
↓ (with 20 sec duration buffer)
= 0.12 * (360 sec / 21600 sec) = 0.002 = 0.2%
Why both windows? Fast burn catches catastrophic failures (page someone now). Slow burn catches degradation (fix it this week). Together, they prevent alert fatigue while catching real problems.
Why This Approach Is Revolutionary
Traditional alerting says "alert if error rate > 1%". That's static. It doesn't account for context. Maybe 1% errors is fine - your error budget can absorb it. Maybe 0.1% errors is catastrophic - you're burning budget at 10x normal rate. Burn rate alerts adapt to your SLO, which means they adapt to your business impact.
Prediction Quality Alerts: Drift Detection
Your model can serve predictions that are technically "correct" (the model ran, inference succeeded) but increasingly wrong.
Here's what to monitor:
Output Distribution Drift
Compare your current prediction distribution to your baseline (from training data or recent history).
# Pseudocode for PSI calculation
def psi_score(current_dist, baseline_dist):
"""Prediction Set Index: measures distribution shift"""
# Split both into 10 quantile bins
# For each bin:
# (current_pct - baseline_pct) * ln(current_pct / baseline_pct)
# Sum across bins
# PSI > 0.2 signals significant shift
passAlert when: PSI > 0.2 for 1+ hours Action: Retrain-scale)-real-time-ml-features)-apache-spark))-training-smaller-models)) model on recent data, or reduce traffic until you understand the shift
Embedding Drift
If your model uses learned embeddings (recommenders, search, NLP):
# Compare embeddings from incoming data to reference embeddings
# Use cosine similarity or L2 distance
current_embeddings = model.encode(current_batch)
baseline_embeddings = reference_embeddings # from training
similarity = cosine_similarity(current_embeddings, baseline_embeddings)
if mean(similarity) < threshold: # embeddings diverging
alert("embedding_drift")Ground Truth Feedback Loop
Once users interact with your predictions, you get feedback. Monitor it:
# After collecting ground truth feedback
actual_labels = get_ground_truth_feedback()
predictions = get_recent_predictions()
# Calculate metric that matters (accuracy, AUC, MAP@k, etc)
current_accuracy = calculate_accuracy(predictions, actual_labels)
# Compare to historical baseline
if current_accuracy < baseline_accuracy * 0.95: # 5% drop
alert("accuracy_degradation")This is gold because ground truth is objective truth. If accuracy is dropping, your model needs help.
Reducing Alert Fatigue: The Real Challenge
You can define a hundred alerts. The hard part is tuning them so only real problems trigger pages.
This is the dirty secret of monitoring: defining metrics is easy. Choosing alert thresholds is hard. Set the threshold too low and you're paged constantly on false alarms. You stop believing alerts and stop responding quickly - a phenomenon called alert fatigue. An on-call engineer who's been paged 20 times in the past week for non-critical alerts will treat the 21st page with skepticism, even if it's critical.
The statistics bear this out. Studies of production systems show alert fatigue is endemic. Teams receive thousands of alerts per month but only 10-20% are actionable. The noise crowds out the signal. The solution isn't fewer alerts (you actually do need to monitor those things), but smarter alerts that are more selective about what triggers a page.
The other dimension of alert tuning is understanding your baseline. What does "normal" look like for your system? This changes over time. A model that just got deployed might have different behavior characteristics than one that's been running for six months. Traffic patterns change seasonally. Your data distribution shifts. What counts as healthy changes.
This is why smart teams review their alerts regularly. Once a quarter, you look at alerts that fired and ask: was this actually a problem we needed to respond to? If you're seeing lots of false positives, lower the threshold or add more conditions. If you're missing problems, raise the threshold or add new signals.
The best practice is instrumenting alerts with action buttons or runbooks so on-call engineers can quickly understand what to do when an alert fires. An alert that just says "high latency" is less useful than "high latency - check GPU utilization (dashboard link) and see if batch size needs adjustment (runbook link)." A well-designed alert provides context and action directly.
This is where most organizations struggle. They build comprehensive monitoring, get flooded with alerts, and people start ignoring them. When a critical alert fires, nobody notices because they've trained themselves to tune out the noise. This is called alert fatigue, and it's the enemy of observability. It's not just annoying - it's dangerous. Alert fatigue has directly caused major incidents when engineers didn't respond to critical alerts because they were numb to alert notifications.
The solution isn't fewer alerts. It's smarter alerts.
1. Symptom-Based Alerting
Alert on symptoms (impacts to users), not causes (internal metrics).
Bad: "Alert on CPU > 90%" Good: "Alert on P99 latency > 500ms"
The second one tells you something users care about. The first is just a proxy. There might be good reasons CPU is high (spike in traffic) that don't matter. Or CPU could be low but your model could still be latency-bound due to network calls.
Think about it: what does the user experience? Slow predictions. What matters to your business? Prediction quality and speed. A high CPU alert might be a symptom, but it's not the disease. The disease is missing your SLO.
This principle applies across all three alert layers. For infrastructure, alert on the impact (latency spike) not the mechanism (CPU). For quality, alert on the outcome (accuracy drop) not the signal (a specific metric threshold).
2. Alert Inhibition During Maintenance
You know you're taking the service down for 30 minutes. Silence alerts during that window.
inhibition_rules:
- source_match:
alertname: "HighLatency"
target_match:
maintenance_window: "true"
equal: ["alertname"]When maintenance_window=true, that alert won't fire.
More importantly, you can use inhibition rules to suppress lower-severity alerts when higher-severity ones fire. For example, don't alert on high error rate if the service is completely down. The downstream issue (high error rate) is a symptom of the upstream problem (service down).
Inhibition rules prevent alert storms where one failure cascades into dozens of redundant notifications.
3. Composite Alerts and Correlation
Sometimes one thing causes ten alerts. Correlate them:
- alert: ModelServerCrash
expr: |
rate(container_restarts_total[5m]) > 0.1
and on() rate(prediction_errors_total[5m]) > 0.1
for: 2m
annotations:
summary: "Model server crashing and returning errors"Instead of 10 separate alerts, you get 1 that says "here's the problem."
This requires understanding causal relationships in your system. A container restart causes error spikes. A model overfitting causes poor prediction quality on new data. When you understand these relationships, you can create composite alerts that tell the real story.
For ML systems specifically, think about what cascades:
- Feature pipeline failure → prediction errors → user corrections → accuracy drop
- Model deployment bug → latency spike → error rate increase → user complaints
- Resource shortage → batch timeout → queue buildup → throughput drop
Create alerts for the chain, not individual links.
4. ML-Based Anomaly Detection (Advanced)
For sophisticated teams:
# Use an unsupervised anomaly detector on your metrics
from sklearn.ensemble import IsolationForest
# Train on historical "normal" behavior
normal_metrics = collect_metrics(days=30)
detector = IsolationForest(contamination=0.05)
detector.fit(normal_metrics)
# Score new incoming metrics
current_metrics = collect_metrics(minutes=5)
anomaly_score = detector.score_samples(current_metrics)
if anomaly_score > threshold:
alert("unusual_system_behavior")This catches weird combinations of metrics you didn't think to define explicitly.
The beauty of this approach: you're learning what "normal" actually looks like for your system. Monday morning spike? Normal. 3am traffic drop? Normal. But 3am spike? Anomalous.
The downside: you need 2-4 weeks of baseline data before you can trust it. And you need to retrain when you legitimately change your system's characteristics (new model, new infrastructure, new user base).
Alert Fatigue in Numbers
A study of production systems found that:
- Average team receives 1,500 alerts per service per month
- Of those, 70-80% are noise or redundant
- Average response time to critical alert: 9 minutes
- Alert response time with alert fatigue: 20+ minutes
The math: if you can reduce your alert volume by 75% while keeping signal constant, you cut response time in half. That's the game.
Implementation Roadmap: Starting Simple
Don't try to implement everything at once. Here's a realistic progression:
Week 1-2: Foundation
- Define your three SLOs (latency, availability, quality)
- Create Layer 1 (infrastructure) alerts
- Set up basic AlertManager routing to Slack
Week 3-4: Service Observability
- Add Layer 2 (service) alerts with burn rate multipliers
- Implement alert inhibition for maintenance windows
- Start tracking alert volume and false positive rate
Week 5-6: Quality Monitoring
- Add PSI calculation pipeline
- Create Layer 3 (quality) alerts for prediction drift
- Add ground truth feedback loop collection
Week 7-8: Optimization
- Analyze alerts that fired but didn't require action
- Create composite alerts to reduce noise
- Tune thresholds based on 2 weeks of data
Week 9-10: Advanced
- Implement ML-based anomaly detection
- Create runbooks for the 5 most common alerts
- Establish an on-call rotation with alert ownership
Real-World Alerting Scenarios
Let me walk through a few common scenarios and how proper alerts would handle them.
Scenario 1: The Silent Model Degradation
Your checkout model has been serving for 6 months. Latency is perfect. Errors are zero. Your infrastructure team is sleeping soundly.
But the product changed. Customers can now buy subscriptions, not just one-time purchases. The model was trained on one-time purchase patterns. On subscriptions, it's only 60% accurate - but you're not checking for accuracy.
Without proper alerting: Users gradually stop using the recommendation feature. Revenue slowly declines. You discover the problem 3 months later in a business review.
With proper alerting:
- You have a ground truth feedback loop where users indicate whether recommendations were helpful
- You calculate rolling accuracy every hour
- You alert when accuracy drops 5% below baseline
- You catch it in 2 hours
- You retrain on new patterns or route subscription customers to a different model
Cost of the second scenario: 2 hours of manual investigation and a retrain job. Cost of the first: $500K in lost revenue.
Scenario 2: The Cascading Failure
Your feature engineering pipeline runs every hour. At 2 AM, there's a brief network blip. One feature computation fails and retries, but the retry logic has a bug. It now returns NaN values for 0.001% of requests.
The model handles NaN gracefully - it just returns the average prediction. Latency? Perfect. Error rate? Zero. The service is technically working.
But now 0.001% of predictions are garbage. Multiply that across millions of requests per day, and you're degrading quality silently.
Without proper alerting: You find out 3 weeks later when you run a manual quality audit.
With proper alerting:
- You're monitoring input feature distribution (are we getting unexpected values?)
- You alert when NaN frequency exceeds baseline
- Or you're monitoring output distributions and alert when predictions become suspiciously uniform
- You catch it within an hour
- You can either fix the pipeline or add NaN handling logic
Scenario 3: The Resource Squeeze
Your model uses 8GB of VRAM. Your GPU has 40GB. Life is good. Then product scales up 10x. Batch size increases. Model size increases (you added features). One day, you OOM.
Without proper alerting: The service crashes. Customers get errors for 15 minutes until your auto-scaling or recovery logic kicks in.
With proper alerting:
- You're tracking GPU memory usage continuously
- You alert at 70% utilization (still healthy, but trending badly)
- You alert at 90% utilization (critical, need action now)
- When the 70% alert fires, your oncall engineer can scale out or optimize batch size before hitting the wall
This is an infrastructure alert, but it's tied to a business impact: keeping the service available.
Real-World Questions from ML Teams
We've worked with dozens of teams on their monitoring strategies. Here are the questions that come up repeatedly:
Q: Should we alert on every metric we track?
No. Track everything you need for debugging, but only alert on things that require immediate action. Use your monitoring dashboard for exploration, use alerts for action.
Q: How often should we retune alert thresholds?
At least quarterly, or whenever you make major changes (new model, new infrastructure, new traffic patterns). Too tight and you get false positives. Too loose and you miss real problems.
Q: What should be our target alert volume?
Aim for 5-10 alerts per service per week. If you're getting more than 1-2 per day, you have too much noise. If you're getting less than 1 per week, you might be missing problems.
Q: Who should we page for a warning-level alert?
Create a ticket, don't page. Pages should be reserved for critical alerts where human intervention is needed within 15 minutes. Warnings can wait until the next business day for triage.
Q: Should we alert on predicted metrics?
Yes, if you have a model that predicts when a resource will be exhausted. Predicting GPU OOM in 2 hours is valuable. But keep predictions validated - if your prediction model is wrong, it's noise.
Putting It Together: AlertManager Configuration
Here's a production-ready AlertManager configuration with multi-window burn rates and ML alert rules:
# alertmanager.yml
global:
resolve_timeout: 5m
slack_api_url: "https://hooks.slack.com/services/YOUR/WEBHOOK/URL"
inhibition_rules:
# Don't alert on latency if the service is down
- source_match:
severity: "critical"
alertname: "PredictionServiceDown"
target_match:
alertname: "HighLatency"
equal: ["instance"]
# Silence model alerts during maintenance windows
- source_match:
alertname: "PlannedMaintenance"
target_match_re:
alertname: ".*Drift|.*Quality"
equal: ["job"]
route:
receiver: "default"
group_by: ["alertname", "cluster"]
group_wait: 10s
group_interval: 10s
repeat_interval: 4h
# Fast burn → immediate page
routes:
- match:
severity: "critical"
receiver: "pagerduty"
group_wait: 0s
repeat_interval: 15m
# Slow burn → incident ticket
- match:
severity: "warning"
receiver: "slack"
group_wait: 30s
repeat_interval: 2h
receivers:
- name: "default"
slack_configs:
- channel: "#ml-alerts"
title: "{{ .GroupLabels.alertname }}"
text: "{{ range .Alerts }}{{ .Annotations.summary }}{{ end }}"
- name: "pagerduty"
pagerduty_configs:
- service_key: "${PAGERDUTY_SERVICE_KEY}"
---
# prometheus_rules.yml
groups:
- name: ml_service_alerts
interval: 30s
rules:
# Layer 2: Service Alerts
- alert: HighPredictionLatencyFastBurn
expr: |
histogram_quantile(0.99,
rate(prediction_latency_seconds_bucket[1m])
) > 0.5
for: 1m
labels:
severity: critical
annotations:
summary: "P99 latency > 500ms (fast burn)"
runbook: "https://wiki.internal/runbook/high-latency"
- alert: HighPredictionLatencySlowBurn
expr: |
histogram_quantile(0.99,
rate(prediction_latency_seconds_bucket[6m])
) > 0.5
for: 5m
labels:
severity: warning
annotations:
summary: "P99 latency > 500ms (slow burn)"
# Multi-window error rate (burn rate)
- alert: HighErrorRateFastBurn
expr: |
(rate(prediction_errors_total[1m]) /
rate(predictions_total[1m])) > 0.004
for: 1m
labels:
severity: critical
annotations:
summary: "Error rate > 0.4% (burning budget in 1 hour)"
- alert: HighErrorRateSlowBurn
expr: |
(rate(prediction_errors_total[6m]) /
rate(predictions_total[6m])) > 0.002
for: 10m
labels:
severity: warning
annotations:
summary: "Error rate > 0.2% (burning budget in 6 hours)"
# Layer 3: Model Quality Alerts
- alert: PredictionDistributionShift
expr: psi_score > 0.2
for: 1h
labels:
severity: warning
alert_type: "quality"
annotations:
summary: "Prediction distribution shifted (PSI={{ $value | humanize }})"
runbook: "Check feature pipelines, recent training data changes"
- alert: EmbeddingDrift
expr: |
mean(embedding_similarity_to_baseline) < 0.85
for: 2h
labels:
severity: warning
alert_type: "quality"
annotations:
summary: "Embedding drift detected"
- alert: AccuracyDegradation
expr: |
current_accuracy < baseline_accuracy * 0.95
for: 30m
labels:
severity: warning
alert_type: "quality"
annotations:
summary: "Model accuracy dropped > 5%"
dashboard: "https://dash.internal/model-metrics"
- alert: HighUserCorrectionRate
expr: |
(rate(user_corrections_total[1h]) /
rate(predictions_total[1h])) > 0.05
for: 1h
labels:
severity: warning
alert_type: "quality"
annotations:
summary: "Users correcting > 5% of predictions"
# Layer 1: Infrastructure
- alert: GPUMemoryOOM
expr: gpu_memory_available_bytes < 100000000
for: 1m
labels:
severity: critical
annotations:
summary: "GPU memory critically low"
- alert: ContainerRestartSpike
expr: |
rate(container_restarts_total[5m]) > 0.1
for: 5m
labels:
severity: warning
annotations:
summary: "Model server restarting (>6/hour)"Visualizing Your Alert Strategy
Here's how these pieces fit together:
graph TD
A[SLO Defined<br/>99.9% Availability<br/>P99<500ms] -->|defines| B[Error Budget<br/>86.4 seconds/day]
B -->|drives| C[Burn Rate Multipliers<br/>1h: 720x<br/>6h: 120x]
C -->|creates| D1[Fast Burn Alert<br/>Error Rate > 0.4%<br/>Page Engineer]
C -->|creates| D2[Slow Burn Alert<br/>Error Rate > 0.2%<br/>File Ticket]
E[Infrastructure Metrics<br/>GPU, CPU, Network] -->|feed| F{Symptom-Based<br/>Alerts}
G[Service Metrics<br/>Latency, Errors, Throughput] -->|feed| F
H[Quality Metrics<br/>PSI, Drift, Feedback] -->|feed| F
F -->|yes| I[Inhibit During<br/>Maintenance?]
I -->|no| J[Correlate with<br/>Other Alerts?]
J -->|no| K[Page/Ticket/Observe]
J -->|yes| L[Composite Alert<br/>Root Cause]And here's the decision framework for what to monitor:
graph TD
A["Is this a user-facing metric?"] -->|yes| B["Measure and Alert"]
A -->|no| C{"Could it cause<br/>user impact?"}
C -->|yes| D["Measure, Alert with<br/>Correlation"]
C -->|no| E["Measure, Don't Alert"]
B -->|P99 Latency?| F["SLO-based<br/>Multi-Window"]
B -->|Error Rate?| G["Burn Rate<br/>Fast/Slow"]
B -->|Prediction Quality?| H["Domain Metrics<br/>PSI, Drift, Feedback"]
D -->|GPU Util?| I["Alert on Spikes<br/>or Duration"]
D -->|CPU Util?| J["Alert on 90th%ile<br/>+ Duration"]The SLO Decision Framework
When you're deciding what SLO to set:
- Start with user impact: What do users actually care about? Latency? Accuracy?
- Measure the baseline: What are you actually delivering today?
- Set ambitious but achievable SLO: Usually 99.9-99.95% for most systems
- Define error budget usage: How fast can it burn before you take action?
- Set alert thresholds: Use burn rate multipliers (1h, 6h, 30d)
- Iterate: You'll get this wrong the first time. That's normal.
Summary
Building a mature alerting strategy for ML systems means:
- Define SLOs that capture latency, availability, and quality
- Understand error budgets and burn rate multipliers for fast/slow problems
- Layer your alerts: infrastructure, service, quality
- Monitor for drift in predictions, embeddings, and ground truth feedback
- Reduce fatigue through symptom-based alerting and intelligent inhibition
- Use AlertManager to route critical alerts to on-call, warnings to tickets
The teams with the best ML systems aren't those with the most alerts. They're the ones with the right alerts - ones that tell them something actionable about their system's health.
Start with the infrastructure and service layers. Get those stable. Then add quality monitoring. You'll be amazed at what you catch once you're looking for silent failures.