You've probably been there: your ML system is humming along, predictions flowing smoothly, and then - suddenly - your dashboard lights up like a Christmas tree. Fifty alerts in five minutes. Some are real problems. Some are noise. Some are telling you about issues that don't matter yet.

The difference between a mature ML observability practice and a chaotic one comes down to knowing what to alert on and when. This isn't just about preventing alert fatigue (though that's important). It's about creating a system where each alert represents genuine, actionable intelligence about your models' health.

Let's build that system together.

Why ML Alerting Is Different From Traditional Systems

Traditional infrastructure monitoring is straightforward: your server is either up or down, your database either responds or hangs. ML systems are sneakier. Your model can be completely live, serving predictions with perfect latency, and still be completely wrong. This is the fundamental challenge of ML monitoring: all your traditional signals can look healthy while your model silently decays.

Consider what happens when you rely purely on infrastructure metrics for an ML system. Your CPU is fine. Your memory is fine. Your network is fine. Your inference latency is perfect. All green lights. But your recommendation model has learned outdated patterns. It's recommending products users don't want. Your fraud detection model is missing fraud because user behavior has changed. Your forecasting model is predicting demand incorrectly because the world changed. The infrastructure looks perfect while the model has become useless.

This is why traditional alerting fails for ML systems. A traditional on-call engineer monitoring infrastructure metrics would never be alerted to these problems. They might notice slow business metrics - engagement dropping, fraud increasing, forecast inaccuracy rising - but those metrics aren't tied to the ML system's health, so the alerts don't fire.

The solution is ML-specific monitoring. You need to understand not just whether your system is running, but whether it's running correctly. This requires domain knowledge. For a recommendation system, you need to understand what "good" recommendations look like and alert when recommendations degrade. For a fraud detector, you need to understand fraud patterns and alert when the model starts missing fraud. For a forecast model, you need to understand demand patterns and alert when forecast accuracy drops.

This domain knowledge is the hard part. Infrastructure monitoring is mechanical - CPU > 90% is bad for any workload. ML monitoring requires thinking about what your specific model should be doing and detecting when it drifts from that expectation.

A model can slowly drift toward worthlessness. A feature pipeline-pipelines-training-orchestration)-fundamentals)) can corrupt data silently. Your embeddings can shift in subtle ways that tank relevance. A recommendation model might start recommending irrelevant products. A fraud detector might start missing obvious fraud. None of this shows up in your infrastructure metrics. Your API responds in 200ms. Your GPU utilization is normal. Your error rates are fine. But your model is broken.

These problems don't trigger traditional alerts because the infrastructure looks healthy. They require domain-specific monitoring. This is why so many ML systems fail invisibly. Teams build sophisticated monitoring for infrastructure - CPUs, memory, disk, network - but forget to monitor the thing that actually matters: is the model still right?

That's where a proper alerting strategy comes in. You need three distinct monitoring layers working together:

Infrastructure alerts (GPU, memory, network)
Service alerts (latency, throughput, error rates)
Model quality alerts (predictions, drift, feedback loops)

Most teams focus on layers one and two. Layer three is where ML differentiation happens. It's also where most teams fall short, which is why so many ML systems degrade undetected. Building this layer requires a different mindset than traditional monitoring. You're not checking if the system is up. You're checking if the system is right.

Defining SLOs for ML Systems

Before you can alert intelligently, you need to know what "healthy" actually means. Service Level Objectives (SLOs) are your foundation.

An SLO is a contractual promise: we will deliver X level of service over Y time period. Unlike an SLA (Service Level Agreement, which includes penalties for missing), an SLO is your internal target. It shapes your alerting, your reliability engineering, your on-call rotations, your deployment-production-inference-deployment) strategies.

The magic of SLOs is that they force clarity about what actually matters. Do you care about P99 latency or average latency? (You care about P99 - users feel the slow cases.) Do you want 99.9% uptime or 99.99%? (99.9% allows 43 minutes of downtime monthly, 99.99% allows 4 minutes. Pick based on impact if the service goes down.) What accuracy baseline does your model need to maintain? SLOs make these decisions explicit and force you to quantify them.

For ML specifically, you need three dimensions of SLOs: latency SLO (predictions arrive within X milliseconds), availability SLO (the service is up X% of the time), and quality SLO (predictions are correct within X% accuracy or other domain metric). Most teams focus on latency and availability, treating quality as an afterthought. That's a mistake. Your users care about quality most - a slow prediction is bad, a down service is bad, but a wrong prediction is worst. Yet quality SLOs are hardest to define because they're domain-specific.

The benefit is that SLOs become the source of truth for your alerting strategy. Once you have SLOs, everything else follows: what to alert on (things that violate SLOs), when to page (fast burn of error budget), when to wait (slow burn of error budget), and when to deploy (only if you still have budget to spare).

For ML systems, typical SLOs look like this:

Latency SLO: P99 latency < 500ms

"99% of predictions arrive in under 500 milliseconds"
Measured over a rolling 30-day window
Supports your business requirements (user experience, real-time constraints)

Availability SLO: 99.9% uptime

Your model endpoint is reachable and responds
Account for planned maintenance windows
This is about service availability, not prediction quality

Quality SLO: Prediction Set Size Index (PSI) < 0.2

Your prediction distribution hasn't shifted dramatically from baseline
PSI measures how much your current predictions differ from your training distribution
Lower is better; PSI > 0.2 usually signals something's broken

Error Budget: If your SLO is 99.9%, your error budget is 0.1%

Over 30 days: ~43 minutes of unplanned downtime
Over 90 days: ~130 minutes
Your burn rate tells you how fast you're consuming this budget

The key insight? You only have so many failures to burn through. Once your error budget is consumed, you need to take action - either fix the problem or reduce traffic.

SLO: 99.9% availability
Daily error budget: 86.4 seconds (1 day = 86,400 seconds)
If you burn 86.4 seconds in 1 hour → 1440x baseline burn rate
If you burn 86.4 seconds in 6 hours → 240x baseline burn rate

This math drives everything that follows.

Why SLOs Matter in Practice

Most teams skip SLO definition because "it feels like extra work." But here's what happens without them: you end up with random alert thresholds, no correlation between metrics, and no way to distinguish critical from noise. An engineer keeps alert fatigue by setting thresholds so high they miss real problems. Then a P99 latency spike at 2 AM goes unnoticed because nobody was alerted, because the threshold was "good enough."

With SLOs, you have a shared understanding: we care about P99 < 500ms because our customers experience delays above that. We're willing to burn 86 seconds of downtime per day, but not more, because the business would suffer. This shared understanding lets you design alerts that align with business impact, not just technical metrics.

The Alert Taxonomy: Three Layers

Let's get specific about what to alert on.

Layer 1: Infrastructure Alerts

These are your smoke detectors. They tell you when the building's on fire.

GPU/TPU Memory OOM: If you're running out of VRAM, inference will fail or become glacially slow
CPU Utilization > 90%: Indicates bottleneck, may cause timeout cascades
Network Latency to Model Server: Unusual spikes indicate network problems
Container Restarts: Frequent restarts suggest the model server is crashing
Disk Space: If your model artifacts run out of space, you can't update models

Alert Condition: Trigger immediately on hard limits (OOM, disk full), and on sustained 90th percentile metrics.

yaml

- alert: GPUMemoryOOM
  expr: gpu_memory_available_bytes < 100000000 # 100MB
  for: 1m
  annotations:
    summary: "GPU memory critically low"
 
- alert: ContainerRestartSpike
  expr: rate(container_restarts_total[5m]) > 0.1 # > 6 restarts/hour
  for: 5m
  annotations:
    summary: "Model server restarting frequently"

Layer 2: Service Alerts

These tell you whether your system is responsive.

Prediction Latency P99 > 500ms: You're missing your SLO
Request Error Rate > 1%: Something's returning errors that shouldn't
Throughput Drop > 50%: Suddenly getting way fewer requests (upstream issue?)
Response Timeout Rate: Requests timing out at the client

These are your first line of defense. They alert you to systemic availability problems.

yaml

- alert: HighPredictionLatency
  expr: histogram_quantile(0.99, prediction_latency_seconds) > 0.5
  for: 5m
  annotations:
    summary: "P99 latency exceeding SLO"
    runbook: "Check GPU utilization, batch size, downstream API calls"
 
- alert: HighErrorRate
  expr: rate(prediction_errors_total[5m]) > 0.01
  for: 10m
  annotations:
    summary: "Error rate > 1%"

Layer 3: Model Quality Alerts

This is where you catch silent failures. Your model is serving, but it's serving garbage.

Prediction Distribution Shift (PSI > 0.2): Your predictions have moved away from the baseline
Input Feature Distribution Shift: Your features look different than training data
Embedding-automated-model-compression)-engineering-chunking-embedding-retrieval) Drift: Your learned representations are shifting
Ground Truth Feedback Divergence: Users are correcting your predictions more than usual
Calibration Drift: Your confidence scores no longer match actual accuracy

These require domain knowledge and ground truth feedback loops.

yaml

- alert: PredictionDistributionShift
  expr: psi_score > 0.2
  for: 1h
  annotations:
    summary: "Prediction distribution shifted"
    runbook: "Review recent training data, check for data pipeline changes"
 
- alert: HighUserCorrectionRate
  expr: rate(user_corrections_total[1h]) / rate(predictions_total[1h]) > 0.05
  annotations:
    summary: "Users correcting >5% of predictions"

Multi-Window Burn Rate Alerts: The Google SRE Approach

Here's where things get sophisticated. Google SRE's multi-window burn rate approach lets you catch problems before you consume your entire error budget.

The idea: measure how fast you're burning through your error budget at different timescales. If you're burning fast, you need to move fast. If you're burning slowly, you have time to think.

Baseline error budget burn rate = 1.0
If your latency SLO is 99.9%, your baseline burn rate = 0.1% per unit time

A "fast burn" is when you burn your *entire monthly budget in 1 hour*
A "slow burn" is when you burn it in 30 days (that's expected)

Fast burn multiplier = 30 days / 1 hour = 720x
Slow burn multiplier = 30 days / 6 hours = 120x

In practice:

Fast Burn Alert (1-hour window):

Your error rate is 720x the baseline
You're on pace to burn your entire monthly budget in 1 hour
Action: Page an on-call engineer immediately, consider traffic reduction
Threshold: 14.4% error rate (720 × 0.1% * 20 sec buffer)

Slow Burn Alert (6-hour window):

Your error rate is 120x the baseline
You're on pace to burn your entire monthly budget in 6 hours
Action: Create incident ticket, investigate root cause
Threshold: 2.4% error rate (120 × 0.1% * 20 sec buffer)

The math looks like this:

Target SLO = 99.9% (0.1% error budget)
Burn rate multiplier (1 hour) = 24 * 30 = 720x
Fast burn threshold = 0.001 * 720 = 0.72 (72% errors)
   ↓ (with 20 sec duration buffer)
   = 0.72 * (20 sec / 3600 sec) = 0.004 = 0.4%

Burn rate multiplier (6 hours) = 4 * 30 = 120x
Slow burn threshold = 0.001 * 120 = 0.12 (12% errors)
   ↓ (with 20 sec duration buffer)
   = 0.12 * (360 sec / 21600 sec) = 0.002 = 0.2%

Why both windows? Fast burn catches catastrophic failures (page someone now). Slow burn catches degradation (fix it this week). Together, they prevent alert fatigue while catching real problems.

Why This Approach Is Revolutionary

Traditional alerting says "alert if error rate > 1%". That's static. It doesn't account for context. Maybe 1% errors is fine - your error budget can absorb it. Maybe 0.1% errors is catastrophic - you're burning budget at 10x normal rate. Burn rate alerts adapt to your SLO, which means they adapt to your business impact.

Prediction Quality Alerts: Drift Detection

Your model can serve predictions that are technically "correct" (the model ran, inference succeeded) but increasingly wrong.

Here's what to monitor:

Output Distribution Drift

Compare your current prediction distribution to your baseline (from training data or recent history).

python

# Pseudocode for PSI calculation
def psi_score(current_dist, baseline_dist):
    """Prediction Set Index: measures distribution shift"""
    # Split both into 10 quantile bins
    # For each bin:
    #   (current_pct - baseline_pct) * ln(current_pct / baseline_pct)
    # Sum across bins
    # PSI > 0.2 signals significant shift
    pass

Alert when: PSI > 0.2 for 1+ hours Action: Retrain-scale)-real-time-ml-features)-apache-spark))-training-smaller-models)) model on recent data, or reduce traffic until you understand the shift

Embedding Drift

If your model uses learned embeddings (recommenders, search, NLP):

python

# Compare embeddings from incoming data to reference embeddings
# Use cosine similarity or L2 distance
current_embeddings = model.encode(current_batch)
baseline_embeddings = reference_embeddings  # from training
 
similarity = cosine_similarity(current_embeddings, baseline_embeddings)
if mean(similarity) < threshold:  # embeddings diverging
    alert("embedding_drift")

Ground Truth Feedback Loop

Once users interact with your predictions, you get feedback. Monitor it:

python

# After collecting ground truth feedback
actual_labels = get_ground_truth_feedback()
predictions = get_recent_predictions()
 
# Calculate metric that matters (accuracy, AUC, MAP@k, etc)
current_accuracy = calculate_accuracy(predictions, actual_labels)
 
# Compare to historical baseline
if current_accuracy < baseline_accuracy * 0.95:  # 5% drop
    alert("accuracy_degradation")

This is gold because ground truth is objective truth. If accuracy is dropping, your model needs help.

Reducing Alert Fatigue: The Real Challenge

You can define a hundred alerts. The hard part is tuning them so only real problems trigger pages.

This is the dirty secret of monitoring: defining metrics is easy. Choosing alert thresholds is hard. Set the threshold too low and you're paged constantly on false alarms. You stop believing alerts and stop responding quickly - a phenomenon called alert fatigue. An on-call engineer who's been paged 20 times in the past week for non-critical alerts will treat the 21st page with skepticism, even if it's critical.

The statistics bear this out. Studies of production systems show alert fatigue is endemic. Teams receive thousands of alerts per month but only 10-20% are actionable. The noise crowds out the signal. The solution isn't fewer alerts (you actually do need to monitor those things), but smarter alerts that are more selective about what triggers a page.

The other dimension of alert tuning is understanding your baseline. What does "normal" look like for your system? This changes over time. A model that just got deployed might have different behavior characteristics than one that's been running for six months. Traffic patterns change seasonally. Your data distribution shifts. What counts as healthy changes.

This is why smart teams review their alerts regularly. Once a quarter, you look at alerts that fired and ask: was this actually a problem we needed to respond to? If you're seeing lots of false positives, lower the threshold or add more conditions. If you're missing problems, raise the threshold or add new signals.

The best practice is instrumenting alerts with action buttons or runbooks so on-call engineers can quickly understand what to do when an alert fires. An alert that just says "high latency" is less useful than "high latency - check GPU utilization (dashboard link) and see if batch size needs adjustment (runbook link)." A well-designed alert provides context and action directly.

This is where most organizations struggle. They build comprehensive monitoring, get flooded with alerts, and people start ignoring them. When a critical alert fires, nobody notices because they've trained themselves to tune out the noise. This is called alert fatigue, and it's the enemy of observability. It's not just annoying - it's dangerous. Alert fatigue has directly caused major incidents when engineers didn't respond to critical alerts because they were numb to alert notifications.

The solution isn't fewer alerts. It's smarter alerts.

1. Symptom-Based Alerting

Alert on symptoms (impacts to users), not causes (internal metrics).

Bad: "Alert on CPU > 90%" Good: "Alert on P99 latency > 500ms"

The second one tells you something users care about. The first is just a proxy. There might be good reasons CPU is high (spike in traffic) that don't matter. Or CPU could be low but your model could still be latency-bound due to network calls.

Think about it: what does the user experience? Slow predictions. What matters to your business? Prediction quality and speed. A high CPU alert might be a symptom, but it's not the disease. The disease is missing your SLO.

This principle applies across all three alert layers. For infrastructure, alert on the impact (latency spike) not the mechanism (CPU). For quality, alert on the outcome (accuracy drop) not the signal (a specific metric threshold).

2. Alert Inhibition During Maintenance

You know you're taking the service down for 30 minutes. Silence alerts during that window.

yaml

inhibition_rules:
  - source_match:
      alertname: "HighLatency"
    target_match:
      maintenance_window: "true"
    equal: ["alertname"]

When maintenance_window=true, that alert won't fire.

More importantly, you can use inhibition rules to suppress lower-severity alerts when higher-severity ones fire. For example, don't alert on high error rate if the service is completely down. The downstream issue (high error rate) is a symptom of the upstream problem (service down).

Inhibition rules prevent alert storms where one failure cascades into dozens of redundant notifications.

3. Composite Alerts and Correlation

Sometimes one thing causes ten alerts. Correlate them:

yaml

- alert: ModelServerCrash
  expr: |
    rate(container_restarts_total[5m]) > 0.1
    and on() rate(prediction_errors_total[5m]) > 0.1
  for: 2m
  annotations:
    summary: "Model server crashing and returning errors"

Instead of 10 separate alerts, you get 1 that says "here's the problem."

This requires understanding causal relationships in your system. A container restart causes error spikes. A model overfitting causes poor prediction quality on new data. When you understand these relationships, you can create composite alerts that tell the real story.

For ML systems specifically, think about what cascades:

Feature pipeline failure → prediction errors → user corrections → accuracy drop
Model deployment bug → latency spike → error rate increase → user complaints
Resource shortage → batch timeout → queue buildup → throughput drop

Create alerts for the chain, not individual links.

4. ML-Based Anomaly Detection (Advanced)

For sophisticated teams:

python

# Use an unsupervised anomaly detector on your metrics
from sklearn.ensemble import IsolationForest
 
# Train on historical "normal" behavior
normal_metrics = collect_metrics(days=30)
detector = IsolationForest(contamination=0.05)
detector.fit(normal_metrics)
 
# Score new incoming metrics
current_metrics = collect_metrics(minutes=5)
anomaly_score = detector.score_samples(current_metrics)
 
if anomaly_score > threshold:
    alert("unusual_system_behavior")

This catches weird combinations of metrics you didn't think to define explicitly.

The beauty of this approach: you're learning what "normal" actually looks like for your system. Monday morning spike? Normal. 3am traffic drop? Normal. But 3am spike? Anomalous.

The downside: you need 2-4 weeks of baseline data before you can trust it. And you need to retrain when you legitimately change your system's characteristics (new model, new infrastructure, new user base).

Alert Fatigue in Numbers

A study of production systems found that:

Average team receives 1,500 alerts per service per month
Of those, 70-80% are noise or redundant
Average response time to critical alert: 9 minutes
Alert response time with alert fatigue: 20+ minutes

The math: if you can reduce your alert volume by 75% while keeping signal constant, you cut response time in half. That's the game.

Implementation Roadmap: Starting Simple

Don't try to implement everything at once. Here's a realistic progression:

Week 1-2: Foundation

Define your three SLOs (latency, availability, quality)
Create Layer 1 (infrastructure) alerts
Set up basic AlertManager routing to Slack

Week 3-4: Service Observability

Add Layer 2 (service) alerts with burn rate multipliers
Implement alert inhibition for maintenance windows
Start tracking alert volume and false positive rate

Week 5-6: Quality Monitoring

Add PSI calculation pipeline
Create Layer 3 (quality) alerts for prediction drift
Add ground truth feedback loop collection

Week 7-8: Optimization

Analyze alerts that fired but didn't require action
Create composite alerts to reduce noise
Tune thresholds based on 2 weeks of data

Week 9-10: Advanced

Implement ML-based anomaly detection
Create runbooks for the 5 most common alerts
Establish an on-call rotation with alert ownership

Real-World Alerting Scenarios

Let me walk through a few common scenarios and how proper alerts would handle them.

Scenario 1: The Silent Model Degradation

Your checkout model has been serving for 6 months. Latency is perfect. Errors are zero. Your infrastructure team is sleeping soundly.

But the product changed. Customers can now buy subscriptions, not just one-time purchases. The model was trained on one-time purchase patterns. On subscriptions, it's only 60% accurate - but you're not checking for accuracy.

Without proper alerting: Users gradually stop using the recommendation feature. Revenue slowly declines. You discover the problem 3 months later in a business review.

With proper alerting:

You have a ground truth feedback loop where users indicate whether recommendations were helpful
You calculate rolling accuracy every hour
You alert when accuracy drops 5% below baseline
You catch it in 2 hours
You retrain on new patterns or route subscription customers to a different model

Cost of the second scenario: 2 hours of manual investigation and a retrain job. Cost of the first: $500K in lost revenue.

Scenario 2: The Cascading Failure

Your feature engineering pipeline runs every hour. At 2 AM, there's a brief network blip. One feature computation fails and retries, but the retry logic has a bug. It now returns NaN values for 0.001% of requests.

The model handles NaN gracefully - it just returns the average prediction. Latency? Perfect. Error rate? Zero. The service is technically working.

But now 0.001% of predictions are garbage. Multiply that across millions of requests per day, and you're degrading quality silently.

Without proper alerting: You find out 3 weeks later when you run a manual quality audit.

With proper alerting:

You're monitoring input feature distribution (are we getting unexpected values?)
You alert when NaN frequency exceeds baseline
Or you're monitoring output distributions and alert when predictions become suspiciously uniform
You catch it within an hour
You can either fix the pipeline or add NaN handling logic

Scenario 3: The Resource Squeeze

Your model uses 8GB of VRAM. Your GPU has 40GB. Life is good. Then product scales up 10x. Batch size increases. Model size increases (you added features). One day, you OOM.

Without proper alerting: The service crashes. Customers get errors for 15 minutes until your auto-scaling or recovery logic kicks in.

With proper alerting:

You're tracking GPU memory usage continuously
You alert at 70% utilization (still healthy, but trending badly)
You alert at 90% utilization (critical, need action now)
When the 70% alert fires, your oncall engineer can scale out or optimize batch size before hitting the wall

This is an infrastructure alert, but it's tied to a business impact: keeping the service available.

Real-World Questions from ML Teams

We've worked with dozens of teams on their monitoring strategies. Here are the questions that come up repeatedly:

Q: Should we alert on every metric we track?

No. Track everything you need for debugging, but only alert on things that require immediate action. Use your monitoring dashboard for exploration, use alerts for action.

Q: How often should we retune alert thresholds?

At least quarterly, or whenever you make major changes (new model, new infrastructure, new traffic patterns). Too tight and you get false positives. Too loose and you miss real problems.

Q: What should be our target alert volume?

Aim for 5-10 alerts per service per week. If you're getting more than 1-2 per day, you have too much noise. If you're getting less than 1 per week, you might be missing problems.

Q: Who should we page for a warning-level alert?

Create a ticket, don't page. Pages should be reserved for critical alerts where human intervention is needed within 15 minutes. Warnings can wait until the next business day for triage.

Q: Should we alert on predicted metrics?

Yes, if you have a model that predicts when a resource will be exhausted. Predicting GPU OOM in 2 hours is valuable. But keep predictions validated - if your prediction model is wrong, it's noise.

Putting It Together: AlertManager Configuration

Here's a production-ready AlertManager configuration with multi-window burn rates and ML alert rules:

yaml

# alertmanager.yml
global:
  resolve_timeout: 5m
  slack_api_url: "https://hooks.slack.com/services/YOUR/WEBHOOK/URL"
 
inhibition_rules:
  # Don't alert on latency if the service is down
  - source_match:
      severity: "critical"
      alertname: "PredictionServiceDown"
    target_match:
      alertname: "HighLatency"
    equal: ["instance"]
 
  # Silence model alerts during maintenance windows
  - source_match:
      alertname: "PlannedMaintenance"
    target_match_re:
      alertname: ".*Drift|.*Quality"
    equal: ["job"]
 
route:
  receiver: "default"
  group_by: ["alertname", "cluster"]
  group_wait: 10s
  group_interval: 10s
  repeat_interval: 4h
 
  # Fast burn → immediate page
  routes:
    - match:
        severity: "critical"
      receiver: "pagerduty"
      group_wait: 0s
      repeat_interval: 15m
 
    # Slow burn → incident ticket
    - match:
        severity: "warning"
      receiver: "slack"
      group_wait: 30s
      repeat_interval: 2h
 
receivers:
  - name: "default"
    slack_configs:
      - channel: "#ml-alerts"
        title: "{{ .GroupLabels.alertname }}"
        text: "{{ range .Alerts }}{{ .Annotations.summary }}{{ end }}"
 
  - name: "pagerduty"
    pagerduty_configs:
      - service_key: "${PAGERDUTY_SERVICE_KEY}"
 
---
# prometheus_rules.yml
 
groups:
  - name: ml_service_alerts
    interval: 30s
    rules:
      # Layer 2: Service Alerts
      - alert: HighPredictionLatencyFastBurn
        expr: |
          histogram_quantile(0.99,
            rate(prediction_latency_seconds_bucket[1m])
          ) > 0.5
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "P99 latency > 500ms (fast burn)"
          runbook: "https://wiki.internal/runbook/high-latency"
 
      - alert: HighPredictionLatencySlowBurn
        expr: |
          histogram_quantile(0.99,
            rate(prediction_latency_seconds_bucket[6m])
          ) > 0.5
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "P99 latency > 500ms (slow burn)"
 
      # Multi-window error rate (burn rate)
      - alert: HighErrorRateFastBurn
        expr: |
          (rate(prediction_errors_total[1m]) /
           rate(predictions_total[1m])) > 0.004
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Error rate > 0.4% (burning budget in 1 hour)"
 
      - alert: HighErrorRateSlowBurn
        expr: |
          (rate(prediction_errors_total[6m]) /
           rate(predictions_total[6m])) > 0.002
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "Error rate > 0.2% (burning budget in 6 hours)"
 
      # Layer 3: Model Quality Alerts
      - alert: PredictionDistributionShift
        expr: psi_score > 0.2
        for: 1h
        labels:
          severity: warning
          alert_type: "quality"
        annotations:
          summary: "Prediction distribution shifted (PSI={{ $value | humanize }})"
          runbook: "Check feature pipelines, recent training data changes"
 
      - alert: EmbeddingDrift
        expr: |
          mean(embedding_similarity_to_baseline) < 0.85
        for: 2h
        labels:
          severity: warning
          alert_type: "quality"
        annotations:
          summary: "Embedding drift detected"
 
      - alert: AccuracyDegradation
        expr: |
          current_accuracy < baseline_accuracy * 0.95
        for: 30m
        labels:
          severity: warning
          alert_type: "quality"
        annotations:
          summary: "Model accuracy dropped > 5%"
          dashboard: "https://dash.internal/model-metrics"
 
      - alert: HighUserCorrectionRate
        expr: |
          (rate(user_corrections_total[1h]) /
           rate(predictions_total[1h])) > 0.05
        for: 1h
        labels:
          severity: warning
          alert_type: "quality"
        annotations:
          summary: "Users correcting > 5% of predictions"
 
      # Layer 1: Infrastructure
      - alert: GPUMemoryOOM
        expr: gpu_memory_available_bytes < 100000000
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "GPU memory critically low"
 
      - alert: ContainerRestartSpike
        expr: |
          rate(container_restarts_total[5m]) > 0.1
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Model server restarting (>6/hour)"

Visualizing Your Alert Strategy

Here's how these pieces fit together:

graph TD
    A[SLO Defined<br/>99.9% Availability<br/>P99<500ms] -->|defines| B[Error Budget<br/>86.4 seconds/day]
    B -->|drives| C[Burn Rate Multipliers<br/>1h: 720x<br/>6h: 120x]
    C -->|creates| D1[Fast Burn Alert<br/>Error Rate > 0.4%<br/>Page Engineer]
    C -->|creates| D2[Slow Burn Alert<br/>Error Rate > 0.2%<br/>File Ticket]
 
    E[Infrastructure Metrics<br/>GPU, CPU, Network] -->|feed| F{Symptom-Based<br/>Alerts}
    G[Service Metrics<br/>Latency, Errors, Throughput] -->|feed| F
    H[Quality Metrics<br/>PSI, Drift, Feedback] -->|feed| F
 
    F -->|yes| I[Inhibit During<br/>Maintenance?]
    I -->|no| J[Correlate with<br/>Other Alerts?]
    J -->|no| K[Page/Ticket/Observe]
    J -->|yes| L[Composite Alert<br/>Root Cause]

And here's the decision framework for what to monitor:

graph TD
    A["Is this a user-facing metric?"] -->|yes| B["Measure and Alert"]
    A -->|no| C{"Could it cause<br/>user impact?"}
    C -->|yes| D["Measure, Alert with<br/>Correlation"]
    C -->|no| E["Measure, Don't Alert"]
 
    B -->|P99 Latency?| F["SLO-based<br/>Multi-Window"]
    B -->|Error Rate?| G["Burn Rate<br/>Fast/Slow"]
    B -->|Prediction Quality?| H["Domain Metrics<br/>PSI, Drift, Feedback"]
 
    D -->|GPU Util?| I["Alert on Spikes<br/>or Duration"]
    D -->|CPU Util?| J["Alert on 90th%ile<br/>+ Duration"]

The SLO Decision Framework

When you're deciding what SLO to set:

Start with user impact: What do users actually care about? Latency? Accuracy?
Measure the baseline: What are you actually delivering today?
Set ambitious but achievable SLO: Usually 99.9-99.95% for most systems
Define error budget usage: How fast can it burn before you take action?
Set alert thresholds: Use burn rate multipliers (1h, 6h, 30d)
Iterate: You'll get this wrong the first time. That's normal.

Summary

Building a mature alerting strategy for ML systems means:

Define SLOs that capture latency, availability, and quality
Understand error budgets and burn rate multipliers for fast/slow problems
Layer your alerts: infrastructure, service, quality
Monitor for drift in predictions, embeddings, and ground truth feedback
Reduce fatigue through symptom-based alerting and intelligent inhibition
Use AlertManager to route critical alerts to on-call, warnings to tickets

The teams with the best ML systems aren't those with the most alerts. They're the ones with the right alerts - ones that tell them something actionable about their system's health.

Start with the infrastructure and service layers. Get those stable. Then add quality monitoring. You'll be amazed at what you catch once you're looking for silent failures.

Alerting Strategies for ML Systems: What to Monitor and When

Why ML Alerting Is Different From Traditional Systems

Defining SLOs for ML Systems

Why SLOs Matter in Practice

The Alert Taxonomy: Three Layers

Layer 1: Infrastructure Alerts

Layer 2: Service Alerts

Layer 3: Model Quality Alerts

Multi-Window Burn Rate Alerts: The Google SRE Approach

Why This Approach Is Revolutionary

Prediction Quality Alerts: Drift Detection

Output Distribution Drift

Embedding Drift

Ground Truth Feedback Loop

Reducing Alert Fatigue: The Real Challenge

1. Symptom-Based Alerting

2. Alert Inhibition During Maintenance

3. Composite Alerts and Correlation

4. ML-Based Anomaly Detection (Advanced)

Alert Fatigue in Numbers

Implementation Roadmap: Starting Simple

Real-World Alerting Scenarios

Scenario 1: The Silent Model Degradation

Scenario 2: The Cascading Failure

Scenario 3: The Resource Squeeze

Real-World Questions from ML Teams

Putting It Together: AlertManager Configuration

Visualizing Your Alert Strategy

The SLO Decision Framework

Summary

Need help implementing this?