December 9, 2025
AI/ML Infrastructure Monitoring Logging

Log Aggregation for ML Systems: ELK Stack and Beyond

You've deployed your machine learning model to production. Everything looks good in dev. Then, at 3 AM, your model starts returning confidence scores of 0.01 on everything. Your alert fires, but where do you even start debugging? By the time you sift through logs scattered across a dozen containers, your model has already tanked your user experience.

This is where proper log aggregation becomes your lifeline.

ML systems are fundamentally different from traditional applications. They don't just have latency and error rates - they have model drift, cache hits, token consumption, inference latency variations, and request dependencies across feature stores)). Generic logging won't cut it. You need structured, queryable logs that capture the entire ML pipeline-pipelines-training-orchestration)-fundamentals)) context.

This guide walks you through building a production-grade logging system for ML workloads, starting with the ELK Stack (Elasticsearch, Logstash, Kibana) and exploring alternatives like Loki. By the end, you'll understand how to instrument your models, aggregate logs at scale, and detect anomalies before your users do.

Logs are the historical record of what your ML system has done. Without comprehensive logging, you're flying blind. When-scale)-real-time-ml-features)-apache-spark))-training-smaller-models)) something goes wrong - a model producing bad predictions, a training job failing, inference latency spiking - you have no way to investigate. You don't know what the system was doing when the problem occurred. You don't know what changed. You can't reproduce the issue. Logging is not optional in production systems; it's fundamental to operational visibility.

The challenge with ML systems specifically is that they generate vastly more data than traditional applications. Every training step produces metrics. Every inference request produces results. Every model produces predictions with confidence scores. If you log everything verbosely, you drown in data. If you log too little, you lack visibility when you need it most. Finding the right balance is an art.

The other challenge is that ML logs are not uniform. Training logs look different from inference logs. Batch processing logs look different from real-time logs. Logs from different frameworks look different. Without a unified logging strategy, you end up with incompatible log formats, making it hard to correlate events across your system. One component might log as plain text, another as JSON, another as CSV. When something goes wrong that spans multiple components, you can't trace it easily.

The ELK stack (Elasticsearch, Logstash, Kibana) provides a powerful solution to these challenges. Logstash normalizes logs from diverse sources into a unified format. Elasticsearch indexes and stores them efficiently, making them searchable. Kibana visualizes them, enabling interactive exploration and analysis. Together, they form a system that can handle the volume and variety of ML logs while making the data accessible and useful.

Table of Contents
  1. The ML Logging Problem
  2. Understanding the Logging Landscape
  3. Structured Logging for ML: The Schema
  4. Why Schema Design Matters for Production
  5. ELK Stack Architecture for ML Systems
  6. Elasticsearch: Index Templates and Sharding
  7. Logstash: Filtering and Enriching
  8. Kibana: Dashboards and Queries
  9. Training Logs: From PyTorch to JSON
  10. PyTorch Lightning with Structured Callbacks
  11. Hugging Face Trainer Integration
  12. Log-Based Anomaly Detection with Elasticsearch ML
  13. Loki + Grafana: The Lighter Alternative
  14. The Practical Reality of Log-Driven Operations
  15. Production Log Schema Reference
  16. Common Pitfalls in ML Log Aggregation
  17. Pitfall 1: Overwhelming Log Volume
  18. Pitfall 2: Unbounded Nested Objects
  19. Pitfall 3: Timezone Confusion
  20. Pitfall 4: Cardinality Explosions
  21. Pitfall 5: Losing Logs During Shutdown
  22. Production Considerations
  23. Cost Optimization
  24. High-Availability Elasticsearch
  25. Logstash Scaling
  26. Request Tracing with Correlation IDs
  27. Advanced Topics: Alert Strategies and On-Call Integration
  28. Debugging Workflows: From Alert to Root Cause
  29. Wrapping It Up

The ML Logging Problem

Traditional application logs capture what your code is doing. ML system logs need to answer: "Why did this prediction happen?" This distinction is profound and changes everything about how you approach logging for ML systems.

In traditional software, debugging is largely about execution flow. You follow the code path: function A called function B, which called function C, which threw an exception. Stack traces tell you the story. Logs capture discrete events: user logged in, database query executed, HTTP response sent. These logs are sufficient because the software is deterministic. Given the same input, it produces the same output. Bugs are usually in the code logic or the infrastructure. You find them by tracing execution.

Machine learning breaks this determinism. Your model runs, processes the same input that worked yesterday, and produces a different output today. Why? There are dozens of possible reasons. Did the model version change? Did the feature values drift? Did your feature store return stale data? Did the model overfit to training data and now encounters novel patterns? Did your preprocessing introduce a subtle bug?

Traditional logs don't answer these questions. You need a different kind of logging that captures the entire context of a prediction. Not just "inference completed" but "inference completed with these features, from this model version, with this confidence score, in this amount of time."

The second reason ML logging is different: volume and observability. A traditional API might handle 100 requests per second. An ML inference system might handle 10,000 per second or more. You can't log every request at the same detail level - you'd drown in data. But you also can't ignore any request - the problem might manifest as a rare edge case.

The third reason: correlation across services. A fraud detection request might flow through a feature store, a model server, a ranking layer, and a cache. To understand what happened, you need to trace that request end-to-end. Traditional logs, scattered across different services, make that nearly impossible without a correlation ID strategy.

These differences mean ML systems need a purpose-built logging approach, not just application logging with a few tweaks.

Consider an inference request hitting your recommendation model. What you actually need to understand:

  • Request Context: Which user, session, timestamp, request ID
  • Model Metadata: Model version, training date, serving variant
  • Input Features: Feature vector hash, feature store latency, cache hit
  • Inference Details: Latency in milliseconds, GPU memory used, batch size
  • Output: Prediction, confidence score, threshold applied
  • Side Effects: Feature engineering duration, A/B test group, ranking position

A single line of generic logging like "Inference completed" tells you almost nothing. You need structured JSON with semantics.

The second problem: volume. A single model serving thousands of requests per second across a distributed cluster generates logs faster than you can read them. You need indexing, aggregation, and smart querying.

The third problem: correlation. When something goes wrong, you need to trace a request through multiple services - feature store, model server, cache layer, ranking logic. Log aggregation with request IDs ties it all together.

Understanding the Logging Landscape

Before you jump to ELK, understand the trade-offs. Logging infrastructure choices shape how easily you can debug production issues. Get it wrong, and you're back to SSH-ing into containers at 3 AM to grep through raw files.

There are roughly three categories of logging solutions:

  1. Full-text search indexing (ELK, Splunk): Every field is indexed and searchable. Powerful for exploratory debugging, expensive at scale.
  2. Label-based indexing (Loki, Grafana-grafana-ml-infrastructure-metrics)): Only configured labels are indexed. Cheaper, faster, but less flexible for ad-hoc queries.
  3. Structured time-series logging (CloudWatch, Datadog): Logs as structured events with optional querying. Middle ground in flexibility and cost.

For ML systems, you typically want #1 or a hybrid of #1 and #2 - detailed logs for active debugging, but efficient storage and fast queries for known-pattern searches.

Structured Logging for ML: The Schema

Before you touch ELK, start with your logging schema. This is your contract with your observability system. Schema design is the most important decision you'll make about logging. It's easy to underestimate - schema seems like an implementation detail, something you refine over time. But in reality, your schema shapes what questions you can answer.

Consider two approaches. In the first, you log everything as free-text strings: "inference completed in 87ms with confidence 0.94". In the second, you log structured data: latency_ms: 87, confidence_score: 0.94. The first feels simpler - you're just writing strings. But try to answer a question: "What's the 95th percentile latency?" You'd need to parse every log entry, extract the number, and compute the percentile. Now try: "Show me confidence scores broken down by model version." You'd need to parse the confidence number and separately figure out which model version processed each request. These queries become increasingly complex.

With structured data, these questions are trivial. The system knows latency_ms is a number, so it can compute percentiles directly. It knows model_version is a categorical dimension, so it can group and aggregate. You can compose queries fluidly.

The schema matters because it's expensive to change later. Say you ship logging without request_id, then six months later realize you need to correlate logs across services. You can't add request_id retroactively to old logs - they're already indexed without it. Your tracing capability will be incomplete. Or imagine logging confidence_score but realizing your models output arrays of confidence scores (one per candidate). Your schema assumed a scalar; now you need to change it. Migrating a petabyte of logs to a new schema is painful.

This is why leading with schema design is essential. Spend time thinking about what you'll query. What dimensions do you need to slice data by? What metrics do you need to compute? What correlations matter for debugging? Encode all of that into your schema upfront. You can refine the schema over time - add new fields, deprecate old ones - but your core dimensions should be stable.

The right mindset is to think of your logging schema as a contract between your application and your observability system. The application promises to log certain fields with specific meanings. The observability system promises to index, search, and aggregate those fields. Both sides can optimize based on that contract. Violating it (logging inconsistent values, omitting fields) breaks the entire system.

Here's a production-grade schema for ML inference:

json
{
  "timestamp": "2026-02-27T14:32:15.847Z",
  "request_id": "req-8f4c2a1b-92d3-4e8c-b2f1-5d3c1e9a2b4f",
  "service": "recommendation-model-prod",
  "environment": "production",
  "log_level": "INFO",
  "event_type": "inference_complete",
  "model_metadata": {
    "model_name": "collaborative_filtering_v3",
    "model_version": "3.2.1",
    "training_date": "2026-02-20",
    "serving_variant": "prod-shadow-test"
  },
  "request_context": {
    "user_id": "user_5f2c9e1a",
    "session_id": "session_3b1d4e2c",
    "timestamp_ms": 1740648735847,
    "batch_size": 32
  },
  "input_features": {
    "feature_vector_hash": "sha256_abc123xyz",
    "feature_count": 128,
    "feature_store_latency_ms": 45,
    "cache_hit": true,
    "feature_engineering_duration_ms": 12
  },
  "inference_metrics": {
    "latency_ms": 87,
    "gpu_memory_mb": 324,
    "batch_processing_time_ms": 78,
    "post_processing_time_ms": 9
  },
  "output": {
    "prediction": [4, 12, 7, 2, 18],
    "confidence_scores": [0.94, 0.87, 0.76, 0.65, 0.52],
    "confidence_threshold_applied": 0.50,
    "items_above_threshold": 5
  },
  "quality_metrics": {
    "input_tokens": 256,
    "output_tokens": 12,
    "perplexity": 2.34
  },
  "anomaly_signals": {
    "latency_anomalous": false,
    "confidence_anomalous": false,
    "drift_score": 0.12
  }
}

This schema serves multiple purposes. It's queryable, it preserves ML context, and it's extensible. You'll query latency_ms across different model_version values, or count inference requests grouped by serving_variant.

The key principle: make debugging possible. If something goes wrong, your logs should let you reconstruct exactly what happened.

Why Schema Design Matters for Production

A poorly designed schema creates technical debt that compounds. You start logging latency_ms inconsistently (sometimes in milliseconds, sometimes in seconds). Months later, half your queries break. You miss real anomalies because critical fields are missing.

Invest in schema upfront. Use semantic versioning-ab-testing) for your log schema, and handle backward compatibility explicitly. When you add a field, update your schema version. When you're querying, account for missing fields in older logs.

A schema contract looks like:

json
{
  "schema_version": "2.1",
  "fields": {
    "required": ["timestamp", "request_id", "service", "event_type"],
    "inference_specific": ["inference_metrics", "model_metadata", "output"],
    "optional": ["anomaly_signals", "quality_metrics"]
  },
  "history": [
    {
      "version": "2.1",
      "date": "2026-02-27",
      "added": ["quality_metrics"]
    },
    {
      "version": "2.0",
      "date": "2026-01-15",
      "added": ["anomaly_signals"]
    }
  ]
}

The challenge of log volume in ML systems requires thinking about sampling and aggregation strategies early. If you log every single batch iteration during training, you might generate terabytes of logs for a single model training job. That's not sustainable. Instead, you might log only every hundredth batch. Or you might log only when metrics change significantly. Or you might log aggregate statistics rather than per-batch details. These sampling strategies reduce data volume while maintaining visibility.

The selection of what fields to include in logs requires understanding what you'll need for debugging. Do you need to log the full input to a model, or just statistics about it? Do you need to log intermediate activations, or just final outputs? Do you need to log all hyperparameters, or only the ones you're actively tuning? Every field you log consumes storage and processing power. The fields you include should be the ones you'll actually use for debugging or optimization.

The format of logs matters for usability. JSON logs are structured and queryable, but they're harder to read for humans. Plain text logs are human-readable but hard to parse. The best approach is often to have both: machine-readable JSON logs for aggregation and analysis, with a human-readable summary as part of the JSON structure. This gives you the best of both worlds.

The latency of log availability matters for operational awareness. If you log something and it takes ten minutes to appear in Elasticsearch, that's not useful for real-time monitoring. Logstash ingestion should be fast. Elasticsearch indexing should be fast. For critical systems, you might use in-memory caches that display logs in near-real-time before they're fully indexed. This gives you immediate visibility while you wait for full indexing.

The aggregation of logs is a powerful capability for understanding system behavior. Instead of looking at individual logs, you look at statistics: how many errors per minute? What's the distribution of latencies? What's the most common error message? Elasticsearch aggregations can compute these statistics efficiently. Good dashboards use aggregations to show you the big picture at a glance.

The correlation of logs across time is useful for understanding trends. How has error rate changed over the past month? Is latency increasing or decreasing? Are errors becoming more or less frequent? By plotting these metrics over time, you can identify trends and patterns. You can correlate these trends with changes you made to the system.

The ability to drill down from high-level dashboards to individual logs is important for investigation. Maybe your dashboard shows that error rate is elevated. You drill down to see what errors are occurring. You drill down further to see the specific requests that triggered those errors. You might find that all errors are for a particular type of request or from a particular user. This drill-down capability turns logs from passive records into active investigation tools.

The cross-team sharing of logs and log insights is valuable for organizational learning. When one team learns how to debug a particular kind of problem, that knowledge should be shared. Shared dashboards and runbooks make this easier. Some teams maintain log analysis repositories: curated sets of queries and dashboards that solve common problems. This is organizational knowledge that benefits everyone.

ELK Stack Architecture for ML Systems

The ELK Stack has three components:

  • Elasticsearch: Stores and indexes your logs. Superpowers for querying.
  • Logstash: Transforms raw logs into structured format. Filters, enriches, validates.
  • Kibana: Visualizes logs and creates dashboards. This is where you spend your time.

Here's how they flow together:

graph LR
    A["ML Services<br/>PyTorch, HF Transformers<br/>Custom Training Loops"] -->|JSON Logs| B["Logstash<br/>Filter → Enrich → Validate"]
    B -->|Structured Data| C["Elasticsearch<br/>Index Templates<br/>Shard Strategy"]
    C -->|Query & Visualize| D["Kibana<br/>Dashboards<br/>Alerts"]
    E["Feature Store<br/>Cache Layer<br/>Batch Jobs"] -->|JSON Logs| B
    D -->|Anomalies| F["Alerting System<br/>PagerDuty<br/>Slack"]

Let's build each piece.

Elasticsearch: Index Templates and Sharding

Elasticsearch is a distributed search engine. To make it work for ML logs, you need proper index templates that define how data gets stored.

Here's a production template for ML inference logs:

json
{
  "index_patterns": ["ml-inference-*"],
  "template": {
    "settings": {
      "number_of_shards": 5,
      "number_of_replicas": 1,
      "refresh_interval": "30s",
      "index": {
        "lifecycle": {
          "name": "ml-logs-policy",
          "rollover_alias": "ml-inference-write"
        }
      }
    },
    "mappings": {
      "properties": {
        "timestamp": {
          "type": "date",
          "format": "strict_date_time"
        },
        "request_id": {
          "type": "keyword"
        },
        "service": {
          "type": "keyword"
        },
        "event_type": {
          "type": "keyword"
        },
        "model_metadata": {
          "properties": {
            "model_name": {"type": "keyword"},
            "model_version": {"type": "keyword"},
            "serving_variant": {"type": "keyword"}
          }
        },
        "inference_metrics": {
          "properties": {
            "latency_ms": {"type": "integer"},
            "gpu_memory_mb": {"type": "integer"},
            "batch_processing_time_ms": {"type": "integer"}
          }
        },
        "output": {
          "properties": {
            "confidence_scores": {
              "type": "nested",
              "properties": {
                "score": {"type": "float"}
              }
            }
          }
        },
        "anomaly_signals": {
          "properties": {
            "latency_anomalous": {"type": "boolean"},
            "drift_score": {"type": "float"}
          }
        }
      }
    }
  }
}

This template does several things:

  • Creates new indices daily with rollover (keeps queries fast)
  • Uses keyword type for fields you'll filter on (model_version, service)
  • Uses integer for metrics you'll aggregate (latency_ms)
  • Uses nested for arrays you need to query individually
  • Sets sharding strategy for your data volume

Deploy this with:

bash
curl -X PUT "localhost:9200/_index_template/ml-logs-template" \
  -H 'Content-Type: application/json' \
  -d @template.json

Logstash: Filtering and Enriching

Logstash receives raw logs from your services and transforms them. Here's a realistic config for ML inference logs:

input {
  tcp {
    port => 5000
    codec => json
  }
}

filter {
  # Parse and validate timestamp
  date {
    match => [ "timestamp", "ISO8601" ]
    target => "@timestamp"
  }

  # Validate required fields
  if ![request_id] or ![model_metadata][model_version] {
    mutate {
      add_tag => [ "invalid_schema" ]
      add_field => { "validation_error" => "Missing required fields" }
    }
  }

  # Enrich with deployment context
  mutate {
    add_field => {
      "deployment_environment" => "production"
      "cluster_name" => "ml-us-west-2"
      "ingest_timestamp" => "%{@timestamp}"
    }
  }

  # Parse confidence scores for per-item analysis
  if [output][confidence_scores] {
    ruby {
      code => '
        scores = event.get("[output][confidence_scores]")
        if scores.is_a?(Array)
          event.set("[output][mean_confidence]",
            scores.sum.to_f / scores.length)
          event.set("[output][max_confidence]", scores.max)
          event.set("[output][min_confidence]", scores.min)
        end
      '
    }
  }

  # Calculate derived metrics
  if [inference_metrics][latency_ms] {
    mutate {
      convert => {
        "[inference_metrics][latency_ms]" => "integer"
      }
    }

    # Mark anomalies for later aggregation
    if [inference_metrics][latency_ms] > 500 {
      mutate {
        add_tag => [ "high_latency" ]
      }
    }
  }

  # Detect potential data drift
  if [output][confidence_threshold_applied] and [output][items_above_threshold] {
    ruby {
      code => '
        threshold = event.get("[output][confidence_threshold_applied]")
        items = event.get("[output][items_above_threshold]")

        if items && items < 2
          event.set("[anomaly_signals][low_confidence_warning]", true)
        end
      '
    }
  }
}

output {
  # Send to Elasticsearch with proper index naming
  elasticsearch {
    hosts => ["elasticsearch:9200"]
    index => "ml-inference-%{+YYYY.MM.dd}"
    document_type => "_doc"
  }

  # Also send high-severity logs to a separate stream for alerting
  if "high_latency" in [tags] or "invalid_schema" in [tags] {
    file {
      path => "/var/log/ml-anomalies.log"
      codec => json
    }
  }

  # Debug: print a sample of incoming logs
  if [@metadata][index_type] == "sample" {
    stdout {
      codec => rubydebug
    }
  }
}

This Logstash config:

  • Parses incoming JSON logs over TCP
  • Validates required fields
  • Enriches logs with deployment-production-inference-deployment) metadata
  • Calculates derived metrics (mean confidence, max confidence)
  • Tags anomalies for easy filtering
  • Routes high-severity logs to separate files
  • Sends everything to Elasticsearch with date-based indices

Kibana: Dashboards and Queries

Now that your logs are indexed, Kibana lets you explore and visualize them. Here's a query to find inference latency anomalies:

json
{
  "query": {
    "bool": {
      "must": [
        {
          "range": {
            "timestamp": {
              "gte": "now-1h",
              "lte": "now"
            }
          }
        },
        {
          "term": {
            "model_metadata.model_version.keyword": "3.2.1"
          }
        }
      ],
      "filter": [
        {
          "range": {
            "inference_metrics.latency_ms": {
              "gte": 500
            }
          }
        }
      ]
    }
  },
  "aggs": {
    "latency_by_variant": {
      "terms": {
        "field": "model_metadata.serving_variant.keyword",
        "size": 10
      },
      "aggs": {
        "p99_latency": {
          "percentiles": {
            "field": "inference_metrics.latency_ms",
            "percents": [99]
          }
        }
      }
    }
  },
  "size": 0
}

In Kibana's UI, you'd create a dashboard showing:

  1. Real-time latency distribution (p50, p95, p99 by model variant)
  2. Confidence score trends (are predictions getting less confident?)
  3. Cache hit rate (feature store performance)
  4. Error rate timeline (schema validation failures)
  5. Anomaly detector results (automated drift detection)

A production dashboard might look like:

graph TB
    subgraph Dashboard["Kibana ML Monitoring Dashboard"]
        A["Latency Distribution<br/>p50: 78ms | p95: 234ms | p99: 512ms"]
        B["Confidence Trends<br/>Mean: 0.82 | 24h change: +3%"]
        C["Cache Hit Rate<br/>Feature Store: 94.2%"]
        D["Error Rate<br/>Invalid Schema: 0.12%"]
        E["Anomaly Score<br/>Drift: 0.15 | Latency: LOW"]
        F["Model Version Distribution<br/>v3.2.1: 85% | v3.1.9: 15%"]
    end

The architectural choice of ELK has important implications for how you instrument your system. Because Elasticsearch is built for search and aggregation, you can afford to log more detail than you could with traditional file-based logging. Instead of sampling one percent of requests, you can log all of them. This transforms your visibility from sampling-based approximations to complete coverage.

The scalability characteristics of ELK are important to understand. Logstash can become a bottleneck if you have more logging volume than it can process. Solutions include running multiple Logstash instances, distributing load across them, or using more efficient log shippers like Filebeat. Elasticsearch performance depends on your indexing strategy and retention policies. Indexing too much data or retaining data too long can make queries slow and cost excessive disk space.

The alerting capabilities that Elasticsearch provides are valuable for production operations. You can set up alerts that trigger when certain log patterns occur. If a model starts producing errors at high rates, an alert fires. If inference latency percentiles exceed thresholds, an alert fires. These alerts enable rapid response to problems. The key is setting alert thresholds carefully so you catch real problems without overwhelming yourself with false alarms.

Data retention and archival are practical considerations. Elasticsearch is optimized for recent data, not historical data. As logs age, you might archive them to cheaper storage like S3 or HDFS. This lets you maintain complete historical logs for compliance while keeping Elasticsearch responsive. When you need to investigate a historical issue, you can search both recent logs in Elasticsearch and historical logs in archive storage.

Training Logs: From PyTorch to JSON

Inference logging is one half. Training logs are the other. You need to capture model training in structured format too.

PyTorch Lightning with Structured Callbacks

If you're using PyTorch-ddp-advanced-distributed-training) Lightning (you should be), here's a callback that emits structured logs:

python
import json
import logging
from pytorch_lightning.callbacks import Callback
from datetime import datetime
 
class StructuredLoggingCallback(Callback):
    def __init__(self, experiment_id: str, model_name: str):
        self.experiment_id = experiment_id
        self.model_name = model_name
        self.logger = logging.getLogger(__name__)
        self.logger.setLevel(logging.INFO)
 
    def on_train_epoch_end(self, trainer, pl_module):
        log_entry = {
            "timestamp": datetime.utcnow().isoformat() + "Z",
            "experiment_id": self.experiment_id,
            "model_name": self.model_name,
            "event_type": "training_epoch_end",
            "epoch": trainer.current_epoch,
            "global_step": trainer.global_step,
            "metrics": {
                "train_loss": trainer.callback_metrics.get("train_loss"),
                "val_loss": trainer.callback_metrics.get("val_loss"),
                "learning_rate": trainer.optimizers[0].param_groups[0]["lr"]
            },
            "training_context": {
                "batch_size": trainer.datamodule.batch_size,
                "num_gpus": trainer.num_gpus,
                "world_size": trainer.world_size,
                "gradient_accumulation_steps": trainer.accumulate_grad_batches
            },
            "system_metrics": {
                "epoch_duration_seconds": trainer.fit_loop.epoch_progress.current.completed
            }
        }
 
        self.logger.info(json.dumps(log_entry))
 
    def on_train_batch_end(self, trainer, pl_module, outputs, batch, batch_idx):
        # Sample every Nth batch to avoid overwhelming logs
        if batch_idx % 100 == 0:
            log_entry = {
                "timestamp": datetime.utcnow().isoformat() + "Z",
                "experiment_id": self.experiment_id,
                "event_type": "training_batch_sample",
                "batch_idx": batch_idx,
                "loss": outputs.get("loss"),
                "batch_size": len(batch[0]) if isinstance(batch, (list, tuple)) else batch.size(0)
            }
            self.logger.info(json.dumps(log_entry))
 
# Usage
trainer = pl.Trainer(
    callbacks=[
        StructuredLoggingCallback(
            experiment_id="exp_2026_02_27_v1",
            model_name="collaborative_filtering_v3"
        )
    ]
)
trainer.fit(model, dataloader)

This callback captures training progress in structured JSON that flows directly to your ELK stack.

Hugging Face Trainer Integration

For transformer models, the Hugging Face Trainer supports custom callbacks:

python
from transformers import TrainerCallback
import json
import logging
 
class MLStructuredLoggingCallback(TrainerCallback):
    def __init__(self, model_name: str):
        self.model_name = model_name
        self.logger = logging.getLogger(__name__)
 
    def on_log(self, args, state, control, logs=None, **kwargs):
        if logs:
            log_entry = {
                "timestamp": datetime.utcnow().isoformat() + "Z",
                "model_name": self.model_name,
                "event_type": "training_step",
                "global_step": state.global_step,
                "epoch": state.epoch,
                "metrics": logs,
                "learning_rate": logs.get("learning_rate"),
                "training_loss": logs.get("loss"),
                "eval_metrics": {
                    "eval_loss": logs.get("eval_loss"),
                    "eval_f1": logs.get("eval_f1"),
                    "eval_accuracy": logs.get("eval_accuracy")
                }
            }
            self.logger.info(json.dumps(log_entry))
 
# Usage
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    callbacks=[MLStructuredLoggingCallback(model_name="bert-finetuned")]
)
trainer.train()

Log-Based Anomaly Detection with Elasticsearch ML

Elasticsearch has built-in machine learning for anomaly detection. You can create jobs that automatically detect:

  • Latency spikes
  • Error rate increases
  • Confidence score drops
  • Cache hit rate degradation

Here's how to create an anomaly detection job for inference latency:

json
{
  "job_id": "ml-inference-latency-anomaly",
  "description": "Detect latency anomalies in inference requests",
  "analysis_config": {
    "detectors": [
      {
        "detector_description": "mean of latency_ms by model_version",
        "function": "mean",
        "field_name": "inference_metrics.latency_ms",
        "by_field_name": "model_metadata.model_version.keyword",
        "partition_field_name": "model_metadata.serving_variant.keyword"
      }
    ],
    "bucket_span": "5m",
    "detector_time_format": "epoch_ms",
    "influencers": [
      "model_metadata.model_version.keyword",
      "model_metadata.serving_variant.keyword"
    ]
  },
  "data_description": {
    "time_field": "@timestamp",
    "time_format": "epoch_ms"
  },
  "model_plot_config": {
    "enabled": true,
    "terms_field": "model_metadata.model_version.keyword"
  }
}

Deploy with:

bash
curl -X PUT "localhost:9200/_ml/anomaly_detectors/ml-inference-latency-anomaly" \
  -H 'Content-Type: application/json' \
  -d @anomaly_job.json

The anomaly detector runs continuously and scores each bucket on how anomalous it is (0-100). You can alert when scores exceed 75.

Understanding your logs at scale requires thinking differently about what to log and how to query it. With millions of logs per day, you need to be selective. Logging every single inference request might generate more logs than your storage can handle. Instead, you log a sample of requests, or you log requests that meet certain criteria (slow requests, requests with errors, etc.). This requires careful thinking about what information is valuable and what is noise.

The schema design for ML logs is more complex than for traditional applications. You need to capture metrics like training loss, validation accuracy, inference latency, model predictions, and confidence scores. You need to capture environment information like which GPU device ran the computation, which version of the model was used, and what hyperparameters were configured. You need to capture context information like the user who triggered the computation, the feature flag configuration, and the ab-test variant being served. A well-designed log schema captures all of this in a queryable format.

One challenge is that different parts of your system might log at different levels of detail. Your training loop might log every step. Your inference servers might log one entry per request. Your data pipeline-parallelism)-automated-model-compression) might log one entry per batch. Without careful normalization, comparing across these components is difficult. That's where Logstash's transformation capabilities become valuable. You can use Logstash to parse, enrich, and normalize logs from diverse sources.

The query language and visualization capabilities of Kibana enable powerful analysis. You can ask questions like "what was the average inference latency per model version last week?" or "how many requests had predictions below the confidence threshold?" or "what was the error rate for each geographic region?" These kinds of questions drive operational insights. But asking them requires understanding your logs well enough to write the right queries.

Performance optimization of the ELK stack is an ongoing concern. As your log volume grows, Elasticsearch queries can become slow. You might index too many fields, making indexing slow. You might retain too much data, making searches slow. You need to periodically review your indexing strategy and retention policies. You might use index templates to automatically apply consistent indexing patterns to new indices. You might use index lifecycle management to automatically delete old data.

Alerting on logs is more nuanced than alerting on metrics. With metrics like CPU utilization, a threshold alert is straightforward: alert if CPU is above eighty percent. With logs, you need to look for patterns. Alert if the error rate exceeds a threshold. Alert if you see specific error messages. Alert if the distribution of response times changes. This requires writing alert logic that understands your logs.

The integration between logs and metrics is important. Your metrics system might tell you that inference latency is high. Your logs can tell you why. Maybe all requests are going to a particular GPU that's overloaded. Maybe a particular model is slow. Maybe there's a network issue. Correlating logs and metrics helps you understand the full picture.

Cost management of log aggregation is a practical concern. Storing and indexing billions of log entries is expensive. Every byte costs money for storage, and every search costs money for compute. You need to be thoughtful about what to log, how long to retain it, and how often you search it. Some teams implement tiered logging where recent logs are in Elasticsearch for fast search, and old logs are in cheaper storage for rare access.

Loki + Grafana: The Lighter Alternative

ELK is powerful but heavyweight. If you want something simpler, Loki paired with Grafana is a strong alternative.

Loki doesn't index log content - it only indexes labels. This makes it fast and cheap but less flexible for ad-hoc querying.

For ML logs, you'd send the same JSON but configure Loki to extract important labels:

yaml
scrape_configs:
  - job_name: ml-inference
    static_configs:
      - targets:
          - localhost
        labels:
          job: ml-inference
          __path__: /var/log/ml/*.log
 
    pipeline_stages:
      - json:
          expressions:
            request_id: request_id
            model_version: model_metadata.model_version
            latency: inference_metrics.latency_ms
            confidence: output.mean_confidence
 
      - labels:
          request_id:
          model_version:
          latency:
          confidence:

In Grafana, you'd query with LogQL:

{job="ml-inference"}
  | json model_version="model_metadata.model_version", latency="inference_metrics.latency_ms"
  | model_version="3.2.1"
  | latency > 500

ELK vs Loki for ML:

  • ELK: Full-text search, complex aggregations, Elasticsearch ML jobs. Better when you need detailed debugging.
  • Loki: Lightweight, cheaper, faster for known queries. Better when you have high volume and know what to look for.

For ML systems at scale, we recommend ELK + Loki hybrid: ELK for detailed inference logs, Loki for high-volume trace/event logs.

The Practical Reality of Log-Driven Operations

Running production ML systems without proper logging is like flying a plane without instruments. You might make it for a while, but when things go wrong, you're helpless. Logs are your instruments, giving you visibility into what the system is doing, why it's behaving a certain way, and where the problems lie.

The operational cadence of a well-logged system is fundamentally different from a poorly-logged one. With good logging, when an alert fires, you can immediately correlate it to your logs and understand context. When a customer reports an issue, you can pull up their request logs and see exactly what happened. When you deploy a new model version, you can compare its logs to the previous version and spot behavioral changes. These capabilities transform operations from reactive firefighting to proactive problem-solving.

The investment required to build good logging infrastructure pays dividends immediately. Yes, it takes engineering effort to design schemas, instrument code, and set up aggregation infrastructure. But the time you save by being able to debug issues quickly and confidently repays that investment within days or weeks.

The skill of reading logs effectively is worth developing. An experienced engineer can look at a dashboard of logs and spot anomalies that others miss. They understand what normal looks like and can quickly identify deviations. They know which fields to correlate when investigating issues. These skills come from practice and from familiarity with your system's specific logging patterns.

The continuous improvement of your logging infrastructure is important. As your system grows and changes, your logging needs change. Features that seemed unimportant initially might become critical for debugging. Fields that seemed essential might become noise. Regularly reviewing your logging strategy and adjusting it based on what you've learned from actual incidents is a best practice.

The documentation of your logging system is essential for team onboarding. When new engineers join, they need to understand your logging strategy. What logs exist? What do they contain? How do you search them? How do you interpret them? Without documentation, knowledge lives only in the heads of senior engineers. Documentation lets knowledge scale across the team.

The budget allocation for logging infrastructure is worth taking seriously. Storage costs money. Processing costs money. Search bandwidth costs money. For a large ML system, logging costs can be significant. But viewing this as an expense misses the point. Logging is an investment in operational visibility and reliability. The cost of not being able to debug issues quickly is much higher than the cost of logging.

The tooling choices matter but shouldn't paralyze you. ELK Stack is powerful and popular, but Datadog might be simpler to operate. Loki is lightweight and cheap. Splunk is enterprise-grade. The specific tool matters less than having a tool, instrumenting your system thoroughly, and using the data to improve your system. Pick something reasonable and get started. You can always migrate later if needed.

The relationship between logging and compliance is worth understanding. Compliance requirements like GDPR, HIPAA, and SOC2 often require audit logs: records of who accessed what and when. Your logging infrastructure needs to capture this information. Some regulations require retention of logs for specific periods. Your retention policies need to comply. Some regulations require encryption of logs or restriction of who can access them. These requirements might drive your infrastructure choices.

The integration of logging with incident response workflows makes response faster and more effective. When you're in an incident, you need information fast. Good integration means you can get from alert to relevant logs in seconds. Some teams use tools like PagerDuty that integrate with Elasticsearch for one-click access to context. Others build custom tools. Whatever you do, optimize for speed in the incident response path.

The psychology of logging is worth considering. Engineers are more likely to add good logs if they've been burned by poor logs in the past. They're more likely to maintain logging infrastructure if they've seen its value in debugging production issues. Building a culture around observability requires leaders who emphasize its importance and teams that celebrate the debugging successes that good logging makes possible.

The future trajectory of log aggregation includes increasing automation and intelligence. Instead of engineers manually writing alerts, systems will automatically detect anomalies. Instead of engineers writing dashboards, systems will automatically create relevant visualizations. Instead of engineers writing queries, natural language interfaces will understand intent. Machine learning on logs will become standard practice. The basic infrastructure of collecting and indexing logs remains the same, but the layers of intelligence on top will become more sophisticated.

The final principle: logs are not an afterthought. They're a first-class citizen in your system design. Think about logging from the very beginning of system design. Plan what you need to observe. Design your log schema to support that observation. Instrument your code comprehensively. Set up aggregation and visualization. Use it to improve the system. This intentional approach to logging pays for itself many times over in reduced debugging time and improved system reliability.

Production Log Schema Reference

Here's your complete ML inference logging schema as a reference:

json
{
  "timestamp": "ISO8601",
  "request_id": "UUID or unique identifier",
  "service": "service_name",
  "environment": "production|staging|development",
  "log_level": "ERROR|WARN|INFO|DEBUG",
  "event_type": "inference_complete|training_epoch_end|validation_start",
 
  "model_metadata": {
    "model_name": "string",
    "model_version": "semantic_version",
    "training_date": "YYYY-MM-DD",
    "serving_variant": "prod|canary|shadow|experiment_id"
  },
 
  "request_context": {
    "user_id": "string",
    "session_id": "string",
    "timestamp_ms": "epoch_ms",
    "batch_size": "integer"
  },
 
  "input_features": {
    "feature_vector_hash": "sha256_hex",
    "feature_count": "integer",
    "feature_store_latency_ms": "integer",
    "cache_hit": "boolean",
    "feature_engineering_duration_ms": "integer"
  },
 
  "inference_metrics": {
    "latency_ms": "integer",
    "gpu_memory_mb": "integer",
    "batch_processing_time_ms": "integer",
    "post_processing_time_ms": "integer"
  },
 
  "output": {
    "prediction": "array|string|number",
    "confidence_scores": "array of floats",
    "confidence_threshold_applied": "float",
    "items_above_threshold": "integer"
  },
 
  "quality_metrics": {
    "input_tokens": "integer",
    "output_tokens": "integer",
    "perplexity": "float"
  },
 
  "anomaly_signals": {
    "latency_anomalous": "boolean",
    "confidence_anomalous": "boolean",
    "drift_score": "float 0-1"
  }
}

Common Pitfalls in ML Log Aggregation

You've deployed your logging infrastructure. Then reality hits. Here are the traps teams fall into. These aren't theoretical - they're the actual failure modes we've seen repeatedly in production systems.

The common thread through all these pitfalls is the tension between wanting complete information and needing to operate at scale. You want to log everything about every request, to maximize your debugging capability. But if you actually do that at scale, your infrastructure collapses. Storage explodes. Query latency becomes unbearable. Your logs become such a firehose that finding signal in the noise becomes impossible.

This is why the best logging systems are designed around principles like adaptive sampling, cardinality limits, and aggressive filtering. You're not trying to log everything. You're trying to log the right things, in the right proportions, to enable effective debugging without overwhelming your infrastructure.

The teams that succeed at logging at scale all converge on similar patterns: log samples of normal requests (maybe 5%), but log 100% of errors. Log all anomalies. Log all requests involving new users or rare feature values. Log trailing edge cases. This way, your normal workload stays manageable, but you still have rich data about anything unusual. When something goes wrong, you have the context you need to understand it.

Pitfall 1: Overwhelming Log Volume

Your model serves 1000 requests/second. You log every request. That's 86 million log entries per day. At average 2KB per entry, that's 172GB per day. Elasticsearch collapses trying to index this.

The problem: You're logging too much.

You don't need every request. You need samples and anomalies.

Solution: Implement adaptive sampling:

python
import random
import logging
 
class AdaptiveSamplingLogger:
    def __init__(self, base_rate=0.1):
        self.base_rate = base_rate
        self.high_latency_rate = 1.0  # Always log slow requests
        self.error_rate = 1.0          # Always log errors
 
    def should_log(self, latency_ms, is_error=False):
        if is_error:
            return True
        if latency_ms > 500:  # High latency threshold
            return True
        return random.random() < self.base_rate
 
# Usage
sampler = AdaptiveSamplingLogger(base_rate=0.05)  # Log 5% of normal requests
 
if sampler.should_log(inference_latency, is_error=request_failed):
    logger.info(json.dumps(log_entry))

This way, you capture 5% of normal traffic (for baseline metrics), 100% of slow requests (for performance debugging), and 100% of errors. Your log volume drops by 95%, and you lose no critical information.

Pitfall 2: Unbounded Nested Objects

You log the entire inference request:

python
log_entry = {
    "request": full_request_object,  # 50 fields
    "features": feature_vector,       # 10,000 dimensions
    "model_weights_hash": weights    # Serialized weights
}

Elasticsearch tries to index all 10,000 feature dimensions individually. Your index explodes. Queries slow down.

Solution: Hash complex objects, keep summaries:

python
import hashlib
import json
 
def compute_feature_hash(features):
    """Hash features instead of storing them."""
    return hashlib.sha256(
        json.dumps(features, sort_keys=True).encode()
    ).hexdigest()
 
log_entry = {
    "request": {
        "user_id": request.user_id,
        "session_id": request.session_id,
        # DON'T store full feature vector
    },
    "features": {
        "vector_hash": compute_feature_hash(features),
        "vector_dimension": len(features),
        "vector_norm": sum(f**2 for f in features) ** 0.5
    }
}

This way, you capture enough context to correlate logs, but avoid indexing massive nested objects.

Pitfall 3: Timezone Confusion

Your servers are UTC. Your Kibana is viewing PST. Your alerts fire at the "wrong time." You're actually looking at the right data, but the timestamp confusion makes it hard.

Solution: Always use ISO8601 with timezone:

python
from datetime import datetime, timezone
 
# Good
timestamp = datetime.now(tz=timezone.utc).isoformat()
# "2026-02-27T22:15:30.123456+00:00"
 
# Bad
timestamp = datetime.now().isoformat()
# "2026-02-27T22:15:30.123456"  (no timezone, ambiguous)

In Elasticsearch mapping, enforce this:

json
"timestamp": {
  "type": "date",
  "format": "strict_date_time"  # Enforces ISO8601 with timezone
}

Pitfall 4: Cardinality Explosions

You log user_id as a keyword field. Your platform has 100 million users. Elasticsearch tries to build a dictionary of all 100 million values. Memory usage explodes.

Solution: Don't index unbounded high-cardinality fields. Use them only in aggregations:

json
{
  "properties": {
    "user_id": {
      "type": "keyword",
      "ignore_above": 512,  # Don't index values > 512 chars
      "index": false        # Don't index at all for filtering
    }
  }
}

Or don't store the raw user_id - store a hash:

python
import hashlib
 
def user_id_hash(user_id):
    return hashlib.md5(user_id.encode()).hexdigest()
 
log_entry = {
    "user_id_hash": user_id_hash(user_id),
    "user_cohort": compute_user_cohort(user_id)  # Lower cardinality
}

This trades specificity for performance. You can't query for a specific user, but you can analyze by cohort.

Pitfall 5: Losing Logs During Shutdown

Your model server gets a SIGTERM. Logstash is buffering 10,000 log entries. Your code exits immediately. Logs vanish.

Solution: Implement graceful shutdown with buffer flushing:

python
import logging
import signal
import time
 
class GracefulShutdownHandler(logging.Handler):
    def __init__(self, logstash_handler):
        super().__init__()
        self.logstash = logstash_handler
 
    def emit(self, record):
        self.logstash.emit(record)
 
def graceful_shutdown(signum, frame):
    """Flush logs before exiting."""
    logger.info("Graceful shutdown initiated")
 
    # Flush all handlers
    for handler in logging.root.handlers:
        handler.flush()
 
    # Wait for async sends
    time.sleep(2)
 
    sys.exit(0)
 
signal.signal(signal.SIGTERM, graceful_shutdown)
signal.signal(signal.SIGINT, graceful_shutdown)

Or use a logging library like python-json-logger with async buffering:

python
from pythonjsonlogger import jsonlogger
import logging
 
logger = logging.getLogger()
handler = logging.FileHandler('logs.json')
formatter = jsonlogger.JsonFormatter()
handler.setFormatter(formatter)
logger.addHandler(handler)

Testing your logging infrastructure is often neglected but critical for reliability. You need to test that logs are actually being captured and indexed. You need to test that your Logstash transformations are working correctly. You need to test that your queries return the expected results. This might sound basic, but teams often discover that their logs were being dropped or incorrectly parsed only when they desperately need them during an incident.

The operational burden of maintaining an ELK stack should not be underestimated. You need to monitor Elasticsearch health, manage disk space, handle shard allocation, and maintain the cluster. You might need Logstash and Kibana running as well. For small teams, this operational overhead might be too high. This is where managed services like Datadog or New Relic have value. They handle the operational burden in exchange for cost. The choice between self-hosted and managed depends on your team's operational maturity and budget.

Integration with incident response workflows is important. When you have an alert, you need a way to quickly go from the alert to the relevant logs. Some teams build custom tools that hyperlink alerts to Kibana dashboards showing relevant logs. Others use tools like Opsgenie that integrate with Elasticsearch for one-click access to context. This integration matters because time matters in incidents; every second spent searching for context is a second not spent fixing the problem.

One often overlooked aspect is log correlation across distributed systems. When an inference request flows through multiple services, different services log different parts of the request. The frontend logs that a request was made. The load balancer logs the routing decision. The inference server-inference-server-multi-model-serving) logs the actual computation. Without correlation, you're just looking at fragmented pieces. With a trace ID that flows through all services, you can reconstruct the full request. This is where distributed tracing concepts apply to logs.

The human factor matters in log analysis. Developers need to understand what the logs mean. What does this error message indicate? What values are normal, and what values indicate a problem? This is where documentation and runbooks come in. Good teams maintain documentation of their log formats, what different error messages mean, and how to debug common issues using logs.

Production Considerations

You're scaling your logging to thousands of requests per second. New constraints emerge:

Cost Optimization

Elasticsearch storage is expensive. A GB per day of logs at 3-year retention costs ~$1000/year. Here's how to optimize:

  1. Delete low-value logs after a shorter retention period:
json
{
  "index_lifecycle_policy": {
    "phases": {
      "hot": {
        "min_age": "0d",
        "actions": {
          "rollover": {
            "max_primary_store_size": "50GB"
          }
        }
      },
      "warm": {
        "min_age": "3d",
        "actions": {
          "set_priority": {
            "priority": 50
          },
          "forcemerge": {
            "max_num_segments": 1
          }
        }
      },
      "cold": {
        "min_age": "30d",
        "actions": {
          "searchable_snapshot": {}
        }
      },
      "delete": {
        "min_age": "90d",
        "actions": {
          "delete": {}
        }
      }
    }
  }
}

This policy:

  • Keeps hot logs (hot) for 3 days (full speed)
  • Compresses warm logs (30 days) (slower, cheaper)
  • Archives old logs (cold) (very cheap, slower search)
  • Deletes after 90 days

Cost savings: 60-70%.

  1. Compress JSON before sending:
python
import gzip
import json
 
def log_compressed(log_entry):
    json_bytes = json.dumps(log_entry).encode()
    compressed = gzip.compress(json_bytes)
 
    # Send compressed to Logstash
    logstash_client.send(compressed)

Compression ratio: 3-5x for JSON.

High-Availability Elasticsearch

Production Elasticsearch needs redundancy:

yaml
# docker-compose.yml
services:
  elasticsearch-1:
    image: docker.elastic.co/elasticsearch/elasticsearch:8.0.0
    environment:
      - discovery.seed_hosts=elasticsearch-1,elasticsearch-2,elasticsearch-3
      - cluster.initial_master_nodes=elasticsearch-1,elasticsearch-2,elasticsearch-3
      - xpack.security.enabled=true
    ports:
      - "9200:9200"
 
  elasticsearch-2:
    image: docker.elastic.co/elasticsearch/elasticsearch:8.0.0
    environment:
      - discovery.seed_hosts=elasticsearch-1,elasticsearch-2,elasticsearch-3
      - cluster.initial_master_nodes=elasticsearch-1,elasticsearch-2,elasticsearch-3
 
  elasticsearch-3:
    image: docker.elastic.co/elasticsearch/elasticsearch:8.0.0
    environment:
      - discovery.seed_hosts=elasticsearch-1,elasticsearch-2,elasticsearch-3
      - cluster.initial_master_nodes=elasticsearch-1,elasticsearch-2,elasticsearch-3

A 3-node cluster survives any single node failure.

Logstash Scaling

Logstash becomes the bottleneck. Single-instance Logstash can handle ~10K logs/sec. Beyond that, you need multiple Logstash instances with load balancing:

yaml
# Logstash cluster with load balancer
services:
  logstash-1:
    image: docker.elastic.co/logstash/logstash:8.0.0
    ports:
      - "5000:5000"
 
  logstash-2:
    image: docker.elastic.co/logstash/logstash:8.0.0
    ports:
      - "5001:5000"
 
  logstash-3:
    image: docker.elastic.co/logstash/logstash:8.0.0
    ports:
      - "5002:5000"
 
  nginx:
    image: nginx:latest
    ports:
      - "5000:5000"
    volumes:
      - ./nginx.conf:/etc/nginx/nginx.conf

nginx.conf:

upstream logstash {
    server logstash-1:5000;
    server logstash-2:5000;
    server logstash-3:5000;
}

server {
    listen 5000;
    location / {
        proxy_pass http://logstash;
    }
}

Load-balanced Logstash can handle 100K+ logs/sec.

Request Tracing with Correlation IDs

For complex distributed systems, trace requests across services using correlation IDs:

python
import uuid
import contextvars
 
# Context variable to hold the correlation ID
correlation_id = contextvars.ContextVar('correlation_id', default=None)
 
def set_correlation_id(request):
    """Extract or generate correlation ID from request headers."""
    cid = request.headers.get('X-Correlation-ID', str(uuid.uuid4()))
    correlation_id.set(cid)
    return cid
 
def log_with_correlation(event):
    """Add correlation ID to every log."""
    cid = correlation_id.get()
    if cid:
        event['correlation_id'] = cid
    return event
 
# Usage
app = Flask(__name__)
 
@app.before_request
def before():
    set_correlation_id(request)
 
@app.route('/predict', methods=['POST'])
def predict():
    # At each step, correlation_id is available
    features = extract_features(request)
    log_entry = log_with_correlation({
        "step": "feature_extraction",
        "feature_count": len(features)
    })
    logger.info(json.dumps(log_entry))
 
    # Call feature store
    features = feature_store.get(request.user_id)  # Passes correlation_id automatically
 
    # Predict
    prediction = model.predict(features)
    log_entry = log_with_correlation({
        "step": "inference",
        "latency_ms": elapsed
    })
    logger.info(json.dumps(log_entry))
 
    return prediction

In Kibana, you can now search by correlation_id to see the entire request flow across all services.

Advanced Topics: Alert Strategies and On-Call Integration

Once you've got logs aggregating and dashboards visualized, the next frontier is alerting. Not all alerts are equal. An alert that fires 100 times a day trains your on-call engineer to ignore it. An alert that misses a real issue makes you look careless to your team.

The key is context-aware alerting. Don't alert on raw metrics. Alert on anomalies relative to baseline. Your model's latency might normally be 80ms but occasionally spike to 200ms due to network variability. A naive alert on latency > 200ms fires constantly. A smarter alert: "latency exceeded baseline by >3 standard deviations" fires only when something is genuinely wrong.

Elasticsearch ML jobs compute this for you. They learn the normal pattern and alert only on genuine anomalies. Hook those alerts to your on-call rotation: when Elasticsearch ML detects a latency anomaly, it sends a Slack message with context (which model, which variant, what changed?), and an engineer can jump straight into Kibana to investigate.

Another pattern: multi-signal alerting. Alert when multiple indicators fire together, not individually. A single slow request might be noise. Latency slow AND error rate high AND confidence scores dropping? That's a real issue. You can encode this in Kibana as a compound query, or in your alerting layer (Elastic Alerting, PagerDuty, Opsgenie) as a correlation rule.

The relationship between logging and metrics is symbiotic. Metrics give you the high-level view: your inference latency is high. Logs give you the details: which specific requests were slow, what parameters they had, what model they used. The combination is powerful. A well-designed observability system captures both metrics and logs and makes it easy to correlate them.

Sampling strategies for logs require careful thought. Logging everything creates too much data. Sampling too aggressively loses important information. One effective strategy is adaptive sampling where you log all errors, all slow requests, and a random sample of normal requests. This gives you visibility into problems without the overhead of logging everything. Another strategy is context-aware sampling where you log more detail in certain contexts: more detail on requests from a particular customer, more detail during certain hours, more detail for certain features.

The data model for ML logs often includes nested structures. A training job log might contain logs from each training step. An inference request log might contain logs from each stage of processing. Elasticsearch handles this well with nested documents, allowing you to query at multiple levels of detail. But this requires thoughtful design of your log schema. You need to decide what goes at the top level and what goes nested.

Machine learning on logs is a powerful capability. Beyond simple alerting based on thresholds, you can use unsupervised learning to detect anomalies. You can use classification to categorize logs. You can use clustering to discover patterns. Elasticsearch ML provides built-in capabilities for these kinds of analyses. This is where your logs become actionable intelligence, not just recording of events.

Compliance and data privacy are important considerations for logging. Logs might contain personally identifiable information. Logs might contain sensitive information like API keys or passwords. Your logging system needs to protect this information. You might need to redact or encrypt certain fields. You need to ensure that access to logs is properly controlled. Compliance requirements like GDPR or HIPAA might require retention limits on logs. You need policies and procedures for handling these requirements.

The operational aspect of log aggregation includes handling failures gracefully. What happens if Elasticsearch goes down? Do you keep logging locally and replay when Elasticsearch comes back up? Do you discard logs and accept the loss of visibility? Do you have a backup log aggregation system? These are operational questions that need answers before you have an emergency.

Log storage optimization involves index lifecycle management. Older indices might be read-heavy but rarely queried. You might compress them or move them to cheaper storage. New indices might be write-heavy. You might use different settings for new indices than old ones. Elasticsearch supports index lifecycle management policies that automate this process. These policies can automatically roll indices, delete old indices, or move them to different storage tiers.

Searching and querying logs at scale requires understanding Elasticsearch query patterns. Full-text search is powerful but slow. Filtering on indexed fields is fast. Aggregations and analytics are powerful for summarization. Time-series queries are optimized for time-based analysis. Understanding these patterns helps you write efficient queries that return results quickly.

The visualization of logs in Kibana should tell a story. Dashboards should answer specific questions. What's the health of my system? What are the top errors? What's the distribution of latencies? A good dashboard provides at-a-glance visibility into what matters. Dashboards are valuable for on-call rotations where engineers need to quickly assess system health.

Log retention policies balance visibility with cost. Keeping logs forever is expensive. Deleting logs immediately loses visibility. A common pattern is to keep detailed logs for a week, aggregated logs for a month, and summarized statistics forever. This gives you detailed visibility for recent events and broad context for historical patterns.

The integration of logging with other observability tools like tracing and metrics is increasingly important. A complete observability picture includes logs, metrics, and traces. Logs tell you what happened. Metrics tell you how much. Traces tell you how requests flowed through the system. The three together provide comprehensive visibility.

The operational excellence around logging comes from standardization. When all ML systems in your organization log in the same format with the same level of detail, operations becomes much easier. Engineers don't have to learn different logging styles for different systems. Dashboards and alerts can be reused across systems. Runbooks written for one system often apply to others. This standardization is valuable enough to spend time on.

The integration of structured logging with unstructured logging is a practical reality. Some parts of your system might output free-form text logs (like deep learning frameworks during training). Other parts output structured JSON logs. Logstash can parse both and convert them to structured form. But the parsing can be error-prone. Finding the right balance between structured and flexible is important.

The performance impact of logging on your application is worth considering. Logging is not free. Writing logs consumes CPU. Sending logs over the network consumes bandwidth. If your logging is too aggressive, it can slow down your inference. You need to measure the overhead and optimize. Some applications log to disk asynchronously to avoid blocking on network I/O.

The cardinality of your log fields matters for Elasticsearch performance. If you have a field that can take millions of unique values (like request IDs), Elasticsearch needs to index all of them. This consumes memory and makes searches slow. You need to be selective about what you index. Request IDs might not need to be indexed; they're useful only for debugging specific requests. User IDs might need to be indexed because you frequently filter by them.

The backup and recovery of logs is important for disaster recovery. What if you lose the log files? What if Elasticsearch becomes corrupted? You need backups of your logs. You might use snapshots to backup Elasticsearch. You might archive logs to S3 regularly. You might run multiple Elasticsearch clusters for redundancy. The choice depends on how critical your logs are.

The security of log data is important. Logs might contain sensitive information. You need to ensure that access to logs is properly controlled. You might need encryption at rest and in transit. You might need to redact sensitive information before storing logs. You might need to maintain audit logs of who accessed which logs.

The migration of logs between systems is sometimes necessary. Maybe you outgrow ELK and want to move to a managed service. Maybe you want to change the log format. Maybe you want to consolidate logs from multiple systems. Planning for these migrations is wise. You should design your logging infrastructure to be portable.

The cost analysis of logging is important for budgeting. How much does it cost to log, store, and search logs? How many queries do you run per day? These costs add up. Understanding them helps you make informed decisions about logging granularity and retention.

The relationship between debugging and logging is intimate. When something goes wrong, the first thing engineers do is search the logs. Good logs make debugging easy. Poor logs make debugging frustrating. Logging is worth the effort because it directly impacts how quickly you can resolve issues.

The culture of logging in an organization matters. If engineers don't see value in logging, they won't spend time making logs good. If logs have helped solve actual problems, engineers become believers. Building this culture requires leaders who emphasize the value of observability and who allocate time for improving logging infrastructure.

The future of log aggregation will likely involve more AI and automation. Instead of engineers writing alerts, the system automatically detects anomalies. Instead of engineers writing dashboards, the system automatically creates relevant visualizations. Instead of engineers writing queries, the system understands what you're asking and executes the right query. These advancements will make log aggregation more powerful and accessible.

Debugging Workflows: From Alert to Root Cause

The ultimate goal is fast debugging. When an alert fires at 3 AM, you want to go from "something is wrong" to "the culprit is X" in minutes, not hours.

Your ELK setup enables this by making logs searchable and correlated. When latency spikes, you can immediately:

  1. Filter logs by model_version to see if it's a recent deployment
  2. Filter by serving_variant to isolate canary vs. production traffic
  3. Check feature_store_latency to see if the bottleneck is upstream
  4. Look at GPU memory usage to see if OOM is forcing slowdowns
  5. Trace request IDs through the inference pipeline to spot where time is lost

Without structured logging, you're manually grepping log files and stitching together context manually. With ELK, you're clicking through dashboards and building mental models in real time.

Build your debugging workflow in writing. Document the sequence: "When latency alert fires, I check these three dashboards in this order, then dive into Kibana with these queries." Make it repeatable. Train new team members on it. Your on-call rotation will run smoother.

Wrapping It Up

Log aggregation transforms debugging from archaeological excavation to scientific investigation. You move from "why is the model down?" to "here's the exact sequence of events" in seconds.

The workflow is:

  1. Instrument your model code with structured JSON logging
  2. Aggregate those logs with Logstash or similar
  3. Index in Elasticsearch with proper templates
  4. Query and visualize with Kibana dashboards
  5. Alert on anomalies before users notice

Start simple - capture model_version, latency_ms, and confidence_scores. Add more fields as you understand your system better. Your logs are your system's nervous system. Make them informative.


Need help implementing this?

We build automation systems like this for clients every day.

Discuss Your Project