Scaling ML: Kubernetes and Cloud Deployment

You've built a machine learning model that actually works. You've trained it, validated it, monitored it in production. Now your team wants to run three models. Then ten. Then forty. You need infrastructure that doesn't break when demand spikes, and you need to manage it without hiring an entire DevOps team.
Enter Kubernetes. It's not magic (though it sometimes feels that way). It's orchestration: a system that manages containers at scale, handles resource allocation, restarts failed services, and scales automatically when traffic surges. For ML workloads, it's become the industry standard.
This isn't theory. We're walking through a complete Kubernetes deployment, from Docker image to live endpoint with automatic scaling, and then looking at when you might skip Kubernetes entirely for simpler cloud options.
But first, let's talk about why this problem exists in the first place. Moving from a notebook experiment to a model that serves real users is one thing. Moving from one model to a fleet of models, each with different hardware requirements, traffic patterns, and update schedules, is an entirely different engineering challenge. The naive approach, SSH into a server, run your script, cross your fingers, collapses under that kind of load. You end up with snowflake servers no one fully understands, models that quietly die in the night, and engineers spending weekends restarting inference jobs instead of building new ones.
What you actually want is infrastructure-as-code: a declarative description of what your system should look like, enforced continuously by tooling you don't have to babysit. Kubernetes is that tooling. It was built by Google, who ran more containers in a week than most companies run in a lifetime, and open-sourced in 2014. Since then it has become the foundation that AWS, Google Cloud, Azure, and virtually every serious production ML platform is built on top of. Learning Kubernetes isn't just a career skill, it's the common language of production ML infrastructure.
In this guide we'll build from first principles. We'll containerize a real PyTorch model, write the Kubernetes manifests to deploy it, expose it to traffic, add persistent storage for model artifacts, wire up autoscaling, and finally step back to compare cloud-managed Kubernetes options and the simpler serverless alternatives that might suit your needs better. By the end you'll have a repeatable blueprint you can adapt for your own models.
Table of Contents
- Why Kubernetes for ML?
- Kubernetes Core Concepts for ML
- Deploying a Model with Kubernetes: Step-by-Step
- Step 1: Package Your Model in Docker
- Step 2: Create a Kubernetes Deployment
- Step 3: Expose with a Service
- Step 4: Add Persistent Storage for Model Artifacts
- Step 5: Automatic Scaling with HPA
- Managing Multiple Models with Helm
- Why Kubernetes for ML: The Deeper Case
- Cloud Deployment Patterns
- Comparing Cloud Kubernetes Services
- Auto-Scaling Strategies
- Common Scaling Mistakes
- When to Skip Kubernetes Entirely
- GPU Scheduling and Resource Management
- Monitoring and Observability
- The Complete Picture: From Commit to Production
- Conclusion
Why Kubernetes for ML?
ML models have unique infrastructure demands that make standard web hosting approaches strain quickly. A standard web service might scale linearly with user traffic, add more RAM, add more CPU, done. Models demand GPU memory, consistent model artifact storage, and precise resource allocation. They also need to coexist peacefully, one model consuming all available GPU RAM shouldn't crash another.
Beyond resource contention, ML workloads tend to be stateful in ways that complicate traditional deployment. A model's weights might be several gigabytes. Loading them on every container restart wastes minutes and generates real costs. Model serving often benefits from batching multiple requests together for GPU efficiency, which requires careful orchestration between processes. And ML teams deploy frequently, new training runs produce new model versions weekly or even daily, which demands zero-downtime update mechanisms.
Kubernetes solves this by:
- Resource isolation: Each pod (container) gets declared CPU/memory requests and limits
- Hardware targeting: Schedule pods on nodes with GPUs, high memory, or specific hardware
- State persistence: Persistent volumes hold model artifacts, avoiding expensive downloads on restart
- Autoscaling: Horizontal Pod Autoscaler watches CPU and custom metrics, spinning up replicas under load
- Rolling updates: Deploy new model versions without downtime
- Service discovery: Internal DNS so services find each other automatically
For ML teams, this is the difference between "manually logging into servers to restart processes" and "infrastructure that fixes itself." When a pod crashes at 3 AM, Kubernetes detects it within seconds, starts a replacement, and routes traffic away from the failed instance, all before you even get a notification. When a traffic spike hits your prediction API, HPA can double your replica count in under a minute without a human in the loop. That operational confidence is what lets small ML teams manage large fleets of models without burning out.
Kubernetes Core Concepts for ML
If you're new to K8s, here's the essential vocabulary:
Pod: The smallest unit. Usually one container (your model service), but it can be multiple. Pods are ephemeral, they get created and destroyed constantly.
Deployment: Manages a set of pods. "I want 3 replicas of my FastAPI model server." Deployment watches them, restarts failures, scales them up or down.
Service: Exposes pods to the network. Think of it as a stable load balancer in front of your pods. Your application talks to the service; the service routes to healthy pods.
Ingress: Routes external traffic to services. Handles TLS, path-based routing, and domain binding, the front door of your cluster.
PersistentVolume (PV) & PersistentVolumeClaim (PVC): Persistent storage. Models live here. When a pod restarts, it reconnects to the same storage.
ConfigMap & Secret: Configuration and credentials. Store API keys, model paths, hyperparameters without baking them into your Docker image.
HPA (Horizontal Pod Autoscaler): Watches metrics (CPU, custom metrics) and automatically scales replica count. Underutilized? Scale down. Traffic spiking? Scale up.
This might seem abstract. Let's make it concrete.
Deploying a Model with Kubernetes: Step-by-Step
We'll deploy a simple PyTorch image classifier served by FastAPI. The goal: a scalable, self-healing model endpoint that can handle 100s of concurrent requests.
Step 1: Package Your Model in Docker
Your model lives in a Docker image. The Dockerfile is the blueprint Kubernetes uses to start new pods, every replica of your service is a fresh container built from this same image, which is why getting the Dockerfile right matters enormously. We want the image to be lean (faster pulls), deterministic (consistent behavior across restarts), and self-contained (no runtime dependency on the host machine). Here's a minimal Dockerfile:
FROM python:3.11-slim
WORKDIR /app
# Copy model artifact
COPY model.pt /app/model.pt
COPY requirements.txt /app/requirements.txt
RUN pip install -r requirements.txt
# Copy FastAPI server
COPY server.py /app/server.py
EXPOSE 8000
CMD ["uvicorn", "server.py:app", "--host", "0.0.0.0", "--port", "8000"]Your FastAPI server (server.py):
from fastapi import FastAPI
from pydantic import BaseModel
import torch
import torch.nn.functional as F
from torchvision import transforms
from PIL import Image
import io
app = FastAPI()
# Load model once at startup
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = torch.load("/app/model.pt", map_location=device)
model.eval()
class PredictionRequest(BaseModel):
image_base64: str
@app.post("/predict")
async def predict(request: PredictionRequest):
# Decode and preprocess image
image_data = io.BytesIO(request.image_base64.encode())
image = Image.open(image_data).convert("RGB")
# Normalize
transform = transforms.Compose([
transforms.Resize(224),
transforms.ToTensor(),
transforms.Normalize(
mean=[0.485, 0.456, 0.406],
std=[0.229, 0.224, 0.225]
)
])
x = transform(image).unsqueeze(0).to(device)
with torch.no_grad():
logits = model(x)
probs = F.softmax(logits, dim=1)
return {"prediction": probs.tolist()}
@app.get("/health")
async def health():
return {"status": "ok"}Notice two important design decisions in that server code. First, the model is loaded at startup, once, not on every request. That single torch.load() call might take a few seconds, but after that, every prediction hits an already-warm model in GPU memory. Loading per-request would make your latency 10x worse and your GPU thrash constantly. Second, the /health endpoint exists purely for Kubernetes, it has no ML logic, just a 200 OK response, so the orchestrator can tell the difference between "container started" and "model loaded and ready to serve traffic."
Build and push to your container registry before deploying:
docker build -t myregistry.azurecr.io/image-classifier:v1 .
docker push myregistry.azurecr.io/image-classifier:v1Why this structure? Your model artifact (model.pt) is loaded at container startup, not on every request. The health endpoint is critical for Kubernetes readiness probes. Once this image is in your registry, Kubernetes can pull it onto any node in your cluster, the model is portable.
Step 2: Create a Kubernetes Deployment
The Deployment manifest is where you tell Kubernetes what you want your service to look like. You're not running a command like "start three containers", you're declaring desired state and handing that declaration to a control loop that enforces it continuously. If a pod dies, the control loop creates a new one. If you change the replica count, the control loop converges on the new target. This declarative model is what makes Kubernetes so powerful for production workloads.
Save this as deployment.yaml:
apiVersion: apps/v1
kind: Deployment
metadata:
name: image-classifier
labels:
app: image-classifier
spec:
replicas: 2 # Start with 2 pods
selector:
matchLabels:
app: image-classifier
template:
metadata:
labels:
app: image-classifier
spec:
containers:
- name: model-server
image: myregistry.azurecr.io/image-classifier:v1
ports:
- containerPort: 8000
# Resource requests and limits
resources:
requests:
memory: "2Gi"
cpu: "500m"
nvidia.com/gpu: "1"
limits:
memory: "4Gi"
cpu: "2000m"
nvidia.com/gpu: "1"
# Readiness probe: only send traffic if this passes
readinessProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 10
periodSeconds: 5
# Liveness probe: restart if this fails
livenessProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 30
periodSeconds: 10Key decisions here:
- replicas: 2: We run 2 copies. If one dies, the other handles traffic.
- requests: Kubernetes reserves this much memory/CPU per pod. Your model gets 2GB RAM, 0.5 CPU cores, and 1 GPU.
- limits: Pod gets killed if it exceeds these. Prevents runaway processes from starving the node.
- readinessProbe: Waits 10 seconds after startup, then checks
/healthevery 5 seconds. If it fails, Kubernetes removes the pod from service (but doesn't restart it). - livenessProbe: After 30 seconds, checks
/healthevery 10 seconds. If it fails repeatedly, Kubernetes restarts the container.
The distinction between requests and limits trips up a lot of people early on. Think of requests as a reservation, Kubernetes won't schedule a pod on a node that doesn't have enough free capacity to honor the request. Limits are a ceiling, if a pod tries to use more memory than its limit, the kernel kills it with an OOM error. Setting limits prevents one misbehaving model from taking down an entire node. For ML workloads, set your memory limit generously (2x your expected peak) but your CPU limit tightly, since inference code rarely benefits from unbounded CPU beyond a certain point.
Deploy and verify the pods come up healthy:
kubectl apply -f deployment.yamlCheck status:
kubectl get pods
kubectl describe pod image-classifier-<pod-id>
kubectl logs image-classifier-<pod-id>Step 3: Expose with a Service
Services give you a stable DNS name. Create service.yaml:
apiVersion: v1
kind: Service
metadata:
name: image-classifier-service
spec:
type: LoadBalancer
selector:
app: image-classifier
ports:
- protocol: TCP
port: 80
targetPort: 8000Deploy:
kubectl apply -f service.yaml
kubectl get svcYou'll see an EXTERNAL-IP. Hit that at port 80, it routes to port 8000 inside the pods.
The Service abstracts away the fact that pods come and go. Your client code calls one stable endpoint; the Service figures out which pod is healthy and routes accordingly. This is also how you get load balancing for free, requests distribute across all healthy replicas. For production, you'd use type: ClusterIP (internal only) and pair it with an Ingress for external routing, TLS, and domain binding. The Ingress layer lets you route /classifier/v1/predict and /detector/v2/predict to different services behind a single external IP, which becomes essential once you're running multiple models.
Step 4: Add Persistent Storage for Model Artifacts
Models are often too large to rebuild on every deployment. Use a PersistentVolumeClaim:
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: model-storage
spec:
accessModes:
- ReadOnlyMany # Multiple pods can read simultaneously
storageClassName: standard
resources:
requests:
storage: 10GiUpdate your deployment to mount it:
spec:
containers:
- name: model-server
volumeMounts:
- name: model-volume
mountPath: /models
volumes:
- name: model-volume
persistentVolumeClaim:
claimName: model-storageNow /models is backed by durable storage. Deploy a new version, the old pods restart, and they reconnect to the same model files. No re-download needed. This separation between model artifacts and application code is one of the key patterns in production ML, it means you can update your serving code without retraining, or swap in a new model version without rebuilding your Docker image. In larger setups, teams point this PVC at a model registry (MLflow, W&B, or a plain S3 bucket mounted via FUSE), so model promotion from staging to production is a metadata update rather than a file copy.
Step 5: Automatic Scaling with HPA
Here's the magic. The Horizontal Pod Autoscaler watches CPU usage and scales automatically. Without autoscaling, you're forced to over-provision, running enough replicas to handle your peak traffic at all times, even when utilization is at 5% at 3 AM. With HPA, you run a lean baseline and let the cluster breathe to meet demand, dramatically cutting your cloud bill while improving reliability under load.
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: image-classifier-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: image-classifier
minReplicas: 2
maxReplicas: 10
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 80
behavior:
scaleDown:
stabilizationWindowSeconds: 300 # Don't thrash, wait 5 min before scaling down
policies:
- type: Percent
value: 50
periodSeconds: 60
scaleUp:
stabilizationWindowSeconds: 0 # Scale up immediately
policies:
- type: Percent
value: 100
periodSeconds: 30Deploy:
kubectl apply -f hpa.yaml
kubectl get hpaNow: if CPU usage exceeds 70% across your pods, Kubernetes spins up a new replica within 30 seconds. If usage drops below 70%, it scales down after 5 minutes (to avoid flapping). The asymmetry in the behavior block is intentional and reflects the reality of ML workloads: you want to scale up fast when a traffic surge hits, but scale down slowly to avoid the situation where HPA removes a pod mid-request and then immediately needs to add it back. The five-minute stabilization window for scale-down is conservative by design.
Why custom metrics? You can tie scaling to model-specific metrics: inference latency, queue depth, or custom metrics from Prometheus. That's advanced, but if your inference time explodes, you want to scale before CPU maxes out. A model processing large batches might peg one GPU thread at 100% while other cores sit idle, CPU-based HPA would miss that signal entirely, but a custom metric tracking p95 inference latency would catch it immediately.
Managing Multiple Models with Helm
Deploying 10 models means 10 deployments, 10 services, 10 HPAs. That's 30 YAML files. Managing that by hand is not just tedious, it's a maintenance nightmare. When you need to update resource limits across all your models, or change your readiness probe timeout globally, you're editing 30 files and hoping you don't miss one. Helm is templating for Kubernetes, it reduces boilerplate and makes deployments repeatable.
Create a Helm chart structure:
my-ml-chart/
├── Chart.yaml
├── values.yaml
├── templates/
├── deployment.yaml
├── service.yaml
├── hpa.yaml
└── pvc.yaml
Chart.yaml:
apiVersion: v2
name: ml-model
description: ML model deployment
version: 1.0.0values.yaml (defaults):
replicaCount: 2
image:
repository: myregistry.azurecr.io/image-classifier
tag: v1
resources:
requests:
memory: "2Gi"
cpu: "500m"
limits:
memory: "4Gi"
cpu: "2000m"
autoscaling:
enabled: true
minReplicas: 2
maxReplicas: 10
cpuUtilizationPercent: 70templates/deployment.yaml:
apiVersion: apps/v1
kind: Deployment
metadata:
name: {{ .Release.Name }}
spec:
replicas: {{ .Values.replicaCount }}
selector:
matchLabels:
app: {{ .Release.Name }}
template:
metadata:
labels:
app: {{ .Release.Name }}
spec:
containers:
- name: model-server
image: {{ .Values.image.repository }}:{{ .Values.image.tag }}
resources: {{ toYaml .Values.resources | nindent 12 }}The power of this structure is that each model gets its own values-<model>.yaml override file that specifies only what's different, image tag, resource requests, max replicas, while inheriting all the default patterns from the chart. A new team member can deploy a new model to production by writing a fifteen-line YAML file and running one command, without needing to understand the full Kubernetes machinery underneath.
Deploy multiple models:
helm install classifier1 my-ml-chart -f values-classifier1.yaml
helm install classifier2 my-ml-chart -f values-classifier2.yaml
helm install object-detector my-ml-chart -f values-object-detector.yamlEach gets its own deployment, service, HPA, with one Helm command. Upgrades are unified:
helm upgrade classifier1 my-ml-chart -f values-classifier1.yamlThis single command triggers a rolling update: Kubernetes starts new pods with the updated image, waits for them to pass readiness checks, then terminates the old ones. If anything goes wrong, helm rollback classifier1 0 reverts to the previous release in seconds.
Why Kubernetes for ML: The Deeper Case
We covered the tactical reasons earlier, resource isolation, autoscaling, rolling updates. But it's worth stepping back to understand why Kubernetes specifically, rather than just any container orchestration system, has become the default for ML infrastructure at scale.
The answer is ecosystem depth. Kubernetes has accumulated an extraordinary collection of ML-specific tooling built on top of its core primitives. Kubeflow turns a Kubernetes cluster into an end-to-end ML platform, you get Jupyter notebooks, distributed training via TFJob and PyTorchJob, hyperparameter tuning via Katib, and model serving via KFServing, all orchestrated on the same infrastructure. NVIDIA's GPU Operator automates the installation and configuration of GPU drivers across your nodes. Argo Workflows lets you define complex ML pipelines as Kubernetes resources, with branching, retries, and artifact passing between steps. MLflow can be deployed as a Kubernetes service and integrates with your model registry. Seldon Core and BentoML both offer Kubernetes-native model serving with A/B testing, canary deployments, and traffic splitting built in.
The cumulative effect is that adopting Kubernetes for your ML infrastructure unlocks an ecosystem rather than just a tool. Your DevOps team almost certainly already knows Kubernetes. Your cloud provider offers managed Kubernetes as a first-class product. The hiring pool for Kubernetes engineers is far larger than for any proprietary orchestration system. And when you eventually want to run multi-cloud or hybrid infrastructure, some training on-prem, inference on cloud, Kubernetes provides the common abstraction layer that makes that possible.
This doesn't mean Kubernetes is always the right choice. But it does mean that when you outgrow simpler solutions, Kubernetes is the direction the industry has converged on.
Cloud Deployment Patterns
Beyond choosing which Kubernetes service to use, you also need to decide how you structure your workloads across cloud resources. The patterns that work for web services often need adjustment for ML, because ML has fundamentally different cost and performance characteristics.
The most common pattern for ML inference is a spot/preemptible instance pool for training combined with on-demand instances for inference serving. Training jobs can tolerate interruption, if a spot instance gets reclaimed, you checkpoint and resume. Inference serving cannot tolerate interruption, a user waiting for a prediction cannot wait for a new node to spin up. By separating these workloads into different node pools with different instance types, you dramatically cut training costs while maintaining serving reliability.
A second important pattern is model caching at the edge of your cluster. If you have twenty models and each is 500MB, you don't want every node to download all twenty models on startup. Instead, use a distributed caching layer, Redis, Memcached, or a purpose-built model cache, so models are downloaded once per node and then served from local storage. Combined with Kubernetes node affinity rules that keep a model's pods pinned to specific nodes, you can effectively pre-warm model caches and eliminate cold-start latency.
A third pattern that production teams rely on is blue-green or canary deployment for model updates. Rather than rolling a new model version out to all replicas simultaneously, you route a small percentage of traffic, say, 5%, to the new version and watch the metrics. If latency, error rate, and accuracy look good, gradually shift more traffic. If something goes wrong, flip 100% back to the old version with a single command. Most Kubernetes ingress controllers (Istio, Nginx, Traefik) support traffic splitting out of the box, making this pattern easy to implement without custom code.
Comparing Cloud Kubernetes Services
You don't have to run Kubernetes yourself. Cloud providers offer managed K8s:
EKS (AWS Elastic Kubernetes Service)
- AWS ecosystem integration (S3 for models, ECR for images, CloudWatch for logs)
- Spot instances for cheap GPU nodes (great for ML)
- AWS-native autoscaling and load balancing
- Cost: You pay for the control plane (roughly $0.10/hour) plus node costs
GKE (Google Kubernetes Engine)
- Google's own platform; tight integration with GCP services
- Built-in monitoring and logging dashboards
- Anthos for hybrid/multi-cloud (if you have on-prem clusters too)
- Cost: Control plane is free; you pay for nodes
AKS (Azure Kubernetes Service)
- Azure integration (Azure ML, Azure Container Registry)
- Cheaper than EKS for some node types
- GPU node pools with automatic scaling
- Cost: Control plane is free; you pay for nodes
Decision matrix:
- Using AWS heavily? EKS.
- Using Google Cloud ML Platform? GKE integrates seamlessly.
- Standardizing on Azure? AKS.
- Evaluating all three? GKE has the lowest control-plane costs.
Auto-Scaling Strategies
The HPA we configured earlier is a solid foundation, but real production systems need more nuance. The challenge with ML workloads is that the relationship between traffic and resource consumption is often non-linear, a model might handle 10 requests per second at 30% CPU utilization, but 15 requests per second at 95% utilization due to request batching behavior or memory pressure. Purely reactive scaling (wait until CPU hits 70%, then scale) can result in degraded latency during the ramp-up period.
The more sophisticated approach is predictive scaling. If you have historical traffic patterns, and most production systems do, you can pre-scale before demand arrives rather than reacting to it. AWS and GKE both offer scheduled scaling actions that let you say "increase minimum replicas to 8 between 9 AM and 6 PM on weekdays." This is crude but effective for predictable patterns.
For unpredictable traffic, KEDA (Kubernetes Event-Driven Autoscaler) is worth investigating. KEDA extends HPA with dozens of event sources, SQS queue depth, Kafka lag, Prometheus metrics, Azure Service Bus, so you can scale based on signals that are more directly meaningful to your model's load. If inference requests queue in an SQS queue, scaling on queue depth is far more responsive than scaling on CPU, because you can add replicas the moment the queue starts growing rather than waiting for CPU to spike.
Finally, don't neglect Cluster Autoscaler alongside pod-level HPA. HPA adds more pods; Cluster Autoscaler adds more nodes when pods can't be scheduled. Both are necessary for a fully elastic system. Without Cluster Autoscaler, HPA creates pods that sit in Pending state waiting for capacity that never comes. The two work together: HPA signals demand via pending pods, Cluster Autoscaler provisions nodes to satisfy it, pods start, and HPA continues scaling until utilization stabilizes.
Common Scaling Mistakes
After helping teams move from manual server management to Kubernetes, the same patterns of mistakes appear repeatedly. Knowing them in advance will save you significant pain.
Setting resource requests too low. This is the most common mistake. Underestimating memory requests causes two problems: Kubernetes schedules too many pods onto a single node (because it thinks there's more room than there is), and pods get OOM-killed during inference spikes when actual usage exceeds the (wrong) limit. The fix is to profile your model's memory usage under realistic load before setting these values. A simple approach: run your model in Docker locally, load-test it at 2x expected peak traffic, and watch docker stats. Set your Kubernetes memory request to 110% of the median usage, and your limit to 150%.
Ignoring pod disruption budgets. A PodDisruptionBudget (PDB) tells Kubernetes how many pods can be unavailable simultaneously during voluntary disruptions like node drains or upgrades. Without a PDB, Kubernetes might evict all your pods at once during a cluster upgrade, taking your model endpoint down completely. A simple minAvailable: 1 PDB ensures at least one pod stays running through any disruption.
Scaling on CPU alone for GPU workloads. GPU-accelerated inference often runs at 10-20% CPU while the GPU is at 100%. If your HPA watches CPU, it will see low utilization and scale down, right when your service is under maximum load. Always expose GPU utilization as a custom metric via DCGM Exporter and set your HPA to scale on GPU utilization when running GPU workloads.
Not setting a scale-down stabilization window. Without the stabilizationWindowSeconds we configured earlier, HPA can exhibit "flapping", scaling down as a traffic burst ends, then immediately scaling back up as the next burst arrives. Each scale event takes 30-60 seconds, during which newly-scheduled pods are starting their model loading sequences. The result is degraded latency at exactly the moments you need headroom most.
Forgetting that scale-up has latency. Even with HPA configured correctly, there's a gap between when load exceeds your threshold and when new pods are fully ready to serve. For ML models, "fully ready" means the model is loaded into GPU memory, not just that the container started. If your model takes 45 seconds to load, your minimum replicas need to be sized to handle peak load with some headroom, because the buffer between "HPA decides to scale" and "new pod is serving traffic" is at least a minute.
When to Skip Kubernetes Entirely
Kubernetes is powerful but also complex. You don't need it if:
You have a single model, low traffic: Deploy to AWS Lambda or Google Cloud Run. Code is simpler, no infrastructure to manage, pay per invocation.
Example: Cloud Run deployment
Cloud Run is particularly compelling for ML models that see bursty, unpredictable traffic. You write the same FastAPI server code, containerize it identically, and hand it to Google Cloud instead of a Kubernetes cluster. The tradeoff is that you give up control over infrastructure details in exchange for never having to think about nodes, pods, or cluster management again. For a team of two data scientists who want to ship a model and focus on improving it rather than operating it, that tradeoff is often the right one.
gcloud run deploy image-classifier \
--source . \
--platform managed \
--memory 4Gi \
--timeout 3600 \
--max-instances 100Google handles scaling, load balancing, and TLS. Your FastAPI server runs the same code. Cost: roughly $0.00002 per invocation. Perfect for bursty, unpredictable traffic.
You need ultra-low latency: Edge deployment or specialized hardware. Kubernetes adds ~100ms overhead.
You have one GPU and it's never idle: Kubernetes overhead isn't worth it. Run the server directly on your machine.
Serverless + Lambda
Lambda is the most extreme version of managed infrastructure, you provide a function, AWS provides everything else. Cold starts are the main gotcha for ML: a Lambda that hasn't been invoked recently needs to download your model weights, initialize PyTorch, and load weights into memory before it can serve its first request. For large models, that can be 10-30 seconds of latency. Provisioned Concurrency solves this by keeping a number of Lambda instances pre-warmed, but at that point you're paying for idle capacity in a way that starts to resemble just running a server.
# lambda_handler.py
import json
import base64
import torch
from PIL import Image
import io
model = torch.load("/opt/ml/model.pt")
def lambda_handler(event, context):
image_base64 = event['body']['image']
image_data = base64.b64decode(image_base64)
image = Image.open(io.BytesIO(image_data))
# Predict
with torch.no_grad():
result = model(image.unsqueeze(0))
prediction = result.tolist()
return {
"statusCode": 200,
"body": json.dumps({"prediction": prediction})
}Package with Docker, push to ECR, deploy to Lambda. Scales automatically. No Kubernetes. Lambda is genuinely excellent for lightweight models, preprocessing pipelines, and any workload where requests are infrequent enough that cold starts are acceptable.
The tradeoff: cold start latency (Lambda: ~5-10 sec on first invoke after idle), limited execution time (15 minutes max), and less control over the runtime environment.
GPU Scheduling and Resource Management
For serious ML, you need GPUs. Kubernetes schedules them like any other resource.
In your deployment, you declare GPU requirements alongside CPU and memory. Kubernetes will only schedule the pod on a node that has a free GPU slot, if no such node exists and Cluster Autoscaler is configured, it will provision a new GPU node. The NVIDIA device plugin (installed via the GPU Operator) is what makes this work at the node level, exposing each GPU as an allocatable resource that the Kubernetes scheduler can reason about.
resources:
requests:
nvidia.com/gpu: "1"
limits:
nvidia.com/gpu: "1"This asks for 1 GPU (any kind). Kubernetes will only schedule the pod on a node with free GPU slots.
GPU-specific scheduling:
nodeSelector:
gpu: "true"
gpu-type: "a100"Tells Kubernetes: "Schedule me only on nodes with A100 GPUs." Label your nodes accordingly:
kubectl label nodes my-gpu-node gpu=true gpu-type=a100Shared GPUs: If you're running many small models on one GPU, use libraries like NVIDIA's GPU sharing orchestrator or TensorFlow serving's batching to pack multiple models per GPU. Kubernetes doesn't handle this natively, you manage it in your application layer. For teams running dozens of small models on a limited GPU budget, Multi-Instance GPU (MIG) partitioning on A100s is worth investigating: it lets you split one physical GPU into up to seven isolated instances with guaranteed memory and compute, each appearing as a separate GPU to Kubernetes.
Monitoring and Observability
Kubernetes clusters are opaque without monitoring. Deploy Prometheus for metrics:
apiVersion: v1
kind: ConfigMap
metadata:
name: prometheus-config
data:
prometheus.yml: |
global:
scrape_interval: 15s
scrape_configs:
- job_name: 'kubernetes-pods'
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: trueExpose metrics from your FastAPI server using prometheus-client. The pattern here is deliberately simple, a latency histogram and a request counter give you the two most important signals for a model serving endpoint. From these, you can derive throughput (rate of predictions_total), error rate (if you add an errors_total counter), and latency percentiles (p50, p95, p99 from the histogram). That's a complete picture of your serving health in four metrics.
from prometheus_client import Counter, Histogram, generate_latest
import time
inference_latency = Histogram('inference_latency_seconds', 'Inference latency')
predictions_total = Counter('predictions_total', 'Total predictions')
@app.post("/predict")
async def predict(request: PredictionRequest):
start = time.time()
# ... inference ...
latency = time.time() - start
inference_latency.observe(latency)
predictions_total.inc()
return result
@app.get("/metrics")
async def metrics():
return generate_latest()Now Prometheus scrapes /metrics from all pods, stores timeseries data, and you can alert on high latency, OOM events, or scaling decisions. Wire this up to Grafana for dashboards and PagerDuty for alerts, and you have full observability into your model fleet. The annotation prometheus.io/scrape: "true" in your pod template is what tells Prometheus to include your pods in its scrape targets, add that to your Deployment's pod metadata and monitoring is automatic for every replica.
The Complete Picture: From Commit to Production
Here's the workflow:
- Develop locally: Train, test, validate your model.
- Commit to repo: Code +
model.pt(or reference to model registry). - CI/CD triggers build: Docker image built, tests run, pushed to registry.
- Helm release:
helm upgradedeploys new image to staging, canary, then production. - Kubernetes takes over: Pods start, readiness probes check health, HPA scales as traffic comes in.
- Monitoring dashboards show inference latency, throughput, and GPU utilization.
- On alert: PagerDuty pages on-call engineer, who can rollback in seconds:
helm rollback image-classifier 0.
This is the gold standard for ML operations. It's not trivial to set up (first time: ~1 week). But once operational, you can deploy new models with one command, confident that infrastructure will scale and self-heal.
Conclusion
Kubernetes for ML is a significant investment, and it's worth being honest about that upfront. The first week is steep. You'll fight with YAML indentation, misconfigured probes, and node selector typos. You'll google "kubernetes pod crashloopbackoff" more times than you'd like. But the infrastructure you end up with is genuinely transformative for a team that manages multiple models in production.
The value proposition is this: every hour you spend building Kubernetes infrastructure properly is an hour you don't spend manually restarting model servers, investigating why a pod consumed all available memory and killed everything else on the node, or figuring out why your model endpoint went down because someone restarted a server during an update. Kubernetes doesn't eliminate operational burden, it automates the repetitive, mechanical parts of it so your team can focus on the work that actually matters: training better models, improving data quality, and shipping features.
Start simple. One model, one cluster, one managed Kubernetes service. Get the basic Deployment/Service/HPA stack working. Add monitoring. Then add Helm to manage multiple models. Then explore Kubeflow or KEDA for more advanced workflows. Build incrementally rather than trying to architect the complete system before you've deployed anything.
And remember that Kubernetes is not always the answer. If you're running a single model with unpredictable traffic and a small team, Cloud Run or Lambda will serve you better. The goal is matching infrastructure complexity to your actual needs, not adopting sophisticated tooling for its own sake.