December 30, 2025
AI/ML Infrastructure Platform GPU Kubernetes

Kubernetes for ML: Getting Started with GPU Workloads

You've got a shiny GPU cluster, a pile of ML training jobs, and a growing team that can't keep stepping on each other's toes while provisioning resources. Welcome to the world of Kubernetes GPU scheduling - a place where "my job crashed because the GPU was already taken" happens faster than you can say "out of memory."

Here's the good news: Kubernetes has had GPU support for years, and the NVIDIA ecosystem has matured significantly. But the gotchas? They're real. Understanding how Kubernetes advertises GPUs, how the scheduler places your workloads, and how to prevent resource conflicts is the difference between a cluster that hums along smoothly and one that leaves expensive compute sitting idle.

Let's walk through what you actually need to know to get GPU workloads running reliably on Kubernetes - starting with the device plugin and moving all the way to handling multi-team resource contention.

Table of Contents
  1. The NVIDIA GPU Device Plugin: How Kubernetes Sees Your GPUs
  2. How the Device Plugin Works
  3. Installing the Device Plugin
  4. GPU Resource Requests and Limits: Whole GPUs and MIG Slices
  5. The Asymmetry: GPU Limits, Not Requests
  6. Whole GPUs vs. MIG Slices
  7. Node Affinity and Tolerations: Keeping GPU Workloads on GPU Nodes
  8. Labeling GPU Nodes
  9. Taints and Tolerations
  10. NodeSelector vs. NodeAffinity
  11. Production Deployment Checklist: Validating Your GPU Cluster
  12. Pre-Deployment Validation
  13. Pod Submission Validation
  14. Runtime Monitoring
  15. Key Takeaways
  16. Advanced Scheduling Patterns for Multi-User Clusters
  17. The Hidden Complexity of Mixed Workloads
  18. Observability and Debugging GPU Scheduling Issues
  19. Building an Observatory Dashboard
  20. Optimizing GPU Cluster Efficiency at Scale
  21. Troubleshooting GPU Scheduling in Production
  22. Related Resources

The NVIDIA GPU Device Plugin: How Kubernetes Sees Your GPUs

Before your cluster can schedule a single ML job on a GPU, Kubernetes needs to know those GPUs exist. That's where the NVIDIA Kubernetes Device Plugin comes in. Understanding how this plugin works is crucial because without it, your Kubernetes cluster has no idea that your expensive GPUs are available. Your pods can land on GPU nodes, but they'll just sit there, unable to access the hardware they need.

Kubernetes has a philosophy of treating compute resources as abstract quantities - it doesn't care if you're running on Intel or AMD processors, as long as you've reserved the CPU cores. The same abstraction applies to GPUs, except GPUs are fundamentally different. A CPU core can time-slice between multiple processes. GPUs cannot. If your neural network training kernel is running on a GPU, no other process can use that GPU until the training completes. This incompatibility with traditional resource sharing means Kubernetes can't treat GPUs like CPUs. It needs a specialized mechanism to manage them.

That's where the device plugin comes in. It's essentially a bridge between the GPU hardware and Kubernetes' resource scheduler. The plugin is aware of GPU memory, compute capability, interconnect topology, and driver versions. It advertises all of this to Kubernetes so the scheduler can make intelligent placement decisions. Without this information, Kubernetes is scheduling blind. It might place two GPU-intensive workloads on the same node that only has one GPU, causing both to fail. Or it might place a job requiring H100s on a node with A100s. The device plugin prevents these mistakes.

What's particularly elegant about the device plugin architecture is that it lets NVIDIA evolve GPU support without modifying Kubernetes itself. When))-ml-model-testing)-scale)-real-time-ml-features)-apache-spark))-training-smaller-models)) a new GPU model arrives, NVIDIA updates the plugin. When new GPU interconnect technology emerges, they add support. The Kubernetes community doesn't have to understand the intricacies of NVIDIA hardware. They just have to support a standard interface for device plugins, which they do. This separation of concerns has allowed Kubernetes GPU support to evolve rapidly without requiring core Kubernetes changes.

The device plugin runs as a DaemonSet on every GPU-enabled node. It talks to the kubelet on that node and registers GPUs as schedulable resources under the nvidia.com/gpu resource class. Without it, your GPUs are invisible to the Kubernetes scheduler - your pods will land on nodes with GPUs, but the scheduler has no idea they're there. The plugin essentially translates the hardware layer (NVIDIA GPUs) into Kubernetes' abstract resource model, making GPUs first-class schedulable resources.

How the Device Plugin Works

Here's the flow: The DaemonSet deployment runs a plugin container on each node with GPUs. The plugin registers with the kubelet via the device plugin API (Unix socket communication). The kubelet reports available GPUs to the API server. When you request nvidia.com/gpu in a pod spec, the scheduler reserves that resource and the device plugin injects the GPU device into the container at runtime.

The plugin also manages device isolation - it ensures that if Pod A gets GPU 0, Pod B can't accidentally access the same device. This prevents cross-workload interference and data corruption. Without proper isolation, two training jobs running on the same physical GPU could corrupt each other's memory, leading to silent failures and incorrect results.

Think of the device plugin as a translator between the GPU hardware and Kubernetes' resource system. GPUs are special - they can't be oversubscribed like CPUs, and they require special drivers and runtime. The device plugin bridges that gap, making GPUs look like regular Kubernetes resources that the scheduler understands.

Installing the Device Plugin

You can install the device plugin via Helm, Kubernetes manifests, or through the NVIDIA GPU Operator (which manages the entire GPU software stack: drivers, runtime, device plugin, and monitoring). For a minimal setup, here's a basic DaemonSet manifest:

yaml
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: nvidia-device-plugin-daemonset
  namespace: kube-system
spec:
  selector:
    matchLabels:
      name: nvidia-device-plugin-ds
  updateStrategy:
    type: RollingUpdate
  template:
    metadata:
      labels:
        name: nvidia-device-plugin-ds
    spec:
      tolerations:
        - key: nvidia.com/gpu
          operator: Exists
          effect: NoSchedule
      nodeSelector:
        nvidia.com/gpu: "true"
      containers:
        - image: nvidia/k8s-device-plugin:v0.14.0
          name: nvidia-device-plugin-ctr
          volumeMounts:
            - name: device-metrics
              mountPath: /run/prometheus
      volumes:
        - name: device-metrics
          hostPath:
            path: /run/prometheus

Once this DaemonSet is running, you should see GPU capacity reported. The device plugin immediately exposes GPUs as a schedulable resource. Your cluster is now aware of the hardware.

bash
kubectl describe node <gpu-node> | grep nvidia.com/gpu

You'll see something like nvidia.com/gpu: 8 (meaning 8 GPUs available on that node). This tells the scheduler exactly how many GPUs it can allocate before the node is full.

GPU Resource Requests and Limits: Whole GPUs and MIG Slices

Here's where things get interesting. GPU resources in Kubernetes work differently than CPU and memory - and this trips up a lot of folks on their first ML cluster. The distinction comes from a fundamental physical difference between GPUs and other resources.

When you request 4 CPU cores in Kubernetes, you're not getting 4 exclusive cores. You're reserving 4 units of compute, but the CPU scheduler might actually time-slice you across fewer physical cores. You might get 2 microseconds on core 1, then 2 microseconds on core 2, back to core 1, cycling thousands of times per second. Mathematically, from your application's perspective, you got 4 cores. The CPU handles the illusion. Memory works similarly - you request 8GB, and the kernel manages paging, caching, and allocation transparently.

GPUs fundamentally cannot do this. A GPU kernel running on a GPU either has the GPU or it doesn't. There's no middle ground, no time-slicing, no "sharing" in the traditional sense. If your training job is in the middle of processing a batch on GPU 0, and another process wants GPU 0, the only options are: (1) wait for the first job to finish, or (2) fail. This binary nature of GPU allocation is why Kubernetes treats them differently. The scheduler can't overcommit GPUs like it does with CPUs because the hardware doesn't support it.

The practical consequence: when you request a GPU in Kubernetes, you're requesting exclusive access. The scheduler can't oversubscribe GPUs like it does with CPUs. If a node has 8 GPUs, it can only service 8 GPU requests (or fewer, depending on your workloads' requirements). The minute the 9th GPU request arrives, it has to wait for a GPU to free up. This exclusivity is also why GPU limits are handled differently. With CPU and memory, you specify both requests and limits. The request is what you're guaranteed to get; the limit is the maximum you can use. GPUs don't have this distinction. You either get the GPU or you don't. So you put the GPU request/limit in the limits field only, and Kubernetes treats it as both.

The Asymmetry: GPU Limits, Not Requests

This is the biggest source of confusion: CPUs and memory? You specify both requests and limits. GPUs? You only specify in the limits section. Period.

Here's why: GPUs can't be oversubscribed (unlike CPU, where cgroups can time-slice across cores). If your pod asks for GPU and another pod gets it, your pod doesn't get a fraction - it gets nothing. So the scheduler needs to be conservative.

When you specify nvidia.com/gpu: 1 in the limits section, Kubernetes automatically treats it as both a request and a limit. Don't try to add a request for GPU; it will be ignored or rejected depending on your cluster config. This asymmetry confuses people because it breaks the mental model they have from CPU and memory. But it makes sense once you understand that GPU allocation is binary: either you have the GPU or you don't.

Understanding this constraint helps you design your resource requests correctly. You're not trying to overcommit - you're declaring exclusive need. If you request 1 GPU, the scheduler interprets that as "this pod will monopolize 1 GPU until it terminates," not "this pod uses an average of 0.5 GPUs."

Whole GPUs vs. MIG Slices

Modern NVIDIA GPUs (A100, H100) support Multi-Instance GPU (MIG), which natively partitions a single physical GPU into isolated logical GPUs. Each slice gets its own memory and compute units. This is game-changing for multi-tenant clusters where you have a mix of small inference jobs and large training jobs.

For a full GPU, your pod spec looks like this:

yaml
apiVersion: v1
kind: Pod
metadata:
  name: ml-training-full-gpu
spec:
  containers:
    - name: trainer
      image: nvidia/cuda:12.2.0-runtime-ubuntu22.04
      command: ["python", "train.py"]
      resources:
        limits:
          nvidia.com/gpu: 1
          memory: "16Gi"
          cpu: "8"

If you're running multiple lighter workloads, you can use MIG. Say you've partitioned an A100 into 1g.10gb slices (each slice has 10GB VRAM and 1/7th of compute). Your pod spec becomes:

yaml
apiVersion: v1
kind: Pod
metadata:
  name: ml-inference-mig
spec:
  containers:
    - name: inference
      image: nvidia/cuda:12.2.0-runtime-ubuntu22.04
      command: ["python", "serve.py"]
      resources:
        limits:
          nvidia.com/gpu: 1
          nvidia.com/mig-1g.10gb: 1
          memory: "12Gi"
          cpu: "4"

Note: The exact MIG profile names depend on your GPU model and how you've configured partitioning. MIG is powerful for sharing but requires careful planning - once you partition a GPU into MIG mode, you can't use it as a whole GPU until you reconfigure.

Node Affinity and Tolerations: Keeping GPU Workloads on GPU Nodes

You don't want your expensive ML job landing on a CPU-only node by accident. You need two things: taints on GPU nodes and matching tolerations in your pods.

Labeling GPU Nodes

First, label your GPU nodes. It's simple but essential:

bash
kubectl label nodes <gpu-node-1> gpu=true
kubectl label nodes <gpu-node-1> gpu-type=a100
kubectl label nodes <gpu-node-2> gpu-type=h100

Labels become the language you use to express scheduling constraints. You're telling Kubernetes "this node has GPU capability, specifically this GPU type." The scheduler uses these labels to make intelligent placement decisions.

Taints and Tolerations

Next, taint the GPU nodes so non-GPU workloads can't accidentally land there:

bash
kubectl taint nodes <gpu-node-1> gpu=true:NoSchedule

This means: "No pod can land on this node unless it explicitly tolerates the gpu=true taint." Taints are the enforcement mechanism - they prevent accidental misconfiguration. A dataflow job without GPUs won't accidentally land on your expensive GPU node and waste resources.

In your ML pod, add the toleration:

yaml
apiVersion: v1
kind: Pod
metadata:
  name: ml-training-with-affinity
spec:
  tolerations:
    - key: gpu
      operator: Equal
      value: "true"
      effect: NoSchedule
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
          - matchExpressions:
              - key: gpu
                operator: In
                values:
                  - "true"
  containers:
    - name: trainer
      image: nvidia/cuda:12.2.0-runtime-ubuntu22.04
      resources:
        limits:
          nvidia.com/gpu: 2

The combination of taint + toleration + nodeAffinity ensures your GPU job can land on GPU nodes and nowhere else. This prevents resource waste - GPU nodes stay dedicated to GPU workloads.

NodeSelector vs. NodeAffinity

For simple cases, you can use nodeSelector:

yaml
spec:
  nodeSelector:
    gpu-type: a100
  ...

But nodeAffinity is more flexible - it supports OR logic, negation, and preferred vs. required constraints. Use it when you need to say "prefer H100s, but fall back to A100s if needed":

yaml
affinity:
  nodeAffinity:
    preferredDuringSchedulingIgnoredDuringExecution:
      - weight: 100
        preference:
          matchExpressions:
            - key: gpu-type
              operator: In
              values:
                - h100
      - weight: 50
        preference:
          matchExpressions:
            - key: gpu-type
              operator: In
              values:
                - a100

This gives you scheduling flexibility. Your job will prefer H100s but is happy with A100s if that's all that's available. Without this fallback, jobs would sit pending waiting for the perfect GPU type.

Production Deployment Checklist: Validating Your GPU Cluster

Before you let your team run training jobs on your Kubernetes GPU cluster, validate these critical items. Missed any one of these, and you'll spend weeks debugging mysterious failures.

Pre-Deployment Validation

Hardware Verification:

  • All GPU nodes have same driver version
  • All GPUs visible to OS: lspci | grep NVIDIA shows expected count
  • GPU VRAM matches specification: nvidia-smi | grep -A2 "GPU 0"
  • NVLink active (if applicable): nvidia-smi nvlink -s
  • GPU power delivery stable: No throttling warnings in DCGM metrics

Kubernetes Integration:

  • Device plugin DaemonSet running on all GPU nodes
  • GPU capacity reported: kubectl describe nodes | grep nvidia.com/gpu
  • GFD labels present: kubectl get nodes --show-labels | grep nvidia.com/gpu

These validations ensure your infrastructure is correctly configured before users start scheduling jobs. A single misconfigured node can cause silent failures that waste days of GPU time.

Pod Submission Validation

Before submitting a training job, use this validation script:

bash
#!/bin/bash
# validate-pod.sh - Verify pod can run before submitting
 
POD_MANIFEST="training-pod.yaml"
 
# 1. Check if namespace exists and isn't full
NAMESPACE=$(grep "namespace:" $POD_MANIFEST | awk '{print $NF}')
QUOTA=$(kubectl describe quota -n $NAMESPACE 2>/dev/null | grep nvidia.com/gpu)
echo "Namespace quota: $QUOTA"
 
# 2. Check if requested GPUs are available
REQUESTED_GPUS=$(grep "nvidia.com/gpu:" $POD_MANIFEST | awk '{print $NF}')
AVAILABLE=$(kubectl describe nodes | grep "nvidia.com/gpu" | awk '{sum += $NF} END {print sum}')
echo "Requesting $REQUESTED_GPUS GPUs, $AVAILABLE available"
 
# 3. Check node affinity constraints
grep -A 5 "nodeAffinity" $POD_MANIFEST
echo "Verifying node affinity matches actual nodes..."
kubectl get nodes --show-labels | grep -E "gpu|$(grep 'key:' $POD_MANIFEST | awk '{print $NF}')"
 
# 4. Simulate scheduling
echo "Dry-run scheduling..."
kubectl apply -f $POD_MANIFEST --dry-run=server
 
echo "✓ All checks passed"

This prevents obvious mistakes (requesting GPUs that don't exist, namespace over quota, affinity constraints that no nodes satisfy) before you waste time waiting for pod scheduling.

Runtime Monitoring

After your pod starts, watch for these red flags:

bash
# Monitor GPU utilization in real-time
watch -n 1 'kubectl exec -it <pod-name> -- nvidia-smi'
 
# Check for memory leaks
kubectl exec <pod-name> -- nvidia-smi dmon  # GPU memory delta over time
 
# Verify NCCL communication (for distributed training)
kubectl logs <pod-name> | grep "NCCL"
 
# Check for thermal throttling
kubectl exec <pod-name> -- nvidia-smi --query-gpu=index,temperature.gpu --format=csv,noheader

These diagnostics catch problems early. Memory leaks that would waste days of training are caught in minutes. Thermal throttling that slowly degrades performance becomes visible immediately.

Key Takeaways

  1. Install the device plugin first: Without it, Kubernetes doesn't know about your GPUs. Use either the NVIDIA device plugin directly or the GPU Operator for a complete stack.

  2. GPU requests live in limits: Don't try to specify GPU requests separately. Put nvidia.com/gpu in the limits section and Kubernetes handles the request automatically. This quirk of GPU scheduling reflects the exclusive nature of GPU allocation.

  3. Label and taint GPU nodes: Use node labels and taints to keep ML workloads on GPU nodes and prevent accidental scheduling conflicts. A CPU-only pod landing on a GPU node is a waste; taints prevent this.

  4. Leverage GPU Feature Discovery: Use GFD labels to schedule workloads to the right GPU types (A100 vs. H100, GPU memory, compute capability). This prevents "job lands on wrong GPU" surprises.

  5. Manage contention with priority and quotas: High-priority training jobs should preempt low-priority inference. Resource quotas prevent one team from consuming all capacity. These policies ensure fair resource sharing.

  6. Debug with kubectl describe: When a pod is stuck pending, the event messages from describe will tell you exactly why the scheduler rejected it. "Insufficient nvidia.com/gpu" is clear. "Tolerations don't match taint" is clear. Messages are your debugger.

GPU scheduling in Kubernetes is powerful but nuanced and requires careful attention to detail. Start simple with one node pool, one taint, and one priority class, then layer in complexity as your team grows. The NVIDIA ecosystem is mature enough that you can rely on the device plugin and GFD without building custom tooling for most use cases - unless you're running thousands of GPUs, in which case you'll probably want to look at Kueue or Volcano for gang scheduling and advanced quota management. The time invested in understanding your cluster's GPU topology and scheduling constraints pays enormous dividends in reliability and efficiency.

Advanced Scheduling Patterns for Multi-User Clusters

When you have multiple teams sharing a GPU cluster - some training large models, others running inference, others experimenting with new architectures - scheduling becomes genuinely complex. The naive approach is first-come-first-served: jobs get scheduled in the order they're submitted. This leads to tragedy-of-the-commons dynamics where one team's long-running training job blocks everyone else. The solution is implementing priority classes and resource quotas that enforce fairness while allowing flexibility.

Kubernetes has built-in PriorityClass resources. You define classes like production-inference (highest priority), model-training (medium), experimentation (low). When a training job from team A is running and team B submits a production inference-production-inference-deployment) pod, the scheduler can preempt team A's pod if needed, restarting it later when resources free up. This preemption is powerful but requires careful design. Your training jobs must be resumable (checkpointing before preemption) or you're just wasting compute. You also need pod disruption budgets to avoid simultaneously preempting too many replicas.

Resource quotas add another dimension. You allocate each team a GPU quota per namespace: team-ml gets 32 GPUs, team-data gets 16 GPUs, team-eng gets 8 GPUs. Each team can use up to their quota, preventing any one team from monopolizing capacity. But quotas are often too rigid. A team that's careful about resource usage shouldn't be blocked because they temporarily hit quota limits. The solution is combining quotas with priority - high-priority jobs get their quota regardless, while lower-priority jobs can use excess capacity that other teams aren't using. This overcommitment creates efficient utilization while preserving fairness.

The Hidden Complexity of Mixed Workloads

A subtle but important challenge in multi-user clusters is that different workload types have different constraints. Training jobs are typically large, long-running, and need stable resources. Inference jobs are smaller, latency-sensitive, and might tolerate preemption better if they have caching. Batch processing jobs are flexible on timing but want throughput. Running these alongside each other requires scheduling strategies tailored to each type.

For training jobs, you want to guarantee exclusive GPU access. If a training job is interrupted mid-epoch, it wastes all computation since the last checkpoint. For inference, you want to pack multiple small jobs per GPU if possible, but guarantee latency SLAs. Batch jobs, meanwhile, might be happy with whatever GPU time is available - they're not latency-sensitive.

A production cluster might implement separate nodegroups (in cloud terms) or node pools (in on-prem terms) for each workload type. Training nodes are optimized for sustained throughput with large caches and bandwidth. Inference nodes might be smaller GPUs, optimized for latency. Batch nodes might mix GPU types. This requires discipline and costs more in total capacity, but provides predictability for each workload class.

Observability and Debugging GPU Scheduling Issues

When things go wrong in Kubernetes GPU scheduling, debugging is notoriously hard and requires systematic investigation. A pod sits pending for hours with no clear explanation. When you run kubectl describe pod, it says "insufficient nvidia.com/gpu" but you can clearly see free GPUs available in kubectl describe nodes. This discrepancy indicates a mismatch between what the pod is requesting and what the cluster has available. The common causes are many: (1) node affinity constraints are too restrictive (the pod wants GPU type X, but only GPU type Y is available in free slots), (2) resource fragmentation (three nodes with 1 GPU each free, but the pod wants 2 GPUs on a single node), (3) taints preventing scheduling (the node is tainted and the pod's toleration doesn't match), or (4) capacity overcommitment (resource limits are set too low, preventing more pods even though GPUs are free).

Understanding your cluster's resource topology is essential. Use this command to get a clear picture:

kubectl describe nodes | grep -A 3 "nvidia.com"

This shows how many GPUs each node has and how many are allocated. If you see a node with free GPUs but pods pending, check the pod's resource requests and node affinity. Often the pod is over-requesting resources or asking for GPU types that don't exist on free nodes.

Building an Observatory Dashboard

Successful GPU clusters have observability dashboards that show: (1) GPU utilization across the cluster over time (should be >70 percent for cost-efficiency), (2) Pod scheduling latency (how long pods sit pending before getting scheduled), (3) GPU preemption rate (how often pods are being evicted, should be low for stable workloads), (4) Per-team resource usage (quota tracking), and (5) Bottleneck identification (what's preventing more pods from scheduling?). Building this requires integrating Prometheus-grafana-ml-infrastructure-metrics) metrics from kubelet with Kubernetes API events and custom logging from your workload orchestrator.

One team we worked with discovered through dashboards that their cluster was 40 percent idle, even though users complained about long scheduling waits. Investigation revealed that nearly half the nodes had mixed GPU types (some A100s, some V100s), and many pods were requesting specific types. When a pod wanted an A100 but only V100s were free, it sat pending. The fix was either being more flexible with GPU type requests or ensuring homogeneous GPU types per node pool.

Optimizing GPU Cluster Efficiency at Scale

Moving from single-node GPU scheduling to cluster-wide optimization requires thinking about global resource utilization, not just per-node correctness. A cluster with eight GPU nodes can run eight single-GPU jobs, but that leaves zero flexibility. A job that needs two GPUs has to wait. A burst of small inference jobs gets starved by one large training job. Managing this efficiently is what separates amateur clusters from professional ones.

GPU binpacking is the first technique. Instead of assigning GPUs randomly, you want to pack jobs onto as few nodes as possible. This leaves entire nodes empty and available for large jobs. Kubernetes doesn't do this automatically. The default scheduler assigns pods to nodes based on available resources, which spreads jobs across the cluster. You need custom scheduling or a tool like Karpenter to implement smart binpacking.

The second technique is burst tolerance. Not all GPU workloads need dedicated resources. Batch inferenceprocessing-millions-records), hyperparameter search, and experimental training can tolerate sharing. You can oversubscribe GPUs (pack more jobs than GPUs available) if you implement preemption and checkpointing. A job running on oversubscribed GPUs might be interrupted, but it can checkpoint state and resume later. This dramatically improves cluster utilization. A cluster that runs at 40 percent average utilization without oversubscription might run at 80 percent with it, assuming your workloads are tolerant of interruptions.

The third technique is geographic distribution. If you have GPUs in multiple zones or regions, you can distribute workloads to maximize parallelism while managing latency. Long-running training jobs prefer colocation (all on one node for fast all-reduce). Inference jobs prefer geographic spread (serve from edge locations). Dynamic rescheduling lets you move jobs based on where free capacity exists. Again, Kubernetes doesn't do this automatically - you need custom logic or external orchestrators.

These optimization techniques are advanced, but they're necessary if you're managing serious GPU clusters with tens or hundreds of nodes. The gains are real: 40 percent improvement in utilization is common, which translates directly to cost savings.

Troubleshooting GPU Scheduling in Production

When a GPU pod sits pending for hours with no clear reason, systematic debugging is essential. The first step is understanding why the scheduler rejected it. Use kubectl describe pod <pod-name> and look at the "Events" section. The messages will tell you exactly why scheduling failed. "Insufficient nvidia.com/gpu" means you're out of GPU capacity. "Tolerations don't match taint" means your pod's toleration doesn't match the node's taint. "Node didn't match NodeAffinity rules" means your node affinity constraints are too restrictive.

If GPU capacity is available but pods aren't scheduling, the issue is usually node affinity. A pod requesting a specific GPU type (a100-40gb) won't schedule on nodes with different GPUs (a100-80gb), even if the 80gb GPUs are available. The fix is loosening constraints or using preferred affinity instead of required. If you need strict GPU type matching, you need to ensure you have the right node pool available.

Another common issue is resource quota exhaustion. Namespaces have resource quotas. If your namespace has a quota of 4 GPUs and you've already allocated 4, new GPU pods sit pending even if the cluster has free GPUs. Check your quota with kubectl describe quota -n <namespace>. Adjust quotas when you expect usage spikes.


Need help implementing this?

We build automation systems like this for clients every day.

Discuss Your Project