January 20, 2025
Azure Kubernetes Terraform Machine Learning DevOps

Building a Real ML Platform on Azure: The Messy Truth

So you want to build an ML platform on Azure. Not the toy examples. Not the "hello world" notebooks. The real thing - training jobs that run for days, model serving that handles actual traffic, and a bill that doesn't make your CFO cry.

I've done this dance. Let me walk you through what actually works, what breaks, and what Azure doesn't tell you in the docs.

Table of Contents
  1. What We're Building
  2. Part 1: Terraform - Infrastructure as Code That Actually Works
  3. The Provider Setup
  4. Variables - Make It Yours
  5. The Resource Group
  6. Container Registry
  7. Storage Account for Data
  8. The AKS Cluster - The Main Event
  9. The CPU Work Node Pool
  10. The GPU Node Pool - Where the Money Goes
  11. Log Analytics for Monitoring
  12. Outputs - What We Need for Later
  13. Part 2: Deploying It - The Terraform Dance
  14. Getting kubectl Access
  15. Part 3: Helm - Installing the Good Stuff
  16. The GPU Operator - Making GPUs Work
  17. Karpenter - The Magic Autoscaler
  18. Prometheus + Grafana - Seeing What's Happening
  19. KServe - Model Serving
  20. Part 4: Testing It - Does It Actually Work?
  21. A Simple GPU Test
  22. A Real Training Job
  23. Part 5: The Cost Reality
  24. Part 6: What Breaks and How to Fix It
  25. Problem: GPU nodes won't scale down
  26. Problem: Pods stuck in Pending
  27. Problem: Out of memory (OOM) during training
  28. Summary: What You Built

What We're Building

Here's the architecture we're aiming for:

graph TD
    subgraph Azure
        AKS["AKS<br/>(Kubernetes)"]
        ACR["ACR<br/>(Container Images)"]
        Storage["Storage<br/>(Data & Models)"]
 
        subgraph Node Pools
            System["System Pool<br/>(tainted)"]
            CPU["CPU Work Pool<br/>(general compute)"]
            GPU["GPU Pool<br/>(NVIDIA)"]
        end
 
        AKS --> System
        AKS --> CPU
        AKS --> GPU
        AKS -.->|pull images| ACR
        AKS -.->|mount data| Storage
    end

Three node pools:

  • System: Core Kubernetes stuff, tainted so workloads don't land here
  • CPU Work: General compute, data preprocessing, web services
  • GPU: The expensive stuff, only for training and inference

Let's build it.

Part 1: Terraform - Infrastructure as Code That Actually Works

First, the Terraform setup. I'm assuming you have the Azure CLI logged in (az login) and a subscription selected.

The Provider Setup

hcl
# providers.tf
terraform {
  required_version = ">= 1.5.0"
  required_providers {
    azurerm = {
      source  = "hashicorp/azurerm"
      version = "~> 3.75"
    }
    helm = {
      source  = "hashicorp/helm"
      version = "~> 2.11"
    }
    kubernetes = {
      source  = "hashicorp/kubernetes"
      version = "~> 2.23"
    }
  }
 
  backend "azurerm" {
    resource_group_name  = "tfstate"
    storage_account_name = "tfstatemlplatform"
    container_name       = "tfstate"
    key                  = "ml-platform.tfstate"
  }
}
 
provider "azurerm" {
  features {
    resource_group {
      prevent_deletion_if_contains_resources = false
    }
  }
}
 
# We'll configure helm/kubernetes providers after AKS is created

The gotcha: That backend block assumes you've already created the storage account for Terraform state. Do that first, manually. Don't try to Terraform your Terraform backend. I've seen people recurse themselves into oblivion.

bash
# One-time setup for Terraform state
az group create --name tfstate --location eastus
az storage account create \
  --name tfstatemlplatform \
  --resource-group tfstate \
  --sku Standard_LRS \
  --encryption-services blob
az storage container create \
  --name tfstate \
  --account-name tfstatemlplatform

Variables - Make It Yours

hcl
# variables.tf
variable "prefix" {
  description = "Prefix for all resources"
  type        = string
  default     = "mlplatform"
}
 
variable "location" {
  description = "Azure region"
  type        = string
  default     = "eastus"  # Good GPU availability
}
 
variable "kubernetes_version" {
  description = "AKS Kubernetes version"
  type        = string
  default     = "1.28"
}
 
variable "gpu_node_count" {
  description = "Initial GPU node count"
  type        = number
  default     = 1  # Start small, scale up
}
 
variable "gpu_vm_size" {
  description = "GPU VM size"
  type        = string
  default     = "Standard_NC6s_v3"  # V100, cheaper than A100
}

Why East US? GPU quota. Some regions (looking at you, West Europe) are perpetually out of GPU capacity. East US and South Central US usually have stock. Check before you build.

The Resource Group

hcl
# main.tf
resource "azurerm_resource_group" "main" {
  name     = "${var.prefix}-rg"
  location = var.location
 
  tags = {
    Environment = "production"
    ManagedBy   = "terraform"
  }
}

Container Registry

hcl
resource "azurerm_container_registry" "main" {
  name                = "${var.prefix}acr"
  resource_group_name = azurerm_resource_group.main.name
  location            = azurerm_resource_group.main.location
  sku                 = "Standard"
  admin_enabled       = false  # Use service principals, not admin account
 
  # Enable for AKS pull
  anonymous_pull_enabled = false
}
 
# Grant AKS pull access to ACR
resource "azurerm_role_assignment" "aks_acr_pull" {
  scope                = azurerm_container_registry.main.id
  role_definition_name = "AcrPull"
  principal_id         = azurerm_kubernetes_cluster.main.kubelet_identity[0].object_id
}

The gotcha: ACR takes forever to delete. If you're testing and destroying repeatedly, use a random suffix on the name or you'll hit "name already exists" errors for 30+ minutes.

Storage Account for Data

hcl
resource "azurerm_storage_account" "ml" {
  name                     = "${var.prefix}mlsa"
  resource_group_name      = azurerm_resource_group.main.name
  location                 = azurerm_resource_group.main.location
  account_tier             = "Standard"
  account_replication_type = "LRS"  # Upgrade to GRS for production
 
  # Hierarchical namespace for Data Lake Gen2
  is_hns_enabled = true
 
  blob_properties {
    versioning_enabled = true
  }
}
 
resource "azurerm_storage_container" "datasets" {
  name                  = "datasets"
  storage_account_name  = azurerm_storage_account.ml.name
  container_access_type = "private"
}
 
resource "azurerm_storage_container" "models" {
  name                  = "models"
  storage_account_name  = azurerm_storage_account.ml.name
  container_access_type = "private"
}
 
resource "azurerm_storage_container" "checkpoints" {
  name                  = "checkpoints"
  storage_account_name  = azurerm_storage_account.ml.name
  container_access_type = "private"
}

Why hierarchical namespace? Because you'll eventually want to use Azure ML or Databricks with this storage, and they expect Data Lake Gen2. Enable it now, save the migration headache later.

The AKS Cluster - The Main Event

This is where it gets interesting. We're building a cluster with three node pools, and we need to be careful about how we do it.

hcl
resource "azurerm_kubernetes_cluster" "main" {
  name                = "${var.prefix}-aks"
  location            = azurerm_resource_group.main.location
  resource_group_name = azurerm_resource_group.main.name
  dns_prefix          = var.prefix
  kubernetes_version  = var.kubernetes_version
 
  # System node pool - this is mandatory and can't be deleted
  default_node_pool {
    name                = "system"
    node_count          = 2
    vm_size             = "Standard_D4s_v3"
    type                = "VirtualMachineScaleSets"
    enable_auto_scaling = true
    min_count           = 2
    max_count           = 4
 
    # Taint this pool so only system workloads run here
    node_taints = ["CriticalAddonsOnly=true:NoSchedule"]
 
    # Labels for node selection
    node_labels = {
      "node-type" = "system"
    }
 
    os_disk_size_gb = 128
  }
 
  identity {
    type = "SystemAssigned"
  }
 
  network_profile {
    network_plugin    = "azure"
    network_policy    = "calico"
    load_balancer_sku = "standard"
  }
 
  # Enable features we'll need
  azure_policy_enabled = true
 
  # OMS agent for monitoring
  oms_agent {
    log_analytics_workspace_id = azurerm_log_analytics_workspace.main.id
  }
 
  # Enable CSI drivers for storage
  storage_profile {
    blob_driver_enabled = true
    disk_driver_enabled = true
    file_driver_enabled = true
  }
}

Critical detail: That CriticalAddonsOnly taint on the system pool. Without this, your GPU workloads will happily schedule on your system nodes and either fail (no GPU) or cost you a fortune (if you accidentally used GPU-enabled VMs for system). Always taint your system pool.

The CPU Work Node Pool

hcl
resource "azurerm_kubernetes_cluster_node_pool" "cpu" {
  name                  = "cpu"
  kubernetes_cluster_id = azurerm_kubernetes_cluster.main.id
  vm_size               = "Standard_D8s_v3"
  node_count            = 2
  enable_auto_scaling   = true
  min_count             = 2
  max_count             = 10
 
  node_labels = {
    "node-type" = "cpu-work"
    "workload"  = "general"
  }
 
  os_disk_size_gb = 256
 
  # Spot instances for cost savings (optional, remove for production stability)
  priority = "Spot"
  eviction_policy = "Delete"
 
  tags = {
    "node-type" = "cpu-work"
  }
}

The spot instance gamble: I'm using Spot VMs here because they're 70-90% cheaper. But Azure can evict them with 30 seconds notice. For training jobs that can't be interrupted, use priority = "Regular". For web services that can fail over, Spot is fine.

The GPU Node Pool - Where the Money Goes

hcl
resource "azurerm_kubernetes_cluster_node_pool" "gpu" {
  name                  = "gpu"
  kubernetes_cluster_id = azurerm_kubernetes_cluster.main.id
  vm_size               = var.gpu_vm_size
  node_count            = var.gpu_node_count
  enable_auto_scaling   = true
  min_count             = 0    # Scale to zero when not needed!
  max_count             = 4    # Limit for cost control
 
  node_labels = {
    "node-type" = "gpu"
    "accelerator" = "nvidia"
  }
 
  # Taint so only GPU workloads schedule here
  node_taints = [
    "nvidia.com/gpu=true:NoSchedule"
  ]
 
  os_disk_size_gb = 512  # Large models need space
 
  # GPU nodes are expensive - use regular priority
  priority = "Regular"
 
  tags = {
    "node-type" = "gpu"
    "cost-center" = "ml-training"
  }
}

The scale-to-zero trick: min_count = 0 means when you're not training, you're not paying. Karpenter (which we'll install later) will spin these up on demand. This one setting can save you thousands per month.

Log Analytics for Monitoring

hcl
resource "azurerm_log_analytics_workspace" "main" {
  name                = "${var.prefix}-logs"
  location            = azurerm_resource_group.main.location
  resource_group_name = azurerm_resource_group.main.name
  sku                 = "PerGB2018"
  retention_in_days   = 30
}

Outputs - What We Need for Later

hcl
# outputs.tf
output "aks_name" {
  value = azurerm_kubernetes_cluster.main.name
}
 
output "aks_resource_group" {
  value = azurerm_resource_group.main.name
}
 
output "acr_login_server" {
  value = azurerm_container_registry.main.login_server
}
 
output "storage_account_name" {
  value = azurerm_storage_account.ml.name
}
 
output "kube_config" {
  value     = azurerm_kubernetes_cluster.main.kube_config_raw
  sensitive = true
}

Part 2: Deploying It - The Terraform Dance

Now let's actually run this thing.

bash
# Initialize Terraform
terraform init
 
# See what it will create
terraform plan -out=tfplan
 
# Apply it (this takes 10-15 minutes)
terraform apply tfplan

While you wait: AKS cluster creation is slow. Go get coffee. Seriously, 10-15 minutes is normal.

Getting kubectl Access

bash
# Get credentials
az aks get-credentials \
  --name $(terraform output -raw aks_name) \
  --resource-group $(terraform output -raw aks_resource_group) \
  --overwrite-existing
 
# Verify
kubectl get nodes

You should see something like:

NAME                              STATUS   ROLES   AGE   VERSION
aks-cpu-12345678-vmss000000       Ready    agent   5m    v1.28.3
aks-cpu-12345678-vmss000001       Ready    agent   5m    v1.28.3
aks-gpu-87654321-vmss000000       Ready    agent   3m    v1.28.3
aks-system-abcdef12-vmss000000    Ready    agent   8m    v1.28.3
aks-system-abcdef12-vmss000001    Ready    agent   8m    v1.28.3

Notice the GPU node is there even though we said min_count = 0? That's because AKS node pools with min_count = 0 still create one node initially. It'll scale down after a few minutes of idle.

Part 3: Helm - Installing the Good Stuff

Now we need to install the Kubernetes add-ons that make this actually usable for ML.

The GPU Operator - Making GPUs Work

First, the NVIDIA GPU operator. This installs drivers, device plugin, and all the machinery to make GPUs usable in Kubernetes.

bash
# Add the NVIDIA Helm repo
helm repo add nvidia https://nvidia.github.io/gpu-operator
helm repo update
 
# Install GPU operator
helm install gpu-operator nvidia/gpu-operator \
  --namespace gpu-operator \
  --create-namespace \
  --wait

The gotcha: This takes 5-10 minutes. It's compiling drivers and setting up the stack. Don't panic if pods stay in Init state for a while.

Verify it worked:

bash
# Check GPU nodes are labeled
kubectl get nodes -L nvidia.com/gpu.product
 
# Should show:
# NAME                            GPU.PRODUCT
# aks-gpu-87654321-vmss000000     Tesla-V100
 
# Check device plugin
kubectl get pods -n gpu-operator
 
# Should see pods like:
# gpu-operator-xxx
# nvidia-container-toolkit-daemonset-xxx
# nvidia-dcgm-exporter-xxx
# nvidia-device-plugin-daemonset-xxx

Karpenter - The Magic Autoscaler

Karpenter is better than the default Cluster Autoscaler. It provisions nodes based on pending pod specs, not pre-defined pools.

bash
# Add Karpenter Helm repo
helm repo add karpenter https://charts.karpenter.sh
helm repo update
 
# Install Karpenter
helm install karpenter karpenter/karpenter \
  --namespace karpenter \
  --create-namespace \
  --set serviceAccount.annotations."eks\.amazonaws\.com/role-arn"="" \
  --wait

Wait, Karpenter on Azure? Yes, Karpenter works on Azure now. It's still "experimental" but it's stable enough for production. The Azure provider is at karpenter.azure.com.

Actually, let me correct that. As of late 2024, Karpenter's Azure support is still beta. For production, you might want to stick with Cluster Autoscaler. But I'm showing you Karpenter because it's the future.

Here's the Cluster Autoscaler alternative:

bash
# Enable cluster autoscaler on AKS (Azure-managed)
az aks update \
  --name $(terraform output -raw aks_name) \
  --resource-group $(terraform output -raw aks_resource_group) \
  --enable-cluster-autoscaler \
  --min-count 2 \
  --max-count 10

Prometheus + Grafana - Seeing What's Happening

bash
# Add Prometheus Helm repo
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
 
# Install kube-prometheus-stack
helm install monitoring prometheus-community/kube-prometheus-stack \
  --namespace monitoring \
  --create-namespace \
  --set prometheus.prometheusSpec.retention=30d \
  --set grafana.enabled=true \
  --set grafana.adminPassword=admin123 \
  --wait

Access Grafana:

bash
# Port-forward to access Grafana
kubectl port-forward -n monitoring svc/monitoring-grafana 8080:80
 
# Now open http://localhost:8080 in your browser
# Login: admin / admin123

Change that password. Seriously. The default is fine for testing, but change it before you expose this to the internet.

KServe - Model Serving

bash
# Install cert-manager first (KServe dependency)
helm repo add jetstack https://charts.jetstack.io
helm repo update
 
helm install cert-manager jetstack/cert-manager \
  --namespace cert-manager \
  --create-namespace \
  --set installCRDs=true \
  --wait
 
# Install KServe
helm install kserve kserve/kserve \
  --namespace kserve \
  --create-namespace \
  --wait

Part 4: Testing It - Does It Actually Work?

Let's run a real training job and see if this thing works.

A Simple GPU Test

yaml
# gpu-test.yaml
apiVersion: v1
kind: Pod
metadata:
  name: gpu-test
spec:
  containers:
  - name: pytorch
    image: pytorch/pytorch:2.1.0-cuda12.1-cudnn8-runtime
    command: ["python", "-c"]
    args:
      - |
        import torch
        print(f"PyTorch version: {torch.__version__}")
        print(f"CUDA available: {torch.cuda.is_available()}")
        if torch.cuda.is_available():
            print(f"CUDA version: {torch.version.cuda}")
            print(f"GPU count: {torch.cuda.device_count()}")
            print(f"GPU name: {torch.cuda.get_device_name(0)}")
    resources:
      limits:
        nvidia.com/gpu: 1
  nodeSelector:
    node-type: gpu
  tolerations:
  - key: nvidia.com/gpu
    operator: Exists
    effect: NoSchedule
bash
kubectl apply -f gpu-test.yaml
kubectl logs -f gpu-test

Expected output:

PyTorch version: 2.1.0
CUDA available: True
CUDA version: 12.1
GPU count: 1
GPU name: Tesla V100-PCIE-16GB

If you see CUDA available: False, something's wrong with the GPU operator. Check the logs:

bash
kubectl logs -n gpu-operator -l app=nvidia-device-plugin-daemonset

A Real Training Job

yaml
# training-job.yaml
apiVersion: batch/v1
kind: Job
metadata:
  name: pytorch-training
spec:
  template:
    spec:
      containers:
      - name: training
        image: your-acr.azurecr.io/training:latest  # Build and push this
        command: ["python", "train.py"]
        env:
        - name: DATA_PATH
          value: "/data"
        - name: MODEL_PATH
          value: "/models"
        - name: CHECKPOINT_PATH
          value: "/checkpoints"
        resources:
          limits:
            nvidia.com/gpu: 2  # Request 2 GPUs
            memory: "64Gi"
            cpu: "16"
        volumeMounts:
        - name: data
          mountPath: /data
        - name: models
          mountPath: /models
        - name: checkpoints
          mountPath: /checkpoints
      volumes:
      - name: data
        persistentVolumeClaim:
          claimName: dataset-pvc
      - name: models
        persistentVolumeClaim:
          claimName: model-pvc
      - name: checkpoints
        persistentVolumeClaim:
          claimName: checkpoint-pvc
      nodeSelector:
        node-type: gpu
      tolerations:
      - key: nvidia.com/gpu
        operator: Exists
        effect: NoSchedule
      restartPolicy: Never

Part 5: The Cost Reality

Let me be honest about what this costs.

ComponentMonthly Cost (approx)
AKS Control Plane$0.10/hour = ~$73/month
System nodes (2x D4s_v3)~$280/month
CPU nodes (2x D8s_v3, Spot)~$150/month
GPU nodes (1x NC6s_v3)~$800/month
Storage (1TB)~$20/month
Monitoring~$50/month
Total (running)~$1,373/month

But: With min_count = 0 on GPU nodes and smart scheduling, you can get this down to ~$600/month if you're only training occasionally.

Part 6: What Breaks and How to Fix It

Problem: GPU nodes won't scale down

Symptom: GPU nodes stay at 1 even when no workloads are running.

Cause: Something is running on them (system pods, monitoring, etc.).

Fix: Check what's running:

bash
kubectl describe node aks-gpu-xxxxx

Look for pods with kube-system namespace. Add node affinity to keep them off GPU nodes:

yaml
affinity:
  nodeAffinity:
    requiredDuringSchedulingIgnoredDuringExecution:
      nodeSelectorTerms:
      - matchExpressions:
        - key: node-type
          operator: NotIn
          values:
          - gpu

Problem: Pods stuck in Pending

Symptom: Your GPU job stays in Pending forever.

Check:

bash
kubectl describe pod pytorch-training-xxxxx

Look for events. Common causes:

  • Insufficient nvidia.com/gpu - No GPU nodes available, check autoscaler
  • Failed to pull image - ACR authentication issue
  • 0/4 nodes are available - Check taints/tolerations match

Problem: Out of memory (OOM) during training

Symptom: Pod restarts, exit code 137.

Fix: Increase memory limit or use gradient checkpointing. Also check if you're loading the whole dataset into RAM - use DataLoader with num_workers and pin_memory.

Summary: What You Built

You now have:

  • An AKS cluster with separate system, CPU, and GPU node pools
  • Auto-scaling that scales GPU nodes to zero
  • Container registry for your training images
  • Blob storage for datasets and models
  • GPU operator for NVIDIA driver management
  • Monitoring with Prometheus and Grafana
  • KServe for model serving

This is a production-ready foundation. From here, you can add:

  • Kubeflow or MLflow for experiment tracking
  • Argo Workflows for pipeline orchestration
  • Vault for secrets management (which we set up earlier!)
  • Azure AD integration for authentication

The real work starts now - building your actual ML pipelines. But the infrastructure? That's solid.

Need help with the next layer? That's where the real fun begins.


Want us to build this for your team? We've deployed ML infrastructure on Azure for businesses of all sizes. Schedule a discovery call and let's talk about what you need.

Need help implementing this?

We build automation systems like this for clients every day.

Discuss Your Project