Building a Real ML Platform on Azure: The Messy Truth
So you want to build an ML platform on Azure. Not the toy examples. Not the "hello world" notebooks. The real thing - training jobs that run for days, model serving that handles actual traffic, and a bill that doesn't make your CFO cry.
I've done this dance. Let me walk you through what actually works, what breaks, and what Azure doesn't tell you in the docs.
Table of Contents
- What We're Building
- Part 1: Terraform - Infrastructure as Code That Actually Works
- The Provider Setup
- Variables - Make It Yours
- The Resource Group
- Container Registry
- Storage Account for Data
- The AKS Cluster - The Main Event
- The CPU Work Node Pool
- The GPU Node Pool - Where the Money Goes
- Log Analytics for Monitoring
- Outputs - What We Need for Later
- Part 2: Deploying It - The Terraform Dance
- Getting kubectl Access
- Part 3: Helm - Installing the Good Stuff
- The GPU Operator - Making GPUs Work
- Karpenter - The Magic Autoscaler
- Prometheus + Grafana - Seeing What's Happening
- KServe - Model Serving
- Part 4: Testing It - Does It Actually Work?
- A Simple GPU Test
- A Real Training Job
- Part 5: The Cost Reality
- Part 6: What Breaks and How to Fix It
- Problem: GPU nodes won't scale down
- Problem: Pods stuck in Pending
- Problem: Out of memory (OOM) during training
- Summary: What You Built
What We're Building
Here's the architecture we're aiming for:
graph TD
subgraph Azure
AKS["AKS<br/>(Kubernetes)"]
ACR["ACR<br/>(Container Images)"]
Storage["Storage<br/>(Data & Models)"]
subgraph Node Pools
System["System Pool<br/>(tainted)"]
CPU["CPU Work Pool<br/>(general compute)"]
GPU["GPU Pool<br/>(NVIDIA)"]
end
AKS --> System
AKS --> CPU
AKS --> GPU
AKS -.->|pull images| ACR
AKS -.->|mount data| Storage
endThree node pools:
- System: Core Kubernetes stuff, tainted so workloads don't land here
- CPU Work: General compute, data preprocessing, web services
- GPU: The expensive stuff, only for training and inference
Let's build it.
Part 1: Terraform - Infrastructure as Code That Actually Works
First, the Terraform setup. I'm assuming you have the Azure CLI logged in (az login) and a subscription selected.
The Provider Setup
# providers.tf
terraform {
required_version = ">= 1.5.0"
required_providers {
azurerm = {
source = "hashicorp/azurerm"
version = "~> 3.75"
}
helm = {
source = "hashicorp/helm"
version = "~> 2.11"
}
kubernetes = {
source = "hashicorp/kubernetes"
version = "~> 2.23"
}
}
backend "azurerm" {
resource_group_name = "tfstate"
storage_account_name = "tfstatemlplatform"
container_name = "tfstate"
key = "ml-platform.tfstate"
}
}
provider "azurerm" {
features {
resource_group {
prevent_deletion_if_contains_resources = false
}
}
}
# We'll configure helm/kubernetes providers after AKS is createdThe gotcha: That backend block assumes you've already created the storage account for Terraform state. Do that first, manually. Don't try to Terraform your Terraform backend. I've seen people recurse themselves into oblivion.
# One-time setup for Terraform state
az group create --name tfstate --location eastus
az storage account create \
--name tfstatemlplatform \
--resource-group tfstate \
--sku Standard_LRS \
--encryption-services blob
az storage container create \
--name tfstate \
--account-name tfstatemlplatformVariables - Make It Yours
# variables.tf
variable "prefix" {
description = "Prefix for all resources"
type = string
default = "mlplatform"
}
variable "location" {
description = "Azure region"
type = string
default = "eastus" # Good GPU availability
}
variable "kubernetes_version" {
description = "AKS Kubernetes version"
type = string
default = "1.28"
}
variable "gpu_node_count" {
description = "Initial GPU node count"
type = number
default = 1 # Start small, scale up
}
variable "gpu_vm_size" {
description = "GPU VM size"
type = string
default = "Standard_NC6s_v3" # V100, cheaper than A100
}Why East US? GPU quota. Some regions (looking at you, West Europe) are perpetually out of GPU capacity. East US and South Central US usually have stock. Check before you build.
The Resource Group
# main.tf
resource "azurerm_resource_group" "main" {
name = "${var.prefix}-rg"
location = var.location
tags = {
Environment = "production"
ManagedBy = "terraform"
}
}Container Registry
resource "azurerm_container_registry" "main" {
name = "${var.prefix}acr"
resource_group_name = azurerm_resource_group.main.name
location = azurerm_resource_group.main.location
sku = "Standard"
admin_enabled = false # Use service principals, not admin account
# Enable for AKS pull
anonymous_pull_enabled = false
}
# Grant AKS pull access to ACR
resource "azurerm_role_assignment" "aks_acr_pull" {
scope = azurerm_container_registry.main.id
role_definition_name = "AcrPull"
principal_id = azurerm_kubernetes_cluster.main.kubelet_identity[0].object_id
}The gotcha: ACR takes forever to delete. If you're testing and destroying repeatedly, use a random suffix on the name or you'll hit "name already exists" errors for 30+ minutes.
Storage Account for Data
resource "azurerm_storage_account" "ml" {
name = "${var.prefix}mlsa"
resource_group_name = azurerm_resource_group.main.name
location = azurerm_resource_group.main.location
account_tier = "Standard"
account_replication_type = "LRS" # Upgrade to GRS for production
# Hierarchical namespace for Data Lake Gen2
is_hns_enabled = true
blob_properties {
versioning_enabled = true
}
}
resource "azurerm_storage_container" "datasets" {
name = "datasets"
storage_account_name = azurerm_storage_account.ml.name
container_access_type = "private"
}
resource "azurerm_storage_container" "models" {
name = "models"
storage_account_name = azurerm_storage_account.ml.name
container_access_type = "private"
}
resource "azurerm_storage_container" "checkpoints" {
name = "checkpoints"
storage_account_name = azurerm_storage_account.ml.name
container_access_type = "private"
}Why hierarchical namespace? Because you'll eventually want to use Azure ML or Databricks with this storage, and they expect Data Lake Gen2. Enable it now, save the migration headache later.
The AKS Cluster - The Main Event
This is where it gets interesting. We're building a cluster with three node pools, and we need to be careful about how we do it.
resource "azurerm_kubernetes_cluster" "main" {
name = "${var.prefix}-aks"
location = azurerm_resource_group.main.location
resource_group_name = azurerm_resource_group.main.name
dns_prefix = var.prefix
kubernetes_version = var.kubernetes_version
# System node pool - this is mandatory and can't be deleted
default_node_pool {
name = "system"
node_count = 2
vm_size = "Standard_D4s_v3"
type = "VirtualMachineScaleSets"
enable_auto_scaling = true
min_count = 2
max_count = 4
# Taint this pool so only system workloads run here
node_taints = ["CriticalAddonsOnly=true:NoSchedule"]
# Labels for node selection
node_labels = {
"node-type" = "system"
}
os_disk_size_gb = 128
}
identity {
type = "SystemAssigned"
}
network_profile {
network_plugin = "azure"
network_policy = "calico"
load_balancer_sku = "standard"
}
# Enable features we'll need
azure_policy_enabled = true
# OMS agent for monitoring
oms_agent {
log_analytics_workspace_id = azurerm_log_analytics_workspace.main.id
}
# Enable CSI drivers for storage
storage_profile {
blob_driver_enabled = true
disk_driver_enabled = true
file_driver_enabled = true
}
}Critical detail: That CriticalAddonsOnly taint on the system pool. Without this, your GPU workloads will happily schedule on your system nodes and either fail (no GPU) or cost you a fortune (if you accidentally used GPU-enabled VMs for system). Always taint your system pool.
The CPU Work Node Pool
resource "azurerm_kubernetes_cluster_node_pool" "cpu" {
name = "cpu"
kubernetes_cluster_id = azurerm_kubernetes_cluster.main.id
vm_size = "Standard_D8s_v3"
node_count = 2
enable_auto_scaling = true
min_count = 2
max_count = 10
node_labels = {
"node-type" = "cpu-work"
"workload" = "general"
}
os_disk_size_gb = 256
# Spot instances for cost savings (optional, remove for production stability)
priority = "Spot"
eviction_policy = "Delete"
tags = {
"node-type" = "cpu-work"
}
}The spot instance gamble: I'm using Spot VMs here because they're 70-90% cheaper. But Azure can evict them with 30 seconds notice. For training jobs that can't be interrupted, use priority = "Regular". For web services that can fail over, Spot is fine.
The GPU Node Pool - Where the Money Goes
resource "azurerm_kubernetes_cluster_node_pool" "gpu" {
name = "gpu"
kubernetes_cluster_id = azurerm_kubernetes_cluster.main.id
vm_size = var.gpu_vm_size
node_count = var.gpu_node_count
enable_auto_scaling = true
min_count = 0 # Scale to zero when not needed!
max_count = 4 # Limit for cost control
node_labels = {
"node-type" = "gpu"
"accelerator" = "nvidia"
}
# Taint so only GPU workloads schedule here
node_taints = [
"nvidia.com/gpu=true:NoSchedule"
]
os_disk_size_gb = 512 # Large models need space
# GPU nodes are expensive - use regular priority
priority = "Regular"
tags = {
"node-type" = "gpu"
"cost-center" = "ml-training"
}
}The scale-to-zero trick: min_count = 0 means when you're not training, you're not paying. Karpenter (which we'll install later) will spin these up on demand. This one setting can save you thousands per month.
Log Analytics for Monitoring
resource "azurerm_log_analytics_workspace" "main" {
name = "${var.prefix}-logs"
location = azurerm_resource_group.main.location
resource_group_name = azurerm_resource_group.main.name
sku = "PerGB2018"
retention_in_days = 30
}Outputs - What We Need for Later
# outputs.tf
output "aks_name" {
value = azurerm_kubernetes_cluster.main.name
}
output "aks_resource_group" {
value = azurerm_resource_group.main.name
}
output "acr_login_server" {
value = azurerm_container_registry.main.login_server
}
output "storage_account_name" {
value = azurerm_storage_account.ml.name
}
output "kube_config" {
value = azurerm_kubernetes_cluster.main.kube_config_raw
sensitive = true
}Part 2: Deploying It - The Terraform Dance
Now let's actually run this thing.
# Initialize Terraform
terraform init
# See what it will create
terraform plan -out=tfplan
# Apply it (this takes 10-15 minutes)
terraform apply tfplanWhile you wait: AKS cluster creation is slow. Go get coffee. Seriously, 10-15 minutes is normal.
Getting kubectl Access
# Get credentials
az aks get-credentials \
--name $(terraform output -raw aks_name) \
--resource-group $(terraform output -raw aks_resource_group) \
--overwrite-existing
# Verify
kubectl get nodesYou should see something like:
NAME STATUS ROLES AGE VERSION
aks-cpu-12345678-vmss000000 Ready agent 5m v1.28.3
aks-cpu-12345678-vmss000001 Ready agent 5m v1.28.3
aks-gpu-87654321-vmss000000 Ready agent 3m v1.28.3
aks-system-abcdef12-vmss000000 Ready agent 8m v1.28.3
aks-system-abcdef12-vmss000001 Ready agent 8m v1.28.3
Notice the GPU node is there even though we said min_count = 0? That's because AKS node pools with min_count = 0 still create one node initially. It'll scale down after a few minutes of idle.
Part 3: Helm - Installing the Good Stuff
Now we need to install the Kubernetes add-ons that make this actually usable for ML.
The GPU Operator - Making GPUs Work
First, the NVIDIA GPU operator. This installs drivers, device plugin, and all the machinery to make GPUs usable in Kubernetes.
# Add the NVIDIA Helm repo
helm repo add nvidia https://nvidia.github.io/gpu-operator
helm repo update
# Install GPU operator
helm install gpu-operator nvidia/gpu-operator \
--namespace gpu-operator \
--create-namespace \
--waitThe gotcha: This takes 5-10 minutes. It's compiling drivers and setting up the stack. Don't panic if pods stay in Init state for a while.
Verify it worked:
# Check GPU nodes are labeled
kubectl get nodes -L nvidia.com/gpu.product
# Should show:
# NAME GPU.PRODUCT
# aks-gpu-87654321-vmss000000 Tesla-V100
# Check device plugin
kubectl get pods -n gpu-operator
# Should see pods like:
# gpu-operator-xxx
# nvidia-container-toolkit-daemonset-xxx
# nvidia-dcgm-exporter-xxx
# nvidia-device-plugin-daemonset-xxxKarpenter - The Magic Autoscaler
Karpenter is better than the default Cluster Autoscaler. It provisions nodes based on pending pod specs, not pre-defined pools.
# Add Karpenter Helm repo
helm repo add karpenter https://charts.karpenter.sh
helm repo update
# Install Karpenter
helm install karpenter karpenter/karpenter \
--namespace karpenter \
--create-namespace \
--set serviceAccount.annotations."eks\.amazonaws\.com/role-arn"="" \
--waitWait, Karpenter on Azure? Yes, Karpenter works on Azure now. It's still "experimental" but it's stable enough for production. The Azure provider is at karpenter.azure.com.
Actually, let me correct that. As of late 2024, Karpenter's Azure support is still beta. For production, you might want to stick with Cluster Autoscaler. But I'm showing you Karpenter because it's the future.
Here's the Cluster Autoscaler alternative:
# Enable cluster autoscaler on AKS (Azure-managed)
az aks update \
--name $(terraform output -raw aks_name) \
--resource-group $(terraform output -raw aks_resource_group) \
--enable-cluster-autoscaler \
--min-count 2 \
--max-count 10Prometheus + Grafana - Seeing What's Happening
# Add Prometheus Helm repo
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
# Install kube-prometheus-stack
helm install monitoring prometheus-community/kube-prometheus-stack \
--namespace monitoring \
--create-namespace \
--set prometheus.prometheusSpec.retention=30d \
--set grafana.enabled=true \
--set grafana.adminPassword=admin123 \
--waitAccess Grafana:
# Port-forward to access Grafana
kubectl port-forward -n monitoring svc/monitoring-grafana 8080:80
# Now open http://localhost:8080 in your browser
# Login: admin / admin123Change that password. Seriously. The default is fine for testing, but change it before you expose this to the internet.
KServe - Model Serving
# Install cert-manager first (KServe dependency)
helm repo add jetstack https://charts.jetstack.io
helm repo update
helm install cert-manager jetstack/cert-manager \
--namespace cert-manager \
--create-namespace \
--set installCRDs=true \
--wait
# Install KServe
helm install kserve kserve/kserve \
--namespace kserve \
--create-namespace \
--waitPart 4: Testing It - Does It Actually Work?
Let's run a real training job and see if this thing works.
A Simple GPU Test
# gpu-test.yaml
apiVersion: v1
kind: Pod
metadata:
name: gpu-test
spec:
containers:
- name: pytorch
image: pytorch/pytorch:2.1.0-cuda12.1-cudnn8-runtime
command: ["python", "-c"]
args:
- |
import torch
print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
print(f"CUDA version: {torch.version.cuda}")
print(f"GPU count: {torch.cuda.device_count()}")
print(f"GPU name: {torch.cuda.get_device_name(0)}")
resources:
limits:
nvidia.com/gpu: 1
nodeSelector:
node-type: gpu
tolerations:
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedulekubectl apply -f gpu-test.yaml
kubectl logs -f gpu-testExpected output:
PyTorch version: 2.1.0
CUDA available: True
CUDA version: 12.1
GPU count: 1
GPU name: Tesla V100-PCIE-16GB
If you see CUDA available: False, something's wrong with the GPU operator. Check the logs:
kubectl logs -n gpu-operator -l app=nvidia-device-plugin-daemonsetA Real Training Job
# training-job.yaml
apiVersion: batch/v1
kind: Job
metadata:
name: pytorch-training
spec:
template:
spec:
containers:
- name: training
image: your-acr.azurecr.io/training:latest # Build and push this
command: ["python", "train.py"]
env:
- name: DATA_PATH
value: "/data"
- name: MODEL_PATH
value: "/models"
- name: CHECKPOINT_PATH
value: "/checkpoints"
resources:
limits:
nvidia.com/gpu: 2 # Request 2 GPUs
memory: "64Gi"
cpu: "16"
volumeMounts:
- name: data
mountPath: /data
- name: models
mountPath: /models
- name: checkpoints
mountPath: /checkpoints
volumes:
- name: data
persistentVolumeClaim:
claimName: dataset-pvc
- name: models
persistentVolumeClaim:
claimName: model-pvc
- name: checkpoints
persistentVolumeClaim:
claimName: checkpoint-pvc
nodeSelector:
node-type: gpu
tolerations:
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
restartPolicy: NeverPart 5: The Cost Reality
Let me be honest about what this costs.
| Component | Monthly Cost (approx) |
|---|---|
| AKS Control Plane | $0.10/hour = ~$73/month |
| System nodes (2x D4s_v3) | ~$280/month |
| CPU nodes (2x D8s_v3, Spot) | ~$150/month |
| GPU nodes (1x NC6s_v3) | ~$800/month |
| Storage (1TB) | ~$20/month |
| Monitoring | ~$50/month |
| Total (running) | ~$1,373/month |
But: With min_count = 0 on GPU nodes and smart scheduling, you can get this down to ~$600/month if you're only training occasionally.
Part 6: What Breaks and How to Fix It
Problem: GPU nodes won't scale down
Symptom: GPU nodes stay at 1 even when no workloads are running.
Cause: Something is running on them (system pods, monitoring, etc.).
Fix: Check what's running:
kubectl describe node aks-gpu-xxxxxLook for pods with kube-system namespace. Add node affinity to keep them off GPU nodes:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: node-type
operator: NotIn
values:
- gpuProblem: Pods stuck in Pending
Symptom: Your GPU job stays in Pending forever.
Check:
kubectl describe pod pytorch-training-xxxxxLook for events. Common causes:
Insufficient nvidia.com/gpu- No GPU nodes available, check autoscalerFailed to pull image- ACR authentication issue0/4 nodes are available- Check taints/tolerations match
Problem: Out of memory (OOM) during training
Symptom: Pod restarts, exit code 137.
Fix: Increase memory limit or use gradient checkpointing. Also check if you're loading the whole dataset into RAM - use DataLoader with num_workers and pin_memory.
Summary: What You Built
You now have:
- An AKS cluster with separate system, CPU, and GPU node pools
- Auto-scaling that scales GPU nodes to zero
- Container registry for your training images
- Blob storage for datasets and models
- GPU operator for NVIDIA driver management
- Monitoring with Prometheus and Grafana
- KServe for model serving
This is a production-ready foundation. From here, you can add:
- Kubeflow or MLflow for experiment tracking
- Argo Workflows for pipeline orchestration
- Vault for secrets management (which we set up earlier!)
- Azure AD integration for authentication
The real work starts now - building your actual ML pipelines. But the infrastructure? That's solid.
Need help with the next layer? That's where the real fun begins.
Want us to build this for your team? We've deployed ML infrastructure on Azure for businesses of all sizes. Schedule a discovery call and let's talk about what you need.