February 15, 2024
Terraform Azure IaC DevOps

Terraform on Azure: Production-Ready Best Practices

You've been clicking around the Azure portal for months. Maybe years. You've got resource groups named "test-rg-final-v2" and a storage account somebody created at 2am that nobody wants to delete because something depends on it. Sound familiar?

I've been managing Azure infrastructure for over six years now and I've done 50+ cloud migrations. The single biggest force multiplier I've found isn't a fancy service mesh or a Kubernetes operator. It's Terraform. Properly structured, version-controlled, pipeline-automated Terraform.

This isn't the "here's how to run terraform init" post. This is the post I wish someone had handed me before I learned these lessons the hard way across dozens of production environments. We're going deep on project structure, state management, module patterns, networking, identity, CI/CD, testing, and the real-world pitfalls that will bite you at 3am if you don't plan for them.

Let's get into it.

Table of Contents
  1. Why Terraform Over Bicep or ARM Templates?
  2. Project Structure That Actually Scales
  3. State Management Deep Dive
  4. The Azure Storage Backend
  5. Setting Up the State Storage Properly
  6. State Locking
  7. State File Disaster Recovery
  8. Provider Configuration and Version Pinning
  9. Variable Management That Doesn't Drive You Insane
  10. The tfvars Pattern
  11. Secrets with Azure Key Vault
  12. Module Patterns for the Real World
  13. When to Modularize
  14. A Real Module Example
  15. Versioned Modules vs. Local Modules
  16. Identity and RBAC with Terraform
  17. Service Principal for Terraform
  18. Managing Role Assignments
  19. CI/CD Integration
  20. GitHub Actions
  21. Azure DevOps Pipeline
  22. Importing Existing Resources
  23. Common Pitfalls That Will Ruin Your Week
  24. Provider Version Drift
  25. The Accidental Destroy
  26. State Conflicts During Team Development
  27. Sensitive Data in Plan Output
  28. Testing Your Terraform
  29. Level 1: Validate and Lint
  30. Level 2: Plan Analysis
  31. Level 3: Integration Testing with Terratest
  32. Real-World Lessons from Production
  33. Summary

Why Terraform Over Bicep or ARM Templates?

I get this question constantly. Azure has Bicep. It's first-party. It's got great VS Code tooling. Why would you choose Terraform?

Here's my honest answer: it depends on your reality, not Azure's marketing.

Multi-cloud is real. Not in the "we're running identical workloads on three clouds" fantasy way. In the "we have Azure for production, AWS for a data pipeline a contractor built, and Cloudflare for DNS and CDN" way. Terraform handles all of those with one language and one workflow. Bicep handles exactly one of them.

Team familiarity matters more than you think. If your team already knows HCL from managing AWS or GCP resources, forcing them to learn Bicep for Azure creates cognitive overhead that slows deployments. I've watched teams spend weeks learning a new DSL when they could have been shipping infrastructure.

The ecosystem is massive. The Terraform Registry has thousands of community modules and providers. Need to manage your Datadog monitors alongside your Azure resources? PagerDuty escalation policies? GitHub repository settings? Terraform has providers for all of it. Bicep will never have that breadth.

State management gives you power. This is controversial, but hear me out. Bicep fans love to point out that Bicep is stateless. But Terraform's state file is actually a feature. It gives you terraform plan -- a genuine preview of exactly what will change before you touch anything. It gives you drift detection. It gives you import capabilities. The state file is a responsibility, yes. But it's a superpower when managed correctly.

That said, if you're an all-Azure shop with a team that's already deep in the Microsoft ecosystem and you have zero multi-cloud ambitions, Bicep is a perfectly valid choice. I'm not religious about tools. I'm religious about shipping reliable infrastructure.

Project Structure That Actually Scales

Most Terraform tutorials show you a flat directory with main.tf, variables.tf, and outputs.tf. That works for a tutorial. It falls apart the moment you have dev, staging, and production environments with different configurations.

Here's the structure I use after years of iteration:

infrastructure/
├── environments/
│   ├── dev/
│   │   ├── main.tf
│   │   ├── variables.tf
│   │   ├── outputs.tf
│   │   ├── terraform.tfvars
│   │   ├── backend.tf
│   │   └── providers.tf
│   ├── staging/
│   │   ├── main.tf
│   │   ├── variables.tf
│   │   ├── outputs.tf
│   │   ├── terraform.tfvars
│   │   ├── backend.tf
│   │   └── providers.tf
│   └── prod/
│       ├── main.tf
│       ├── variables.tf
│       ├── outputs.tf
│       ├── terraform.tfvars
│       ├── backend.tf
│       └── providers.tf
├── modules/
│   ├── networking/
│   │   ├── main.tf
│   │   ├── variables.tf
│   │   └── outputs.tf
│   ├── compute/
│   │   ├── main.tf
│   │   ├── variables.tf
│   │   └── outputs.tf
│   ├── storage/
│   │   ├── main.tf
│   │   ├── variables.tf
│   │   └── outputs.tf
│   ├── identity/
│   │   ├── main.tf
│   │   ├── variables.tf
│   │   └── outputs.tf
│   └── monitoring/
│       ├── main.tf
│       ├── variables.tf
│       └── outputs.tf
└── shared/
    ├── locals.tf
    └── tags.tf

Why separate directories per environment instead of workspaces? Because Terraform workspaces share backend configuration. That means your dev state file lives in the same storage account as your prod state file with only the key name differentiating them. I've seen engineers accidentally run commands against the wrong workspace and nuke production resources. Separate directories with separate backend configs make that physically impossible.

Why backend.tf and providers.tf in each environment? Because each environment might target a different subscription, use different credentials, or need different provider versions during a migration. Explicit is better than implicit when your production environment is on the line.

The shared/ directory holds common locals and tag definitions that get referenced by environment configs. Tags are especially important -- I define them once and reference everywhere:

hcl
# shared/tags.tf
locals {
  common_tags = {
    managed_by  = "terraform"
    project     = "myproject"
    cost_center = "engineering"
    owner       = "platform-team"
  }
}

State Management Deep Dive

State is where Terraform beginners get burned. Let me save you the pain.

The Azure Storage Backend

Never, ever store state locally. Not even for "quick tests." You will forget. You will run terraform apply from your laptop and then wonder why the pipeline is showing drift. Use Azure Blob Storage from day one:

hcl
terraform {
  backend "azurerm" {
    resource_group_name  = "rg-terraform-state"
    storage_account_name = "tfstateproduction"
    container_name       = "tfstate"
    key                  = "prod.terraform.tfstate"
  }
}

Pro tip: Enable versioning on the container. It's saved me more than once when someone ran terraform destroy on the wrong workspace.

Setting Up the State Storage Properly

Here's the bootstrap script I use for every new project. Yes, you have to create the state storage outside of Terraform. It's the one chicken-and-egg problem you just have to accept:

bash
#!/bin/bash
RESOURCE_GROUP="rg-terraform-state"
STORAGE_ACCOUNT="tfstate$(openssl rand -hex 4)"
CONTAINER="tfstate"
LOCATION="uksouth"
 
# Create resource group
az group create \
  --name $RESOURCE_GROUP \
  --location $LOCATION
 
# Create storage account with security defaults
az storage account create \
  --name $STORAGE_ACCOUNT \
  --resource-group $RESOURCE_GROUP \
  --location $LOCATION \
  --sku Standard_GRS \
  --kind StorageV2 \
  --min-tls-version TLS1_2 \
  --allow-blob-public-access false \
  --https-only true
 
# Enable blob versioning
az storage account blob-service-properties update \
  --account-name $STORAGE_ACCOUNT \
  --resource-group $RESOURCE_GROUP \
  --enable-versioning true \
  --enable-delete-retention true \
  --delete-retention-days 30
 
# Create container
az storage container create \
  --name $CONTAINER \
  --account-name $STORAGE_ACCOUNT
 
# Enable storage account lock to prevent accidental deletion
az lock create \
  --name "CanNotDelete" \
  --resource-group $RESOURCE_GROUP \
  --resource-name $STORAGE_ACCOUNT \
  --resource-type Microsoft.Storage/storageAccounts \
  --lock-type CanNotDelete
 
echo "Storage Account: $STORAGE_ACCOUNT"

Notice a few things: I use Standard_GRS (geo-redundant storage) because losing your state file is a really bad day. I enable blob versioning and soft delete with a 30-day retention. And I slap a CanNotDelete lock on the storage account because someone will try to clean up "unused" resources someday.

State Locking

Terraform uses Azure Blob Storage lease mechanisms for state locking automatically. When one person or pipeline runs terraform apply, it takes a lease on the blob. Anyone else trying to run simultaneously gets blocked. This prevents two people from writing conflicting changes to the same infrastructure.

But here's what nobody tells you: leases can get stuck. If your pipeline crashes mid-apply, the lease might not release. You'll see an error like "Error acquiring the state lock." Don't panic. You can break it:

bash
# Break a stuck lease (use with caution!)
az storage blob lease break \
  --blob-name prod.terraform.tfstate \
  --container-name tfstate \
  --account-name tfstateproduction

Only do this when you're absolutely sure no other process is running. Breaking a lease while someone else is mid-apply is a recipe for corrupted state.

State File Disaster Recovery

Your state file is your infrastructure's source of truth. If you lose it, Terraform thinks nothing exists. Every resource becomes unknown. Rebuilding state by hand with terraform import for hundreds of resources is a multi-day nightmare. I've done it. Once. Never again.

Here's my disaster recovery checklist:

  1. Geo-redundant storage (GRS or GZRS) for the state storage account
  2. Blob versioning enabled so you can roll back to a previous state
  3. Soft delete with 30-day retention
  4. Resource lock on the storage account
  5. Periodic state backups to a separate subscription (I use an Azure Function on a timer)
  6. State file in a dedicated resource group that nobody else touches

Provider Configuration and Version Pinning

This will save you from a class of bugs that are incredibly hard to diagnose. Always pin your provider versions.

hcl
terraform {
  required_version = ">= 1.7.0, < 2.0.0"
 
  required_providers {
    azurerm = {
      source  = "hashicorp/azurerm"
      version = "~> 3.85.0"
    }
    azuread = {
      source  = "hashicorp/azuread"
      version = "~> 2.47.0"
    }
    random = {
      source  = "hashicorp/random"
      version = "~> 3.6.0"
    }
  }
}
 
provider "azurerm" {
  features {
    key_vault {
      purge_soft_delete_on_destroy    = false
      recover_soft_deleted_key_vaults = true
    }
    resource_group {
      prevent_deletion_if_contains_resources = true
    }
  }
 
  subscription_id = var.subscription_id
}

A few things worth calling out:

The ~> operator (pessimistic constraint) allows patch version updates but not minor or major. So ~> 3.85.0 allows 3.85.1, 3.85.2, etc., but not 3.86.0. This protects you from breaking changes while still getting bug fixes.

The features block is Azure-specific and incredibly important. Setting purge_soft_delete_on_destroy = false prevents Terraform from permanently purging Key Vaults when you destroy them. Setting prevent_deletion_if_contains_resources = true stops Terraform from deleting a resource group that still has resources in it -- a safety net that has prevented data loss more times than I can count.

Run terraform init -upgrade intentionally, not accidentally. When you want to update providers, do it deliberately, test in dev first, and commit the updated .terraform.lock.hcl file. That lock file is your guarantee that everyone on the team and every pipeline run uses the exact same provider binaries.

Variable Management That Doesn't Drive You Insane

Variables in Terraform can come from five different places, and understanding the precedence order is critical:

  1. Environment variables (TF_VAR_name)
  2. terraform.tfvars file
  3. *.auto.tfvars files
  4. -var and -var-file command line flags
  5. Variable defaults in variables.tf

The tfvars Pattern

Each environment gets its own terraform.tfvars:

hcl
# environments/prod/terraform.tfvars
location            = "uksouth"
environment         = "prod"
subscription_id     = "xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx"
vnet_address_space  = ["10.1.0.0/16"]
sku_tier            = "Premium"
min_instance_count  = 3
enable_diagnostics  = true
hcl
# environments/dev/terraform.tfvars
location            = "uksouth"
environment         = "dev"
subscription_id     = "yyyyyyyy-yyyy-yyyy-yyyy-yyyyyyyyyyyy"
vnet_address_space  = ["10.100.0.0/16"]
sku_tier            = "Standard"
min_instance_count  = 1
enable_diagnostics  = false

Secrets with Azure Key Vault

Never put secrets in tfvars files. Not even in "private" repos. Use Azure Key Vault as a data source:

hcl
data "azurerm_key_vault" "main" {
  name                = "kv-${var.project}-${var.environment}"
  resource_group_name = "rg-${var.project}-shared"
}
 
data "azurerm_key_vault_secret" "db_password" {
  name         = "database-admin-password"
  key_vault_id = data.azurerm_key_vault.main.id
}
 
resource "azurerm_mssql_server" "main" {
  name                         = "sql-${var.project}-${var.environment}"
  resource_group_name          = azurerm_resource_group.main.name
  location                     = var.location
  version                      = "12.0"
  administrator_login          = "sqladmin"
  administrator_login_password = data.azurerm_key_vault_secret.db_password.value
 
  tags = local.common_tags
}

This way, secrets live in Key Vault (where they belong), Terraform reads them at plan/apply time, and they never touch your version control.

Module Patterns for the Real World

Modules are where Terraform goes from "neat tool" to "force multiplier." But there's a right way and a wrong way to use them.

When to Modularize

Don't create a module for everything. My rule of thumb:

  • Two or more environments deploying the same pattern? Module.
  • One environment, one deployment? Inline resources are fine.
  • Cross-team shared infrastructure pattern? Definitely a module.
  • Single resource with three attributes? Please don't. A module wrapping a single resource just adds indirection.

A Real Module Example

Here's a networking module I use across most Azure projects:

hcl
# modules/networking/variables.tf
variable "resource_group_name" {
  type        = string
  description = "Name of the resource group"
}
 
variable "location" {
  type        = string
  description = "Azure region"
}
 
variable "environment" {
  type        = string
  description = "Environment name (dev, staging, prod)"
}
 
variable "project" {
  type        = string
  description = "Project name"
}
 
variable "vnet_address_space" {
  type        = list(string)
  description = "Address space for the VNet"
}
 
variable "subnets" {
  type = map(object({
    address_prefixes                          = list(string)
    service_endpoints                         = optional(list(string), [])
    private_endpoint_network_policies_enabled = optional(bool, true)
    delegation = optional(object({
      name = string
      service_delegation = object({
        name    = string
        actions = list(string)
      })
    }), null)
  }))
  description = "Map of subnet configurations"
}
 
variable "tags" {
  type        = map(string)
  description = "Resource tags"
  default     = {}
}
hcl
# modules/networking/main.tf
resource "azurerm_virtual_network" "main" {
  name                = "vnet-${var.project}-${var.environment}"
  location            = var.location
  resource_group_name = var.resource_group_name
  address_space       = var.vnet_address_space
 
  tags = var.tags
}
 
resource "azurerm_subnet" "main" {
  for_each = var.subnets
 
  name                 = "snet-${each.key}"
  resource_group_name  = var.resource_group_name
  virtual_network_name = azurerm_virtual_network.main.name
  address_prefixes     = each.value.address_prefixes
  service_endpoints    = each.value.service_endpoints
 
  private_endpoint_network_policies_enabled = each.value.private_endpoint_network_policies_enabled
 
  dynamic "delegation" {
    for_each = each.value.delegation != null ? [each.value.delegation] : []
    content {
      name = delegation.value.name
      service_delegation {
        name    = delegation.value.service_delegation.name
        actions = delegation.value.service_delegation.actions
      }
    }
  }
}
 
resource "azurerm_network_security_group" "main" {
  for_each = var.subnets
 
  name                = "nsg-${each.key}"
  location            = var.location
  resource_group_name = var.resource_group_name
 
  tags = var.tags
}
 
resource "azurerm_subnet_network_security_group_association" "main" {
  for_each = var.subnets
 
  subnet_id                 = azurerm_subnet.main[each.key].id
  network_security_group_id = azurerm_network_security_group.main[each.key].id
}
hcl
# modules/networking/outputs.tf
output "vnet_id" {
  value = azurerm_virtual_network.main.id
}
 
output "vnet_name" {
  value = azurerm_virtual_network.main.name
}
 
output "subnet_ids" {
  value = { for k, v in azurerm_subnet.main : k => v.id }
}
 
output "nsg_ids" {
  value = { for k, v in azurerm_network_security_group.main : k => v.id }
}

And calling it from your environment:

hcl
# environments/prod/main.tf
module "networking" {
  source = "../../modules/networking"
 
  resource_group_name = azurerm_resource_group.main.name
  location            = var.location
  environment         = var.environment
  project             = var.project
  vnet_address_space  = var.vnet_address_space
 
  subnets = {
    app = {
      address_prefixes  = ["10.1.1.0/24"]
      service_endpoints = ["Microsoft.Sql", "Microsoft.Storage"]
    }
    data = {
      address_prefixes  = ["10.1.2.0/24"]
      service_endpoints = ["Microsoft.Sql"]
      private_endpoint_network_policies_enabled = false
    }
    appgw = {
      address_prefixes = ["10.1.3.0/24"]
    }
  }
 
  tags = local.common_tags
}

Versioned Modules vs. Local Modules

For a single team working on a single project, local module references (source = "../../modules/networking") are fine. But the moment you have multiple teams consuming shared modules, you need versioning.

You have two options: a private Terraform registry (Terraform Cloud, Spacelift, or a self-hosted solution) or Git-based versioning:

hcl
module "networking" {
  source = "git::https://dev.azure.com/myorg/infra-modules/_git/terraform-modules//networking?ref=v2.1.0"
  # ...
}

The ?ref=v2.1.0 pins to a Git tag. This means Team A can upgrade to v2.2.0 when they're ready while Team B stays on v2.1.0 until they've tested. No surprises.

Identity and RBAC with Terraform

Managing Azure RBAC through Terraform is one of those things that seems simple until you hit the edge cases.

Service Principal for Terraform

Your pipeline needs an identity. Use a service principal with a federated credential (no secrets to rotate) for GitHub Actions, or a service connection in Azure DevOps:

hcl
# Create a service principal for Terraform pipelines
resource "azuread_application" "terraform" {
  display_name = "sp-terraform-${var.environment}"
}
 
resource "azuread_service_principal" "terraform" {
  client_id = azuread_application.terraform.client_id
}
 
resource "azurerm_role_assignment" "terraform_contributor" {
  scope                = data.azurerm_subscription.current.id
  role_definition_name = "Contributor"
  principal_id         = azuread_service_principal.terraform.object_id
}
 
# For managing RBAC itself, you also need this:
resource "azurerm_role_assignment" "terraform_user_access_admin" {
  scope                = data.azurerm_subscription.current.id
  role_definition_name = "User Access Administrator"
  principal_id         = azuread_service_principal.terraform.object_id
}

Important lesson learned: Contributor alone isn't enough if Terraform needs to manage role assignments. You'll also need User Access Administrator. But don't give that to the dev environment's service principal. Follow least privilege -- dev gets Contributor, prod gets Contributor plus User Access Administrator on specific resource groups only.

Managing Role Assignments

hcl
# Assign roles to your application's managed identity
resource "azurerm_role_assignment" "app_storage_reader" {
  scope                = azurerm_storage_account.main.id
  role_definition_name = "Storage Blob Data Reader"
  principal_id         = azurerm_linux_web_app.main.identity[0].principal_id
}
 
resource "azurerm_role_assignment" "app_keyvault_reader" {
  scope                = azurerm_key_vault.main.id
  role_definition_name = "Key Vault Secrets User"
  principal_id         = azurerm_linux_web_app.main.identity[0].principal_id
}

Prefer managed identities over service principals for your applications. They eliminate the credential rotation problem entirely.

CI/CD Integration

This is where everything comes together. You need two things: automated plan on every PR, and controlled apply on merge.

GitHub Actions

yaml
name: Terraform
 
on:
  pull_request:
    paths:
      - 'infrastructure/**'
  push:
    branches:
      - main
    paths:
      - 'infrastructure/**'
 
permissions:
  id-token: write
  contents: read
  pull-requests: write
 
jobs:
  plan:
    name: Terraform Plan
    runs-on: ubuntu-latest
    if: github.event_name == 'pull_request'
    strategy:
      matrix:
        environment: [dev, staging, prod]
    steps:
      - uses: actions/checkout@v4
 
      - name: Setup Terraform
        uses: hashicorp/setup-terraform@v3
        with:
          terraform_version: 1.7.0
 
      - name: Azure Login (OIDC)
        uses: azure/login@v2
        with:
          client-id: ${{ secrets.AZURE_CLIENT_ID }}
          tenant-id: ${{ secrets.AZURE_TENANT_ID }}
          subscription-id: ${{ secrets.AZURE_SUBSCRIPTION_ID }}
 
      - name: Terraform Init
        run: terraform init
        working-directory: infrastructure/environments/${{ matrix.environment }}
 
      - name: Terraform Validate
        run: terraform validate
        working-directory: infrastructure/environments/${{ matrix.environment }}
 
      - name: Terraform Plan
        id: plan
        run: terraform plan -no-color -out=tfplan
        working-directory: infrastructure/environments/${{ matrix.environment }}
        env:
          ARM_CLIENT_ID: ${{ secrets.AZURE_CLIENT_ID }}
          ARM_TENANT_ID: ${{ secrets.AZURE_TENANT_ID }}
          ARM_SUBSCRIPTION_ID: ${{ secrets.AZURE_SUBSCRIPTION_ID }}
          ARM_USE_OIDC: true
 
      - name: Comment Plan on PR
        uses: actions/github-script@v7
        if: github.event_name == 'pull_request'
        with:
          script: |
            const output = `#### Terraform Plan - ${{ matrix.environment }} \`${{ steps.plan.outcome }}\`
            <details><summary>Show Plan</summary>
 
            \`\`\`
            ${{ steps.plan.outputs.stdout }}
            \`\`\`
 
            </details>`;
            github.rest.issues.createComment({
              issue_number: context.issue.number,
              owner: context.repo.owner,
              repo: context.repo.repo,
              body: output
            });
 
  apply:
    name: Terraform Apply
    runs-on: ubuntu-latest
    if: github.ref == 'refs/heads/main' && github.event_name == 'push'
    environment: production
    steps:
      - uses: actions/checkout@v4
 
      - name: Setup Terraform
        uses: hashicorp/setup-terraform@v3
        with:
          terraform_version: 1.7.0
 
      - name: Azure Login (OIDC)
        uses: azure/login@v2
        with:
          client-id: ${{ secrets.AZURE_CLIENT_ID }}
          tenant-id: ${{ secrets.AZURE_TENANT_ID }}
          subscription-id: ${{ secrets.AZURE_SUBSCRIPTION_ID }}
 
      - name: Terraform Init
        run: terraform init
        working-directory: infrastructure/environments/prod
 
      - name: Terraform Apply
        run: terraform apply -auto-approve
        working-directory: infrastructure/environments/prod
        env:
          ARM_CLIENT_ID: ${{ secrets.AZURE_CLIENT_ID }}
          ARM_TENANT_ID: ${{ secrets.AZURE_TENANT_ID }}
          ARM_SUBSCRIPTION_ID: ${{ secrets.AZURE_SUBSCRIPTION_ID }}
          ARM_USE_OIDC: true

Notice I'm using OIDC authentication (ARM_USE_OIDC: true) instead of client secrets. No secrets to rotate. No credentials stored in GitHub. Azure and GitHub handle the token exchange through federation. If you're still using ARM_CLIENT_SECRET, stop. Set up OIDC federation today.

The plan job runs against all environments in parallel using a matrix strategy. The apply job only runs on merge to main and targets production. I also use GitHub's environment: production protection rule, which requires manual approval before the apply runs.

Azure DevOps Pipeline

If your team is in the Azure DevOps ecosystem, here's the equivalent:

yaml
trigger:
  branches:
    include:
      - main
  paths:
    include:
      - infrastructure/**
 
pool:
  vmImage: 'ubuntu-latest'
 
stages:
  - stage: Plan
    jobs:
      - job: TerraformPlan
        steps:
          - task: TerraformInstaller@1
            inputs:
              terraformVersion: '1.7.0'
 
          - task: TerraformTaskV4@4
            displayName: 'Terraform Init'
            inputs:
              provider: 'azurerm'
              command: 'init'
              workingDirectory: 'infrastructure/environments/prod'
              backendServiceArm: 'Azure-Terraform-SC'
              backendAzureRmResourceGroupName: 'rg-terraform-state'
              backendAzureRmStorageAccountName: 'tfstateproduction'
              backendAzureRmContainerName: 'tfstate'
              backendAzureRmKey: 'prod.terraform.tfstate'
 
          - task: TerraformTaskV4@4
            displayName: 'Terraform Plan'
            inputs:
              provider: 'azurerm'
              command: 'plan'
              workingDirectory: 'infrastructure/environments/prod'
              environmentServiceNameAzureRM: 'Azure-Terraform-SC'
 
  - stage: Apply
    dependsOn: Plan
    condition: and(succeeded(), eq(variables['Build.SourceBranch'], 'refs/heads/main'))
    jobs:
      - deployment: TerraformApply
        environment: 'production'
        strategy:
          runOnce:
            deploy:
              steps:
                - checkout: self
 
                - task: TerraformInstaller@1
                  inputs:
                    terraformVersion: '1.7.0'
 
                - task: TerraformTaskV4@4
                  displayName: 'Terraform Init'
                  inputs:
                    provider: 'azurerm'
                    command: 'init'
                    workingDirectory: 'infrastructure/environments/prod'
                    backendServiceArm: 'Azure-Terraform-SC'
                    backendAzureRmResourceGroupName: 'rg-terraform-state'
                    backendAzureRmStorageAccountName: 'tfstateproduction'
                    backendAzureRmContainerName: 'tfstate'
                    backendAzureRmKey: 'prod.terraform.tfstate'
 
                - task: TerraformTaskV4@4
                  displayName: 'Terraform Apply'
                  inputs:
                    provider: 'azurerm'
                    command: 'apply'
                    workingDirectory: 'infrastructure/environments/prod'
                    environmentServiceNameAzureRM: 'Azure-Terraform-SC'

The deployment job type with an environment reference gives you approval gates in Azure DevOps. Set those up. Nobody should be able to push to production without at least one other person reviewing the plan output.

Importing Existing Resources

You've got a portal-created mess and you want to bring it under Terraform control? Welcome to reality. The terraform import command is your friend, but it's a tedious friend.

bash
# Import an existing resource group
terraform import azurerm_resource_group.main /subscriptions/xxxx/resourceGroups/rg-myapp-prod
 
# Import an existing App Service
terraform import azurerm_linux_web_app.main /subscriptions/xxxx/resourceGroups/rg-myapp-prod/providers/Microsoft.Web/sites/myapp-prod

The workflow goes like this:

  1. Write the Terraform resource block (even with placeholder values)
  2. Run terraform import with the Azure resource ID
  3. Run terraform plan to see the diff
  4. Adjust your HCL until the plan shows no changes
  5. Repeat for the next resource

For large imports, look at aztfexport -- Microsoft's tool that generates Terraform config from existing Azure resources. It's not perfect, but it gets you 80% of the way there and saves hours of manual work.

bash
# Export an entire resource group to Terraform
aztfexport resource-group rg-myapp-prod

Common Pitfalls That Will Ruin Your Week

I've hit every single one of these. Learn from my mistakes.

Provider Version Drift

Developer A runs terraform init -upgrade on their laptop, gets azurerm 3.90. Developer B still has 3.85. They both make changes. Merge conflicts in .terraform.lock.hcl ensue, and now nobody's sure which version is correct. Always commit the lock file. Always upgrade providers through a deliberate PR.

The Accidental Destroy

Someone runs terraform destroy thinking they're in dev. They're in prod. I've seen it happen to experienced engineers. Mitigations:

  • Separate directories per environment (not workspaces)
  • Use prevent_destroy lifecycle blocks on critical resources
  • Require manual approval in your pipeline before any destroy
  • Name your resources with the environment so it's obvious where you are
hcl
resource "azurerm_mssql_database" "main" {
  name     = "sqldb-${var.project}-${var.environment}"
  # ...
 
  lifecycle {
    prevent_destroy = true
  }
}

State Conflicts During Team Development

Two engineers working on the same environment at the same time will hit state lock conflicts. This is by design -- it's protecting you. But it's annoying. The fix is cultural: coordinate who's working on what, use short-lived branches, and consider splitting large environments into smaller state files (e.g., separate networking from compute).

Sensitive Data in Plan Output

terraform plan will happily print your database passwords in the CI/CD logs if you're not careful. Mark sensitive variables and outputs:

hcl
variable "db_password" {
  type      = string
  sensitive = true
}
 
output "connection_string" {
  value     = azurerm_mssql_server.main.fully_qualified_domain_name
  sensitive = true
}

Testing Your Terraform

"It worked in dev" is not a test strategy. Here are three levels of testing I use.

Level 1: Validate and Lint

The bare minimum. Run these in every PR:

bash
terraform fmt -check -recursive
terraform validate
tflint --init && tflint

tflint catches things that validate misses, like using an invalid Azure VM size or referencing a region that doesn't exist.

Level 2: Plan Analysis

Parse the plan output programmatically. Check that no unexpected destroys are happening, that costs are within budget, and that naming conventions are followed. Tools like OPA (Open Policy Agent) with conftest are excellent here:

bash
# Convert plan to JSON
terraform plan -out=tfplan
terraform show -json tfplan > plan.json
 
# Run policy checks
conftest test plan.json -p policies/

Example policy (Rego):

rego
# policies/tags.rego
package main
 
deny[msg] {
  resource := input.resource_changes[_]
  resource.change.actions[_] == "create"
  not resource.change.after.tags.managed_by
  msg := sprintf("Resource %s is missing the 'managed_by' tag", [resource.address])
}

Level 3: Integration Testing with Terratest

For critical shared modules, write actual integration tests that deploy real infrastructure, validate it, and tear it down:

go
package test
 
import (
    "testing"
    "github.com/gruntwork-io/terratest/modules/terraform"
    "github.com/gruntwork-io/terratest/modules/azure"
    "github.com/stretchr/testify/assert"
)
 
func TestNetworkingModule(t *testing.T) {
    t.Parallel()
 
    terraformOptions := terraform.WithDefaultRetryableErrors(t, &terraform.Options{
        TerraformDir: "../modules/networking",
        Vars: map[string]interface{}{
            "resource_group_name": "rg-test-networking",
            "location":           "uksouth",
            "environment":        "test",
            "project":            "terratest",
            "vnet_address_space":  []string{"10.200.0.0/16"},
            "subnets": map[string]interface{}{
                "app": map[string]interface{}{
                    "address_prefixes": []string{"10.200.1.0/24"},
                },
            },
        },
    })
 
    defer terraform.Destroy(t, terraformOptions)
    terraform.InitAndApply(t, terraformOptions)
 
    vnetName := terraform.Output(t, terraformOptions, "vnet_name")
    assert.Equal(t, "vnet-terratest-test", vnetName)
 
    // Verify the VNet actually exists in Azure
    exists := azure.VirtualNetworkExists(t, vnetName, "rg-test-networking", "")
    assert.True(t, exists)
}

Yes, this spins up real Azure resources and costs real money. Run it in CI on a schedule (nightly, not on every PR) or gate it behind a label. The cost is trivial compared to the cost of a broken module hitting production.

Real-World Lessons from Production

After six years of managing Azure infrastructure with Terraform and more than 50 migrations under my belt, here are the lessons that don't show up in documentation.

Start with networking. Every single migration or greenfield project, nail down the network architecture first. VNet address spaces, subnet sizing, peering topology, DNS. Getting this wrong means re-IPing later, and re-IPing in production is painful.

Use consistent naming conventions religiously. I follow Microsoft's Cloud Adoption Framework naming: {resource-type}-{project}-{environment}-{region}-{instance}. When you're staring at 200 resources in the portal at 2am during an incident, clear names save minutes that matter.

Don't manage everything in Terraform. Some things are better left to Azure Policy, especially compliance and governance guardrails. Terraform manages your application infrastructure. Azure Policy ensures nobody creates a VM in a non-approved region or a storage account without encryption. They complement each other.

Keep your state files small. If a single terraform plan takes more than 30 seconds, your state file is too big. Split it. Networking in one state, compute in another, data tier in a third. Use terraform_remote_state data sources or output values to share information between them.

Document your modules. Not in a wiki that nobody reads. In the variables.tf file with description fields and in a README.md at the module root with usage examples. Future you will be grateful.

Have a break-glass procedure. Sometimes Terraform state gets corrupted, or a provider bug causes a resource to be destroyed and recreated when it shouldn't be. Have a documented procedure for: pulling state, making manual fixes with terraform state rm or terraform state mv, and pushing state back. Practice it before you need it.

Summary

Terraform on Azure isn't hard. But doing it well -- in a way that scales across teams, environments, and years of maintenance -- requires deliberate structure and learned lessons that only come from production experience.

The patterns in this post aren't theoretical. They're what I use every day managing real Azure infrastructure for production workloads. Start with the basics: remote state, pinned providers, separate environment directories. Then layer in modules, CI/CD pipelines, and testing as your infrastructure grows.

The single most important takeaway? Treat your infrastructure code with the same rigor as your application code. Code review, automated testing, version control, CI/CD pipelines. If you wouldn't deploy application code by SSH-ing into a server and editing files, don't deploy infrastructure by running Terraform from your laptop.

Your infrastructure deserves better than portal clicking and prayer. Give it Terraform, give it structure, and give yourself the ability to sleep through the night.

Need help implementing this?

We build automation systems like this for clients every day.

Discuss Your Project