Terraform on Azure: Production-Ready Best Practices

You've been clicking around the Azure portal for months. Maybe years. You've got resource groups named "test-rg-final-v2" and a storage account somebody created at 2am that nobody wants to delete because something depends on it. Sound familiar?
I've been managing Azure infrastructure for over six years now and I've done 50+ cloud migrations. The single biggest force multiplier I've found isn't a fancy service mesh or a Kubernetes operator. It's Terraform. Properly structured, version-controlled, pipeline-automated Terraform.
This isn't the "here's how to run terraform init" post. This is the post I wish someone had handed me before I learned these lessons the hard way across dozens of production environments. We're going deep on project structure, state management, module patterns, networking, identity, CI/CD, testing, and the real-world pitfalls that will bite you at 3am if you don't plan for them.
Let's get into it.
Table of Contents
- Why Terraform Over Bicep or ARM Templates?
- Project Structure That Actually Scales
- State Management Deep Dive
- The Azure Storage Backend
- Setting Up the State Storage Properly
- State Locking
- State File Disaster Recovery
- Provider Configuration and Version Pinning
- Variable Management That Doesn't Drive You Insane
- The tfvars Pattern
- Secrets with Azure Key Vault
- Module Patterns for the Real World
- When to Modularize
- A Real Module Example
- Versioned Modules vs. Local Modules
- Identity and RBAC with Terraform
- Service Principal for Terraform
- Managing Role Assignments
- CI/CD Integration
- GitHub Actions
- Azure DevOps Pipeline
- Importing Existing Resources
- Common Pitfalls That Will Ruin Your Week
- Provider Version Drift
- The Accidental Destroy
- State Conflicts During Team Development
- Sensitive Data in Plan Output
- Testing Your Terraform
- Level 1: Validate and Lint
- Level 2: Plan Analysis
- Level 3: Integration Testing with Terratest
- Real-World Lessons from Production
- Summary
Why Terraform Over Bicep or ARM Templates?
I get this question constantly. Azure has Bicep. It's first-party. It's got great VS Code tooling. Why would you choose Terraform?
Here's my honest answer: it depends on your reality, not Azure's marketing.
Multi-cloud is real. Not in the "we're running identical workloads on three clouds" fantasy way. In the "we have Azure for production, AWS for a data pipeline a contractor built, and Cloudflare for DNS and CDN" way. Terraform handles all of those with one language and one workflow. Bicep handles exactly one of them.
Team familiarity matters more than you think. If your team already knows HCL from managing AWS or GCP resources, forcing them to learn Bicep for Azure creates cognitive overhead that slows deployments. I've watched teams spend weeks learning a new DSL when they could have been shipping infrastructure.
The ecosystem is massive. The Terraform Registry has thousands of community modules and providers. Need to manage your Datadog monitors alongside your Azure resources? PagerDuty escalation policies? GitHub repository settings? Terraform has providers for all of it. Bicep will never have that breadth.
State management gives you power. This is controversial, but hear me out. Bicep fans love to point out that Bicep is stateless. But Terraform's state file is actually a feature. It gives you terraform plan -- a genuine preview of exactly what will change before you touch anything. It gives you drift detection. It gives you import capabilities. The state file is a responsibility, yes. But it's a superpower when managed correctly.
That said, if you're an all-Azure shop with a team that's already deep in the Microsoft ecosystem and you have zero multi-cloud ambitions, Bicep is a perfectly valid choice. I'm not religious about tools. I'm religious about shipping reliable infrastructure.
Project Structure That Actually Scales
Most Terraform tutorials show you a flat directory with main.tf, variables.tf, and outputs.tf. That works for a tutorial. It falls apart the moment you have dev, staging, and production environments with different configurations.
Here's the structure I use after years of iteration:
infrastructure/
├── environments/
│ ├── dev/
│ │ ├── main.tf
│ │ ├── variables.tf
│ │ ├── outputs.tf
│ │ ├── terraform.tfvars
│ │ ├── backend.tf
│ │ └── providers.tf
│ ├── staging/
│ │ ├── main.tf
│ │ ├── variables.tf
│ │ ├── outputs.tf
│ │ ├── terraform.tfvars
│ │ ├── backend.tf
│ │ └── providers.tf
│ └── prod/
│ ├── main.tf
│ ├── variables.tf
│ ├── outputs.tf
│ ├── terraform.tfvars
│ ├── backend.tf
│ └── providers.tf
├── modules/
│ ├── networking/
│ │ ├── main.tf
│ │ ├── variables.tf
│ │ └── outputs.tf
│ ├── compute/
│ │ ├── main.tf
│ │ ├── variables.tf
│ │ └── outputs.tf
│ ├── storage/
│ │ ├── main.tf
│ │ ├── variables.tf
│ │ └── outputs.tf
│ ├── identity/
│ │ ├── main.tf
│ │ ├── variables.tf
│ │ └── outputs.tf
│ └── monitoring/
│ ├── main.tf
│ ├── variables.tf
│ └── outputs.tf
└── shared/
├── locals.tf
└── tags.tf
Why separate directories per environment instead of workspaces? Because Terraform workspaces share backend configuration. That means your dev state file lives in the same storage account as your prod state file with only the key name differentiating them. I've seen engineers accidentally run commands against the wrong workspace and nuke production resources. Separate directories with separate backend configs make that physically impossible.
Why backend.tf and providers.tf in each environment? Because each environment might target a different subscription, use different credentials, or need different provider versions during a migration. Explicit is better than implicit when your production environment is on the line.
The shared/ directory holds common locals and tag definitions that get referenced by environment configs. Tags are especially important -- I define them once and reference everywhere:
# shared/tags.tf
locals {
common_tags = {
managed_by = "terraform"
project = "myproject"
cost_center = "engineering"
owner = "platform-team"
}
}State Management Deep Dive
State is where Terraform beginners get burned. Let me save you the pain.
The Azure Storage Backend
Never, ever store state locally. Not even for "quick tests." You will forget. You will run terraform apply from your laptop and then wonder why the pipeline is showing drift. Use Azure Blob Storage from day one:
terraform {
backend "azurerm" {
resource_group_name = "rg-terraform-state"
storage_account_name = "tfstateproduction"
container_name = "tfstate"
key = "prod.terraform.tfstate"
}
}Pro tip: Enable versioning on the container. It's saved me more than once when someone ran terraform destroy on the wrong workspace.
Setting Up the State Storage Properly
Here's the bootstrap script I use for every new project. Yes, you have to create the state storage outside of Terraform. It's the one chicken-and-egg problem you just have to accept:
#!/bin/bash
RESOURCE_GROUP="rg-terraform-state"
STORAGE_ACCOUNT="tfstate$(openssl rand -hex 4)"
CONTAINER="tfstate"
LOCATION="uksouth"
# Create resource group
az group create \
--name $RESOURCE_GROUP \
--location $LOCATION
# Create storage account with security defaults
az storage account create \
--name $STORAGE_ACCOUNT \
--resource-group $RESOURCE_GROUP \
--location $LOCATION \
--sku Standard_GRS \
--kind StorageV2 \
--min-tls-version TLS1_2 \
--allow-blob-public-access false \
--https-only true
# Enable blob versioning
az storage account blob-service-properties update \
--account-name $STORAGE_ACCOUNT \
--resource-group $RESOURCE_GROUP \
--enable-versioning true \
--enable-delete-retention true \
--delete-retention-days 30
# Create container
az storage container create \
--name $CONTAINER \
--account-name $STORAGE_ACCOUNT
# Enable storage account lock to prevent accidental deletion
az lock create \
--name "CanNotDelete" \
--resource-group $RESOURCE_GROUP \
--resource-name $STORAGE_ACCOUNT \
--resource-type Microsoft.Storage/storageAccounts \
--lock-type CanNotDelete
echo "Storage Account: $STORAGE_ACCOUNT"Notice a few things: I use Standard_GRS (geo-redundant storage) because losing your state file is a really bad day. I enable blob versioning and soft delete with a 30-day retention. And I slap a CanNotDelete lock on the storage account because someone will try to clean up "unused" resources someday.
State Locking
Terraform uses Azure Blob Storage lease mechanisms for state locking automatically. When one person or pipeline runs terraform apply, it takes a lease on the blob. Anyone else trying to run simultaneously gets blocked. This prevents two people from writing conflicting changes to the same infrastructure.
But here's what nobody tells you: leases can get stuck. If your pipeline crashes mid-apply, the lease might not release. You'll see an error like "Error acquiring the state lock." Don't panic. You can break it:
# Break a stuck lease (use with caution!)
az storage blob lease break \
--blob-name prod.terraform.tfstate \
--container-name tfstate \
--account-name tfstateproductionOnly do this when you're absolutely sure no other process is running. Breaking a lease while someone else is mid-apply is a recipe for corrupted state.
State File Disaster Recovery
Your state file is your infrastructure's source of truth. If you lose it, Terraform thinks nothing exists. Every resource becomes unknown. Rebuilding state by hand with terraform import for hundreds of resources is a multi-day nightmare. I've done it. Once. Never again.
Here's my disaster recovery checklist:
- Geo-redundant storage (GRS or GZRS) for the state storage account
- Blob versioning enabled so you can roll back to a previous state
- Soft delete with 30-day retention
- Resource lock on the storage account
- Periodic state backups to a separate subscription (I use an Azure Function on a timer)
- State file in a dedicated resource group that nobody else touches
Provider Configuration and Version Pinning
This will save you from a class of bugs that are incredibly hard to diagnose. Always pin your provider versions.
terraform {
required_version = ">= 1.7.0, < 2.0.0"
required_providers {
azurerm = {
source = "hashicorp/azurerm"
version = "~> 3.85.0"
}
azuread = {
source = "hashicorp/azuread"
version = "~> 2.47.0"
}
random = {
source = "hashicorp/random"
version = "~> 3.6.0"
}
}
}
provider "azurerm" {
features {
key_vault {
purge_soft_delete_on_destroy = false
recover_soft_deleted_key_vaults = true
}
resource_group {
prevent_deletion_if_contains_resources = true
}
}
subscription_id = var.subscription_id
}A few things worth calling out:
The ~> operator (pessimistic constraint) allows patch version updates but not minor or major. So ~> 3.85.0 allows 3.85.1, 3.85.2, etc., but not 3.86.0. This protects you from breaking changes while still getting bug fixes.
The features block is Azure-specific and incredibly important. Setting purge_soft_delete_on_destroy = false prevents Terraform from permanently purging Key Vaults when you destroy them. Setting prevent_deletion_if_contains_resources = true stops Terraform from deleting a resource group that still has resources in it -- a safety net that has prevented data loss more times than I can count.
Run terraform init -upgrade intentionally, not accidentally. When you want to update providers, do it deliberately, test in dev first, and commit the updated .terraform.lock.hcl file. That lock file is your guarantee that everyone on the team and every pipeline run uses the exact same provider binaries.
Variable Management That Doesn't Drive You Insane
Variables in Terraform can come from five different places, and understanding the precedence order is critical:
- Environment variables (
TF_VAR_name) terraform.tfvarsfile*.auto.tfvarsfiles-varand-var-filecommand line flags- Variable defaults in
variables.tf
The tfvars Pattern
Each environment gets its own terraform.tfvars:
# environments/prod/terraform.tfvars
location = "uksouth"
environment = "prod"
subscription_id = "xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx"
vnet_address_space = ["10.1.0.0/16"]
sku_tier = "Premium"
min_instance_count = 3
enable_diagnostics = true# environments/dev/terraform.tfvars
location = "uksouth"
environment = "dev"
subscription_id = "yyyyyyyy-yyyy-yyyy-yyyy-yyyyyyyyyyyy"
vnet_address_space = ["10.100.0.0/16"]
sku_tier = "Standard"
min_instance_count = 1
enable_diagnostics = falseSecrets with Azure Key Vault
Never put secrets in tfvars files. Not even in "private" repos. Use Azure Key Vault as a data source:
data "azurerm_key_vault" "main" {
name = "kv-${var.project}-${var.environment}"
resource_group_name = "rg-${var.project}-shared"
}
data "azurerm_key_vault_secret" "db_password" {
name = "database-admin-password"
key_vault_id = data.azurerm_key_vault.main.id
}
resource "azurerm_mssql_server" "main" {
name = "sql-${var.project}-${var.environment}"
resource_group_name = azurerm_resource_group.main.name
location = var.location
version = "12.0"
administrator_login = "sqladmin"
administrator_login_password = data.azurerm_key_vault_secret.db_password.value
tags = local.common_tags
}This way, secrets live in Key Vault (where they belong), Terraform reads them at plan/apply time, and they never touch your version control.
Module Patterns for the Real World
Modules are where Terraform goes from "neat tool" to "force multiplier." But there's a right way and a wrong way to use them.
When to Modularize
Don't create a module for everything. My rule of thumb:
- Two or more environments deploying the same pattern? Module.
- One environment, one deployment? Inline resources are fine.
- Cross-team shared infrastructure pattern? Definitely a module.
- Single resource with three attributes? Please don't. A module wrapping a single resource just adds indirection.
A Real Module Example
Here's a networking module I use across most Azure projects:
# modules/networking/variables.tf
variable "resource_group_name" {
type = string
description = "Name of the resource group"
}
variable "location" {
type = string
description = "Azure region"
}
variable "environment" {
type = string
description = "Environment name (dev, staging, prod)"
}
variable "project" {
type = string
description = "Project name"
}
variable "vnet_address_space" {
type = list(string)
description = "Address space for the VNet"
}
variable "subnets" {
type = map(object({
address_prefixes = list(string)
service_endpoints = optional(list(string), [])
private_endpoint_network_policies_enabled = optional(bool, true)
delegation = optional(object({
name = string
service_delegation = object({
name = string
actions = list(string)
})
}), null)
}))
description = "Map of subnet configurations"
}
variable "tags" {
type = map(string)
description = "Resource tags"
default = {}
}# modules/networking/main.tf
resource "azurerm_virtual_network" "main" {
name = "vnet-${var.project}-${var.environment}"
location = var.location
resource_group_name = var.resource_group_name
address_space = var.vnet_address_space
tags = var.tags
}
resource "azurerm_subnet" "main" {
for_each = var.subnets
name = "snet-${each.key}"
resource_group_name = var.resource_group_name
virtual_network_name = azurerm_virtual_network.main.name
address_prefixes = each.value.address_prefixes
service_endpoints = each.value.service_endpoints
private_endpoint_network_policies_enabled = each.value.private_endpoint_network_policies_enabled
dynamic "delegation" {
for_each = each.value.delegation != null ? [each.value.delegation] : []
content {
name = delegation.value.name
service_delegation {
name = delegation.value.service_delegation.name
actions = delegation.value.service_delegation.actions
}
}
}
}
resource "azurerm_network_security_group" "main" {
for_each = var.subnets
name = "nsg-${each.key}"
location = var.location
resource_group_name = var.resource_group_name
tags = var.tags
}
resource "azurerm_subnet_network_security_group_association" "main" {
for_each = var.subnets
subnet_id = azurerm_subnet.main[each.key].id
network_security_group_id = azurerm_network_security_group.main[each.key].id
}# modules/networking/outputs.tf
output "vnet_id" {
value = azurerm_virtual_network.main.id
}
output "vnet_name" {
value = azurerm_virtual_network.main.name
}
output "subnet_ids" {
value = { for k, v in azurerm_subnet.main : k => v.id }
}
output "nsg_ids" {
value = { for k, v in azurerm_network_security_group.main : k => v.id }
}And calling it from your environment:
# environments/prod/main.tf
module "networking" {
source = "../../modules/networking"
resource_group_name = azurerm_resource_group.main.name
location = var.location
environment = var.environment
project = var.project
vnet_address_space = var.vnet_address_space
subnets = {
app = {
address_prefixes = ["10.1.1.0/24"]
service_endpoints = ["Microsoft.Sql", "Microsoft.Storage"]
}
data = {
address_prefixes = ["10.1.2.0/24"]
service_endpoints = ["Microsoft.Sql"]
private_endpoint_network_policies_enabled = false
}
appgw = {
address_prefixes = ["10.1.3.0/24"]
}
}
tags = local.common_tags
}Versioned Modules vs. Local Modules
For a single team working on a single project, local module references (source = "../../modules/networking") are fine. But the moment you have multiple teams consuming shared modules, you need versioning.
You have two options: a private Terraform registry (Terraform Cloud, Spacelift, or a self-hosted solution) or Git-based versioning:
module "networking" {
source = "git::https://dev.azure.com/myorg/infra-modules/_git/terraform-modules//networking?ref=v2.1.0"
# ...
}The ?ref=v2.1.0 pins to a Git tag. This means Team A can upgrade to v2.2.0 when they're ready while Team B stays on v2.1.0 until they've tested. No surprises.
Identity and RBAC with Terraform
Managing Azure RBAC through Terraform is one of those things that seems simple until you hit the edge cases.
Service Principal for Terraform
Your pipeline needs an identity. Use a service principal with a federated credential (no secrets to rotate) for GitHub Actions, or a service connection in Azure DevOps:
# Create a service principal for Terraform pipelines
resource "azuread_application" "terraform" {
display_name = "sp-terraform-${var.environment}"
}
resource "azuread_service_principal" "terraform" {
client_id = azuread_application.terraform.client_id
}
resource "azurerm_role_assignment" "terraform_contributor" {
scope = data.azurerm_subscription.current.id
role_definition_name = "Contributor"
principal_id = azuread_service_principal.terraform.object_id
}
# For managing RBAC itself, you also need this:
resource "azurerm_role_assignment" "terraform_user_access_admin" {
scope = data.azurerm_subscription.current.id
role_definition_name = "User Access Administrator"
principal_id = azuread_service_principal.terraform.object_id
}Important lesson learned: Contributor alone isn't enough if Terraform needs to manage role assignments. You'll also need User Access Administrator. But don't give that to the dev environment's service principal. Follow least privilege -- dev gets Contributor, prod gets Contributor plus User Access Administrator on specific resource groups only.
Managing Role Assignments
# Assign roles to your application's managed identity
resource "azurerm_role_assignment" "app_storage_reader" {
scope = azurerm_storage_account.main.id
role_definition_name = "Storage Blob Data Reader"
principal_id = azurerm_linux_web_app.main.identity[0].principal_id
}
resource "azurerm_role_assignment" "app_keyvault_reader" {
scope = azurerm_key_vault.main.id
role_definition_name = "Key Vault Secrets User"
principal_id = azurerm_linux_web_app.main.identity[0].principal_id
}Prefer managed identities over service principals for your applications. They eliminate the credential rotation problem entirely.
CI/CD Integration
This is where everything comes together. You need two things: automated plan on every PR, and controlled apply on merge.
GitHub Actions
name: Terraform
on:
pull_request:
paths:
- 'infrastructure/**'
push:
branches:
- main
paths:
- 'infrastructure/**'
permissions:
id-token: write
contents: read
pull-requests: write
jobs:
plan:
name: Terraform Plan
runs-on: ubuntu-latest
if: github.event_name == 'pull_request'
strategy:
matrix:
environment: [dev, staging, prod]
steps:
- uses: actions/checkout@v4
- name: Setup Terraform
uses: hashicorp/setup-terraform@v3
with:
terraform_version: 1.7.0
- name: Azure Login (OIDC)
uses: azure/login@v2
with:
client-id: ${{ secrets.AZURE_CLIENT_ID }}
tenant-id: ${{ secrets.AZURE_TENANT_ID }}
subscription-id: ${{ secrets.AZURE_SUBSCRIPTION_ID }}
- name: Terraform Init
run: terraform init
working-directory: infrastructure/environments/${{ matrix.environment }}
- name: Terraform Validate
run: terraform validate
working-directory: infrastructure/environments/${{ matrix.environment }}
- name: Terraform Plan
id: plan
run: terraform plan -no-color -out=tfplan
working-directory: infrastructure/environments/${{ matrix.environment }}
env:
ARM_CLIENT_ID: ${{ secrets.AZURE_CLIENT_ID }}
ARM_TENANT_ID: ${{ secrets.AZURE_TENANT_ID }}
ARM_SUBSCRIPTION_ID: ${{ secrets.AZURE_SUBSCRIPTION_ID }}
ARM_USE_OIDC: true
- name: Comment Plan on PR
uses: actions/github-script@v7
if: github.event_name == 'pull_request'
with:
script: |
const output = `#### Terraform Plan - ${{ matrix.environment }} \`${{ steps.plan.outcome }}\`
<details><summary>Show Plan</summary>
\`\`\`
${{ steps.plan.outputs.stdout }}
\`\`\`
</details>`;
github.rest.issues.createComment({
issue_number: context.issue.number,
owner: context.repo.owner,
repo: context.repo.repo,
body: output
});
apply:
name: Terraform Apply
runs-on: ubuntu-latest
if: github.ref == 'refs/heads/main' && github.event_name == 'push'
environment: production
steps:
- uses: actions/checkout@v4
- name: Setup Terraform
uses: hashicorp/setup-terraform@v3
with:
terraform_version: 1.7.0
- name: Azure Login (OIDC)
uses: azure/login@v2
with:
client-id: ${{ secrets.AZURE_CLIENT_ID }}
tenant-id: ${{ secrets.AZURE_TENANT_ID }}
subscription-id: ${{ secrets.AZURE_SUBSCRIPTION_ID }}
- name: Terraform Init
run: terraform init
working-directory: infrastructure/environments/prod
- name: Terraform Apply
run: terraform apply -auto-approve
working-directory: infrastructure/environments/prod
env:
ARM_CLIENT_ID: ${{ secrets.AZURE_CLIENT_ID }}
ARM_TENANT_ID: ${{ secrets.AZURE_TENANT_ID }}
ARM_SUBSCRIPTION_ID: ${{ secrets.AZURE_SUBSCRIPTION_ID }}
ARM_USE_OIDC: trueNotice I'm using OIDC authentication (ARM_USE_OIDC: true) instead of client secrets. No secrets to rotate. No credentials stored in GitHub. Azure and GitHub handle the token exchange through federation. If you're still using ARM_CLIENT_SECRET, stop. Set up OIDC federation today.
The plan job runs against all environments in parallel using a matrix strategy. The apply job only runs on merge to main and targets production. I also use GitHub's environment: production protection rule, which requires manual approval before the apply runs.
Azure DevOps Pipeline
If your team is in the Azure DevOps ecosystem, here's the equivalent:
trigger:
branches:
include:
- main
paths:
include:
- infrastructure/**
pool:
vmImage: 'ubuntu-latest'
stages:
- stage: Plan
jobs:
- job: TerraformPlan
steps:
- task: TerraformInstaller@1
inputs:
terraformVersion: '1.7.0'
- task: TerraformTaskV4@4
displayName: 'Terraform Init'
inputs:
provider: 'azurerm'
command: 'init'
workingDirectory: 'infrastructure/environments/prod'
backendServiceArm: 'Azure-Terraform-SC'
backendAzureRmResourceGroupName: 'rg-terraform-state'
backendAzureRmStorageAccountName: 'tfstateproduction'
backendAzureRmContainerName: 'tfstate'
backendAzureRmKey: 'prod.terraform.tfstate'
- task: TerraformTaskV4@4
displayName: 'Terraform Plan'
inputs:
provider: 'azurerm'
command: 'plan'
workingDirectory: 'infrastructure/environments/prod'
environmentServiceNameAzureRM: 'Azure-Terraform-SC'
- stage: Apply
dependsOn: Plan
condition: and(succeeded(), eq(variables['Build.SourceBranch'], 'refs/heads/main'))
jobs:
- deployment: TerraformApply
environment: 'production'
strategy:
runOnce:
deploy:
steps:
- checkout: self
- task: TerraformInstaller@1
inputs:
terraformVersion: '1.7.0'
- task: TerraformTaskV4@4
displayName: 'Terraform Init'
inputs:
provider: 'azurerm'
command: 'init'
workingDirectory: 'infrastructure/environments/prod'
backendServiceArm: 'Azure-Terraform-SC'
backendAzureRmResourceGroupName: 'rg-terraform-state'
backendAzureRmStorageAccountName: 'tfstateproduction'
backendAzureRmContainerName: 'tfstate'
backendAzureRmKey: 'prod.terraform.tfstate'
- task: TerraformTaskV4@4
displayName: 'Terraform Apply'
inputs:
provider: 'azurerm'
command: 'apply'
workingDirectory: 'infrastructure/environments/prod'
environmentServiceNameAzureRM: 'Azure-Terraform-SC'The deployment job type with an environment reference gives you approval gates in Azure DevOps. Set those up. Nobody should be able to push to production without at least one other person reviewing the plan output.
Importing Existing Resources
You've got a portal-created mess and you want to bring it under Terraform control? Welcome to reality. The terraform import command is your friend, but it's a tedious friend.
# Import an existing resource group
terraform import azurerm_resource_group.main /subscriptions/xxxx/resourceGroups/rg-myapp-prod
# Import an existing App Service
terraform import azurerm_linux_web_app.main /subscriptions/xxxx/resourceGroups/rg-myapp-prod/providers/Microsoft.Web/sites/myapp-prodThe workflow goes like this:
- Write the Terraform resource block (even with placeholder values)
- Run
terraform importwith the Azure resource ID - Run
terraform planto see the diff - Adjust your HCL until the plan shows no changes
- Repeat for the next resource
For large imports, look at aztfexport -- Microsoft's tool that generates Terraform config from existing Azure resources. It's not perfect, but it gets you 80% of the way there and saves hours of manual work.
# Export an entire resource group to Terraform
aztfexport resource-group rg-myapp-prodCommon Pitfalls That Will Ruin Your Week
I've hit every single one of these. Learn from my mistakes.
Provider Version Drift
Developer A runs terraform init -upgrade on their laptop, gets azurerm 3.90. Developer B still has 3.85. They both make changes. Merge conflicts in .terraform.lock.hcl ensue, and now nobody's sure which version is correct. Always commit the lock file. Always upgrade providers through a deliberate PR.
The Accidental Destroy
Someone runs terraform destroy thinking they're in dev. They're in prod. I've seen it happen to experienced engineers. Mitigations:
- Separate directories per environment (not workspaces)
- Use
prevent_destroylifecycle blocks on critical resources - Require manual approval in your pipeline before any destroy
- Name your resources with the environment so it's obvious where you are
resource "azurerm_mssql_database" "main" {
name = "sqldb-${var.project}-${var.environment}"
# ...
lifecycle {
prevent_destroy = true
}
}State Conflicts During Team Development
Two engineers working on the same environment at the same time will hit state lock conflicts. This is by design -- it's protecting you. But it's annoying. The fix is cultural: coordinate who's working on what, use short-lived branches, and consider splitting large environments into smaller state files (e.g., separate networking from compute).
Sensitive Data in Plan Output
terraform plan will happily print your database passwords in the CI/CD logs if you're not careful. Mark sensitive variables and outputs:
variable "db_password" {
type = string
sensitive = true
}
output "connection_string" {
value = azurerm_mssql_server.main.fully_qualified_domain_name
sensitive = true
}Testing Your Terraform
"It worked in dev" is not a test strategy. Here are three levels of testing I use.
Level 1: Validate and Lint
The bare minimum. Run these in every PR:
terraform fmt -check -recursive
terraform validate
tflint --init && tflinttflint catches things that validate misses, like using an invalid Azure VM size or referencing a region that doesn't exist.
Level 2: Plan Analysis
Parse the plan output programmatically. Check that no unexpected destroys are happening, that costs are within budget, and that naming conventions are followed. Tools like OPA (Open Policy Agent) with conftest are excellent here:
# Convert plan to JSON
terraform plan -out=tfplan
terraform show -json tfplan > plan.json
# Run policy checks
conftest test plan.json -p policies/Example policy (Rego):
# policies/tags.rego
package main
deny[msg] {
resource := input.resource_changes[_]
resource.change.actions[_] == "create"
not resource.change.after.tags.managed_by
msg := sprintf("Resource %s is missing the 'managed_by' tag", [resource.address])
}Level 3: Integration Testing with Terratest
For critical shared modules, write actual integration tests that deploy real infrastructure, validate it, and tear it down:
package test
import (
"testing"
"github.com/gruntwork-io/terratest/modules/terraform"
"github.com/gruntwork-io/terratest/modules/azure"
"github.com/stretchr/testify/assert"
)
func TestNetworkingModule(t *testing.T) {
t.Parallel()
terraformOptions := terraform.WithDefaultRetryableErrors(t, &terraform.Options{
TerraformDir: "../modules/networking",
Vars: map[string]interface{}{
"resource_group_name": "rg-test-networking",
"location": "uksouth",
"environment": "test",
"project": "terratest",
"vnet_address_space": []string{"10.200.0.0/16"},
"subnets": map[string]interface{}{
"app": map[string]interface{}{
"address_prefixes": []string{"10.200.1.0/24"},
},
},
},
})
defer terraform.Destroy(t, terraformOptions)
terraform.InitAndApply(t, terraformOptions)
vnetName := terraform.Output(t, terraformOptions, "vnet_name")
assert.Equal(t, "vnet-terratest-test", vnetName)
// Verify the VNet actually exists in Azure
exists := azure.VirtualNetworkExists(t, vnetName, "rg-test-networking", "")
assert.True(t, exists)
}Yes, this spins up real Azure resources and costs real money. Run it in CI on a schedule (nightly, not on every PR) or gate it behind a label. The cost is trivial compared to the cost of a broken module hitting production.
Real-World Lessons from Production
After six years of managing Azure infrastructure with Terraform and more than 50 migrations under my belt, here are the lessons that don't show up in documentation.
Start with networking. Every single migration or greenfield project, nail down the network architecture first. VNet address spaces, subnet sizing, peering topology, DNS. Getting this wrong means re-IPing later, and re-IPing in production is painful.
Use consistent naming conventions religiously. I follow Microsoft's Cloud Adoption Framework naming: {resource-type}-{project}-{environment}-{region}-{instance}. When you're staring at 200 resources in the portal at 2am during an incident, clear names save minutes that matter.
Don't manage everything in Terraform. Some things are better left to Azure Policy, especially compliance and governance guardrails. Terraform manages your application infrastructure. Azure Policy ensures nobody creates a VM in a non-approved region or a storage account without encryption. They complement each other.
Keep your state files small. If a single terraform plan takes more than 30 seconds, your state file is too big. Split it. Networking in one state, compute in another, data tier in a third. Use terraform_remote_state data sources or output values to share information between them.
Document your modules. Not in a wiki that nobody reads. In the variables.tf file with description fields and in a README.md at the module root with usage examples. Future you will be grateful.
Have a break-glass procedure. Sometimes Terraform state gets corrupted, or a provider bug causes a resource to be destroyed and recreated when it shouldn't be. Have a documented procedure for: pulling state, making manual fixes with terraform state rm or terraform state mv, and pushing state back. Practice it before you need it.
Summary
Terraform on Azure isn't hard. But doing it well -- in a way that scales across teams, environments, and years of maintenance -- requires deliberate structure and learned lessons that only come from production experience.
The patterns in this post aren't theoretical. They're what I use every day managing real Azure infrastructure for production workloads. Start with the basics: remote state, pinned providers, separate environment directories. Then layer in modules, CI/CD pipelines, and testing as your infrastructure grows.
The single most important takeaway? Treat your infrastructure code with the same rigor as your application code. Code review, automated testing, version control, CI/CD pipelines. If you wouldn't deploy application code by SSH-ing into a server and editing files, don't deploy infrastructure by running Terraform from your laptop.
Your infrastructure deserves better than portal clicking and prayer. Give it Terraform, give it structure, and give yourself the ability to sleep through the night.