“You don’t scale AI with spreadsheets. You scale it with code.”
Why IaC is essential for AI workloads
AI environments are fundamentally different from traditional application stacks.
They are:
Complex. GPU compute, high-throughput networking, storage tiers, and fine-grained identity.
Costly. Every GPU minute matters.
Dynamic. Experiments, models, and scaling patterns change constantly.
Manual provisioning does not scale in this reality.
It is slow, error-prone, and impossible to reproduce reliably.
That is why Infrastructure as Code (IaC) is not optional for AI. It is the foundation.
Direct benefits
Click-ops and ad-hoc scripts
Versioned, declarative infrastructure
Idempotent, repeatable deployments
Environments created in minutes
Reviewed, traceable changes
IaC turns an AI environment into something reproducible, auditable, and secure.
IaC fundamentals for AI
Concept
Role in AI environments
Infrastructure defined declaratively
Same code. Same result. Every time
Modules reused across teams and projects
Git history, reviews, and approvals
✅ Terraform. Multi-cloud. Strong module ecosystem.
✅ Bicep. Azure-native. Clean syntax. ARM-integrated.
✅ Azure CLI. Fast iteration and glue automation.
✅ GitHub Actions. CI/CD pipelines for infrastructure.
Common components of an AI environment
Networking. VNets, subnets, NSGs, private endpoints
Compute. GPU VMs, AKS with GPU node pools
Storage. Blob, Data Lake, ephemeral NVMe
Identity. Managed Identity, RBAC, Key Vault
Observability. Log Analytics, metrics, alerts
AI services. Azure ML, Azure OpenAI, Front Door, Purview
Example 1. Creating a GPU VM with Bicep
A virtual machine cannot exist in isolation.
It needs networking, disks, and authentication.
This example is intentionally minimal but complete.
What this example includes
Network Security Group allowing SSH
Ubuntu 22.04 GPU-capable VM
After deployment:
Terraform is ideal for composable, multi-environment platforms, especially when AKS is the control plane.
💡 Inline resources are great for learning. Use modules in production.
Automating IaC with GitHub Actions
💡 Combine with protected branches, reviewers, and environment approvals.
Recommended patterns
Separate modules for network, compute, storage, and observability
Parameterize region, SKU, and scale limits
Automate inference rollouts
New model → Storage update → AKS rollout → Endpoint refresh
“If you can destroy and recreate your entire AI environment safely, you control it.”
Security and governance with IaC
Managed Identity instead of secrets
Key Vault injected via policy
Private networking by default
Azure Policy to enforce SKU, region, and tagging
Validate:
Advanced curiosity
You can estimate average request size using TPM (Tokens per Minute) and QPS (Queries per Second).
This becomes critical to avoid throttling and over-provisioning.
👉 See Chapter 8 for deep dives on TPM, RPM, PTUs, and performance modeling.
Azure Bicep documentation
https://learn.microsoft.com/azure/azure-resource-manager/bicep/
Terraform on Azure
https://learn.microsoft.com/azure/developer/terraform/
GitHub Actions for Azure
https://learn.microsoft.com/azure/developer/github/