Infrastructure as Code (IaC) and Automation
“You don’t scale AI with spreadsheets. You scale it with code.”
Why IaC is essential for AI workloads
AI environments are fundamentally different from traditional application stacks.
They are:
Complex. GPU compute, high-throughput networking, storage tiers, and fine-grained identity.
Costly. Every GPU minute matters.
Dynamic. Experiments, models, and scaling patterns change constantly.
Manual provisioning does not scale in this reality. It is slow, error-prone, and impossible to reproduce reliably.
That is why Infrastructure as Code (IaC) is not optional for AI. It is the foundation.
Direct benefits
Click-ops and ad-hoc scripts
Versioned, declarative infrastructure
Configuration drift
Idempotent, repeatable deployments
Slow experimentation
Environments created in minutes
No audit trail
Reviewed, traceable changes
IaC turns an AI environment into something reproducible, auditable, and secure.
IaC fundamentals for AI
Infrastructure as Code
Infrastructure defined declaratively
Idempotency
Same code. Same result. Every time
Reusability
Modules reused across teams and projects
Auditability
Git history, reviews, and approvals
Core tools
✅ Terraform. Multi-cloud. Strong module ecosystem. ✅ Bicep. Azure-native. Clean syntax. ARM-integrated. ✅ Azure CLI. Fast iteration and glue automation. ✅ GitHub Actions. CI/CD pipelines for infrastructure.
Common components of an AI environment
Networking. VNets, subnets, NSGs, private endpoints
Compute. GPU VMs, AKS with GPU node pools
Storage. Blob, Data Lake, ephemeral NVMe
Identity. Managed Identity, RBAC, Key Vault
Observability. Log Analytics, metrics, alerts
AI services. Azure ML, Azure OpenAI, Front Door, Purview
Example 1. Creating a GPU VM with Bicep
A virtual machine cannot exist in isolation. It needs networking, disks, and authentication.
This example is intentionally minimal but complete.
What this example includes
VNet and subnet
Network Security Group allowing SSH
Public IP and NIC
Ubuntu 22.04 GPU-capable VM
SSH key authentication
main.bicep
main.bicepDeploy
After deployment:
Connect via SSH
Install NVIDIA drivers
Validate with
nvidia-smi
Example 2. AKS cluster with GPU node pool (Terraform)
Terraform is ideal for composable, multi-environment platforms, especially when AKS is the control plane.
💡 Inline resources are great for learning. Use modules in production.
Automating IaC with GitHub Actions
💡 Combine with protected branches, reviewers, and environment approvals.
Recommended patterns
Separate modules for network, compute, storage, and observability
Parameterize region, SKU, and scale limits
Automate inference rollouts New model → Storage update → AKS rollout → Endpoint refresh
Pro insight
“If you can destroy and recreate your entire AI environment safely, you control it.”
Security and governance with IaC
Managed Identity instead of secrets
Key Vault injected via policy
Private networking by default
Azure Policy to enforce SKU, region, and tagging
Hands-on recap
Validate:
SSH access
GPU visibility
Cost and quota alignment
Advanced curiosity
You can estimate average request size using TPM (Tokens per Minute) and QPS (Queries per Second). This becomes critical to avoid throttling and over-provisioning.
👉 See Chapter 8 for deep dives on TPM, RPM, PTUs, and performance modeling.
References
Azure Bicep documentation https://learn.microsoft.com/azure/azure-resource-manager/bicep/
Terraform on Azure https://learn.microsoft.com/azure/developer/terraform/
GitHub Actions for Azure https://learn.microsoft.com/azure/developer/github/
Last updated