Infrastructure as Code (IaC) and Automation

“You don’t scale AI with spreadsheets. You scale it with code.”


Why IaC is essential for AI workloads

AI environments are fundamentally different from traditional application stacks.

They are:

  • Complex. GPU compute, high-throughput networking, storage tiers, and fine-grained identity.

  • Costly. Every GPU minute matters.

  • Dynamic. Experiments, models, and scaling patterns change constantly.

Manual provisioning does not scale in this reality. It is slow, error-prone, and impossible to reproduce reliably.

That is why Infrastructure as Code (IaC) is not optional for AI. It is the foundation.


Direct benefits

Without IaC
With IaC

Click-ops and ad-hoc scripts

Versioned, declarative infrastructure

Configuration drift

Idempotent, repeatable deployments

Slow experimentation

Environments created in minutes

No audit trail

Reviewed, traceable changes

IaC turns an AI environment into something reproducible, auditable, and secure.


IaC fundamentals for AI

Concept
Role in AI environments

Infrastructure as Code

Infrastructure defined declaratively

Idempotency

Same code. Same result. Every time

Reusability

Modules reused across teams and projects

Auditability

Git history, reviews, and approvals

Core tools

Terraform. Multi-cloud. Strong module ecosystem. ✅ Bicep. Azure-native. Clean syntax. ARM-integrated. ✅ Azure CLI. Fast iteration and glue automation. ✅ GitHub Actions. CI/CD pipelines for infrastructure.


Common components of an AI environment

  • Networking. VNets, subnets, NSGs, private endpoints

  • Compute. GPU VMs, AKS with GPU node pools

  • Storage. Blob, Data Lake, ephemeral NVMe

  • Identity. Managed Identity, RBAC, Key Vault

  • Observability. Log Analytics, metrics, alerts

  • AI services. Azure ML, Azure OpenAI, Front Door, Purview


Example 1. Creating a GPU VM with Bicep

A virtual machine cannot exist in isolation. It needs networking, disks, and authentication.

This example is intentionally minimal but complete.

What this example includes

  • VNet and subnet

  • Network Security Group allowing SSH

  • Public IP and NIC

  • Ubuntu 22.04 GPU-capable VM

  • SSH key authentication


main.bicep


Deploy

After deployment:

  • Connect via SSH

  • Install NVIDIA drivers

  • Validate with nvidia-smi


Example 2. AKS cluster with GPU node pool (Terraform)

Terraform is ideal for composable, multi-environment platforms, especially when AKS is the control plane.

💡 Inline resources are great for learning. Use modules in production.


Automating IaC with GitHub Actions

💡 Combine with protected branches, reviewers, and environment approvals.


  • Separate modules for network, compute, storage, and observability

  • Parameterize region, SKU, and scale limits

  • Automate inference rollouts New model → Storage update → AKS rollout → Endpoint refresh


Pro insight

“If you can destroy and recreate your entire AI environment safely, you control it.”


Security and governance with IaC

  • Managed Identity instead of secrets

  • Key Vault injected via policy

  • Private networking by default

  • Azure Policy to enforce SKU, region, and tagging


Hands-on recap

Validate:

  • SSH access

  • GPU visibility

  • Cost and quota alignment


Advanced curiosity

You can estimate average request size using TPM (Tokens per Minute) and QPS (Queries per Second). This becomes critical to avoid throttling and over-provisioning.

👉 See Chapter 8 for deep dives on TPM, RPM, PTUs, and performance modeling.


References

  • Azure Bicep documentation https://learn.microsoft.com/azure/azure-resource-manager/bicep/

  • Terraform on Azure https://learn.microsoft.com/azure/developer/terraform/

  • GitHub Actions for Azure https://learn.microsoft.com/azure/developer/github/

Last updated