Creating an AKS Cluster with GPU using Terraform
Objective
Provision an Azure Kubernetes Service (AKS) cluster with a dedicated GPU node pool for AI and inference workloads.
This lab shows how infrastructure engineers can:
Provision AKS using Terraform
Add a GPU-enabled user node pool
Prepare the cluster so GPUs are actually usable by workloads
Region: West US 3 GPU SKU: Standard_NCas_T4_v3 (cost‑efficient inference GPU)
Lab scope and expectations
This is an infrastructure enablement lab.
It covers:
AKS cluster provisioning
CPU system node pool + GPU user node pool
GPU readiness using the NVIDIA device plugin
Basic validation with a GPU test pod
It does not cover:
Model training or fine-tuning
Advanced MLOps pipelines
Production hardening (private AKS, firewall, policy-as-code)
Prerequisites
Terraform CLI (v1.5+)
Azure CLI
kubectl
Azure subscription with AKS + GPU quota in West US 3
RBAC: Owner or Contributor on the subscription
Login first:
⚠️ Cost warning
GPU node pools are expensive and bill while nodes exist.
This lab uses:
1 ×
Standard_NCas_T4_v3GPU node
Destroy resources when finished.
Folder structure
Deploy the cluster
Validate AKS and GPU nodes
Enable GPU support (required)
AKS does not expose GPUs automatically.
Verify:
GPU validation test
Cleanup
References
AKS GPU clusters: https://learn.microsoft.com/azure/aks/gpu-cluster
Terraform AKS: https://registry.terraform.io/providers/hashicorp/azurerm/latest/docs/resources/kubernetes_cluster
NVIDIA device plugin: https://github.com/NVIDIA/k8s-device-plugin
Last updated