Creating an AKS Cluster with GPU using Terraform

Objective

Provision an Azure Kubernetes Service (AKS) cluster with a dedicated GPU node pool for AI and inference workloads.

This lab shows how infrastructure engineers can:

  • Provision AKS using Terraform

  • Add a GPU-enabled user node pool

  • Prepare the cluster so GPUs are actually usable by workloads

Region: West US 3 GPU SKU: Standard_NCas_T4_v3 (cost‑efficient inference GPU)


Lab scope and expectations

This is an infrastructure enablement lab.

It covers:

  • AKS cluster provisioning

  • CPU system node pool + GPU user node pool

  • GPU readiness using the NVIDIA device plugin

  • Basic validation with a GPU test pod

It does not cover:

  • Model training or fine-tuning

  • Advanced MLOps pipelines

  • Production hardening (private AKS, firewall, policy-as-code)


Prerequisites

  • Terraform CLI (v1.5+)

  • Azure CLI

  • kubectl

  • Azure subscription with AKS + GPU quota in West US 3

  • RBAC: Owner or Contributor on the subscription

Login first:


⚠️ Cost warning

GPU node pools are expensive and bill while nodes exist.

This lab uses:

  • 1 × Standard_NCas_T4_v3 GPU node

Destroy resources when finished.


Folder structure


Deploy the cluster


Validate AKS and GPU nodes


Enable GPU support (required)

AKS does not expose GPUs automatically.

Verify:


GPU validation test


Cleanup


References

  • AKS GPU clusters: https://learn.microsoft.com/azure/aks/gpu-cluster

  • Terraform AKS: https://registry.terraform.io/providers/hashicorp/azurerm/latest/docs/resources/kubernetes_cluster

  • NVIDIA device plugin: https://github.com/NVIDIA/k8s-device-plugin

Last updated