Creating an AKS Cluster with GPU using Terraform

Objective

Provision an Azure Kubernetes Service (AKS) cluster with a dedicated GPU node pool for AI and inference workloads.

This lab demonstrates how to:

  • Use Terraform for Infrastructure-as-Code (IaC)

  • Deploy a GPU-enabled AKS node pool

  • Prepare your cluster for AI workloads such as Azure OpenAI, MLflow, or custom inference containers

Prerequisites

Before running this lab, make sure you have:

  • Terraform CLI installed (v1.5+ recommended)

  • ✅ Access to an Azure subscription with permission to create resource groups and AKS clusters

  • ✅ A quota for GPU-enabled VM SKUs (e.g., Standard_NC6s_v3 or Standard_NCas_T4_v3)

Folder Structure

terraform-aks-gpu/
├── main.tf
├── variables.tf
├── outputs.tf
└── README.md

Configuration

1. Define environment variables

Authenticate and set your default subscription:

az login
az account set --subscription "<your-subscription-id>"

2. Initialize Terraform

terraform init

3. Review and validate the plan

terraform plan -out=tfplan

4. Apply the configuration

terraform apply "tfplan"

What this deployment creates

Resource
Description

Resource group

A logical container for all deployed resources

AKS cluster

Managed Kubernetes cluster configured with default node pool

GPU node pool

A secondary node pool using Standard_NC6s_v3 (or similar)

Managed identity

Used for AKS and node pool operations

Network resources

VNet, subnets, NSG (if defined)

Validation

After deployment, verify your GPU node pool:

az aks nodepool list \
  --cluster-name aks-ai-cluster \
  --resource-group rg-ai-lab \
  --query "[].{Name:name,VMSize:vmSize,NodeCount:count,Mode:mode}"

You can also connect to your cluster:

az aks get-credentials --resource-group rg-ai-lab --name aks-ai-cluster
kubectl get nodes -o wide

Check that the GPU node pool is labeled and ready:

kubectl get nodes -l "agentpool=gpu"

Next steps

Cleanup

To remove all resources:

terraform destroy

References

Last updated

Was this helpful?