githubEdit

Cheatsheets

A condensed, at-a-glance guide to essential commands, architectures, and Azure resources for managing AI workloads efficiently.


Azure GPU VM comparison

VM SKU
Ideal for
GPU type
Inference
Training
Notes

Standard_NC6s_v3

General AI workloads

1Γ— V100

βœ…

⚠️ (limited)

Balanced price/performance

Standard_NCas_T4_v3

Cost-effective inference

1Γ— T4

βœ…βœ…

❌

Best option for production inference

Standard_ND_A100_v4

LLM training, deep learning

1–8Γ— A100

βœ…βœ…βœ…

βœ…βœ…βœ…

High cost, top-tier performance

Standard_NVads_A10

Visualization + lightweight AI

1Γ— A10

βœ…

❌

Good for inference, dev/test, and visualization

πŸ’‘ Tip: For inference, prefer T4 or A10 GPUs. For large-scale training, use A100s.


CPU vs GPU quick reference

Attribute
CPU
GPU

Cores

Tens

Thousands

Strength

Serial and mixed workloads

Massively parallel vector processing

Use cases

Web apps, ETL, monitoring

Deep learning, embeddings, image processing

Azure examples

Dv5, Ev4

NCas_T4, ND_A100

Cost

πŸ’²

πŸ’²πŸ’²πŸ’²

πŸ’‘ Rule of thumb: GPUs are specialized hardware. Use them only where parallelism matters.


Security checklist for AI workloads

Control
Description
Status

RBAC with Managed Identities

Assign least-privilege roles

βœ…

Private Link for all endpoints

No public exposure of APIs

βœ…

Key Vault for secrets

No credentials in code or YAML

βœ…

Diagnostic logs enabled

Send to Log Analytics workspace

βœ…

API rate limiting configured

Prevent misuse and abuse

βœ…

Prompt injection testing done

Validate model input security

βœ…

Data encryption (in transit & at rest)

TLS 1.2+ and SSE enabled

βœ…

Network egress control

Restrict outbound traffic from inference

βœ…

πŸ’‘ Tip: Treat every model as a production API β€” with the same level of security scrutiny.


Monitoring and observability cheat sheet

Metric
Source
Tool

GPU utilization

nvidia-smi, DCGM Exporter

Prometheus / Grafana

Latency (P95)

Application telemetry

Azure Monitor / Application Insights

Errors (429s, 5xx)

API logs

Application Gateway / App Insights

Cost per request

Logs + billing data

Cost Management / Power BI

Token usage (TPM/RPM)

Azure OpenAI metrics and logs

Azure Monitor / Log Analytics

πŸ’‘ Tip: Always correlate GPU utilization, latency, and token throughput.


Infrastructure quick deploy commands

Create a GPU VM

⚠️ Ensure GPU quota is approved in the target region before deployment.


Deploy a model endpoint (Azure ML)


Create an AKS cluster GPU node pool (Terraform)


Performance and throughput formulas

Metric
Formula
Example

TPM (Tokens per Minute)

(Input + Output Tokens) Γ— RPM

(500 + 300) Γ— 1000 = 800,000 TPM

QPS (Queries per Second)

RPM Γ· 60

300 RPM = 5 QPS

Cost estimation (Azure OpenAI)

Tokens Γ— Cost per 1K Tokens

Pricing varies by model

GPU scaling efficiency

Active GPU Time Γ· Total Allocated Time

80% = efficient utilization

πŸ’‘ Tip: Log and analyze TPM, RPM, and QPS to prevent throttling and overprovisioning.


Where to run your model β€” Decision flow

πŸ’‘ Rule:

  • AKS + GPU Pool: Scalable inference and production APIs

  • VM: Isolated testing or proof of concept

  • Azure ML: Training, registry, and lifecycle management

  • Azure Functions: Bursty, low-QPS inference with scale-to-zero


Resource tagging convention for AI workloads

Tag
Example
Purpose

Environment

dev, prod

Identify lifecycle stage

Team

AI-Infra

Ownership

CostCenter

12345

Chargeback and budgeting

Model

gpt4, vision-v2

Correlate to inference and token metrics

Region

eastus, swedencentral

Deployment region

πŸ’‘ Tip: Standardize tags for automation, budget tracking, and governance.


  • https://learn.microsoft.com/azure/machine-learning/

  • https://learn.microsoft.com/azure/ai-services/openai/quotas-limits

  • https://learn.microsoft.com/azure/aks/

  • https://learn.microsoft.com/azure/cost-management-billing/

  • https://learn.microsoft.com/azure/azure-monitor/containers/container-insights-overview

Last updated