Technical FAQ
A practical reference for infrastructure and cloud engineers adopting Artificial Intelligence in their environments.
1. Can I run AI workloads without a GPU?
Yes, but with limitations.
Lightweight models (regression, decision trees, classical ML) can run efficiently on CPU.
Large Language Models (LLMs), computer vision, and deep learning workloads require GPUs for acceptable latency and throughput.
For cost‑effective inference on Azure, consider Standard_NCas_T4_v3 or NVads_A10 SKUs.
💡 Tip: Use Spot VMs only for non‑critical inference or batch jobs and always implement checkpointing and retries.
2. What’s the difference between training and inference?
Training
Builds the model using historical data
Large batch job or offline benchmark
Inference
Executes the model for predictions
Stateless API responding to requests
💡 Infra lens: Training is bursty and compute‑heavy. Inference is latency‑sensitive and continuous.
3. How can I auto‑scale AI workloads?
Recommended signals:
GPU utilization and memory usage
Request queue depth
P95/P99 inference latency
Tokens per minute (TPM) growth rate
Azure implementations:
AKS: Cluster Autoscaler + HPA using Prometheus metrics
Azure ML:
min_instances/max_instanceson online endpointsVMSS: Autoscale rules based on custom GPU metrics
💡 Best practice: Always configure cooldown periods to avoid scale oscillation.
4. How do I secure inference endpoints?
Baseline controls:
Private Endpoints with VNet integration
Authentication using Entra ID (Azure AD) where supported
API keys only as a fallback, stored in Azure Key Vault
WAF or API Management for throttling and abuse prevention
Diagnostic logs enabled in Application Insights
💡 Zero Trust rule: Treat inference endpoints like any external‑facing production API.
5. How much does AI cost to run on Azure?
Costs vary by region, model, and usage profile.
Standard_NC6s_v3
~$1–$1.5/hr
Entry GPU workloads
Standard_NCas_T4_v3
~$0.8–$1/hr
Best inference cost efficiency
ND_A100_v4
$25–$35/hr
High‑end training
Azure OpenAI (Standard)
Pay per token
Variable latency
Azure OpenAI (PTU)
Fixed hourly
Predictable throughput
💡 Tip: Always pair GPU usage with Azure Cost Management budgets and alerts.
6. How do I monitor GPU usage and model latency?
Key telemetry sources:
nvidia-smifor node‑level checksDCGM Exporter for Prometheus
Azure Monitor for Containers
Application Insights for request latency, errors, and dependencies
Grafana for unified dashboards
💡 Golden rule: Correlate GPU utilization, latency, and token throughput.
7. What are common bottlenecks in AI infrastructure?
Storage
Slow dataset reads
NVMe, Premium SSD
Network
Cross‑region latency
Regional inference, Private Link
Compute
GPU idle time
Autoscaling, batching
Cost
Idle clusters
Scheduled shutdowns
💡 Reality: The bottleneck is often the data path, not the GPU.
8. How do I estimate TPM, RPM, and cost for Azure OpenAI?
Formula:
Important considerations:
Tokens include both prompt and response.
Retries amplify real TPM consumption.
Monitor HTTP 429 and
Retry-Afterheaders.
💡 Rule: For sustained production traffic, evaluate Provisioned Throughput Units (PTUs).
9. What’s the best architecture for hybrid environments?
Reference pattern:
Azure Arc for resource management
Azure Monitor Agent for telemetry
Private Link for secure connectivity
Arc‑enabled AKS for unified Kubernetes control
💡 Design principle: Keep inference close to the data to minimize latency and egress.
10. What’s the best way to learn AI for infra engineers?
Suggested path:
AI‑900: Azure AI Fundamentals
Hands‑on labs (GPU VM, AKS GPU, Azure ML inference)
Azure OpenAI and monitoring deep dives
Build an internal Copilot or automation assistant
💡 Mindset: You don’t need to become a data scientist. You need to master how AI runs.
Infrastructure doesn’t compete with AI. It makes AI reliable, secure, and scalable.
Last updated