“You only control what you can measure — and with AI, that’s even more critical.”
Why monitoring AI is different
AI environments behave differently from traditional workloads.
A model may be running and still deliver incorrect results, high latency, or unexpected costs.
Common scenarios:
The model looks fine but predictions are degraded.
The GPU is active but underutilized.
Inference responds, but with high latency noticeable to users.
Costs spike suddenly due to the volume of processed tokens.
Conclusion: Observability isn’t optional. It’s a core part of AI reliability.
What to monitor in AI workloads
Layer/Category
Key Metrics
Tools/Sources
Utilization, memory, temperature, failures
DCGM, nvidia-smi, Azure Managed Prometheus, Azure Monitor
Accuracy, inference latency, TPM/RPM
Application Insights, Azure ML, Azure OpenAI Logs
Throughput, jitter, slow connections
Azure Monitor for Network
Integrity, freshness, ingestion failures
Data Factory, Synapse, Log Analytics
GPU usage, token volume, inference time
Cost Management + Log Analytics
Secret access, Key Vault logs
Azure Policy, Defender for Cloud
💡 Tip: Monitor both the model behavior and the infrastructure that supports it.
Inference without GPU visibility is an incomplete diagnosis.
Collect and visualize resource metrics and logs
Store logs and enable advanced KQL queries
Kubernetes and custom metrics, including GPU and application metrics
Real-time dashboard visualization
Telemetry, response time, tracing
Model and endpoint monitoring
Standardized metrics, logs, and traces
Practical example — Monitoring GPUs in AKS
Install NVIDIA’s DCGM Exporter
Prometheus integration model
You can integrate DCGM metrics using one of the following approaches:
Azure Managed Prometheus (recommended for production AKS clusters)
Self-managed Prometheus, such as kube-prometheus-stack
Azure Managed Prometheus does not require deploying kube-prometheus-stack.
The Helm-based stack is only needed if you operate Prometheus yourself.
Visualize in Grafana
Add panels with GPU-focused metrics such as:
DCGM_FI_DEV_GPU_UTIL — GPU utilization
DCGM_FI_DEV_FB_USED — GPU memory usage
DCGM_FI_DEV_MEM_COPY_UTIL — memory copy pressure
Sample Grafana dashboards for AI workloads
GPU efficiency dashboard
Recommended panels:
GPU Utilization (%) vs Pod Count
GPU Memory Used (MB) vs Inference Latency
GPU Temperature over time
GPU Utilization vs Requests per Second
p95 / p99 latency per endpoint
Error rate by HTTP status
Use Application Insights to track:
duration — average and tail response time
successRate — success percentage
dependency calls — external API latency
🔧 Recommendations:
Track p95 and p99 latency per endpoint
Alert on HTTP 429 and 503
Correlate latency, token usage, and GPU utilization
Azure OpenAI–specific monitoring
Key metrics and signals
Throughput and rate limiting
RPM (Requests per Minute)
TTFT (Time to First Token)
Capacity planning and stability
Monitoring throttling
Cost observability
GPUs and tokens are expensive — and scale fast.
Azure Monitor + Metrics Explorer
Token consumption (TPM/RPM)
Azure OpenAI Metrics and Logs
Cost Management with tags
Azure Anomaly Detector or Machine Learning
Predictive analysis and intelligent autoscaling
Predict GPU usage peaks based on historical data
Detect latency anomalies using Azure Anomaly Detector
Trigger intelligent autoscaling (AKS / VMSS)
Alerts and automated responses
Investigate data bottlenecks or scale replicas
Validate model, network, or rate limits
Trigger fallback pipeline
Retrain or activate previous model
Hands-On — Correlating metrics and logs with KQL
GPU metrics live in Prometheus / Grafana.
Log Analytics is used for correlation.
Best practices for security and observability
Never log sensitive data (prompts, PII, responses)
Enable automatic diagnostics with Azure Policy
Centralize logs in a single workspace
Retain logs for at least 30 days
Use Managed Identity and Key Vault
AI observability checklist
https://learn.microsoft.com/en-us/azure/aks/monitor-gpu-metrics
https://learn.microsoft.com/azure/azure-monitor/
https://learn.microsoft.com/azure/azure-monitor/app/opentelemetry-enable
“Good infrastructure is invisible when it works — but poorly monitored AI shows up quickly, either in your monthly bill or in the user experience.”