Skip to content

Chapter 1 — Why AI Needs You

"The best AI infrastructure I've ever seen wasn't built by a machine learning engineer. It was built by a sysadmin who got curious."


The Monday Morning Message

It's 8:47 AM on a Monday. You're halfway through your coffee, reviewing a Terraform plan for a network redesign, when a Slack message lights up your screen. It's from the data science team lead:

"Hey — we need 8 GPU VMs provisioned by Wednesday for a fine-tuning job. We also need a private endpoint for the model inference API, and can you set up monitoring for TPM? Thanks!"

You read it twice. GPU VMs? Fine-tuning? You know what a private endpoint is — you've built hundreds. Monitoring? That's your bread and butter. But what on earth is "TPM" in this context? It's not Trusted Platform Module — it's Tokens Per Minute, a throughput metric for large language models. You don't know that yet, but here's the thing: everything else in that request is pure infrastructure.

Provisioning compute. Configuring network security. Setting up observability. You've been doing this for years. The only difference is the workload type. This book exists to close that gap — to translate what you already know into the language of AI, and to make you the engineer every AI team is desperate to find.


What AI Actually Is (From Your Perspective)

Let's cut through the hype. If you strip away the buzzwords, AI is a workload. It consumes compute, storage, and network — just like every other workload you've ever managed. The difference is in the shape of that consumption: more parallel compute, larger datasets, and different performance metrics.

Here's how the AI landscape breaks down, explained in terms you already understand:

Artificial Intelligence (AI) is the broad discipline. Think of it as "cloud computing" — a huge umbrella that covers everything from simple automation to complex reasoning systems. Machine Learning (ML) is a subset where systems learn patterns from data instead of being explicitly programmed. If AI is "cloud," then ML is "IaaS" — a specific, practical layer within it. Deep Learning (DL) goes deeper still, using neural networks with many layers to handle complex tasks like image recognition and language generation. It's the equivalent of a highly optimized, purpose-built service.

The AI Stack: Three Layers You Already Know

Every AI system rests on three pillars, and you'll recognize them immediately:

AI Layer What It Does Your Infrastructure Equivalent
Data Feeds the model with examples to learn from Storage — Blob, Data Lake, NFS, databases
Model Learns patterns and makes predictions The application — your compiled binary that runs on compute
Infrastructure Powers everything underneath Your domain — compute, network, security, observability

The model is the application. The data is what it consumes and produces. The infrastructure is everything that makes it run reliably, securely, and at scale. That last part? That's you.

Infra ↔ AI Translation

This is the mental model that will carry you through the entire book. When someone on the AI team uses unfamiliar jargon, map it back to what you know:

AI Concept Infrastructure Equivalent Why It Maps
A trained model A compiled binary It's a static artifact produced by a build process, deployed to serve requests
Training a model A batch job Long-running, compute-intensive process that reads data and produces an output artifact
Inference An API call A request comes in, the model processes it, a response goes out — just like any microservice
Fine-tuning Patching a binary You take an existing artifact and customize it for your environment
A dataset A database or data lake Structured input that the workload depends on
A training pipeline A CI/CD pipeline Automated workflow: ingest → process → build → validate → deploy
Model registry Artifact repository Versioned storage for deployable artifacts (think Azure Container Registry, but for models)
GPU cluster High-performance compute tier Specialized hardware allocated for demanding workloads

💡 Pro Tip: When you're in a meeting and the data science team starts talking about "epochs," "hyperparameters," and "loss functions," don't panic. Those are their tuning knobs — the equivalent of your connection pool sizes, cache TTLs, and autoscale thresholds. You don't need to master their knobs. You need to understand what their knobs demand from your infrastructure.


Traditional Infrastructure vs. AI Infrastructure

Here's the good news: AI infrastructure isn't a foreign planet. It's more like a new neighborhood in a city you already know. The streets follow the same grid, the utilities work the same way — but the buildings look different and the tenants have unusual requirements.

What Changes

Dimension Traditional Infrastructure AI Infrastructure What You Need to Learn
Compute CPUs, general-purpose VMs GPUs (NVIDIA T4, A100, H100), multi-GPU nodes GPU SKU families, CUDA compatibility, vGPU vs. passthrough
Storage SSD/HDD, SAN/NAS, managed disks Data Lakes, Blob with high-throughput tiers, local NVMe for scratch I/O patterns for training (sequential reads), checkpoint storage
Networking 1–25 GbE Ethernet, load balancers InfiniBand (up to 400 Gb/s), RDMA, GPU-to-GPU communication Multi-node training topologies, NCCL configuration
Deployment VMs, App Services, containers Inference endpoints, model-as-a-service, GPU-enabled containers Managed endpoints vs. self-hosted, cold start behavior
Observability CPU %, memory, disk I/O, request latency GPU utilization, VRAM usage, tokens/second, time-to-first-token New metric categories, GPU-specific telemetry collection
Cost Model $/hour per VM, reserved instances $/hour per GPU (10-30× CPU cost), PTUs for managed AI services GPU-specific cost governance, scheduling, auto-shutdown

What Stays the Same

This is equally important — maybe more so. These fundamentals don't change just because the workload runs on GPUs:

  • Security: Network segmentation, private endpoints, identity management, encryption at rest and in transit. A GPU VM still needs an NSG. An inference API still needs authentication.
  • Networking: VNets, subnets, DNS, load balancing, private connectivity. The packets still flow the same way.
  • Infrastructure as Code: Bicep, Terraform, ARM templates. GPU VMs are still Azure resources with properties and parameters.
  • Monitoring and alerting: You're still setting thresholds, building dashboards, and responding to incidents. The metrics just have different names.
  • Incident response: When the model inference API goes down at 2 AM, someone still needs to triage it. That someone should be you.
  • Cost management: Budgets, tagging, right-sizing, reserved capacity. If anything, cost governance is more critical with AI workloads.

⚠️ Production Gotcha: Don't assume AI workloads need entirely new tooling. Teams often over-invest in specialized "ML platforms" while ignoring basic infrastructure hygiene. The most common production failures in AI systems aren't model accuracy problems — they're the same old culprits: disk full, network timeout, expired certificate, missing RBAC permission. Your instincts are right.


Where You Fit: The AI Stack Needs Infrastructure Engineers

The AI industry has a staffing problem, and it's not what you think. There are plenty of data scientists who can build models in Jupyter notebooks. What's critically scarce are engineers who can take those models and run them reliably in production. That's the gap. That's where you come in.

Three Worlds Colliding

AI projects sit at the intersection of three disciplines that historically operated in silos:

Role What They Bring What They Typically Lack
Developers Application logic, APIs, frontend integration GPU infrastructure knowledge, network architecture
Data Scientists / ML Engineers Model development, training, evaluation Production operations, security, cost awareness
Infrastructure Engineers Reliability, security, scalability, cost governance AI/ML vocabulary, model lifecycle understanding

When these worlds collide without a common language, bad things happen. The data science team provisions GPU VMs directly through the portal — no IaC, no tagging, no auto-shutdown policies. The developers expose the model API on a public endpoint because "it's just for testing." Nobody monitors GPU utilization, so half the expensive compute sits idle.

You've seen this pattern before. It's the same chaos that happened when developers first got cloud access without guardrails. And just like cloud governance, the solution is infrastructure engineering discipline applied to a new workload type.

What Goes Wrong Without Infrastructure Expertise

These aren't hypothetical scenarios. They're real patterns I've seen repeatedly across organizations adopting AI:

Ungoverned GPU sprawl. A data scientist requests four Standard_NC24ads_A100_v4 VMs for a training experiment. No resource locks, no budget alerts, no tagging. Three weeks later, the VMs are still running — nobody remembers who provisioned them or whether the experiment finished. Monthly cost: $35,000+.

Exposed inference endpoints. The ML team deploys a model to an Azure Machine Learning managed endpoint with a public IP. No private endpoint, no WAF, no API management. The model serves responses that include proprietary business logic and customer data patterns.

Blind spots in observability. The team monitors model accuracy but not infrastructure health. When inference latency spikes from 200ms to 8 seconds, nobody can tell whether it's the model, the compute, the network, or a noisy neighbor. There are no GPU metrics in the monitoring stack.

⚠️ Production Gotcha — The $50K GPU Weekend: A team provisioned 8 × Standard_ND96asr_v4 VMs (A100 GPUs) on a Friday afternoon for a training run they expected to finish by Saturday morning. The training job crashed at 3 AM due to a checkpoint storage misconfiguration, but the VMs kept running. Nobody had set up auto-shutdown policies or budget alerts. Monday morning surprise: $53,000 in compute charges for 60 hours of idle GPU time. An infrastructure engineer would have configured auto-shutdown, set a budget alert at $5,000, and stored checkpoints on Blob with lifecycle policies. Fifteen minutes of infrastructure work would have saved $48,000.


The AI-Ready Infrastructure Professional

You don't need to reinvent yourself. You need to extend your existing expertise into a new domain. Think of it like when virtualization arrived — you didn't stop being a server engineer. You learned a new abstraction layer and became more valuable. AI infrastructure is the same transition.

Skills You Already Have That Transfer Directly

Take inventory of what you bring to the table. It's more than you think:

Your Existing Skill How It Applies to AI
VM provisioning and management GPU VM SKU selection, CUDA driver management, multi-GPU configuration
Network architecture Private endpoints for model APIs, VNet integration, InfiniBand for distributed training
Kubernetes (AKS) GPU node pools, NVIDIA device plugin, model serving on containers
Storage architecture Data lake design, high-throughput storage for training, checkpoint management
IaC (Terraform/Bicep) Automating AI infrastructure — same resources, new parameters
Monitoring and alerting GPU telemetry, inference latency dashboards, token throughput alerts
Security and compliance Model endpoint authentication, data encryption, network isolation
Cost management GPU cost governance, reserved instances for GPUs, spot instances for training
Incident response Triaging inference failures, GPU memory errors, storage bottlenecks

New Skills to Add

The gap is narrower than you think. Here's what you'll learn through this book:

  • GPU fundamentals: SKU families (NC, ND, NV), CUDA compatibility, GPU memory (VRAM) as a capacity constraint, multi-GPU interconnects
  • Model lifecycle: How models go from training to deployment, what "serving" means, versioning and rollback strategies
  • AI observability: New metrics like tokens per second, time-to-first-token, GPU utilization and VRAM pressure, inference queue depth
  • AI-specific networking: InfiniBand topology for multi-node training, NCCL communication patterns, bandwidth requirements between GPU nodes
  • Managed AI services: Azure OpenAI deployment types (standard vs. provisioned), Azure Machine Learning endpoints, AI Foundry

💡 Pro Tip: You don't need to learn Python, TensorFlow, or PyTorch. You don't need to understand backpropagation or gradient descent. You need to understand what those tools and processes demand from infrastructure. When a data scientist says "my training job needs 8 A100 GPUs with NVLink and 2 TB of NVMe scratch space," you need to know which Azure VM SKU delivers that — not how the training algorithm works internally.

Career Paths in AI Infrastructure

This isn't a niche. It's a rapidly growing field with multiple trajectories:

Role Focus Area Key Skills
AI Infrastructure Engineer Provisioning and managing GPU compute, storage, and networking for AI workloads GPU SKUs, InfiniBand, high-performance storage, IaC
MLOps Engineer Automating the model lifecycle — training, validation, deployment, monitoring CI/CD for models, model registries, A/B deployment
AI Cloud Architect Designing end-to-end AI platforms, reference architectures, governance frameworks Azure AI services, cost optimization, security architecture
AI Platform Engineer Building internal platforms that enable data science teams to self-serve Kubernetes, developer experience, API management, quotas

Decision Matrix — Where to Start: - If you're strong in compute and networking → start with AI Infrastructure Engineer - If you're strong in automation and CI/CD → start with MLOps Engineer - If you're strong in architecture and governance → start with AI Cloud Architect - If you're strong in Kubernetes and platform tooling → start with AI Platform Engineer


Key Terms You'll Encounter

Here's an expanded glossary with infrastructure analogies for every term. Bookmark this section — you'll reference it throughout the book.

Term Definition Infrastructure Analogy
Inference Running a trained model to get predictions from new input An API call — request in, response out. This is the "production" phase.
Training The process of teaching a model by feeding it data A batch job — long-running, compute-intensive, produces an artifact (the trained model)
Fine-tuning Customizing a pre-trained model with your specific data Patching a binary — you take an existing artifact and adapt it for your environment
GPU Graphics Processing Unit — hardware optimized for parallel math operations A co-processor, like a network offload card but for matrix math. Thousands of small cores working in parallel.
CUDA NVIDIA's framework for programming GPUs A driver and runtime framework — like the hypervisor layer that lets your workload talk to the hardware
VRAM Video RAM — the GPU's dedicated memory Think of it as the GPU's "RAM." Models must fit in VRAM to run. It's the most common capacity constraint.
LLM Large Language Model (GPT, Llama, Mistral) A large, stateful application that requires significant compute and memory to serve
Token A chunk of text (~4 characters) that the model processes The unit of work — like a packet in networking or a transaction in a database
TPM Tokens Per Minute — throughput metric for language models Requests per second (RPS) equivalent — measures how much work the model can do
PTU Provisioned Throughput Unit — reserved capacity in Azure OpenAI Reserved instances — you pay for guaranteed capacity rather than pay-as-you-go
MLOps DevOps practices applied to the machine learning lifecycle DevOps for models — version control, CI/CD, monitoring, rollback, but for model artifacts
ONNX Open Neural Network Exchange — portable model format Like an OVA/OVF for VMs — a standardized format that runs across different runtimes
Checkpoint A snapshot of model state saved during training A VM snapshot or database backup — lets you resume training from a known good state
Epoch One complete pass through the training dataset Like a full backup cycle — the job processes every record once
Inference endpoint An API that serves model predictions A web service endpoint — same concepts of scaling, load balancing, and health probes

Infra ↔ AI Translation — The One-Liner Cheat Sheet: A trained model is a compiled binary. Training is a batch job. Inference is an API call. A dataset is a database. A training pipeline is a CI/CD pipeline. If you can hold this mental model, you can navigate any AI architecture conversation.


Hands-On: Your First AI Infrastructure Exploration

Let's get practical. You don't need to train a model or write Python. You need to understand what GPU compute is available to you and what your subscription limits look like. This is reconnaissance — the same first step you'd take before architecting any new workload.

Step 1: Discover GPU VM SKUs in Your Region

Open your terminal and run the following command, replacing <your-region> with your Azure region (e.g., eastus2, westus3, swedencentral):

az vm list-skus --location <your-region> --size Standard_N --output table

This filters for the Standard_N family, which includes all GPU-accelerated VMs in Azure. You'll see SKU names like Standard_NC24ads_A100_v4, Standard_ND96asr_v4, and Standard_NV36ads_A10_v5.

💡 Pro Tip: Pay attention to three GPU VM family prefixes: - NC — Compute-optimized GPUs for training and inference (NVIDIA T4, A100) - ND — High-end GPUs designed for distributed deep learning with InfiniBand (A100, H100) - NV — Visualization and lightweight inference GPUs (AMD Radeon, NVIDIA A10)

For production AI workloads, you'll mostly work with NC and ND series. The ND series VMs with InfiniBand are what large-scale training jobs require.

Step 2: Check Your GPU Quota

Having SKUs available in a region doesn't mean you can deploy them. You need quota. Run this command to see your current GPU-family vCPU limits and usage:

az vm list-usage --location <your-region> --output table | Select-String -Pattern "NC|ND|NV"

Note: On Linux/macOS, replace Select-String -Pattern "NC|ND|NV" with grep -E "NC|ND|NV".

This shows you how many vCPUs you've used and your current limit for each GPU family. If the limit is 0, you'll need to request a quota increase before deploying any GPU VMs.

Questions to Explore

Now that you have the output, try answering these:

  1. Which VM SKU uses the NVIDIA T4 GPU? (Hint: look for NC*T4* in the name — great for inference workloads)
  2. Which SKU uses the NVIDIA A100? (Hint: look for NC*A100* or ND*A100* — the workhorse for training)
  3. What's the difference in vCPU count between the smallest and largest GPU SKU available?
  4. What are your current vCPU quotas for the NC and ND families? Would you be able to deploy a single A100 VM today?
  5. How many GPUs per VM do the ND-series SKUs offer? (Some provide up to 8 GPUs in a single node)

⚠️ Production Gotcha: GPU quota requests in Azure can take 24–72 hours for approval, sometimes longer for high-demand SKUs like A100 and H100. If a project has a Wednesday deadline, don't wait until Tuesday to request quota. Build quota planning into your AI project kickoff process — just like you'd plan IP address space or subscription limits for any large deployment.


Chapter Checklist

Before moving on, make sure you're comfortable with these concepts:

  • AI is an infrastructure workload, not a data science mystery. It consumes compute, storage, and network — your domain.
  • The AI stack has three layers: data (storage), model (the application), and infrastructure (your responsibility).
  • A trained model is a compiled binary. Training is a batch job. Inference is an API call. You already know these patterns.
  • AI infrastructure changes some things (GPUs instead of CPUs, InfiniBand instead of Ethernet, tokens instead of requests) but keeps the fundamentals (security, networking, IaC, monitoring, cost governance).
  • Infrastructure engineers are critically needed in AI — data scientists can build models but struggle with production operations, security, and cost management.
  • You don't need to learn Python or ML theory. You need to learn what AI workloads demand from infrastructure.
  • GPU VMs come in families: NC (compute/training), ND (distributed deep learning), NV (visualization). Quota must be requested in advance.
  • Career paths exist: AI Infrastructure Engineer, MLOps Engineer, AI Cloud Architect, AI Platform Engineer — all build on your current skills.

What's Next

Now that you understand why AI needs your skills, let's look at the fuel that powers it all — data. In Chapter 2, you'll learn how data flows from raw storage to trained model, why I/O performance is the bottleneck most teams discover too late, and how your storage architecture decisions directly impact whether a training job takes 4 hours or 4 days.

The infrastructure decisions you make around data — what tier, what format, what throughput — will determine whether the AI team sees you as an order-taker or a strategic partner. Let's make sure it's the latter.

Next up: Chapter 2 — Data: The Fuel That Powers AI


References